Covid-19 in USA | What Does the Numbers & Twitter Say

5 min readMay 26, 2020

Passengers are seen wearing protective masks and gloves at Miami International Airport in Miami, Florida, United States on March 29, 2020. *Eva Marie Uzcategui | Anadolu Agency | Getty Images*

The United Sates right now has the most coronavirus cases with more than 1.7 million. It is extremely important to look at the public health data and learn how each states are doing according to the numbers.

For this article, we will analyze public health data on Covid-19 to learn and visualize the infection rates of each states. The data will be irrelevant in future as confirmed cases will continue to rise. However, this will be a great practice for learners. We will also analyze tweets about the Covid-19 specifically related to US to learn about what people are saying about it.

Warning: The R scripts used here are polished and short. You must know the basics of R and should have understanding of Twitter API to understand it and work with it.

Note to Readers with NO Programming Knowledge: Please, feel free to ignore the R scripts and other programming aspects. Plots with descriptions and my reports are for everyone.

Resources

To learn or refresh your knowledge of R, DataCamp is extremely helpful. You just need to create a free account.
The free online text book Introduction to Data Analysis and Prediction Algorithms with R is also really helpful.

R tips

Use Microsoft R Open if possible. It comes with the multi-threaded math library (MKL) and checkpoint package.
But don’t hesitate to switch to regular R if you run into trouble. Microsoft R Open can have some bizarre problems in my personal experiences. Don’t waste too much time on fixing them. Switching to regular R often solves the problem.
Install regular R via CRAN (Instructions for Ubuntu). Install checkpointto ensure reproducibility (it is not a Microsoft R Open exclusive.)
Use RStudio and make good use of its Console window. Some people hold strong feelings against R because of some of its uniqueness comparing to other major programming languages. In fact a lot of the confusion can be resolved with simple queries in the Console. Not sure whether the vector index starts from zero or one? c(1,2,3)[1] tells you it’s one. Querying 1:10 tells you the result includes 10 (unlike Python).

Accessing the Data

The data gathered is from WorldOMeter. Please, click here for it’s link to Github. The Excel file was imported into a Data Frame on RStuido.

Data Visualization

At first, we load the libraries we need to access the functions we want to utilize. These are some of most powerful packages in R. Please, click here to read about these at your convenience.

# load libraries
library(tidyverse)
library(ggrepel)
library(ggthemes)
library(dslabs)

Now, we define the intercept in our plot which will connect all the infection rates of each states. We define an object r where cases in each srates are divided by respected states’s populations and multiplied by 1,000,000 to calculate the death rates in each country per 1 million.

# define the intercept
r <- uscovid19infection %>%
  summarize(rate = sum(cases) / sum(population) * 10^6) %>%
  .$rate

Now, we make our plot. We use ggplot function to plot the population (in 1 million) against the cases.

# make the plot, combining all elements
uscovid19infection %>%
  ggplot(aes(population/10^6, cases, label = state)) +
  geom_abline(intercept = log10(r), lty = 1, color = "red") +
  geom_point(aes(col = region), size = 2) +
  geom_text_repel() +
  scale_x_log10() +
  scale_y_log10() +
  xlab("Population in millions (log scale)") +
  ylab("Total number of deaths (log scale)") +
  ggtitle("US Covid-19 Cases") +
  scale_color_discrete(name = "Region") +
  theme_economist()

It seems that Northeastern states like New Jersey, Massachusetts, New York, Pennsylvania and Connecticut have high infection rates with New York and New Jersey being in the top 2. On the other hand, some Western states have low infection rates with Alaska, Montana, Wyoming and Hawaii being in the bottom 4. North Central and Southern states are distributed across the plot.

The Coronavirus Is Deadliest Where Democrats Live

Beyond perception and ideology, there are starkly different realities for red and blue America right now. WASH. WASH…

www.nytimes.com

What Twitter Users Are Saying?

Now, let’s look at what Twitter users are saying. We will analyze the recent 3000 public tweets with #covid19us.

Getting the data and distribution of tweets

First of all, follow the instruction of this article to obtain your own API key and access token, and install twitteR package:

Accessing Data from Twitter API using R

UPDATE: An updated version of this tutorial can be found here.

medium.com

You need these four variables:

consumer_key <- "FILL HERE"
consumer_secret <- "FILL HERE"
access_token <- "FILL HERE"
access_secret <- "FILL HERE"

The main access point for this post is searchTwitter. It will search for 3000 recent public tweets in english with the #covid19us.

setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret)covid19us.tweets<- searchTwitter("#covid19us",n=3000,lang="en")

Now, we clean the data to make sure we don't have any urls. useless words, make all text lower case.

covid_text <- sapply(covid19us.tweets, function(x) x$getText())covid_text_corpus <- Corpus(VectorSource(covid_text))covid_text_corpus <- tm_map(covid_text_corpus, removePunctuation)covid_text_corpus <- tm_map(covid_text_corpus, content_transformer(tolower))covid_text_corpus <- tm_map(covid_text_corpus, function(x)removeWords(x,stopwords()))covid_text_corpus <- tm_map(covid_text_corpus, removeWords, c("RT", "are","that"))removeURL <- function(x) gsub("http[[:alnum:]]*", "", x)
covid_text_corpus <- tm_map(covid_text_corpus, content_transformer(removeURL))covid_2 <- TermDocumentMatrix(covid_text_corpus)
covid_2 <- as.matrix(insta_2)
covid_2 <- sort(rowSums(insta_2),decreasing=TRUE)

Then, we create a data frame of the words and it’s frequencies.

covid_2 <- data.frame(word = names(covid_2),freq=covid_2)

Now, we adjust the margins of our plot and plot see top 25 of most frequent words.

par(mar = c(12, 5, 5, 5)) # Set the margin on all sidesbarplot(covid_2[1:25,]$freq, las = 2, names.arg = insta_2[1:25,]$word,
        col ="blue", main ="Most frequent words in Tweets",
        ylab = "Word frequencies")

Some interesting things happening here as we take a look at the plot above. We see the words like “kag2020” and “memaga” appear a lot. Now, “kag2020” stands for “Keep America Great 2020”, the slogan for President Donald Trump’s re-election campaign and “memaga” for supporters of “Make America Great Again”, President Trumps current slogan. It seems that some Americans seems to support the current president with his efforts in the current pandemic.

We also see hopeful words like “prayers” appear too.

Americans are crowding public places and officials fear possibility of spikes in coronavirus cases

At a glance, it may look like many Americans have long forgotten about the dangers of coronavirus.

www.cnn.com

The end

Thanks for reading! I hope this will help Data Science learners and will provide some insight. Please, stay safe.