The United Sates right now has the most coronavirus cases with more than 1.7 million. It is extremely important to look at the public health data and learn how each states are doing according to the numbers.
For this article, we will analyze public health data on Covid-19 to learn and visualize the infection rates of each states. The data will be irrelevant in future as confirmed cases will continue to rise. However, this will be a great practice for learners. We will also analyze tweets about the Covid-19 specifically related to US to learn about what people are saying about it.
Warning: The R scripts used here are polished and short. You must know the basics of R and should have understanding of Twitter API to understand it and work with it.
Note to Readers with NO Programming Knowledge: Please, feel free to ignore the R scripts and other programming aspects. Plots with descriptions and my reports are for everyone.
- To learn or refresh your knowledge of R, DataCamp is extremely helpful. You just need to create a free account.
- The free online text book Introduction to Data Analysis and Prediction Algorithms with R is also really helpful.
- Use Microsoft R Open if possible. It comes with the multi-threaded math library (MKL) and checkpoint package.
- But don’t hesitate to switch to regular R if you run into trouble. Microsoft R Open can have some bizarre problems in my personal experiences. Don’t waste too much time on fixing them. Switching to regular R often solves the problem.
- Install regular R via CRAN (Instructions for Ubuntu). Install checkpointto ensure reproducibility (it is not a Microsoft R Open exclusive.)
- Use RStudio and make good use of its Console window. Some people hold strong feelings against R because of some of its uniqueness comparing to other major programming languages. In fact a lot of the confusion can be resolved with simple queries in the Console. Not sure whether the vector index starts from zero or one?
c(1,2,3)tells you it’s one. Querying
1:10tells you the result includes 10 (unlike Python).
Accessing the Data
At first, we load the libraries we need to access the functions we want to utilize. These are some of most powerful packages in R. Please, click here to read about these at your convenience.
# load libraries
Now, we define the intercept in our plot which will connect all the infection rates of each states. We define an object r where cases in each srates are divided by respected states’s populations and multiplied by 1,000,000 to calculate the death rates in each country per 1 million.
# define the intercept
r <- uscovid19infection %>%
summarize(rate = sum(cases) / sum(population) * 10^6) %>%
Now, we make our plot. We use ggplot function to plot the population (in 1 million) against the cases.
# make the plot, combining all elements
ggplot(aes(population/10^6, cases, label = state)) +
geom_abline(intercept = log10(r), lty = 1, color = "red") +
geom_point(aes(col = region), size = 2) +
xlab("Population in millions (log scale)") +
ylab("Total number of deaths (log scale)") +
ggtitle("US Covid-19 Cases") +
scale_color_discrete(name = "Region") +
It seems that Northeastern states like New Jersey, Massachusetts, New York, Pennsylvania and Connecticut have high infection rates with New York and New Jersey being in the top 2. On the other hand, some Western states have low infection rates with Alaska, Montana, Wyoming and Hawaii being in the bottom 4. North Central and Southern states are distributed across the plot.
The Coronavirus Is Deadliest Where Democrats Live
Beyond perception and ideology, there are starkly different realities for red and blue America right now. WASH. WASH…
What Twitter Users Are Saying?
Now, let’s look at what Twitter users are saying. We will analyze the recent 3000 public tweets with #covid19us.
Getting the data and distribution of tweets
First of all, follow the instruction of this article to obtain your own API key and access token, and install
Accessing Data from Twitter API using R
UPDATE: An updated version of this tutorial can be found here.
You need these four variables:
consumer_key <- "FILL HERE"
consumer_secret <- "FILL HERE"
access_token <- "FILL HERE"
access_secret <- "FILL HERE"
The main access point for this post is searchTwitter. It will search for 3000 recent public tweets in english with the #covid19us.
setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret)covid19us.tweets<- searchTwitter("#covid19us",n=3000,lang="en")
Now, we clean the data to make sure we don't have any urls. useless words, make all text lower case.
covid_text <- sapply(covid19us.tweets, function(x) x$getText())covid_text_corpus <- Corpus(VectorSource(covid_text))covid_text_corpus <- tm_map(covid_text_corpus, removePunctuation)covid_text_corpus <- tm_map(covid_text_corpus, content_transformer(tolower))covid_text_corpus <- tm_map(covid_text_corpus, function(x)removeWords(x,stopwords()))covid_text_corpus <- tm_map(covid_text_corpus, removeWords, c("RT", "are","that"))removeURL <- function(x) gsub("http[[:alnum:]]*", "", x)
covid_text_corpus <- tm_map(covid_text_corpus, content_transformer(removeURL))covid_2 <- TermDocumentMatrix(covid_text_corpus)
covid_2 <- as.matrix(insta_2)
covid_2 <- sort(rowSums(insta_2),decreasing=TRUE)
Then, we create a data frame of the words and it’s frequencies.
covid_2 <- data.frame(word = names(covid_2),freq=covid_2)
Now, we adjust the margins of our plot and plot see top 25 of most frequent words.
par(mar = c(12, 5, 5, 5)) # Set the margin on all sidesbarplot(covid_2[1:25,]$freq, las = 2, names.arg = insta_2[1:25,]$word,
col ="blue", main ="Most frequent words in Tweets",
ylab = "Word frequencies")
Some interesting things happening here as we take a look at the plot above. We see the words like “kag2020” and “memaga” appear a lot. Now, “kag2020” stands for “Keep America Great 2020”, the slogan for President Donald Trump’s re-election campaign and “memaga” for supporters of “Make America Great Again”, President Trumps current slogan. It seems that some Americans seems to support the current president with his efforts in the current pandemic.
We also see hopeful words like “prayers” appear too.
Americans are crowding public places and officials fear possibility of spikes in coronavirus cases
At a glance, it may look like many Americans have long forgotten about the dangers of coronavirus.
Thanks for reading! I hope this will help Data Science learners and will provide some insight. Please, stay safe.