Get your dataset ready for the analysis
Stata has been my main analysis tool for the last year, mainly because it is what everybody uses around here, recently R is becoming increasingly accesible with the arrival of the dplyr
and tidyr
packages.
Inspired by this Data wrangling webinar offered by Rstudio I thought of trying my new abilities with a real dataset.
I found a great Ebola dataset compiled by Caitlin Rivers that provided sufficient challenge for me.
Get the the data
The first challenge was getting the data. Even stored as a csv
file this was not a straight forward task since
read.csv("https://raw.githubusercontent.com/cmrivers/ebola/master/country_timeseries.csv")
resulted in Error in file(file, “rt”) : cannot open the connection. After some research I found that the proper way of importing a csv
from an URL is using getURL
from the RCurl
package. My final code for importing the data looks like this:
library(RCurl)
library(dplyr)
x <- getURL("https://raw.githubusercontent.com/cmrivers/ebola/master
/country_timeseries.csv")
#import as data table
ts <- tbl_df(read.csv(text = x))
Now lets get a look at what we have using glimpse(ts)
Variables:
$ Date (fctr) 1/5/2015, 1/4/2015, 1/3/...
$ Day (int) 289, 288, 287, 286, 284, ...
$ Cases_Guinea (int) 2776, 2775, 2769, NA, 273...
$ Cases_Liberia (int) NA, NA, 8166, 8157, 8115,...
$ Cases_SierraLeone (int) 10030, 9780, 9722, NA, 96...
$ Cases_Nigeria (int) NA, NA, NA, NA, NA, NA, N...
$ Cases_Senegal (int) NA, NA, NA, NA, NA, NA, N...
$ Cases_UnitedStates (int) NA, NA, NA, NA, NA, NA, N...
$ Cases_Spain (int) NA, NA, NA, NA, NA, NA, N...
$ Cases_Mali (int) NA, NA, NA, NA, NA, NA, N...
$ Deaths_Guinea (int) 1786, 1781, 1767, NA, 173...
$ Deaths_Liberia (int) NA, NA, 3496, 3496, 3471,...
$ Deaths_SierraLeone (int) 2977, 2943, 2915, NA, 282...
$ Deaths_Nigeria (int) NA, NA, NA, NA, NA, NA, N...
$ Deaths_Senegal (int) NA, NA, NA, NA, NA, NA, N...
$ Deaths_UnitedStates (int) NA, NA, NA, NA, NA, NA, N...
$ Deaths_Spain (int) NA, NA, NA, NA, NA, NA, N...
$ Deaths_Mali (int) NA, NA, NA, NA, NA, NA, N...
Well, it worked, but the table is in wide format!. I know that most of the analyses I would like to perform need that the data is in long format, let’s do that.
Prepare your data
What I want to do here can be described in the following steps:
- reshape the data from wide to long format
- extract the name of the country and the type of observation (Case or Death)
- get rid of missing values (flagged as NA)
Let’s do it!
library(tidyr)
# assign everything to a new table
cases <- ts %>%
# from columns 3 to 18 put the name of the variable in column "C" and the value of the cell in "N"
gather(., "C", "N", 3:18) %>%
# split the string in "C" at the "_" sign and put the results in "Type" and "Country"
separate(., C, c("Type", "Country"), sep = "_") %>%
# get rid of the NA's
filter(N != "NA")
# let's peek
glimpse(cases)
Variables:
$ Date (fctr) 1/5/2015, 1/4/2015, 1/3/2015, 12/31/2014,...
$ Day (int) 289, 288, 287, 284, 281, 280, 277, 273, 27...
$ Type (chr) "Cases", "Cases", "Cases", "Cases", "Cases...
$ Country (chr) "Guinea", "Guinea", "Guinea", "Guinea", "G...
$ N (int) 2776, 2775, 2769, 2730, 2706, 2695, 2630, ...
I think I’ve managed to do what I wanted in a few lines of code so let’s wrap up this post with a quick plot.
library(ggplot2)
qplot(data = cases,
x = Day,
y = N,
color = Country,
group = Country,
facets = . ~ Type,
geom = "line")
Thanks for reading, Cheers!