Here’s a page that I prepared for an R training course that I ran at my company. The idea was to show that, just like Apple, R has an app for that. OK, a package for that.
I’ll show a range of tricks made possible by R, including web scaping, text analysis and mapping. And I’ll use data from the recent Brexit referendum to keep it topical.
Let’s begin with the packages:
library(tidyverse) # Does most of what you need library(xml2) # Scrapes web data library(htmlwidgets) # For responsive exhibits library(widgetframe) # For responsive exhibits library(DT) # For responsive data tables library(leaflet) # For interactive map plots library(maptools) # For maps library(sp) # For map-like data
To understand the voting in the Brexit referendum, we need some data about it, which we can scrape from the FT’s page on the subject.
(Note that I’ve dimmed-out much of this code below, so that I can show it on this static webpage. However, I ran the dimmed-out code separately and saved the ensuing results as an R Dataset. I can then just load this dataset for use on this static webpage.)
# webResults <- read_html("https://ig.ft.com/sites/elections/2016/uk/eu-referendum/") # webData <- webResults %>% # html_nodes("td:nth-child(1) , .area-state-3 .hideable") %>% # html_text() # saveRDS(webData, "webData.rds") webData <- readRDS("data/webData.rds") head(webData)
##  "Boston" "7,430" "22,974" "South Holland" ##  "13,074" "36,423"
That gives us some data, but it is an unhelpful format, as R would read it with a character that has a comma in the middle. Fortunately, R can manipulate strings so that they make sense, such as removing those commas:
lWebData <- length(webData) areaName <- webData[seq(from = 1, to = lWebData-2, by = 3)] remainVotes <- webData[seq(from = 2, to = lWebData-1, by = 3)] leaveVotes <- webData[seq(from = 3, to = lWebData, by = 3)] remainVotes <- gsub(",([0-9])", "\\1", remainVotes) leaveVotes <- gsub(",([0-9])", "\\1", leaveVotes) resultsData <- as.data.frame(t(rbind(areaName, remainVotes, leaveVotes)), stringsAsFactors = F) resultsData$remainVotes <- as.numeric(resultsData$remainVotes) resultsData$leaveVotes <- as.numeric(resultsData$leaveVotes)
To understand this data even more, it would help to map it. And the amount of boundary files available for UK maps is extraordinary. R helps us to use these maps, so that we can map each local authority (the areas over which the referendum votes were amalgamated). (As before, I’ve run the dimmed-out code beforehand and saved the RDS, for use on this page.)
# localAuthorityRaw <- readShapeSpatial("Local_Authority_District_(GB)_2015_boundaries_(generalised_clipped)/LAD_DEC_2015_GB_BGC.shp", proj4string=CRS("+init=epsg:27700")) # # Transform the data to use with ggmap # localAuthorityClean <- spTransform(localAuthorityRaw, CRS("+init=epsg:4326")) # # Turns the data into a dataframe # localAuthorityCleanDF <- fortify(localAuthorityClean, region = "LAD15NM") # saveRDS(localAuthorityCleanDF, "localAuthorityCleanDF.rds") localAuthorityCleanDF <- readRDS("data/localAuthorityCleanDF.rds")
Being real-world data, though, the Local Authorities that we scraped from the web do not all match those for which we have boundary files. Fortuntately, R can help us to understand the erroneous labels in the data.
l1 <- as.data.frame(unique(localAuthorityCleanDF$id), stringsAsFactors = F) colnames(l1) <- "locAuthID" areaNameDF <- as.data.frame(areaName, stringsAsFactors = F) l2 <- l1 %>% mutate(check = locAuthID %in% areaNameDF$areaName) l3 <- areaNameDF %>% mutate(check = areaName %in% l1$locAuthID) print(l2[l2$check==F,])
## locAuthID check ## 1 Aberdeen City FALSE ## 40 Bristol, City of FALSE ## 71 City of Edinburgh FALSE ## 80 County Durham FALSE ## 96 Dundee City FALSE ## 130 Glasgow City FALSE ## 152 Herefordshire, County of FALSE ## 171 Kingston upon Hull, City of FALSE ## 172 Kingston upon Thames FALSE ## 208 Newcastle upon Tyne FALSE ## 253 Richmond upon Thames FALSE ## 303 St. Helens FALSE
## areaName check ## 24 Hull FALSE ## 109 Herefordshire FALSE ## 134 St Helens FALSE ## 139 Durham FALSE ## 270 Newcastle-upon-Tyne FALSE ## 308 Northern Ireland FALSE ## 339 Dundee FALSE ## 345 Aberdeen FALSE ## 347 Kingston-upon-Thames FALSE ## 348 Bristol FALSE ## 360 Glasgow FALSE ## 366 Richmond-upon-Thames FALSE ## 374 Edinburgh FALSE ## 382 Gibraltar FALSE
Given these erroneous labels, we can then correct these errors (and drop two areas from our analysis).
# Given these mislabelled regions, alter the names resultsData$areaName <- recode( resultsData$areaName, Hull = "Kingston upon Hull, City of", Herefordshire = "Herefordshire, County of", `St Helens` = "St. Helens", Durham = "County Durham", `Newcastle-upon-Tyne` = "Newcastle upon Tyne", Dundee = "Dundee City", Aberdeen = "Aberdeen City", `Kingston-upon-Thames` = "Kingston upon Thames", Bristol = "Bristol, City of", Glasgow = "Glasgow City", `Richmond-upon-Thames` = "Richmond upon Thames", Edinburgh = "City of Edinburgh") # Drop NI and Gibraltar mapDataSummary <- resultsData[c(-308, -382)]
We can then join the map data to the voting data and determine the proportion of leave votes, along with the total number of votes cast in each Local Authority. The following interactive table provides the details.
mapDataFinal <- mapDataSummary %>% left_join(mapDataLngLat, by = "areaName") %>% mutate(leaveShare = round(leaveVotes/(leaveVotes + remainVotes),2)) %>% mutate(size = leaveVotes + remainVotes) mapDataFinal <- mapDataFinal[complete.cases(mapDataFinal),] dt <- datatable( mapDataFinal, rownames = FALSE, options = list( dom = 'tip', autoWidth = TRUE, order = list(5, 'desc'), columnDefs = list( list( className = 'dt-left', targets = 0) ), pageLength = 10, fillContainer = T ) ) frameWidget(dt, width = 750, height = 500)
We’re now ready to plot the data. When we do so, some regional trends become immediately apparent.
pal <- colorNumeric(palette = "YlOrRd", domain = mapDataFinal$leaveShare) map <- leaflet(mapDataFinal) %>% addProviderTiles("CartoDB.Positron") %>% setView(lng = -3, lat = 53.5, zoom = 6) %>% addCircles(lng = ~mapDataFinal$avLng, lat = ~mapDataFinal$avLat, color = ~pal(mapDataFinal$leaveShare), radius = ~20*sqrt(size), stroke = FALSE, fillOpacity = 0.9, popup = paste(mapDataFinal$areaName, "had ", round(100*mapDataFinal$leaveShare, 1), "% voting for Leave and ", mapDataFinal$size, "total voters")) %>% addLegend("topright", pal = pal, values = ~mapDataFinal$leaveShare, title = "% of Leave voters", labFormat = labelFormat(), opacity = 1) frameWidget(map, height = 400)
And that’s it! Hopefully, this page has given you a quick appreciation of the variety of techniques that you can fruitfully employ in R.