Package-tastic

Here’s a page that I prepared for an R training course that I ran at my company. The idea was to show that, just like Apple, R has an app for that. OK, a package for that.

I’ll show a range of tricks made possible by R, including web scaping, text analysis and mapping. And I’ll use data from the recent Brexit referendum to keep it topical.

 

Loading the packages

Let’s begin with the packages:

library(tidyverse)        # Does most of what you need
library(xml2)             # Scrapes web data
library(htmlwidgets)      # For responsive exhibits
library(widgetframe)      # For responsive exhibits
library(DT)               # For responsive data tables
library(leaflet)          # For interactive map plots
library(maptools)         # For maps
library(sp)               # For map-like data

 

Web scraping

To understand the voting in the Brexit referendum, we need some data about it, which we can scrape from the FT’s page on the subject.

(Note that I’ve dimmed-out much of this code below, so that I can show it on this static webpage. However, I ran the dimmed-out code separately and saved the ensuing results as an R Dataset. I can then just load this dataset for use on this static webpage.)

# webResults <- read_html("https://ig.ft.com/sites/elections/2016/uk/eu-referendum/")
  
# webData <- webResults %>%
#   html_nodes("td:nth-child(1) , .area-state-3 .hideable") %>%
#   html_text()
  
# saveRDS(webData, "webData.rds")
  
webData <- readRDS("data/webData.rds")
head(webData)
## [1] "Boston"        "7,430"         "22,974"        "South Holland"
## [5] "13,074"        "36,423"

 

Dealing with strings

That gives us some data, but it is an unhelpful format, as R would read it with a character that has a comma in the middle. Fortunately, R can manipulate strings so that they make sense, such as removing those commas:

lWebData <- length(webData)
  
areaName <- webData[seq(from = 1, to = lWebData-2, by = 3)]
remainVotes <- webData[seq(from = 2, to = lWebData-1, by = 3)]
leaveVotes <- webData[seq(from = 3, to = lWebData, by = 3)]
  
remainVotes <- gsub(",([0-9])", "\\1", remainVotes)
leaveVotes <- gsub(",([0-9])", "\\1", leaveVotes)
  
resultsData <- as.data.frame(t(rbind(areaName, remainVotes, leaveVotes)), stringsAsFactors = F)
  
resultsData$remainVotes <- as.numeric(resultsData$remainVotes)
  
resultsData$leaveVotes <- as.numeric(resultsData$leaveVotes)

 

Map and boundary data

To understand this data even more, it would help to map it. And the amount of boundary files available for UK maps is extraordinary. R helps us to use these maps, so that we can map each local authority (the areas over which the referendum votes were amalgamated). (As before, I’ve run the dimmed-out code beforehand and saved the RDS, for use on this page.)

# localAuthorityRaw <- readShapeSpatial("Local_Authority_District_(GB)_2015_boundaries_(generalised_clipped)/LAD_DEC_2015_GB_BGC.shp", proj4string=CRS("+init=epsg:27700"))
  
# # Transform the data to use with ggmap
# localAuthorityClean <- spTransform(localAuthorityRaw, CRS("+init=epsg:4326"))
  
# # Turns the data into a dataframe
# localAuthorityCleanDF <- fortify(localAuthorityClean, region = "LAD15NM")
  
# saveRDS(localAuthorityCleanDF, "localAuthorityCleanDF.rds")
  
localAuthorityCleanDF <- readRDS("data/localAuthorityCleanDF.rds")

 

Understand any mislabelling

Being real-world data, though, the Local Authorities that we scraped from the web do not all match those for which we have boundary files. Fortuntately, R can help us to understand the erroneous labels in the data.

l1 <- as.data.frame(unique(localAuthorityCleanDF$id), stringsAsFactors = F)
colnames(l1)[1] <- "locAuthID"
areaNameDF <- as.data.frame(areaName, stringsAsFactors = F)
  
l2 <- l1 %>%
  mutate(check = locAuthID %in% areaNameDF$areaName)
  
l3 <- areaNameDF %>%
  mutate(check = areaName %in% l1$locAuthID)
  
print(l2[l2$check==F,])
##                       locAuthID check
## 1                 Aberdeen City FALSE
## 40             Bristol, City of FALSE
## 71            City of Edinburgh FALSE
## 80                County Durham FALSE
## 96                  Dundee City FALSE
## 130                Glasgow City FALSE
## 152    Herefordshire, County of FALSE
## 171 Kingston upon Hull, City of FALSE
## 172        Kingston upon Thames FALSE
## 208         Newcastle upon Tyne FALSE
## 253        Richmond upon Thames FALSE
## 303                  St. Helens FALSE
print(l3[l3$check==F,])
##                 areaName check
## 24                  Hull FALSE
## 109        Herefordshire FALSE
## 134            St Helens FALSE
## 139               Durham FALSE
## 270  Newcastle-upon-Tyne FALSE
## 308     Northern Ireland FALSE
## 339               Dundee FALSE
## 345             Aberdeen FALSE
## 347 Kingston-upon-Thames FALSE
## 348              Bristol FALSE
## 360              Glasgow FALSE
## 366 Richmond-upon-Thames FALSE
## 374            Edinburgh FALSE
## 382            Gibraltar FALSE

Given these erroneous labels, we can then correct these errors (and drop two areas from our analysis).

# Given these mislabelled regions, alter the names
resultsData$areaName <- recode(
  resultsData$areaName,
  Hull = "Kingston upon Hull, City of",
  Herefordshire = "Herefordshire, County of",
  `St Helens` = "St. Helens",
  Durham = "County Durham",
  `Newcastle-upon-Tyne` = "Newcastle upon Tyne",
  Dundee = "Dundee City",
  Aberdeen = "Aberdeen City",
  `Kingston-upon-Thames` = "Kingston upon Thames",
  Bristol = "Bristol, City of",
  Glasgow = "Glasgow City",
  `Richmond-upon-Thames` = "Richmond upon Thames",
  Edinburgh = "City of Edinburgh")

# Drop NI and Gibraltar
mapDataSummary <- resultsData[c(-308, -382)]

 

Calculate the centrepoint of the local authority

We could use the polygon details that we have to map our data. However, circles on a map would enable us to alter the size of the circle and therefore show the data in more detail. As such, we’ll use R to calculate the centrepoints of each Local Authority.

mapDataLng <- localAuthorityCleanDF %>% 
  group_by(id) %>% 
  summarise(avLng = round(median(long),4))
  
mapDataLat <- localAuthorityCleanDF %>% 
  group_by(id) %>% 
  summarise(avLat = round(median(lat),4))
  
mapDataLngLat <- mapDataLat %>% 
  left_join(mapDataLng, by = "id") %>% 
  rename(areaName = id)

 

Join and amend the data

We can then join the map data to the voting data and determine the proportion of leave votes, along with the total number of votes cast in each Local Authority. The following interactive table provides the details.

mapDataFinal <- mapDataSummary %>% 
  left_join(mapDataLngLat, by = "areaName") %>% 
  mutate(leaveShare = round(leaveVotes/(leaveVotes + remainVotes),2)) %>% 
  mutate(size = leaveVotes + remainVotes)
  
mapDataFinal <- mapDataFinal[complete.cases(mapDataFinal),]
  
dt <- datatable(
  mapDataFinal, 
  rownames = FALSE, 
  options = list(
    dom = 'tip', 
    autoWidth = TRUE,
    order = list(5, 'desc'), 
    columnDefs = list(
      list(
        className = 'dt-left', 
        targets = 0)
      ), 
    pageLength = 10, 
    fillContainer = T
    )
  )
  
frameWidget(dt, width = 750, height = 500)

 

Plotting the votes

We’re now ready to plot the data. When we do so, some regional trends become immediately apparent.

pal <- colorNumeric(palette = "YlOrRd", domain = mapDataFinal$leaveShare)
  
map <- leaflet(mapDataFinal) %>%
  addProviderTiles("CartoDB.Positron") %>%
  setView(lng = -3, lat = 53.5, zoom = 6) %>%
  addCircles(lng = ~mapDataFinal$avLng, 
             lat = ~mapDataFinal$avLat, 
             color = ~pal(mapDataFinal$leaveShare), 
             radius = ~20*sqrt(size), 
             stroke = FALSE, 
             fillOpacity = 0.9,
             popup = paste(mapDataFinal$areaName, "had ", round(100*mapDataFinal$leaveShare, 1), "% voting for Leave and ", mapDataFinal$size, "total voters")) %>% 
  addLegend("topright", pal = pal,
            values = ~mapDataFinal$leaveShare,
            title = "% of Leave voters",
            labFormat = labelFormat(),
            opacity = 1)
  
frameWidget(map, height = 400)

 

There’s a package for that

And that’s it! Hopefully, this page has given you a quick appreciation of the variety of techniques that you can fruitfully employ in R.