Poynter did a nice interactive piece on world population by income (i.e. “How Many Live on How Much, and Where”). I’m always on the lookout for optimized shapefiles and clean data (I’m teaching a data science certificate program starting this Fall) and the speed of the site load and the easy availability of the data set made this one a “must acquire”. Rather than just repeat Poynter’s D3-goodness, here’s a way to look at the income data in series of small multiple choropleths—using R & ggplot2—that involves:
- downloading data & shapefiles from a web site
- using dplyr & tidyr for data munging
- applying custom fill color scale mapping in ggplot
- ordering plots with a custom facet order (using factors)
- tweaking the theme and aesthetics for a nicely finished result
By using D3, Poynter inherently made the data available. Pop open the “Developer Tools” in any browser, reload the page and look at the “Network” tab and you’ll see a list of files (you can sometimes see things in the source code, but this technique is often faster). The income data is a well-formed CSV file http://www.pewglobal.org/wp-content/themes/pew-global/interactive-global-class.csv and their highly optimized world map was also easy to discern http://www.pewglobal.org/wp-content/lib/js/world-geo.json. We’ll start by grabbing the map and using the same map projection that Poynter did (Robinson). Don’t be put off by all the library calls since one of the best parts of R is the ever-increasing repository of great packages to help you get things done.
| library(httr) # getting datalibrary(rgdal) # working with shapefilelibrary(dplyr) # awesome data manipulationlibrary(readr) # faster reading of CSV datalibrary(stringi) # string manipulationlibrary(stringr) # string manipulationlibrary(tidyr) # reshaping datalibrary(grid) # for 'unit'library(scales) # for 'percent'library(ggplot2) # plottinglibrary(ggthemes) # theme_map # this ensures you only download the shapefile once and hides# errors and warnings. remove `try` and `invisible` to see messagestry(invisible(GET("http://www.pewglobal.org/wp-content/lib/js/world-geo.json", write_disk("world-geo.json"))), silent=TRUE) # use ogrListLayers("world-geo.json") to see file type & # layer info to use in the call to readOGR world <- readOGR("world-geo.json", "OGRGeoJSON")world_wt <- spTransform(world, CRS("+proj=robin"))world_map <- fortify(world_wt) |
I would have liked to do fortify(world_wt, region="name") (since that makes working with filling in countries by name much easier in the choropleth part of the code) but that generated TopologyException errors (I’ve seen this happen quite a bit with simplified/optimized shapefiles and some non-D3 geo-packages). One can sometimes fix those with a strategic rgeos::gBuffer call, but that didn’t work well in this case. We can still use country names with a slight rejiggering of the fortified data frame using dplyr:
| world_map %>% left_join(data_frame(id=rownames(world@data), name=world@data$name)) %>% select(-id) %>% rename(id=name) -> world_map |
Now it’s time to get the data. The CSV file has annoying spaces in it that causes R to interpret all the columns as strings, so we can use dplyr again to get them into the format we want them in. Note that I’m also making the percentages decimals so we can usepercent later on to easily format them.
| # a good exercise would be to repeat the download code above # rather and make repeated calls to an external resourceread_csv("http://www.pewglobal.org/wp-content/themes/pew-global/interactive-global-class.csv") %>% mutate_each(funs(str_trim)) %>% filter(id != "None") %>% mutate_each(funs(as.numeric(.)/100), -name, -id) -> dat |
For this post, we’ll only be working with the actual share percentages, so let’s:
- ignore the “change” columns
- convert the data frame from wide to long
- extract out the income levels (e.g. “Poor”, “Low Income”…)
- set a factor order for them so our plots will be in the correct sequence
| dat %>% gather(share, value, starts_with("Share"), -name, -id) %>% select(-starts_with("Change")) %>% mutate(label=factor(stri_trans_totitle(str_match(share, "Share ([[:alpha:]- ]+),")[,2]), c("Poor", "Low Income", "Middle Income", "Upper-Middle Income", "High Income"), ordered=TRUE)) -> share_dat |