Step 5: The Data Analysis Workflow
Once you have an understanding of R’s syntax, the package ecosystem, and how to get help, it’s time to focus on how R can be useful for the most common tasks in the data analysis workflow
5.1 Importing Data
Before you can start performing analysis, you first need to get your data into R. The good thing is that you can import into R all sorts of data formats, the hard part this is that different types often need a different approach:
If you want to learn more on how to import data into R check an online Importing Data into R tutorial or this post on data importing.
5.2 Data Manipulation
Performing data manipulation with R is a broad topic as you can see in for example this Data Wrangling with R video by RStudio or the book Data Manipulation with R. This is a list of packages in R that you should master when performing data manipulations:
- The tidyr package for tidying your data.
- The stringr package for string manipulation.
- When working with data frame like objects it is best to make yourself familiar with the dplyr package (try this course). However. in case of heavy data wrangling tasks, it makes more sense to check out the blazingly fast data.table package (see this syntax cheatsheet for help).
- When working with times and dates install the lubridate package which makes it a bit easier to work with these.
- Packages likezoo, xts and quantmod offer great support for time series analysis in R.
5.3 Data Visualization
One of the main reasons R is the favorite tool of data analysts and scientists is because of its data visualization capabilities. Tons of beautiful plots are created with R as shown by all the posts on FlowingData, such as this famous facebook visualization:
If you want to get started with visualizations in R, take some time to study the ggplot2 package. One of the (if not the) most famous packages in R for creating graphs and plots. ggplot2 is makes intensive use of the grammar of graphics, and as a result is very intuitive in usage (you’re continuously building part of your graphs so it’s a bit like playing with lego). There are tons of resources to get your started such as this interactive coding tutorial, a cheatsheet and an upcoming book by Hadley Wickham.
Besides ggplot2 there are multiple other packages that allow you to create highly engaging graphics and that have good learning resources to get you up to speed. Some of our favourites are:
If you want to see more packages for visualizations see the CRAN task view. In case you run into issues plotting your data this post might help as well.
Next to the “traditional” graphs, R is able to handle and visualize spatial data as well. You can easily visualize spatial data and models on top of static maps from sources such as Google Maps and Open Street Maps with a package such as ggmap. Another great package is choroplethr developed by Ari Lamstein of Trulia or the tmap package. Take this tutorial on Introduction to visualising spatial data in R if you want to learn more.
5.4 The stats part
In case you are new to statistics, there are some very solid sources that explain the basic concepts while making use of R:
Note that these resources are aimed at beginners. If you want to go more advanced you can look at the multiple resources there are for machine learning with R. Books such as Mastering Machine Learning with R and Machine Learning with R explain the different concepts very well, and online resources like the Kaggle Machine Learning course help you practice the different concepts. Furthermore there are some very interesting blogs to kickstart your ML knowledge like Machine Learning Mastery orthis post.
5.5 Reporting your results
One of the best way to share your models, visualizations, etc is through dynamic documents. R Markdown (based on knitr and pandoc) is a great tool for reporting your data analysis in a reproducible manner though html, word, pdf, ioslides, etc. This 4 hour tutorial on Reporting with R Markdown explains the basics of R markdown. Once you are creating your own markdown documents, make sure this cheat sheet is on your desk.