In this recipe, we discuss two ways to subset data. The first approach uses the row and column indices/names, and the other uses the subset() function.
Getting ready
Download the files for this chapter and store the auto-mpg.csv file in your R working directory. Read the data using the following command:
> auto <- read.csv("auto-mpg.csv", stringsAsFactors=FALSE)
The same subsetting principles apply for vectors, lists, arrays, matrices, and data frames. We illustrate with data frames.
How to do it...
The following steps extract a subset of a dataset:
Index by position. Get model_year and car_name for the first three cars:
> auto[1:3, 8:9]
> auto[1:3, c(8,9)]
Index by name. Get model_year and car_name for the first three cars:
> auto[1:3,c("model_year", "car_name")]
Retrieve all details for cars with the highest or lowest mpg, using the following code:
When we have categorical variables, we often want to create groups corresponding to each level and to analyze each group separately to reveal some significant similarities and differences between groups.
The split function divides data into groups based on a factor or vector. The unsplit() function reverses the effect of split.
Getting ready
Download the files for this chapter and store the auto-mpg.csv file in your R working directory. Read the file using the read.csv command and save in the auto variable:
> auto <- read.csv("auto-mpg.csv", stringsAsFactors=FALSE)
How to do it...
Split cylinders using the following command:
> carslist <- split(auto, auto$cylinders)
How it works...
The split(auto, auto$cylinders) function returns a list of data frames with each data frame corresponding to the cases for a particular level of cylinders. To reference a data frame from the list, use the [ notation. Here, carslist[1] is a list of length 1 consisting of the first data frame that corresponds to three cylinder cars, and carslist[[1]] is the associated data frame for three cylinder cars.
Analysts need an unbiased evaluation of the quality of their machine learning models. To get this, they partition the available data into two parts. They use one part to build the machine learning model and retain the remaining data as "hold out" data. After building the model, they evaluate the model's performance on the hold out data. This recipe shows you how to partition data. It separately addresses the situation when the target variable is numeric and when it is categorical. It also covers the process of creating two partitions or three.
Getting ready
If you have not already done so, make sure that the BostonHousing.csv and boston-housing-classification.csv files from the code files of this chapter are in your R working directory. You should also install the caret package using the following command:
> install.packages("caret")
> library(caret)
> bh <- read.csv("BostonHousing.csv")
How to do it…
You may want to develop a model using some machine learning technique (like linear regression or KNN) to predict the value of the median of a home in Boston neighborhoods using the data in the BostonHousing.csv file. The MEDV variable will serve as the target variable.
Case 1 – numerical target variable and two partitions
To create a training partition with 80 percent of the cases and a validation partition with the rest, use the following code:
> trg.idx <- createDataPartition(bh$MEDV, p = 0.8, list = FALSE)
> trg.part <- bh[trg.idx, ]
> val.part <- bh[-trg.idx, ]
After this, the trg.part and val.part variables contain the training and validation partitions, respectively.
Case 2 – numerical target variable and three partitions
Some machine learning techniques require three partitions because they use two partitions just for building the model. Therefore, a third (test) partition contains the "hold-out" data for model evaluation.
Suppose we want a training partition with 70 percent of the cases, and the rest divided equally among validation and test partitions, use the following commands:
> trg.idx <- createDataPartition(bh$MEDV, p = 0.7, list = FALSE)
> trg.part <- bh[trg.idx, ]
> temp <- bh[-trg.idx, ]
> val.idx <- createDataPartition(temp$MEDV, p = 0.5, list = FALSE)
> val.part <- temp[val.idx, ]
> test.part <- temp[-val.idx, ]
Case 3 – categorical target variable and two partitions
Instead of a model to predict a numerical value like MEDV, you may need to create partitions for a classification application. The boston-housing-classification.csv file has a MEDV_CAT variable that categorizes the median values into HIGH or LOW and is suitable for a classification algorithm.
Generating standard plots such as histograms, boxplots, and scatterplots
Before even embarking on any numerical analyses, you may want to get a good idea about the data through a few quick plots. Although the base R system supports powerful graphics, we will generally turn to other plotting options like lattice and ggplot for more advanced plots. Therefore, we cover only the simplest forms of basic graphs.
Getting ready
If you have not already done so, download the data files for this chapter and ensure that they are available in your R environment's working directory and run the following commands:
We often want to see plots side by side for comparisons. This recipe shows how we can achieve this.
Getting ready
If you have not already done so, download the data files for this chapter and ensure that they are available in your R environment's working directory. Once this is done, run the following commands:
R can send its output to several different graphic devices to display graphics in different formats. By default, R prints to the screen. However, we can save graphs in the following file formats as well: PostScript, PDF, PNG, JPEG, Windows metafile, Windows BMP, and so on.
Getting ready
If you have not already done so, download the data files for this chapter and ensure that the auto-mpg.csv file is available in your R environment's working directory and run the following commands:
To send the graphic output to the computer screen, you have to do nothing special. For other devices, you first open the device, send your graphical output to it, and then close the device to close the corresponding file.