The Split-Apply-Combine Strategy using R
Jaynal Abedin
Often, we require similar types of operations in different subgroups of a dataset, such as group-wise summarization, standardization, and statistical modeling. This type of task requires us to break up a big problem into manageable pieces, perform operations on each piece separately, and finally combine the output of each piece into a single piece of output. To understand the split-apply-combine strategy intuitively, we could compare this with the map-reduce strategy for processing large amounts of data, recently popularized by Google. In the map-reduce strategy, the map step corresponds to split and apply and the reduce step consists of combining. The map-reduce approach is primarily designed to deal with a highly parallel environment where the work has been done by several hundreds or thousands of computers independently.
The split-apply-combine strategy creates an opportunity to see the similarities of problems across subgroups that were previously unconnected. This strategy can be used in many existing tools, such as the GROUP BY operation in SAS, PivotTable in MS Excel, and the SQL GROUP BY operator.
To explain the split-apply-combine strategy, we will use Fisher's iris data. This dataset contains the measurements in centimeters of these variables: sepal length and width, and petal length and width, for 50 flowers from each of the three species of iris. The species are Iris setosa, Iris versicolor, and Iris virginica. We want to calculate the mean of each variable and for each species separately. This can be done in different ways using a loop or without using one.
In this section, we will see an example of the split-apply-combine strategy without using a loop. The steps are as follows:
- Split the iris dataset into three parts.
- Remove the species name variable from the data.
- Calculate the mean of each variable for the three different parts separately.
- Combine the output into a single data frame.
The code for this is as follows:
- # notice that during split step a negative 5 is used within the # code, this negative 5 has been used to discard fifth column of the # iris data that contains "species" information and we do not need # that column to calculate mean.
- iris.set <- iris[iris$Species=="setosa",-5]
- iris.versi <- iris[iris$Species=="versicolor",-5]
- iris.virg <- iris[iris$Species=="virginica",-5]
- # calculating mean for each piece (The apply step)
- mean.set <- colMeans(iris.set)
- mean.versi <- colMeans(iris.versi)
- mean.virg <- colMeans(iris.virg)
- # combining the output (The combine step)
- mean.iris <- rbind(mean.set,mean.versi,mean.virg)
- # giving row names so that the output could be easily understood
- rownames(mean.iris) <- c("setosa","versicolor","virginica")
Split-apply-combine with a loop
The following example will calculate the same statistics as in the previous section, but this time we will perform this task using a loop. The steps are similar but the code is different. In each iteration, we will split the data for each species and calculate the mean for each variable and then combine the output into a single data frame, as shown in the following code:
- # split-apply-combine using loop
- # each iteration represents split
- # mean calculation within each iteration represents apply step
- # rbind command in each iteration represents combine step
- mean.iris.loop <- NULL
- for(species in unique(iris$Species))
- {
- iris_sub <- iris[iris$Species==species,]
- column_means <- colMeans(iris_sub[,-5])
- mean.iris.loop <- rbind(mean.iris.loop,column_means)
- }
- # giving row names so that the output could be easily understood
- rownames(mean.iris.loop) <- unique(iris$Species)
An important fact to note in the split-apply-combine strategy is that each piece should be independent of the other. If the calculation in one piece is somehow dependent on the other, the split-apply-combine strategy will not work. This strategy is not applicable in running an average type of operation, where a current average is dependent on the previous one. This strategy is only applicable when the big problem can be broken up into smaller manageable pieces and we can perform the desired operation on each piece independently. For running average calculations, the split-apply-combine strategy is not suitable; we can use a loop instead. But if processing speed is a concern, we can write the code in some lower-level language such as C or Fortran.
Reference
Data Manipulation with R
By: Jaynal Abedin
Publisher: Packt Publishing
Pub. Date: January 15, 2014


雷达卡



京公网安备 11010802022788号







