楼主: ReneeBK
1507 0

The Split-Apply-Combine Strategy using R [推广有奖]

  • 1关注
  • 62粉丝

VIP

已卖:4897份资源

学术权威

14%

还不是VIP/贵宾

-

TA的文库  其他...

R资源总汇

Panel Data Analysis

Experimental Design

威望
1
论坛币
49635 个
通用积分
55.6937
学术水平
370 点
热心指数
273 点
信用等级
335 点
经验
57805 点
帖子
4005
精华
21
在线时间
582 小时
注册时间
2005-5-8
最后登录
2023-11-26

楼主
ReneeBK 发表于 2015-2-6 04:34:03 |AI写论文

+2 论坛币
k人 参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群
赵安豆老师微信:zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

感谢您参与论坛问题回答

经管之家送您两个论坛币!

+2 论坛币

The Split-Apply-Combine Strategy using R


Jaynal Abedin



Often, we require similar types of operations in different subgroups of a dataset, such as group-wise summarization, standardization, and statistical modeling. This type of task requires us to break up a big problem into manageable pieces, perform operations on each piece separately, and finally combine the output of each piece into a single piece of output. To understand the split-apply-combine strategy intuitively, we could compare this with the map-reduce strategy for processing large amounts of data, recently popularized by Google. In the map-reduce strategy, the map step corresponds to split and apply and the reduce step consists of combining. The map-reduce approach is primarily designed to deal with a highly parallel environment where the work has been done by several hundreds or thousands of computers independently.

The split-apply-combine strategy creates an opportunity to see the similarities of problems across subgroups that were previously unconnected. This strategy can be used in many existing tools, such as the GROUP BY operation in SAS, PivotTable in MS Excel, and the SQL GROUP BY operator.

To explain the split-apply-combine strategy, we will use Fisher's iris data. This dataset contains the measurements in centimeters of these variables: sepal length and width, and petal length and width, for 50 flowers from each of the three species of iris. The species are Iris setosa, Iris versicolor, and Iris virginica. We want to calculate the mean of each variable and for each species separately. This can be done in different ways using a loop or without using one.


Split-apply-combine without a loop

In this section, we will see an example of the split-apply-combine strategy without using a loop. The steps are as follows:

  • Split the iris dataset into three parts.
  • Remove the species name variable from the data.
  • Calculate the mean of each variable for the three different parts separately.
  • Combine the output into a single data frame.

The code for this is as follows:

  1. # notice that during split step a negative 5 is used within the # code, this negative 5 has been used to discard fifth column of the # iris data that contains "species" information and we do not need # that column to calculate mean.

  2. iris.set <- iris[iris$Species=="setosa",-5]
  3. iris.versi <- iris[iris$Species=="versicolor",-5]
  4. iris.virg <- iris[iris$Species=="virginica",-5]

  5. # calculating mean for each piece (The apply step)
  6. mean.set <- colMeans(iris.set)
  7. mean.versi <- colMeans(iris.versi)
  8. mean.virg <- colMeans(iris.virg)

  9. # combining the output (The combine step)
  10. mean.iris <- rbind(mean.set,mean.versi,mean.virg)

  11. # giving row names so that the output could be easily understood
  12. rownames(mean.iris) <- c("setosa","versicolor","virginica")
复制代码

Split-apply-combine with a loop

The following example will calculate the same statistics as in the previous section, but this time we will perform this task using a loop. The steps are similar but the code is different. In each iteration, we will split the data for each species and calculate the mean for each variable and then combine the output into a single data frame, as shown in the following code:

  1. # split-apply-combine using loop
  2. # each iteration represents split
  3. # mean calculation within each iteration represents apply step
  4. # rbind command in each iteration represents combine step

  5. mean.iris.loop <- NULL
  6. for(species in unique(iris$Species))
  7. {
  8. iris_sub <- iris[iris$Species==species,]
  9. column_means <- colMeans(iris_sub[,-5])
  10. mean.iris.loop <- rbind(mean.iris.loop,column_means)
  11. }

  12. # giving row names so that the output could be easily understood
  13. rownames(mean.iris.loop) <- unique(iris$Species)
复制代码

An important fact to note in the split-apply-combine strategy is that each piece should be independent of the other. If the calculation in one piece is somehow dependent on the other, the split-apply-combine strategy will not work. This strategy is not applicable in running an average type of operation, where a current average is dependent on the previous one. This strategy is only applicable when the big problem can be broken up into smaller manageable pieces and we can perform the desired operation on each piece independently. For running average calculations, the split-apply-combine strategy is not suitable; we can use a loop instead. But if processing speed is a concern, we can write the code in some lower-level language such as C or Fortran.


Reference

Data Manipulation with R

  • By: Jaynal Abedin

  • Publisher: Packt Publishing

  • Pub. Date: January 15, 2014




二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

关键词:Strategy Strateg combine apply Using different problem require similar single

本帖被以下文库推荐

您需要登录后才可以回帖 登录 | 我要注册

本版微信群
加好友,备注jltj
拉您入交流群
GMT+8, 2025-12-30 04:24