人大经济论坛 › 论坛 › 计量经济学与统计论坛五区 › 计量经济学与统计软件 › winbugs及其他软件专版 › The Split-Apply-Combine Strategy using R

发帖

楼主: ReneeBK

1536 0

The Split-Apply-Combine Strategy using R [推广有奖]

1关注
62粉丝

VIP

已卖：4901份资源

学术权威

14%

还不是VIP/贵宾

TA的文库 其他...

R资源总汇

Panel Data Analysis

Experimental Design

威望: 1 级
论坛币: 49675 个
通用积分: 56.2487
学术水平: 370 点
热心指数: 273 点
信用等级: 335 点
经验: 57805 点
帖子: 4005
精华: 21
在线时间: 582 小时
注册时间: 2005-5-8
最后登录: 2023-11-26

楼主

ReneeBK 发表于 2015-2-6 04:34:03 |AI写论文

是否 +2 论坛币

k人参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群

赵安豆老师微信：zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

立即领取

感谢您参与论坛问题回答

经管之家送您两个论坛币！

+2 论坛币

The Split-Apply-Combine Strategy using R

Jaynal Abedin

Often, we require similar types of operations in different subgroups of a dataset, such as group-wise summarization, standardization, and statistical modeling. This type of task requires us to break up a big problem into manageable pieces, perform operations on each piece separately, and finally combine the output of each piece into a single piece of output. To understand the split-apply-combine strategy intuitively, we could compare this with the map-reduce strategy for processing large amounts of data, recently popularized by Google. In the map-reduce strategy, the map step corresponds to split and apply and the reduce step consists of combining. The map-reduce approach is primarily designed to deal with a highly parallel environment where the work has been done by several hundreds or thousands of computers independently.

The split-apply-combine strategy creates an opportunity to see the similarities of problems across subgroups that were previously unconnected. This strategy can be used in many existing tools, such as the GROUP BY operation in SAS, PivotTable in MS Excel, and the SQL GROUP BY operator.

To explain the split-apply-combine strategy, we will use Fisher's iris data. This dataset contains the measurements in centimeters of these variables: sepal length and width, and petal length and width, for 50 flowers from each of the three species of iris. The species are Iris setosa, Iris versicolor, and Iris virginica. We want to calculate the mean of each variable and for each species separately. This can be done in different ways using a loop or without using one.

Split-apply-combine without a loop

In this section, we will see an example of the split-apply-combine strategy without using a loop. The steps are as follows:

Split the iris dataset into three parts.
Remove the species name variable from the data.
Calculate the mean of each variable for the three different parts separately.
Combine the output into a single data frame.

The code for this is as follows:

# notice that during split step a negative 5 is used within the # code, this negative 5 has been used to discard fifth column of the # iris data that contains "species" information and we do not need # that column to calculate mean.
iris.set <- iris[iris$Species=="setosa",-5]
iris.versi <- iris[iris$Species=="versicolor",-5]
iris.virg <- iris[iris$Species=="virginica",-5]
# calculating mean for each piece (The apply step)
mean.set <- colMeans(iris.set)
mean.versi <- colMeans(iris.versi)
mean.virg <- colMeans(iris.virg)
# combining the output (The combine step)
mean.iris <- rbind(mean.set,mean.versi,mean.virg)
# giving row names so that the output could be easily understood
rownames(mean.iris) <- c("setosa","versicolor","virginica")

复制代码

Split-apply-combine with a loop

The following example will calculate the same statistics as in the previous section, but this time we will perform this task using a loop. The steps are similar but the code is different. In each iteration, we will split the data for each species and calculate the mean for each variable and then combine the output into a single data frame, as shown in the following code:

# split-apply-combine using loop
# each iteration represents split
# mean calculation within each iteration represents apply step
# rbind command in each iteration represents combine step
mean.iris.loop <- NULL
for(species in unique(iris$Species))
{
iris_sub <- iris[iris$Species==species,]
column_means <- colMeans(iris_sub[,-5])
mean.iris.loop <- rbind(mean.iris.loop,column_means)
}
# giving row names so that the output could be easily understood
rownames(mean.iris.loop) <- unique(iris$Species)

复制代码

An important fact to note in the split-apply-combine strategy is that each piece should be independent of the other. If the calculation in one piece is somehow dependent on the other, the split-apply-combine strategy will not work. This strategy is not applicable in running an average type of operation, where a current average is dependent on the previous one. This strategy is only applicable when the big problem can be broken up into smaller manageable pieces and we can perform the desired operation on each piece independently. For running average calculations, the split-apply-combine strategy is not suitable; we can use a loop instead. But if processing speed is a concern, we can write the code in some lower-level language such as C or Fortran.

Reference

Data Manipulation with R

By: Jaynal Abedin
Publisher: Packt Publishing
Pub. Date: January 15, 2014

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

分享0 收藏1 回帖

关键词：Strategy Strateg combine apply Using different problem require similar single

本帖被以下文库推荐

· Case Study NewOccidental|主题: 264, 订阅: 12

返回列表

发帖

本版微信群

加好友,备注jltj
拉您入交流群

京ICP备16021002号-2 京B2-20170662号京公网安备 11010802022788号论坛法律顾问：王进律师知识产权保护声明免责及隐私声明

The Split-Apply-Combine Strategy using R [推广有奖]

经管之家送您一份

经管之家联合CDA

感谢您参与论坛问题回答

扫码加我拉你入群

相关帖子

本帖被以下文库推荐

浏览过的帖子

浏览过的版块

本版微信群

The Split-Apply-Combine Strategy using R [推广有奖]

经管之家送您一份

经管之家联合CDA

感谢您参与论坛问题回答

扫码加我 拉你入群

相关帖子

本帖被以下文库推荐

浏览过的帖子

浏览过的版块

本版微信群

扫码加我拉你入群