楼主: oliyiyi
1116 3

Don’t use stats::aggregate() [推广有奖]

版主

已卖:2994份资源

泰斗

1%

还不是VIP/贵宾

-

TA的文库  其他...

计量文库

威望
7
论坛币
66105 个
通用积分
31671.0967
学术水平
1454 点
热心指数
1573 点
信用等级
1364 点
经验
384134 点
帖子
9629
精华
66
在线时间
5508 小时
注册时间
2007-5-21
最后登录
2025-7-8

初级学术勋章 初级热心勋章 初级信用勋章 中级信用勋章 中级学术勋章 中级热心勋章 高级热心勋章 高级学术勋章 高级信用勋章 特级热心勋章 特级学术勋章 特级信用勋章

楼主
oliyiyi 发表于 2015-11-2 08:08:08 |AI写论文

+2 论坛币
k人 参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群
赵安豆老师微信:zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

感谢您参与论坛问题回答

经管之家送您两个论坛币!

+2 论坛币

When working with an analysis system (such as R) there are usually good reasons to prefer using functions from the “base” system over using functions from extension packages. However, base functions are sometimes locked into unfortunate design compromises that can now be avoided. In R’s case I would say: do not use stats::aggregate().

Read on for our example.

For our example we create a data frame. The issue is: I am working in the Pacific time zone on Saturday October 31st 2015, and I have some time data that I want to work with that is in an Asian time zone.

print(date())## [1] "Sat Oct 31 08:14:38 2015"d <- data.frame(group='x', time=as.POSIXct(strptime('2006/10/01 09:00:00',   format='%Y/%m/%d %H:%M:%S',   tz="Etc/GMT+8"),tz="Etc/GMT+8"))  # I'd like to say UTC+8 or CSTprint(d)##   group                time## 1     x 2006-10-01 09:00:00print(d$time)## [1] "2006-10-01 09:00:00 GMT+8"str(d$time)##  POSIXct[1:1], format: "2006-10-01 09:00:00"print(unclass(d$time))## [1] 1159722000## attr(,"tzone")## [1] "Etc/GMT+8"

Suppose I try to aggregate the data to find the earliest time for each group. I have a problem, aggregate loses the timezone and gives a bad answer.

d2 <- aggregate(time~group,data=d,FUN=min)print(d2)##   group                time## 1     x 2006-10-01 10:00:00print(d2$time)## [1] "2006-10-01 10:00:00 PDT"

This is bad. Our time has lost its time zone and changed from 09:00:00 to 10:00:00. This violates John M. Chambers’ “Prime Directive” that:

computations can be understood and trusted.

Software for Data Analysis, John M. Chambers, Springer 2008, page 3.

The issue is the POSIXct time time is essentially a numeric array carrying around its timezone as an attribute. Most base R code has problems if there are extra attributes on a numeric array. So R-stat code tends to have a habit of dropping attributes when it can. it is odd that the class() is kept (which itself an attribute style structure) and the timezone is lost, but R is full of hand-specified corner cases.

dplyr gets the right answer.

library('dplyr')## ## Attaching package: 'dplyr'## ## The following object is masked from 'package:stats':## ##     filter## ## The following objects are masked from 'package:base':## ##     intersect, setdiff, setequal, unionby_group = group_by(d,group)d3 <- summarize(by_group,min(time))print(d3)## Source: local data frame [1 x 2]## ##   group           min(time)## 1     x 2006-10-01 09:00:00print(d3[[2]])## [1] "2006-10-01 09:00:00 GMT+8"

And plyr also works.

library('plyr')## -------------------------------------------------------------------------## You have loaded plyr after dplyr - this is likely to cause problems.## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:## library(plyr); library(dplyr)## -------------------------------------------------------------------------## ## Attaching package: 'plyr'## ## The following objects are masked from 'package:dplyr':## ##     arrange, count, desc, failwith, id, mutate, rename, summarise,##     summarized4 <- ddply(d,.(group),summarize,time=min(time))print(d4)##   group                time## 1     x 2006-10-01 09:00:00print(d4$time)## [1] "2006-10-01 09:00:00 GMT+8"
二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

关键词:Aggregate stats Gate ATE Don sometimes avoided example usually design

缺少币币的网友请访问有奖回帖集合
https://bbs.pinggu.org/thread-3990750-1-1.html

沙发
seahhj 发表于 2015-11-2 08:23:29
谢谢分享

藤椅
icyjunjin 发表于 2015-11-2 08:30:17
。。。。。。

板凳
rrjj101022 发表于 2015-11-2 21:32:46
谢谢分享~~~

您需要登录后才可以回帖 登录 | 我要注册

本版微信群
加好友,备注jltj
拉您入交流群
GMT+8, 2026-1-3 08:36