楼主: oliyiyi
1279 2

Piping in R and in Pandas [推广有奖]

版主

泰斗

0%

还不是VIP/贵宾

-

TA的文库  其他...

计量文库

威望
7
论坛币
271951 个
通用积分
31269.3519
学术水平
1435 点
热心指数
1554 点
信用等级
1345 点
经验
383775 点
帖子
9598
精华
66
在线时间
5468 小时
注册时间
2007-5-21
最后登录
2024-4-18

初级学术勋章 初级热心勋章 初级信用勋章 中级信用勋章 中级学术勋章 中级热心勋章 高级热心勋章 高级学术勋章 高级信用勋章 特级热心勋章 特级学术勋章 特级信用勋章

+2 论坛币
k人 参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群
赵安豆老师微信:zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

感谢您参与论坛问题回答

经管之家送您两个论坛币!

+2 论坛币

n R community, there’s this one guy, Hadley Wickam, who by himself made R great again. One of the many, many things he came up with - so many they call it a hadleyverse - is the dplyr package, which aims to make data analysis easy and fast. It works by allowing a user to take a data frame and apply to it a pipeline of operations resulting in a desired outcome (an example in just a minute). This approach is a good match for the mental model some data scientists have and turned out to be successful. Then people have ported key pieces to Pandas.

This is a draft. Come back later for the final version.




There’s an interesting story about how Hadley invented all those things. It goes like this. An angel - some say it was a daemon, but don’t believe them, it was an angel - visited our hero in a dream and said: ”I will give you some ideas that will make you rich and famous - well, rich intellectually and famous in the R community - but there’s a catch. For reasons I won’t disclose, for a mortal like you wouldn’t really understand, you must make the piping operator as bad as you possibly can, but without rendering it outright ridiculous. No more than three chars, and remember - as ugly and hard to type as you can.

Hadley agreed and woke up. Being a smart guy that he is, in the morning he constructed an appropriate bundle of characters. Reportedly, the thought process unfolded as follows:

Okay, three characters. Let’s invert that old <-, and elongate it: -->. Meh, waaay too pretty and you type it all with one hand.

I know… ~~> is good. ~>~ even better. One needs to press tilde key twice, then delete one, go to >, repeat, all with Shift pressed. Sweet. But dang, too pretty. I need ugly.

Think, man, think! Let’s see. #>#. Yeah. Nah, that looks half-reasonable. Wait… wait… yes… %>%. That’s it!

[color=rgb(255, 255, 255) !important]


Image credit: Dexter’s Laboratory

And the rest is history:

carriers_db2 %>% summarise(delay = mean(arr_delay)) %>% collect()

But seriously, Hadley says that’s all because infix operators in R they must have the form %something%.

In Pandas

We know of three modules for piping in Pandas: pandas-ply, dplython and dfply. People porting dplyr to Python didn’t have Hadley’s obligations, so all three use reasonable operators.

pandas-ply, from Coursera, is the simplest of them and closest to the Pandas spirit. It uses a normal dot for chaining and just adds a few methods to the DataFrame. Here’s their motivating example, copied from the dplyr intro:

grouped_flights = flights.groupby(['year', 'month', 'day']) output = pd.DataFrame() output['arr'] = grouped_flights.arr_delay.mean() output['dep'] = grouped_flights.dep_delay.mean() filtered_output = output[(output.arr > 30) & (output.dep > 30)]  # instead:  (flights   .groupby(['year', 'month', 'day'])   .ply_select(     arr = X.arr_delay.mean(),     dep = X.dep_delay.mean())   .ply_where(X.arr > 30, X.dep > 30))

Less typing and no intermediate artifacts. Notice how you refer to the transformed dataframe inside the pipeline by X.

dplython is closer to dplyr. The module provides verbs (functions) similar to the R counterpart, but the pipeline operator is a handsome >>:

(diamonds >>    sample_n(10) >>    arrange(X.carat) >>    select(X.carat, X.cut, X.depth, X.price))  (diamonds >>    mutate(carat_bin=X.carat.round()) >>    group_by(X.cut, X.carat_bin) >>    summarize(avg_price=X.price.mean()))   

By the way, what’s with the outer parens? Is this Lisp or something? We hate parens in Lisp.

Let us mention that in R, all functions are pipable. In Python, you need to make them pipable. dlpython has a special decorator for it, @DelayFunction.

Finally, there is dfply, inspired by dlpython, but with even more functions. It appears less mature than the previous two - pip install won’t work here.

diamonds >> drop_endswith('e','y','z') >> head(2)

What these modules provide is mostly syntactic sugar, and using them depends on a personal taste. For example, while the pandas-ply flights example above is convincing, is one of these lines better than the others?

diamonds >> sift(X.carat > 4) >> select(X.carat, X.cut, X.depth, X.price)  diamonds.ply_where(X.carat > 4).ply_select('carat', 'cut', 'depth', 'price')  diamonds[diamonds.carat > 4]['carat', 'cut', 'depth', 'price']
二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

关键词:pandas panda Ping Das ING desired example package frame mental

已有 1 人评分学术水平 热心指数 信用等级 收起 理由
janyiyi + 3 + 3 + 3 精彩帖子

总评分: 学术水平 + 3  热心指数 + 3  信用等级 + 3   查看全部评分

缺少币币的网友请访问有奖回帖集合
https://bbs.pinggu.org/thread-3990750-1-1.html
沙发
Kamize 学生认证  发表于 2016-10-20 23:21:30 来自手机 |只看作者 |坛友微信交流群
oliyiyi 发表于 2016-10-20 19:27
n R community, there’s this one guy, Hadley Wickam, who by himself made R great again. One of the m ...
谢谢楼主分享的资料不错啊!

使用道具

藤椅
janyiyi 发表于 2016-10-29 20:55:09 |只看作者 |坛友微信交流群
谢谢分享

使用道具

您需要登录后才可以回帖 登录 | 我要注册

本版微信群
加好友,备注jltj
拉您入交流群

京ICP备16021002-2号 京B2-20170662号 京公网安备 11010802022788号 论坛法律顾问:王进律师 知识产权保护声明   免责及隐私声明

GMT+8, 2024-4-24 06:40