发帖

楼主: sunrun19840306

767 0

[作业] Groupby-Pandas-User Guide原文翻译10 [推广有奖]

3关注
1粉丝

本科生

90%

还不是VIP/贵宾

-

0%

威望: 0 级
论坛币: 1826 个
通用积分: 5.3141
学术水平: 2 点
热心指数: 1 点
信用等级: 0 点
经验: 2868 点
帖子: 62
精华: 0
在线时间: 97 小时
注册时间: 2008-12-20
最后登录: 2024-3-31

楼主

sunrun19840306 发表于 2020-6-17 14:09:18 |只看作者 |坛友微信交流群|倒序 |AI写论文

是否 +2 论坛币

k人参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群

赵安豆老师微信：zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

立即领取

感谢您参与论坛问题回答

经管之家送您两个论坛币！

+2 论坛币

欢迎关注微信个人

公众号

，在个人

公众号

中，搜索: 大白学财经，有更多金融、python的话题分享

Transformation¶

The transformmethod returns an object that is indexed the same (same size) as the one beinggrouped. The transform function must:

这个transform的方法返回一个对象，这个对象是用同样的size来排序，这个转换函数必须是

· Return a result that is eitherthe same size as the group chunk or broadcastable to the size of the groupchunk (e.g., a scalar, grouped.transform(lambda x: x.iloc[-1])).

· 返回一个结果，即有同样的size和group中广播的一样，

· Operate column-by-column on thegroup chunk. The transform is applied to the first group chunk usingchunk.apply.

· 运行一个列跟着一列，在这个组chunk中，这个转换适用于第一组使用chunk。Apply

· Not perform in-place operationson the group chunk. Group chunks should be treated as immutable, and changes toa group chunk may produce unexpected results. For example, when using fillna, inplacemust be False (grouped.transform(lambda x: x.fillna(inplace=False))).

· 没有展示groupchunk，这个groupchunk应该被免疫对待，或者变换可能产生意想不到的效果。例如使用fillna和inplace函数时必须是false，不可以使用true

· (Optionally) operates on theentire group chunk. If this is supported, a fast path is used starting from thesecond chunk.

· 选用，执行整个组chunk，如果这是被支持的，一个快速的路径被使用来自第二个chunk

· For example, suppose we wished tostandardize the data within each group:例如，假设我们希望标准化以下组中的数据

index =pd.date_range('10/1/1999', periods=1100)

生成一个日期时间，时间长度1100天，

In [90]:ts= pd.Series(np.random.normal(0.5, 2, 1100), index)

生成一个ts序列，用np的随机正态分布均值0.5，方差2，1100个数字，序列索引就用index

In [91]:ts= ts.rolling(window=100, min_periods=100).mean().dropna()

每一百天计算一个均值，最小时间为100天，并且把空值给dropna

In [92]:ts.head()

这样前五位就不会有空值nan了

Out[92]:

2000-01-08 0.779333

2000-01-09 0.778852

2000-01-10 0.786476

2000-01-11 0.782797

2000-01-12 0.798110

Freq: D, dtype:float64

In [93]:ts.tail()

Out[93]:

2002-09-30 0.660294

2002-10-01 0.631095

2002-10-02 0.673601

2002-10-03 0.709213

2002-10-04 0.719369

Freq: D, dtype:float64

In [94]:transformed= (ts.groupby(lambda x: x.year)

....: .transform(lambda x: (x -x.mean()) / x.std()))

年依然还是年，并且以年分组标志，对另外一列的值进行标准化，再作标准化，x就是除了分组之外的另一列

We would expect the result to now have mean 0and standard deviation 1 within each group, which we can easily check:

# Original Data

In [95]: grouped =ts.groupby(lambda x: x.year)

Ts直接进行分组，后求均值方差

In [96]: grouped.mean()

Out[96]:

2000 0.442441

2001 0.526246

2002 0.459365

dtype: float64

In [97]: grouped.std()

Out[97]:

2000 0.131752

2001 0.210945

2002 0.128753

dtype: float64

使用transformed分组计算均值和方差，很容易观察到均值为0，方差为1

# Transformed Data

In [98]: grouped_trans =transformed.groupby(lambda x: x.year)

In [99]: grouped_trans.mean()

Out[99]:

2000 1.168208e-15

2001 1.454544e-15

2002 1.726657e-15

dtype: float64

In [100]: grouped_trans.std()

Out[100]:

2000 1.0

2001 1.0

2002 1.0

dtype: float64

We can also visually compare the original andtransformed data sets.、

我们也可以图形化初始和转换后的数据集

In [101]: compare =pd.DataFrame({'Original': ts, 'Transformed': transformed})

形成一个DF，这种方法常用于两列数字进行比较。方便使用下在的plot函数。

注意这里ts是一个序列，transformed是一个序列，并且两者的索引相同。

In [102]: compare.plot()

Out[102]: <matplotlib.axes._subplots.AxesSubplotat 0x7f6574fbfd10>

Transformation functions that have lower dimension outputs arebroadcast to match the shape of the input array.

转换函数有一个低的维度输出可以广播去匹配更多的输出类型

In [103]: ts.groupby(lambda x: x.year).transform(lambda x: x.max() -x.min())

还是用year分组，把另一列转换成最大值减去最小值

Out[103]:

2000-01-08 0.623893

2000-01-09 0.623893

2000-01-10 0.623893

2000-01-11 0.623893

2000-01-12 0.623893

...

2002-09-30 0.558275

2002-10-01 0.558275

2002-10-02 0.558275

2002-10-03 0.558275

2002-10-04 0.558275

Freq: D, Length: 1001, dtype: float64

Alternatively, the built-in methods could be used to produce thesame outputs.

相同的，正规的方法也可以被使用产生相同的输出

In [104]: max = ts.groupby(lambda x: x.year).transform('max')

In [105]: min = ts.groupby(lambda x: x.year).transform('min')

In [106]: max - min

Out[106]:

2000-01-08 0.623893

2000-01-09 0.623893

2000-01-10 0.623893

2000-01-11 0.623893

2000-01-12 0.623893

...

2002-09-30 0.558275

2002-10-01 0.558275

2002-10-02 0.558275

2002-10-03 0.558275

2002-10-04 0.558275

Freq: D, Length: 1001, dtype: float64

Another common data transform is to replace missing data withthe group mean.

另一个普通数据转换，用总体均值来代替缺失的数据

In [107]: data_df

Out[107]:

A B C

0 1.539708 -1.166480 0.533026

1 1.302092 -0.505754 NaN

2 -0.371983 1.104803 -0.651520

3 -1.309622 1.118697 -1.161657

4 -1.924296 0.396437 0.812436

.. ... ... ...

995 -0.093110 0.683847 -0.774753

996 -0.185043 1.438572 NaN

997 -0.394469 -0.642343 0.011374

998 -1.174126 1.857148 NaN

999 0.234564 0.517098 0.393534

[1000 rows x 3 columns]

In [108]: countries =np.array(['US', 'UK', 'GR', 'JP'])

In [109]: key =countries[np.random.randint(0, 4, 1000)]

In [110]: grouped =data_df.groupby(key)

# Non-NA count in each group

In [111]: grouped.count()

Out[111]:

A B C

GR 209 217 189

JP 240 255 217

UK 216 231 193

US 239 250 217

In [112]: transformed= grouped.transform(lambda x: x.fillna(x.mean()))

We can verify that the group means have not changed in thetransformed data and that the transformed data contains no NAs.

我们也可以证明，均值和个数没有变化，所以没有NA值，

: grouped_trans =transformed.groupby(key)

In [114]: grouped.mean() # original groupmeans

Out[114]:

A B C

GR -0.098371 -0.015420 0.068053

JP 0.069025 0.023100 -0.077324

UK 0.034069 -0.052580 -0.116525

US 0.058664 -0.020399 0.028603

In [115]: grouped_trans.mean() # transformationdid not change group means

Out[115]:

A B C

GR -0.098371 -0.015420 0.068053

JP 0.069025 0.023100 -0.077324

UK 0.034069 -0.052580 -0.116525

US 0.058664 -0.020399 0.028603

In [116]: grouped.count() # original hassome missing data points

Out[116]:

A B C

GR 209 217 189

JP 240 255 217

UK 216 231 193

US 239 250 217

In [117]: grouped_trans.count() # counts aftertransformation

Out[117]:

A B C

GR 228 228 228

JP 267 267 267

UK 247 247 247

US 258 258 258

In [118]: grouped_trans.size() # Verify non-NAcount equals group size

Out[118]:

GR 228

JP 267

UK 247

US 258

dtype: int64

Some functions will automaticallytransform the input when applied to a GroupBy object, but returning an objectof the same shape as the original. Passing as_index=False will not affect these transformationmethods.

For example: fillna, ffill, bfill, shift..

有些函数自动转换输入适用于GB对象，用初始的形状as_index=False传入这一变量也不会影响到这些转变的方法。

如下向前填充，会直接填充，

In [119]: grouped.ffill()

Out[119]:

A B C

0 1.539708 -1.166480 0.533026

1 1.302092 -0.505754 0.533026

2 -0.371983 1.104803 -0.651520

3 -1.309622 1.118697 -1.161657

4 -1.924296 0.396437 0.812436

.. ... ... ...

995 -0.093110 0.683847 -0.774753

996 -0.185043 1.438572 -0.774753

997 -0.394469 -0.642343 0.011374

998 -1.174126 1.857148 -0.774753

999 0.234564 0.517098 0.393534

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

分享0 收藏0 回帖

关键词：pandas panda Group Guide guid

[作业] Groupby-Pandas-User Guide原文翻译10 [推广有奖]

经管之家送您一份

经管之家联合CDA

感谢您参与论坛问题回答

扫码加我拉你入群

相关帖子

本版微信群

[作业] Groupby-Pandas-User Guide原文翻译10 [推广有奖]

经管之家送您一份

经管之家联合CDA

感谢您参与论坛问题回答

扫码加我 拉你入群

相关帖子

本版微信群

扫码加我拉你入群