,在个人 公众号
Transformation¶
The transformmethod returns an object that is indexed the same (same size) as the one beinggrouped. The transform function must:
这个transform的方法返回一个对象,这个对象是用同样的size来排序,这个转换函数必须是
· Return a result that is eitherthe same size as the group chunk or broadcastable to the size of the groupchunk (e.g., a scalar, grouped.transform(lambda x: x.iloc[-1])).
· 返回一个结果,即有同样的size和group中广播的一样,
· Operate column-by-column on thegroup chunk. The transform is applied to the first group chunk usingchunk.apply.
· 运行一个列跟着一列,在这个组chunk中,这个转换适用于第一组使用chunk。Apply
· Not perform in-place operationson the group chunk. Group chunks should be treated as immutable, and changes toa group chunk may produce unexpected results. For example, when using fillna, inplacemust be False (grouped.transform(lambda x: x.fillna(inplace=False))).
· 没有展示groupchunk,这个groupchunk应该被免疫对待,或者变换可能产生意想不到的效果。例如使用fillna和inplace函数时必须是false,不可以使用true
· (Optionally) operates on theentire group chunk. If this is supported, a fast path is used starting from thesecond chunk.
· 选用,执行整个组chunk,如果这是被支持的,一个快速的路径被使用来自第二个chunk
· For example, suppose we wished tostandardize the data within each group:例如,假设我们希望标准化以下组中的数据
index =pd.date_range('10/1/1999', periods=1100)
生成一个日期时间,时间长度1100天,
In [90]:ts= pd.Series(np.random.normal(0.5, 2, 1100), index)
生成一个ts序列,用np的随机正态分布均值0.5,方差2,1100个数字,序列索引就用index
In [91]:ts= ts.rolling(window=100, min_periods=100).mean().dropna()
每一百天计算一个均值,最小时间为100天,并且把空值给dropna
In [92]:ts.head()
这样前五位就不会有空值nan了
Out[92]:
2000-01-08 0.779333
2000-01-09 0.778852
2000-01-10 0.786476
2000-01-11 0.782797
2000-01-12 0.798110
Freq: D, dtype:float64
In [93]:ts.tail()
Out[93]:
2002-09-30 0.660294
2002-10-01 0.631095
2002-10-02 0.673601
2002-10-03 0.709213
2002-10-04 0.719369
Freq: D, dtype:float64
In [94]:transformed= (ts.groupby(lambda x: x.year)
....: .transform(lambda x: (x -x.mean()) / x.std()))
年依然还是年,并且以年分组标志,对另外一列的值进行标准化,再作标准化,x就是除了分组之外的另一列
We would expect the result to now have mean 0and standard deviation 1 within each group, which we can easily check:
# Original Data
In [95]: grouped =ts.groupby(lambda x: x.year)
Ts直接进行分组,后求均值方差
In [96]: grouped.mean()
Out[96]:
2000 0.442441
2001 0.526246
2002 0.459365
dtype: float64
In [97]: grouped.std()
Out[97]:
2000 0.131752
2001 0.210945
2002 0.128753
dtype: float64
使用transformed分组计算均值和方差,很容易观察到均值为0,方差为1
# Transformed Data
In [98]: grouped_trans =transformed.groupby(lambda x: x.year)
In [99]: grouped_trans.mean()
Out[99]:
2000 1.168208e-15
2001 1.454544e-15
2002 1.726657e-15
dtype: float64
In [100]: grouped_trans.std()
Out[100]:
2000 1.0
2001 1.0
2002 1.0
dtype: float64
We can also visually compare the original andtransformed data sets.、
我们也可以图形化初始和转换后的数据集
In [101]: compare =pd.DataFrame({'Original': ts, 'Transformed': transformed})
形成一个DF,这种方法常用于两列数字进行比较。方便使用下在的plot函数。
注意这里ts是一个序列,transformed是一个序列,并且两者的索引相同。
In [102]: compare.plot()
Out[102]: <matplotlib.axes._subplots.AxesSubplotat 0x7f6574fbfd10>
Transformation functions that have lower dimension outputs arebroadcast to match the shape of the input array.
转换函数有一个低的维度输出可以广播去匹配更多的输出类型
In [103]: ts.groupby(lambda x: x.year).transform(lambda x: x.max() -x.min())
还是用year分组,把另一列转换成最大值减去最小值
Out[103]:
2000-01-08 0.623893
2000-01-09 0.623893
2000-01-10 0.623893
2000-01-11 0.623893
2000-01-12 0.623893
...
2002-09-30 0.558275
2002-10-01 0.558275
2002-10-02 0.558275
2002-10-03 0.558275
2002-10-04 0.558275
Freq: D, Length: 1001, dtype: float64
Alternatively, the built-in methods could be used to produce thesame outputs.
相同的,正规的方法也可以被使用产生相同的输出
In [104]: max = ts.groupby(lambda x: x.year).transform('max')
In [105]: min = ts.groupby(lambda x: x.year).transform('min')
In [106]: max - min
Out[106]:
2000-01-08 0.623893
2000-01-09 0.623893
2000-01-10 0.623893
2000-01-11 0.623893
2000-01-12 0.623893
...
2002-09-30 0.558275
2002-10-01 0.558275
2002-10-02 0.558275
2002-10-03 0.558275
2002-10-04 0.558275
Freq: D, Length: 1001, dtype: float64
Another common data transform is to replace missing data withthe group mean.
另一个普通数据转换,用总体均值来代替缺失的数据
In [107]: data_df
Out[107]:
A B C
0 1.539708 -1.166480 0.533026
1 1.302092 -0.505754 NaN
2 -0.371983 1.104803 -0.651520
3 -1.309622 1.118697 -1.161657
4 -1.924296 0.396437 0.812436
.. ... ... ...
995 -0.093110 0.683847 -0.774753
996 -0.185043 1.438572 NaN
997 -0.394469 -0.642343 0.011374
998 -1.174126 1.857148 NaN
999 0.234564 0.517098 0.393534
[1000 rows x 3 columns]
In [108]: countries =np.array(['US', 'UK', 'GR', 'JP'])
In [109]: key =countries[np.random.randint(0, 4, 1000)]
In [110]: grouped =data_df.groupby(key)
# Non-NA count in each group
In [111]: grouped.count()
Out[111]:
A B C
GR 209 217 189
JP 240 255 217
UK 216 231 193
US 239 250 217
In [112]: transformed= grouped.transform(lambda x: x.fillna(x.mean()))
We can verify that the group means have not changed in thetransformed data and that the transformed data contains no NAs.
我们也可以证明,均值和个数没有变化,所以没有NA值,
: grouped_trans =transformed.groupby(key)
In [114]: grouped.mean() # original groupmeans
Out[114]:
A B C
GR -0.098371 -0.015420 0.068053
JP 0.069025 0.023100 -0.077324
UK 0.034069 -0.052580 -0.116525
US 0.058664 -0.020399 0.028603
In [115]: grouped_trans.mean() # transformationdid not change group means
Out[115]:
A B C
GR -0.098371 -0.015420 0.068053
JP 0.069025 0.023100 -0.077324
UK 0.034069 -0.052580 -0.116525
US 0.058664 -0.020399 0.028603
In [116]: grouped.count() # original hassome missing data points
Out[116]:
A B C
GR 209 217 189
JP 240 255 217
UK 216 231 193
US 239 250 217
In [117]: grouped_trans.count() # counts aftertransformation
Out[117]:
A B C
GR 228 228 228
JP 267 267 267
UK 247 247 247
US 258 258 258
In [118]: grouped_trans.size() # Verify non-NAcount equals group size
Out[118]:
GR 228
JP 267
UK 247
US 258
dtype: int64
Some functions will automaticallytransform the input when applied to a GroupBy object, but returning an objectof the same shape as the original. Passing as_index=False will not affect these transformationmethods.
For example: fillna, ffill, bfill, shift..
有些函数自动转换输入适用于GB对象,用初始的形状as_index=False传入这一变量也不会影响到这些转变的方法。
如下向前填充,会直接填充,
In [119]: grouped.ffill()
Out[119]:
A B C
0 1.539708 -1.166480 0.533026
1 1.302092 -0.505754 0.533026
2 -0.371983 1.104803 -0.651520
3 -1.309622 1.118697 -1.161657
4 -1.924296 0.396437 0.812436
.. ... ... ...
995 -0.093110 0.683847 -0.774753
996 -0.185043 1.438572 -0.774753
997 -0.394469 -0.642343 0.011374
998 -1.174126 1.857148 -0.774753
999 0.234564 0.517098 0.393534