Aggregation
Once the GroupBy object has been created, several methods areavailable to perform a computation on the grouped data. These operations aresimilar to the aggregating API,window functionsAPI, and resample API.
An obvious one is aggregation via the aggregate()or equivalently agg()method:
加总,一旦GroupBy对象已经被创造了,有几种方法可以使用来展现一个已经分组过的数据计算结果。这些操作都很相似可以构成求和的API,窗口化函数的API,和重抽样的API
An obvious one is aggregation viathe aggregate()or equivalently agg()method:
最明显的就是聚合通过聚合函数和相似的agg方法来聚合
In [62]: grouped =df.groupby('A')
对A这个字段进行分组
In [63]: grouped.aggregate(np.sum)
对A这个关键词进行列求和。
Out[63]:
C D
A
bar 0.392940 1.732707
foo -1.796421 2.824590
In [64]: grouped =df.groupby(['A', 'B'])
In [65]: grouped.aggregate(np.sum)
Out[65]:
C D
A B
bar one 0.254161 1.511763
three 0.215897 -0.990582
two -0.077118 1.211526
foo one -0.983776 1.614581
three -0.862495 0.024580
two 0.049851 1.185429
As you can see, the result of the aggregation will have thegroup names as the new index along the grouped axis. In the case of multiplekeys, the result is a MultiIndexby default, though this can be changed by using the as_indexoption:
如你所见,这个aggr的结果仍保留了组名,并以分组的序列作为了索引,在多重关键词时,这个多重索引的结果是默认的,尽管这个索引也可以采用asindex进行选择。
In [66]: grouped =df.groupby(['A', 'B'], as_index=False)
In [67]: grouped.aggregate(np.sum)
Out[67]:
A B C D
0 bar one 0.254161 1.511763
1 bar three 0.215897 -0.990582
2 bar two -0.077118 1.211526
3 foo one -0.983776 1.614581
4 foo three -0.862495 0.024580
5 foo two 0.049851 1.185429
In [68]: df.groupby('A', as_index=False).sum()
Out[68]:
A C D
0 bar 0.392940 1.732707
1 foo -1.796421 2.824590
Note that you could use the reset_indexDataFrame function to achieve the same result as the column names are stored inthe resulting MultiIndex:
注意到,你可以使用resetindex这个DF函数来达到同样的结果,当这个列名被存储在多维的索引中。
df.groupby(['A', 'B']).sum().reset_index()
就是用reset_index()代替了as_index=False
Out[69]:
A B C D
0 bar one 0.254161 1.511763
1 bar three 0.215897 -0.990582
2 bar two -0.077118 1.211526
3 foo one -0.983776 1.614581
4 foo three -0.862495 0.024580
5 foo two 0.049851 1.185429
Another simple aggregation example is to compute the size ofeach group. This is included in GroupBy as the sizemethod. It returns a Series whose index are the group names and whose valuesare the sizes of each group.
另一个简单的例子是要计算每组的规模,在groupBy方法中也有,它会返回一个序列,该序列的索引是组的名字,值是每组的size
In [70]: grouped.size()
Out[70]:
A B
bar one 1
three 1
two 1
foo one 2
three 1
two 2
dtype: int64
In [71]: grouped.describe()
Note
Aggregation functions will not return the groupsthat you are aggregating over if they are named columns, when as_index=True, the default. The grouped columns willbe the indices of the returned object.
Passing as_index=False will return the groups that youare aggregating over, if they are named columns.
注意agg函数不会返回这个组,如果他们是被命名的序列,你进行聚合的组。当
As——index为真的时候,这个是默认值。这个分组的列是返回对象的索引。
传入这个为false会返回一个你聚合的组,如果他们是被命名过的列。
[td]
Function | Description |
mean() | Compute mean of groups均值 |
sum() | Compute sum of group values求和 |
size() | Compute group sizes规模 |
count() | Compute count of group组计数 |
std() | Standard deviation of groups组内标准差 |
var() | Compute variance of groups组内方差 |
sem() | Standard error of the mean of groups组内均值标准误 |
describe() | Generates descriptive statistics描述性统计 |
first() | Compute first of group values组内第一个值 |
last() | Compute last of group values组内最后一个值 |
nth() | Take nth value, or a subset if n is a list第n个值或是一个子集,当n是一个list |
min() | Compute min of group values最小值 |
max() | Compute max of group values最大值 |
The aggregating functions above will exclude NA values.Any function which reduces a Series to a scalar value is an aggregation function and willwork, a trivial example is df.groupby('A').agg(lambda ser: 1). Note that nth() can act as a reducer or afilter, see here.
聚合函数也会自动排除NA值,