我个人的理解就是既定的数据值之间差异太大,存在两极化的现象。他就想把直方图中柱子比较高的数据给削掉一些。
我截取了处理前的部分数据:
> head(t$capital_gain,500)
[1] 0 0 0 7688 0 0 0 3103 0 0
[11] 6418 0 0 0 3103 0 0 0 0 0
[21] 0 0 0 0 0 7298 0 0 0 0
[31] 7688 0 0 0 0 0 0 0 0 0
[41] 0 0 0 0 0 0 0 3908 0 0
[51] 0 0 0 14084 0 0 0 3103 5178 0
[61] 0 0 0 0 0 0 15024 0 0 0
[71] 15024 0 0 0 0 0 0 0 0 0
[81] 0 0 0 99999 0 0 0 0 0 7688
[91] 0 0 5178 0 0 0 0 0 0 0
[101] 0 0 0 0 0 0 0 2597 0 0
[111] 0 0 0 0 0 0 0 0 0 0
[121] 0 0 0 0 0 0 0 0 0 0
[131] 0 0 0 0 0 0 0 0 0 0
[141] 0 0 7688 0 0 0 0 0 15024 2907
[151] 0 0 0 0 0 0 0 0 0 0
[161] 0 0 0 0 0 0 0 0 0 0
[171] 0 0 0 0 0 0 0 0 0 0
[181] 4650 15024 0 0 0 0 0 0 0 0
[191] 0 0 0 0 0 0 0 0 0 0
[201] 0 0 0 0 0 0 0 0 0 0
[211] 0 0 6497 0 0 0 0 0 0 0
[221] 0 0 0 0 7688 0 15024 0 1055 0
[231] 0 0 0 0 0 0 5013 0 0 0
[241] 0 0 0 0 0 0 0 0 0 0
[251] 0 0 0 0 0 0 4650 0 0 0
[261] 0 0 0 0 0 3103 0 0 0 0
[271] 0 0 0 0 0 0 0 0 15024 0
[281] 0 0 0 0 0 0 0 0 0 0
[291] 0 0 0 0 0 0 0 0 27828 0
[301] 0 0 0 0 0 0 0 0 0 3103
[311] 0 0 0 4934 4064 0 0 0 0 0
[321] 0 0 0 0 0 0 0 0 0 15024
[331] 0 0 0 3674 0 0 0 0 0 0
[341] 0 2174 0 0 10605 0 99999 5178 0 0
[351] 0 0 0 0 0 0 0 99999 0 0
[361] 0 0 0 0 0 0 0 0 0 0
[371] 0 0 0 0 0 0 0 0 0 0
[381] 0 0 0 0 0 0 0 0 0 0
[391] 0 0 0 0 3418 0 0 0 0 0
[401] 0 0 0 0 0 0 0 0 0 0
[411] 0 0 1055 0 0 0 0 0 99999 0
[421] 0 0 0 0 114 0 0 0 0 0
[431] 0 0 0 0 0 0 0 0 0 0
[441] 2580 0 0 0 0 0 3411 0 0 2174
[451] 0 0 0 0 0 0 0 0 0 0
[461] 0 0 0 0 0 0 0 0 0 0
[471] 0 0 0 0 2907 0 4508 0 0 0
[481] 0 0 0 27828 0 0 0 0 0 0
[491] 0 0 0 0 0 0 0 0 0 0
下面是这一列的直方图:
我按照他的方法,进行了cut2()处理:
>s<-as.numeric(cut2(t$capital_gain,g=10))
>head(s,500)
[1] 1 1 1 2 1 1 1 2 1 1 2 1 1 1 2 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1
[31] 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 2 1 1 1 2 2 1
[61] 1 1 1 1 1 1 2 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 2
[91] 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1
[121] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 2 2
[151] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[181] 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[211] 1 1 2 1 1 1 1 1 1 1 1 1 1 1 2 1 2 1 2 1 1 1 1 1 1 1 2 1 1 1
[241] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 2 1 1 1 1
[271] 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1
[301] 1 1 1 1 1 1 1 1 1 2 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2
[331] 1 1 1 2 1 1 1 1 1 1 1 2 1 1 2 1 2 2 1 1 1 1 1 1 1 1 1 2 1 1
[361] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[391] 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 2 1
[421] 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 2 1 1 2
[451] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 1 1 1
[481] 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
下面是处理后的直方图:
想问一下:
1.这么处理的作用是什么,有必要么,什么情况下这么处理,如果不处理,会影响数据的聚类结果么?
2.我查了一下关于Cut2函数的作用就是数据分箱,等宽分箱,as.numeic以后就显示为1和2,这个值是怎么计算出来的,是组内的平均值么?
3.从峰度和偏度的角度看,前后两个直方图的区别在哪?用什么函数能看到某一列数据峰度和偏度的值?
4.这个实例中,g=10,如果分段的话,应该分几段合适啊,这块有什么标准么?


雷达卡





京公网安备 11010802022788号







