发帖

楼主: CDA网校

880 0

[每天一个数据分析师] R语言做K-means聚类分析时确定类的个数 [推广有奖]

4关注
124
粉丝

管理员

已卖：189份资源

泰斗

4%

还不是VIP/贵宾

-

0%

威望: 3 级
论坛币: 126247 个
通用积分: 12307.5676
学术水平: 278 点
热心指数: 286 点
信用等级: 253 点
经验: 231610 点
帖子: 7079
精华: 19
在线时间: 4408 小时
注册时间: 2019-9-13
最后登录: 2026-2-14

楼主

CDA网校

发表于 2022-5-27 09:20:44 |AI写论文

是否 +2 论坛币

k人参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群

赵安豆老师微信：zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

立即领取

感谢您参与论坛问题回答

经管之家送您两个论坛币！

+2 论坛币

方法一：

K平均算法（K-means聚类分析）
在下面的误差平方和图中，拐点（bend or elbow）的位置对应的x轴即k-means聚类给出的合适的类的个数。

> n = 100  
> g=6  
> set.seed(g)  
> d <- data.fr ame(x = unlist(lapply(1:g, function(i) rnorm(n/g, runif(1)*i^2))), y = unlist(lapply(1:g, function(i) rnorm(n/g, runif(1)*i^2))))  
> plot(d)  
>

> mydata <- d  
>  
> wss <- (nrow(mydata)-1)*sum(apply(mydata,2,var))  
> for (i in 2:15)  
+ wss[i] <- sum(kmeans(mydata,centers=i)$withinss)  
> ###这里的wss(within-cluster sum of squares)是组内平方和  
> plot(1:15, wss, type="b", xlab="Number of Clusters",ylab="Within groups sum of squares")  
>

由上图可以看出，该方法给出合理的类别个数是4个。
方法二：

K中心聚类算法（K-mediods）
使用fpc包里的pamk函数来估计类的个数：

> library(cluster)  
Warning message:  
程辑包‘cluster’是用R版本3.2.3 来建造的  
> library(fpc)  
> pamk.best <- pamk(d)  
> cat("number of clusters estimated by optimum average silhouette width:", pamk.best$nc, "\n")  
number of clusters estimated by optimum average silhouette width: 4  
> plot(pam(d, pamk.best$nc))

sihouette值是用来表示某一个对象和它所属类的凝合力强度以及和其他类分离强度的，值范围为-1到1，值越大表示该对象越匹配所属类以及和邻近类有多不匹配。
所以从上图sihouette plot中可以看出，该方法给出的合理类的个数为4个。
方法三：
基于Calinsky Criterion

> require(vegan)  
载入需要的程辑包：vegan  
载入需要的程辑包：permute  
载入需要的程辑包：lattice  
This is vegan 2.4-0  
Warning messages:  
1: 程辑包‘vegan’是用R版本3.2.5 来建造的  
2: 程辑包‘permute’是用R版本3.2.5 来建造的  
3: 程辑包‘lattice’是用R版本3.2.3 来建造的  
> fit <- cascadeKM(scale(d, center = TRUE, scale = TRUE), 1, 10, iter = 1000)  
> plot(fit, sortg = TRUE, grpmts.plot = TRUE)  
> calinski.best <- as.numeric(which.max(fit$results[2,]))  
> cat("Calinski criterion optimal number of clusters:", calinski.best, "\n")  
Calinski criterion optimal number of clusters: 5

由上图我们可以看到，根据Calinsky标准，得到类的个数是5个。
方法四：

基于模型假设的聚类，利用的是mclust包：

> library(mclust)  
__ ___________ __ _____________  
/ |/ / ____/ / / / / / ___/_ __/  
/ /|_/ / / / / / / / /\__ \ / /  
/ / / / /___/ /___/ /_/ /___/ // /  
/_/ /_/\____/_____/\____//____//_/ version 5.1  
Type 'citation("mclust")' for citing this R package in publications.  
Warning message:  
程辑包‘mclust’是用R版本3.2.4 来建造的  
> d_clust <- Mclust(as.matrix(d), G=1:20)  
> m.best <- dim(d_clust$z)[2]  
> cat("model-based optimal number of clusters:", m.best, "\n")  
model-based optimal number of clusters: 4  
> plot(d_clust)  
Model-based clustering plots:  

1: BIC  
2: classification  
3: uncertainty  
4: density

方法五：

基于AP算法的聚类

> library(apcluster)

载入程辑包：‘apcluster’

The following ob ject is masked from ‘package:stats’:

heatmap

Warning message:
程辑包‘apcluster’是用R版本3.2.5 来建造的

> d.apclus <- apcluster(negDistMat(r=2), d)  
> cat("affinity propogation optimal number of clusters:", length(d.apclus@clusters), "\n")  
affinity propogation optimal number of clusters: 4  
> #4 得出的分类个数  
> heatmap(d.apclus)  
> plot(d.apclus, d)

相关帖子DA内容精选

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

分享0 收藏0 回帖

关键词：k-means聚类 k-means means mean 聚类分析

[每天一个数据分析师] R语言做K-means聚类分析时确定类的个数 [推广有奖]

经管之家送您一份

经管之家联合CDA

感谢您参与论坛问题回答

扫码加我拉你入群

相关帖子

浏览过的帖子

浏览过的版块

初级热心勋章

本版微信群

[每天一个数据分析师] R语言做K-means聚类分析时确定类的个数 [推广有奖]

经管之家送您一份

经管之家联合CDA

感谢您参与论坛问题回答

扫码加我 拉你入群

相关帖子

浏览过的帖子

浏览过的版块

初级热心勋章

本版微信群

扫码加我拉你入群