楼主: SASCHEN
3581 6

[讨论]Cluster methodology [推广有奖]

  • 0关注
  • 0粉丝

硕士生

8%

还不是VIP/贵宾

-

TA的文库  其他...

Social Media Mining

深度學習(DEEP LEARNING)

HTML

威望
0
论坛币
2296 个
通用积分
3.5500
学术水平
4 点
热心指数
5 点
信用等级
4 点
经验
912 点
帖子
161
精华
0
在线时间
43 小时
注册时间
2005-9-25
最后登录
2022-10-29

+2 论坛币
k人 参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群
赵安豆老师微信:zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

感谢您参与论坛问题回答

经管之家送您两个论坛币!

+2 论坛币

Hi all,

I am trying to develop a cluster analysis with the US population, my big interest is to group some states in order to variables as total population, rate of growth in Income and Population, rate of different minorities...So I would like to check with you if I am doing well or I must correct my analysis:

  1. First I have standarized my dataset, because of the scale of data...
  2. I made some histogram to see the appearence of data. The data are not following a NORMAL distribution.
  3. I used a proc cluster with ccc criterion to fix the number of cluster...
  4. Finally I used the proc fastclust when the number of cluster where determinated in point 3.

I really appreciate for your support, many thanks.

Miguel.

二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

关键词:Methodology methodolog Cluster Method CLU 讨论 Methodology Cluster

沙发
SASCHEN 发表于 2005-9-25 11:43:00 |只看作者 |坛友微信交流群

Hi.

Your cluster analysis methodology is a good beginning, and if you like the clusters it gives you, then you're set. The beauty of having a small data set to work with (it sounds like you've got each of the states in the USA as a single point, so you only have around 50 data points, a few more if you include D.C., Puerto Rico, etc.) is that you can try a bunch of different things, and they all work reasonably quickly.

If I understand correctly, you used PROC FASTCLUS after PROC CLUSTER in order to generate centroids? Again, this is the luxury afforded by a small data set. Did you check afterwards to see how close the FASTCLUS clusters agreed with the clusters you chose from CLUSTER? If the agreement is not so good then the FASTCLUS centroids may not be very useful.

When you have medium to large data sets (more than 10,000 records), you can't really run PROC CLUSTER first. What some people do is to run PROC FASTCLUS to generate a lot of "small" clusters (say, 500 to 1,000) and then use the centroids as new data points, and the number of points in each cluster as frequencies for the centroids. Then run PROC CLUSTER on that data set.

And, of course, there is the "forgotten" clustering procedure, PROC MODECLUS, which is faster than CLUSTER and uses nonparametric tests to try to determine the best number of clusters. But I'm getting ahead of myself, because data preparation is so critical to achieving your results. You understand the reasons for standardization, but as you've noted in your related posts, standardizing can swamp out non-spherical shapes (such as elongated ellipses). If you play around with shapes on a piece of paper, you can see for yourself the kinds of delicate situations that can arise. Draw two elongated, needle-like parallel clusters. As they get closer together and more elongated, the typical "vanilla" options like k-means will drastically rearrange what the clusters are. (It's tough to guard against this in higher tensions, because it's hard to visualize the data.) ACECLUS can help you maintain ellipsoidal clusters; read the documentation and experiment.

I'm sorta rambling here, but there are two more related things I want to point out. Outliers. Some clustering methods claim to be immune to outliers, but many of the SAS methods use squared Euclidean distance as their default, and that will be sensitive to outliers. If you toss out outliers and get very different clusters, then the outliers probably were messing you up. You mentioned that your data is not normally distributed; if the tails are fat, then again you need to be careful about the effect on distances. You may want to ditch the squared Euclidean distances. CLUSTER and MODECLUS are quite flexible in the kinds of distances they allow (you can generate all kinds of distance metrics with the %DISTANCE macro, or PROC DISTANCE in SAS 9). FASTCLUS is less flexible about distance, but you can play with the "homotopy parameter."

Good luck, I hope I haven't caused more confusion...

使用道具

藤椅
SASCHEN 发表于 2005-9-25 11:44:00 |只看作者 |坛友微信交流群

As they used to shout on Family Feud, "Good answer! Good answer!" Now I can go to WUSS in San Jose, knowing that you will cover statistical issues for me while i'm gone. :-)

David -- David L. Cassell mathematical statistician Design Pathways 3115 NW Norwood Pl. Corvallis OR 97330

使用道具

板凳
SASCHEN 发表于 2005-9-25 11:45:00 |只看作者 |坛友微信交流群

Hi,

I would like to get information about elliptical clusters, I read the SAS-help and I read that de ACECLUS PROCEDURE, solve this problem. But I need more information about elliptical clusters...

Many thanks

Miguel.

使用道具

报纸
SASCHEN 发表于 2005-9-25 11:47:00 |只看作者 |坛友微信交流群

Elliptical clusters are just ellipsoids. They're shaped like a football in American football, instead of like a sphere. In other words, they're the way we think of the 'shape' of multivariate normal distributions.

I don't think that PROC ACECLUS is really the solution to your problem. I assume, based on your other email, that oyu only have 6 variables and 58 records. I would suggest that you stick with PROC CLUSTER, which does a lot more for you than PROC ACECLUS. Do you really think that you have extremely elliptical but multivariate normal distributions for your data? What do your plots of the data look like?

HTH, David -- David L. Cassell mathematical statistician Design Pathways 3115 NW Norwood Pl. Corvallis OR 97330

使用道具

地板
SASCHEN 发表于 2005-9-25 11:49:00 |只看作者 |坛友微信交流群

Hi, David,

You think that the best way to check distribution is

proc univariate data=cluster noprint; var popc00 pespopc00_hi gropop0009 groinc0009_per hispanic_mex_2; histogram; run;

Or there is a better option...

使用道具

7
SASCHEN 发表于 2005-9-25 11:50:00 |只看作者 |坛友微信交流群

I would recommend trying something like SAS/INSIGHT if you can, so that you can look at a scatterplot matrix of the data. You need to consider more than just univariate slices of your data, or you'll never be able to see what's going on. As an example of this, take Fisher's iris data - the classic in cluster analysis - which has three distinct clusters that pretty much any clustering method can catch. Now look at the thre variables using PROC UNIVARIATE. Blech. Even scatterplot matrices don't show everything.

David L Cassell

使用道具

您需要登录后才可以回帖 登录 | 我要注册

本版微信群
加好友,备注jltj
拉您入交流群

京ICP备16021002-2号 京B2-20170662号 京公网安备 11010802022788号 论坛法律顾问:王进律师 知识产权保护声明   免责及隐私声明

GMT+8, 2024-5-1 09:10