楼主: ReneeBK
1842 4

[问答] Weighting variables in two-step cluster analysis [推广有奖]

  • 1关注
  • 62粉丝

VIP

学术权威

14%

还不是VIP/贵宾

-

TA的文库  其他...

R资源总汇

Panel Data Analysis

Experimental Design

威望
1
论坛币
49492 个
通用积分
53.3854
学术水平
370 点
热心指数
273 点
信用等级
335 点
经验
57815 点
帖子
4006
精华
21
在线时间
582 小时
注册时间
2005-5-8
最后登录
2023-11-26

+2 论坛币
k人 参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群
赵安豆老师微信:zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

感谢您参与论坛问题回答

经管之家送您两个论坛币!

+2 论坛币

I'm using SPSS to perform two-step cluster analyses. SPSS shows predictor importance of each variable used in an analysis. Oftentimes, a binary variable like gender (sorry, I'm just keeping it simple!) will be the most important variable to the formation of the clusters, even if you don't want it to be.

Is there a way to weight variables, so that maybe I can downplay, but not eliminate, gender's role in the analysis?

Thank you for the help!

二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

关键词:weighting Variables Analysis Variable Cluster procedure function multiple adjusted factors

沙发
ReneeBK 发表于 2014-4-23 02:21:44 |只看作者 |坛友微信交流群
One thing to keep in mind before turning to weights is that gender can be considered a "swamping" variable in a two-step cluster analysis. Differences between gender are oftentimes large, and thus overpower weaker, but still substantively interesting heterogeneity in your data.

Instead of down-weighting gender, you could consider looking into a finite mixture regression model. Finite mixture models are a model-based cluster analysis (clusters are usually assumed to be multivariate Gaussian) and a finite mixture regression model essentially combines a cluster analysis with a regression. In your case, you could use gender as a predictor, perform this analysis, and detect clusters while taking into account the predictive power of gender (as well as other variables of interest). More information can be found here and here in the flexmix R package documentation.

使用道具

藤椅
ReneeBK 发表于 2014-4-23 02:27:03 |只看作者 |坛友微信交流群
I want to assign different weights to the variables in my cluster analysis, but my program (Stata) doesn't seem to have an option for this, so I need to do it manually.

Imagine 4 variables A, B, C, D. The weights for those variables should be

w(A)=50%
w(B)=25%
w(C)=10%
w(D)=15%
I am wondering whether one of the following two approaches would actually do the trick:

First I standardize all variables (e.g. by their range). Then I multiply each standardized variable with their weight. Then do the cluster analysis.
I multiply all variables with their weight and standardize them afterwards. Then do the cluster analysis.
Or are both ideas complete nonsense?

The clustering algorithms (I try 3 different) I wish to use are k-means, weighted-average linkage and average-linkage. I plan to use weighted-average linkage to determine a good number of clusters which I plug into k-means afterwards.

使用道具

板凳
ReneeBK 发表于 2014-4-23 02:28:11 |只看作者 |坛友微信交流群
One way to assign a weight to a variable is by changing its scale. The trick works for the clustering algorithms you mention, viz. k-means, weighted-average linkage and average-linkage.

Kaufman, Leonard, and Peter J. Rousseeuw. "Finding groups in data: An introduction to cluster analysis." (2005) - page 11:

The choice of measurement units gives rise to relative weights of the variables. Expressing a variable in smaller units will lead to a larger range for that variable, which will then have a large effect on the resulting structure. On the other hand, by standardizing one attempts to give all variables an equal weight, in the hope of achieving objectivity. As such, it may be used by a practitioner who possesses no prior knowledge. However, it may well be that some variables are intrinsically more important than others in a particular application, and then the assignment of weights should be based on subject-matter knowledge (see, e.g., Abrahamowicz, 1985).

On the other hand, there have been attempts to devise clustering techniques that are independent of the scale of the variables (Friedman and Rubin, 1967). The proposal of Hardy and Rasson (1982) is to search for a partition that minimizes the total volume of the convex hulls of the clusters. In principle such a method is invariant with respect to linear transformations of the data, but unfortunately no algorithm exists for its implementation (except for an approximation that is restricted to two dimensions). Therefore, the dilemma of standardization appears unavoidable at present and the programs described in this book leave the choice up to the user

Abrahamowicz, M. (1985), The use of non-numerical a pnon information for measuring dissimilarities, paper presented at the Fourth European Meeting of the Psychometric Society and the Classification Societies, 2-5 July, Cambridge (UK).

Friedman, H. P., and Rubin, J. (1967), On some invariant criteria for grouping data. J . Amer. Statist. ASSOC6.,2 , 1159-1178.

Hardy, A., and Rasson, J. P. (1982), Une nouvelle approche des problemes de classification automatique, Statist. Anal. Donnies, 7, 41-56.

使用道具

报纸
ReneeBK 发表于 2014-4-23 02:29:06 |只看作者 |坛友微信交流群
Yes approach 1 is the right one, and correspond to what Kaufman, Leonard, and Peter J. Rousseeuw are saying in the paragraphs I quoted in the answer. Approach 2 would be useless as the standardization removes the weights :)

Franck Dernoncourt

使用道具

您需要登录后才可以回帖 登录 | 我要注册

本版微信群
加好友,备注cda
拉您进交流群

京ICP备16021002-2号 京B2-20170662号 京公网安备 11010802022788号 论坛法律顾问:王进律师 知识产权保护声明   免责及隐私声明

GMT+8, 2024-9-19 23:43