楼主: oliyiyi
1424 2

A comparison between PCA and hierarchical clustering [推广有奖]

版主

泰斗

0%

还不是VIP/贵宾

-

TA的文库  其他...

计量文库

威望
7
论坛币
271951 个
通用积分
31269.3519
学术水平
1435 点
热心指数
1554 点
信用等级
1345 点
经验
383775 点
帖子
9598
精华
66
在线时间
5468 小时
注册时间
2007-5-21
最后登录
2024-4-18

初级学术勋章 初级热心勋章 初级信用勋章 中级信用勋章 中级学术勋章 中级热心勋章 高级热心勋章 高级学术勋章 高级信用勋章 特级热心勋章 特级学术勋章 特级信用勋章

+2 论坛币
k人 参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群
赵安豆老师微信:zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

感谢您参与论坛问题回答

经管之家送您两个论坛币!

+2 论坛币

Graphical representations of high-dimensional data sets are the backbone of exploratory data analysis. We examine 2 of the most commonly used methods: heatmaps combined with hierarchical clustering and principal component analysis (PCA).

By Charlotte Soneson, Qlucore.

Introduction

Graphical representations of high-dimensional data sets are at the backbone of straightforward exploratory analysis and hypothesis generation. Within the life sciences, two of the most commonly used methods for this purpose are heatmaps combined with hierarchical clustering and principal component analysis (PCA).

We will use the terminology ‘data set’ to describe the measured data. The data set consists of a number of samples for which a set of variables has been measured. All variables are measured for all samples.

Method

PCA creates a low-dimensional representation of the samples from a data set which is optimal in the sense that it contains as much of the variance in the original data set as is possible. PCA also provides a variable representation that is directly connected to the sample representation, and which allows the user to visually find variables that are characteristic for specific sample groups. (Agglomerative) hierarchical clustering builds a tree-like structure (a dendrogram) where the leaves are the individual objects (samples or variables) and the algorithm successively pairs together objects showing the highest degree of similarity. These objects are then collapsed into a pseudo-object (a cluster) and treated as a single object in all subsequent steps.

Unsupervised

Both PCA and hierarchical clustering are unsupervised methods, meaning that no information about class membership or other response variables are used to obtain the graphical representation. This makes the methods suitable for exploratory data analysis, where the aim is hypothesis generation rather than hypothesis verification.

Comparison

The input to a hierarchical clustering algorithm consists of the measurement of the similarity (or dissimilarity) between each pair of objects, and the choice of the similarity measure can have a large effect on the result. The goal of the clustering algorithm is then to partition the objects into homogeneous groups, such that the within-group similarities are large compared to the between-group similarities. The principal components, on the other hand, are extracted to represent the patterns encoding the highest variance in the data set and not to maximize the separation between groups of samples directly. However, in many high-dimensional real-world data sets, the most dominant patterns, i.e. those captured by the first principal components, are those separating different subgroups of the samples from each other. In this case, the results from PCA and hierarchical clustering support similar interpretations.

The hierarchical clustering dendrogram is often represented together with a heatmap that shows the entire data matrix, with entries color-coded according to their value. The columns of the data matrix are re-ordered according to the hierarchical clustering result, putting similar observation vectors close to each other. Depicting the data matrix in this way can help to find the variables that appear to be characteristic for each sample cluster. This can be compared to PCA, where the synchronized variable representation provides the variables that are most closely linked to any groups emerging in the sample representation.

The heatmap depicts the observed data without any pre-processing. In contrast, since PCA represents the data set in only a few dimensions, some of the information in the data is filtered out in the process. The discarded information is associated with the weakest signals and the least correlated variables in the data set, and it can often be safely assumed that much of it corresponds to measurement errors and noise. This makes the patterns revealed using PCA cleaner and easier to interpret than those seen in the heatmap, albeit at the risk of excluding weak but important patterns.

Another difference is that the hierarchical clustering will always calculate clusters, even if there is no strong signal in the data, in contrast to PCA which in this case will present a plot similar to a cloud with samples evenly distributed.

As we have discussed above, hierarchical clustering serves both as a visualization and a partitioning tool (by cutting the dendrogram at a specific height, distinct sample groups can be formed). Qlucore Omics Explorer provides also another clustering algorithm, namely k-means clustering, which directly partitions the samples into a specified number of groups and thus, as opposed to hierarchical clustering, does not in itself provide a straight-forward graphical representation of the results. However, the cluster labels can be used in conjunction with either heatmaps (by reordering the samples according to the label) or PCA (by assigning a color label to each sample, depending on its assigned class). The quality of the clusters can also be investigated using silhouette plots.

Fig. 1: Combined hierarchical clustering and heatmap and a 3D-sample representation obtained by PCA

Figure 1 shows a combined hierarchical clustering and heatmap (left) and a three-dimensional sample representation obtained by PCA (top right) for an excerpt from a data set of gene expression measurements from patients with acute lymphoblastic leukemia. Here, the dominating patterns in the data are those that discriminate between patients with different subtypes (represented by different colors) from each other. Hence, these groups are clearly visible in the PCA representation. Clusters corresponding to the subtypes also emerge from the hierarchical clustering. In this case, it is clear that the expression vectors (the columns of the heatmap) for samples within the same cluster are much more similar than expression vectors for samples from different clusters. It is also fairly straightforward to determine which variables are characteristic for each cluster. By studying the three-dimensional variable representation from PCA, the variables connected to each of the observed clusters can be inferred. The bottom right figure shows the variable representation, where the variables are colored according to their expression value in the T-ALL subgroup (red samples). The same expression pattern as seen in the heatmap is also visible in this variable plot.



二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

关键词:Hierarchical Clustering Comparison Between Cluster comparison between

缺少币币的网友请访问有奖回帖集合
https://bbs.pinggu.org/thread-3990750-1-1.html
沙发
Kamize 学生认证  发表于 2016-8-31 00:09:14 来自手机 |只看作者 |坛友微信交流群
oliyiyi 发表于 2016-8-27 16:53
Graphical representations of high-dimensional data sets are the backbone of exploratory data analysi ...
谢谢分享了,很赞

使用道具

藤椅
Kamize 学生认证  发表于 2016-9-1 00:55:30 来自手机 |只看作者 |坛友微信交流群
oliyiyi 发表于 2016-8-27 16:53
Graphical representations of high-dimensional data sets are the backbone of exploratory data analysi ...
谢谢分享

使用道具

您需要登录后才可以回帖 登录 | 我要注册

本版微信群
加好友,备注jltj
拉您入交流群

京ICP备16021002-2号 京B2-20170662号 京公网安备 11010802022788号 论坛法律顾问:王进律师 知识产权保护声明   免责及隐私声明

GMT+8, 2024-4-19 16:36