楼主: Nicolle
692 38

数据科学家必须了解的六大聚类算法 [推广有奖]

版主

巨擘

0%

还不是VIP/贵宾

-

TA的文库  其他...

Python Programming

SAS Programming

Structural Equation Modeling

威望
16
论坛币
12282206 个
学术水平
2750 点
热心指数
2704 点
信用等级
2563 点
经验
423914 点
帖子
17936
精华
76
在线时间
6726 小时
注册时间
2005-4-23
最后登录
2018-8-21

Nicolle 学生认证  发表于 2018-2-13 10:23:41 |显示全部楼层

本帖隐藏的内容

数据科学家必须了解的六大聚类算法.pdf (2.18 MB)

  1. 在机器学习中,无监督学习一直是我们追求的方向,而其中的聚类算法更是发现隐藏数据结构与知识的有效手段。目前如谷歌新闻等很多应用都将聚类算法作为主要的实现手段,它们能利用大量的未标注数据构建强大的主题聚类。本文从最基础的 K 均值聚类到基于密度的强大方法介绍了 6 类主流方法,它们各有擅长领域与情景,且基本思想并不一定限于聚类方法。

  2. 本文将从简单高效的 K 均值聚类开始,依次介绍均值漂移聚类、基于密度的聚类、利用高斯混合和最大期望方法聚类、层次聚类和适用于结构化数据的图团体检测。我们不仅会分析基本的实现概念,同时还会给出每种算法的优缺点以明确实际的应用场景。

  3. 聚类是一种包括数据点分组的机器学习技术。给定一组数据点,我们可以用聚类算法将每个数据点分到特定的组中。理论上,属于同一组的数据点应该有相似的属性和/或特征,而属于不同组的数据点应该有非常不同的属性和/或特征。聚类是一种无监督学习的方法,是一种在许多领域常用的统计数据分析技术。
复制代码



已有 1 人评分论坛币 收起 理由
happy_287422301 + 100 精彩帖子

总评分: 论坛币 + 100   查看全部评分

本帖被以下文库推荐

stata SPSS
Nicolle 学生认证  发表于 2018-2-13 10:24:18 |显示全部楼层
  1. K-Means Clustering
  2. K-Means is probably the most well know clustering algorithm. It’s taught in a lot of introductory data science and machine learning classes. It’s easy to understand and implement in code! Check out the graphic below for an illustration.
复制代码
回复

使用道具 举报

Nicolle 学生认证  发表于 2018-2-13 10:24:53 |显示全部楼层
  1. Mean-Shift Clustering
  2. Mean shift clustering is a sliding-window-based algorithm that attempts to find dense areas of data points. It is a centroid-based algorithm meaning that the goal is to locate the center points of each group/class, which works by updating candidates for center points to be the mean of the points within the sliding-window. These candidate windows are then filtered in a post-processing stage to eliminate near-duplicates, forming the final set of center points and their corresponding groups. Check out the graphic below for an illustration.
复制代码
回复

使用道具 举报

Nicolle 学生认证  发表于 2018-2-13 10:25:37 |显示全部楼层
  1. Density-Based Spatial Clustering of Applications with Noise (DBSCAN)
  2. DBSCAN is a density based clustered algorithm similar to mean-shift, but with a couple of notable advantages. Check out another fancy graphic below and let’s get started!
复制代码
回复

使用道具 举报

Nicolle 学生认证  发表于 2018-2-13 10:27:43 |显示全部楼层
  1. Gaussian Mixture Models (GMMs) give us more flexibility than K-Means. With GMMs we assume that the data points are Gaussian distributed; this is a less restrictive assumption than saying they are circular by using the mean. That way, we have two parameters to describe the shape of the clusters: the mean and the standard deviation! Taking an example in two dimensions, this means that the clusters can take any kind of elliptical shape (since we have standard deviation in both the x and y directions). Thus, each Gaussian distribution is assigned to a single cluster.
复制代码
回复

使用道具 举报

Nicolle 学生认证  发表于 2018-2-13 10:28:19 |显示全部楼层
  1. Agglomerative Hierarchical Clustering
  2. Hierarchical clustering algorithms actually fall into 2 categories: top-down or bottom-up. Bottom-up algorithms treat each data point as a single cluster at the outset and then successively merge (or agglomerate) pairs of clusters until all clusters have been merged into a single cluster that contains all data points. Bottom-up hierarchical clustering is therefore called hierarchical agglomerative clustering or HAC. This hierarchy of clusters is represented as a tree (or dendrogram). The root of the tree is the unique cluster that gathers all the samples, the leaves being the clusters with only one sample. Check out the graphic below for an illustration before moving on to the algorithm steps
复制代码
回复

使用道具 举报

hjtoh 发表于 2018-2-13 10:31:13 来自手机 |显示全部楼层
Nicolle 发表于 2018-2-13 10:23
**** 本内容被作者隐藏 ****
楼主给力
回复

使用道具 举报

军旗飞扬 发表于 2018-2-13 10:40:28 |显示全部楼层
谢谢分享
回复

使用道具 举报

啸傲江弧 发表于 2018-2-13 10:49:33 |显示全部楼层
回复

使用道具 举报

HappyAndy_Lo 发表于 2018-2-13 12:08:35 |显示全部楼层
回复

使用道具 举报

您需要登录后才可以回帖 登录 | 我要注册

GMT+8, 2018-8-21 16:18