楼主: Lisrelchen
780 2

【Apache Spark】GMM [推广有奖]

  • 0关注
  • 62粉丝

VIP

院士

67%

还不是VIP/贵宾

-

TA的文库  其他...

Bayesian NewOccidental

Spatial Data Analysis

东西方数据挖掘

威望
0
论坛币
49957 个
通用积分
79.5487
学术水平
253 点
热心指数
300 点
信用等级
208 点
经验
41518 点
帖子
3256
精华
14
在线时间
766 小时
注册时间
2006-5-4
最后登录
2022-11-6

相似文件 换一批

+2 论坛币
k人 参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群
赵安豆老师微信:zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

感谢您参与论坛问题回答

经管之家送您两个论坛币!

+2 论坛币
GMM

本帖隐藏的内容

Spark GMM-master.zip (10.79 KB)


Gaussian Mixture Model Implementation in Pyspark

GMM algorithm models the entire data set as a finite mixture of Gaussian distributions,each parameterized by a mean vector, a covariance matrix and a mixture weights. Here the probability of each point to belong to each cluster is computed along with the cluster statistics.

This distributed implementation of GMM in pyspark estimates the parameters using the Expectation-Maximization algorithm and considers only diagonal covariance matrix for each component.

How to Run

There are two ways to run this code.

  • Using the library in your Python program.

You can train the GMM model by invoking the function GMMModel.trainGMM(data,k,n_iter,ct) where

      data is an RDD(of dense or Sparse Vector),         k is the number of components/clusters,         n_iter is the number of iterations(default 100),         ct is the convergence threshold(default 1e-3).

To use this library in your program simply download the GMMModel.py and GMMClustering.py and add them as Python files along with your own user code as shown below: ``` wgethttps://raw.githubusercontent.com/FlytxtRnD/GMM/master/GMMModel.py wgethttps://raw.githubusercontent.com/FlytxtRnD/GMM/master/GMMClustering.py

   ./bin/spark-submit --master <master> --py-files GMMModel.py,GMMclustering.py                          <your-program.py> <input_file> <num_of_clusters>                           [--n_iter <num_of_iterations>] [--ct <convergence_threshold>] ```       The returned object "model" has the following attributes **model.Means,model.Covars,model.Weights**. To get the cluster labels and responsibilty matrix(membership values):         responsibility_matrix,cluster_labels = GMMModel.resultPredict(model, data)
  • Running the example GMM clustering script.

    If you'd like to run our example program directly, also download the PyGMM.py file and invoke it with spark-submit.


     wget https://raw.githubusercontent.com/FlytxtRnD/GMM/master/PyGMM.py    ./bin/spark-submit --master <master> --py-files GMMModel.py,GMMclustering.py                            PyGMM.py <input_file> <num_of_clusters>                            [--n_iter <num_of_iterations>] [--ct <convergence_threshold>]  ```  where master is your spark master URL and input file should contain comma separated numeric values.     Make sure you enter the full path to the downloaded files.
二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

关键词:Apache Spark apache Spark Park SPAR entire matrix

本帖被以下文库推荐

沙发
MouJack007 发表于 2017-4-18 07:05:49 |只看作者 |坛友微信交流群
谢谢楼主分享!

使用道具

藤椅
MouJack007 发表于 2017-4-18 07:06:45 |只看作者 |坛友微信交流群

使用道具

您需要登录后才可以回帖 登录 | 我要注册

本版微信群
加好友,备注jltj
拉您入交流群

京ICP备16021002-2号 京B2-20170662号 京公网安备 11010802022788号 论坛法律顾问:王进律师 知识产权保护声明   免责及隐私声明

GMT+8, 2024-5-1 10:31