MDL Clustering: Unsupervised Attribute Ranking, Discretization, and Clustering-经管之家官网!

人大经济论坛-经管之家 收藏本站
您当前的位置> 考研考博>>

考研

>>

MDL Clustering: Unsupervised Attribute Ranking, Discretization, and Clustering

MDL Clustering: Unsupervised Attribute Ranking, Discretization, and Clustering

发布:oliyiyi | 分类:考研

关于本站

人大经济论坛-经管之家:分享大学、考研、论文、会计、留学、数据、经济学、金融学、管理学、统计学、博弈论、统计年鉴、行业分析包括等相关资源。
经管之家是国内活跃的在线教育咨询平台!

经管之家新媒体交易平台

提供"微信号、微博、抖音、快手、头条、小红书、百家号、企鹅号、UC号、一点资讯"等虚拟账号交易,真正实现买卖双方的共赢。【请点击这里访问】

提供微信号、微博、抖音、快手、头条、小红书、百家号、企鹅号、UC号、一点资讯等虚拟账号交易,真正实现买卖双方的共赢。【请点击这里访问】

MDLClusteringisafreesoftwaresuiteforunsupervisedattributeranking,discretization,andclusteringbasedontheMinimumDescriptionLengthprincipleandbuiltontheWekaDataMiningplatform.ByZdravkoMarkov,CentralConne ...
坛友互助群


扫码加入各岗位、行业、专业交流群


MDL Clustering is a free software suite for unsupervised attribute ranking, discretization, and clustering based on the Minimum Description Length principle and built on the Weka Data Mining platform.

By Zdravko Markov, Central Connecticut State University.

MDL Clustering is a free software suite for unsupervised attribute ranking, discretization, and clustering built on the Weka Data Mining platform. It implements learning algorithms as Java classes compiled in a JAR file, which can be downloaded or run directly online provided that the Java runtime environment is installed. It starts the Weka command-line interface (Simple CLI), which provides access to the new algorithms and to the full functionality of the Weka data mining software as well. The data should be provided in the Weka ARFF or CSV format.


http://www.kdnuggets.com/wp-content/uploads/clusters.jpg

The basic idea of the algorithms is to use of the Minimum Description Length (MDL) principle to compute the encoding length (complexity) of a data split, which is used as an evaluation measure to rank attributes, discretize numeric attributes or cluster data. Below is a brief description of the algorithms. More details, evaluations and comparisons can be found in recent papers and a book chapter [1, 2, 3].

  • Unsupervised attribute ranking (MDLranker.class). For each attribute the data are split using its values, i.e. a clustering is created such that each cluster contains the instances that have the same value for that attribute. The MDL of this clustering is used as a measure to evaluate the attribute assuming that the best attribute minimizes the corresponding MDL score.
  • Unsupervised discretization (MDLdiscretize.class). The numeric attributes are discretized by splitting the range of their values in two intervals, which are determined by a breakpoint that minimizes the MDL of the resulting data split.
  • Hierarchical clustering (MDLcluster.class). The algorithm starts with the data split produced by the attribute that minimizes MDL and recursively applies the same procedure to the resulting splits, thus generating a hierarchical clustering tree similar to a decision tree. The process of growing the tree is controlled by a parameter evaluating the information compression at each node, which is computed as the difference between the code length of the data at the current node and the MDL of the attribute that produces the data split at that node. If the compression becomes lower than a specified cutoff value the process of growing the tree stops and a leaf node is created.

The algorithms computational complexity is linear in the number of data instances and quadratic in the total number of different attribute-values in the data. However it is substantially reduced by an efficient implementation using bit-level parallelism. This makes it suitable not only for processing large number of instances, but also for large number of attributes, which is typical in text and web mining.

In addition to the above algorithms the suite also includes a utility for creating data files from text or web documents and a lab project for web document clustering illustrating the basic steps of document collection, creating the vector space model, data preprocessing, clustering and attribute selection.

A set of popular benchmark data is provided for experimenting and comparisons. Below are some examples of running the MDL clustering algorithm on a 3-class version of the Reuters data in the Weka command-line interface.

> java MDLcluster data/reuters-3class.arff 280000trade=0 (584469.05)rate=0 (277543.73) [339,18,30] moneyrate=1 (259602.51) [168,0,177] interesttrade=1 (206999.39) [101,301,12] trade

The above clusters are explicitly described by attribute values, which makes the clustering models easier to understand and use. The tree has three clusters (exactly as the number of classes) and identifies the two most important attributes in the data. Further, if we are interested in the structure of these clusters we may lower the compression threshold and obtain the following tree, which shows another important attribute (market) that splits the “interest” cluster (rate=1) in two.

> java MDLcluster data/reuters-3class.arff 250000trade=0 (584469.05)rate=0 (277543.73) mln=0 (160192.77) [178,10,28] money mln=1 (129134.46) [161,8,2] moneyrate=1 (259602.51) market=0 (117196.81) [66,0,117] interest market=1 (107679.77) [102,0,60] moneytrade=1 (206999.39) [101,301,12] trade

The clustering trees produced by MDLcluster in an unsupervised manner are very similar to the decision trees created by supervised learning algorithms. For example, the following tree is produced by the Weka’s J48 algorithm.

> java weka.classifiers.trees.J48 a€“t data/reuters-3class.arff -M 150trade = 0| rate = 0: money (387.0/48.0)| rate = 1| | market = 0: interest (183.0/66.0)| | market = 1: money (162.0/60.0)trade = 1: trade (414.0/113.0)

This similarity indicates that the MDL evaluation measure can identify the important attributes in the data without using class information. A more detailed study [1] shows that the MDL unsupervised attribute ranking performs comparably with the supervised ranking based on information gain (used by the decision tree learning algorithm). Another empirical study [2] show that the MDL clustering algorithm compares favorably with k-means and EM on popular benchmark data and performs particularly well on binary and sparse data (e.g. text and web documents).

References

  • Zdravko Markov. MDL-based Unsupervised Attribute Ranking, Proceedings of the 26th International Florida Artificial Intelligence Research Society Conference (FLAIRS-26), St. Pete Beach, Florida, USA, May 22-24, AAAI Press 2013, pp. 444-449.http://www.aaai.org/ocs/index.php/FLAIRS/FLAIRS13/paper/view/5845/6115
  • Zdravko Markov. MDL-Based Hierarchical Clustering, Proceedings of the IEEE 14th International Conference on Machine Learning and Applications (ICMLA 2015), December 9-11, 2015, Miami, Florida, USA, pp. 464-467. PDF
  • Zdravko Markov and Daniel T. Larose. MDL-Based Model and Feature Evaluation, in Chapter 4 of Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, April 2007, ISBN: 978-0-471-66655-4.

http://www.kdnuggets.com/wp-content/uploads/zdravko-markov.jpgBio: Zdravko Markov is a professor of Computer Science at Central Connecticut State University, where he teaches courses in Programming, Computer Architecture, Artificial Intelligence, Machine Learning, Data and Web Mining. His research area is Artificial Intelligence with a focus on Machine Learning, Data and Web Mining, where he is developing software and a project-based frameworks for teaching core AI topics. Dr. Markov has published 4 books and more than 60 research papers in conference proceedings and journals. His most recent book (co-authored with Daniel Larose) is "Data Mining The Web: Uncovering Patterns in Web Content, Structure, and Usage", published by Wiley in 2007.


扫码或添加微信号:坛友素质互助


「经管之家」APP:经管人学习、答疑、交友,就上经管之家!
免流量费下载资料----在经管之家app可以下载论坛上的所有资源,并且不额外收取下载高峰期的论坛币。
涵盖所有经管领域的优秀内容----覆盖经济、管理、金融投资、计量统计、数据分析、国贸、财会等专业的学习宝库,各类资料应有尽有。
来自五湖四海的经管达人----已经有上千万的经管人来到这里,你可以找到任何学科方向、有共同话题的朋友。
经管之家(原人大经济论坛),跨越高校的围墙,带你走进经管知识的新世界。
扫描下方二维码下载并注册APP
本文关键词:

本文论坛网址:https://bbs.pinggu.org/thread-4790144-1-1.html

人气文章

1.凡人大经济论坛-经管之家转载的文章,均出自其它媒体或其他官网介绍,目的在于传递更多的信息,并不代表本站赞同其观点和其真实性负责;
2.转载的文章仅代表原创作者观点,与本站无关。其原创性以及文中陈述文字和内容未经本站证实,本站对该文以及其中全部或者部分内容、文字的真实性、完整性、及时性,不作出任何保证或承若;
3.如本站转载稿涉及版权等问题,请作者及时联系本站,我们会及时处理。