人大经济论坛 › 论坛 › 计量经济学与统计论坛五区 › 计量经济学与统计软件 › LATEX论坛 › Topological Analysis and Machine Learning: Friends o ...

CDA数据分析研究院

商业数据分析与大数据领航教育品牌



经管云课堂

经管/金融/财会/社科/名师公开课



学术培训

Stata 空间计量 SSCI Python

贵宾：通行论坛特权+数据库权限
+案例库+下载特权 VIP：论坛特权+更多下载次数
+ccerdata数据库+更高阅读权限+……

发帖

楼主: oliyiyi

1163 1

Topological Analysis and Machine Learning: Friends or Enemies? [推广有奖]

1关注
185
粉丝

版主

泰斗

还不是VIP/贵宾

TA的文库 其他...

计量文库

威望: 7 级
论坛币: 272143 个
通用积分: 31452.8796
学术水平: 1454 点
热心指数: 1573 点
信用等级: 1364 点
经验: 384085 点
帖子: 9628
精华: 66
在线时间: 5492 小时
注册时间: 2007-5-21
最后登录: 2024-6-14

楼主

oliyiyi 发表于 2016-7-3 11:08:40 |只看作者 |坛友微信交流群|倒序 |AI写论文

是否 +2 论坛币

k人参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群

赵安豆老师微信：zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

立即领取

感谢您参与论坛问题回答

经管之家送您两个论坛币！

+2 论坛币

What is the interaction between Topological Data Analysis and Machine Learning ? A case study shows how TDA decomposition of the data space provides useful features for improving Machine Learning results.

By Harlan Sexton (Ayasdi).

People new to topological data analysis (TDA) often ask me some form of the question, "What’s the difference between Machine Learning and TDA?" It's a hard question to answer, in part because it depends on what you mean by Machine Learning (ML).

Wikipedia has this to say about ML:

"Machine learning explores the study and construction of algorithms that can learn from and make predictions on data. Such algorithms operate by building a model from example inputs in order to make data-driven predictions or decisions, rather than following strictly static program instructions."

At this level of generality one could argue that TDA is a form of ML, but I think most people who work in both of these areas would disagree.

Concrete examples of ML are rather more similar to one another than they are any example of TDA. The same can be said for TDA where examples look more like TDA than ML.

In order to explain the differences between TDA and ML, and perhaps more importantly demonstrate how and why TDA and ML play well together, I’ll make two overly simple definitions and then I’ll illustrate the point with a real example.

One simple way to define ML would be any method that posits a parameterized model of data and learns the parameters from the data.

In the same vein we can define TDA as any method that uses as its model of data only a notion of 'similarity' between data points.

In this view, ML models are much more specific and detailed, and the success of the model depends on how well it fits the underlying data. The advantage is that when the data fits the model the results can be quite remarkable - apparent chaos is replaced by near perfect understanding.

The advantage for TDA is generality.

With TDA, any notion of similarity is enough to use the methodology. Conversely, with ML you need a strong notion of similarity (and more) to make any progress with any method at all.

For example, from a long list of first names it would be impossible to reliably predict height and weight. You need more.

A key element here is that algorithms derived from topology tend to be very, very forgiving of small errors - even if your notion of similarity is somehow flawed, as long as it is 'sort of right' TDA methods will usually yield something useful.

The generality of TDA methods has another advantage over ML techniques that remain in effect even when an ML method is a good fit - namely ML methods often create detailed internal state which generate notions of similarity, making it possible to couple TDA with ML to get even more insight into a data set.

This all sounds great, but it's general enough (or if you’re feeling less charitable, vague enough) it might mean anything.

So let’s be specific.

A Random ForestTM classifier is an ensemble learning method that operates by constructing a multitude of decision trees during training and using 'majority rule' on that 'forest' (collection of decision trees) to classify non-training data.

While the details of the construction are quite interesting and clever, they aren't germane at the moment. All you need to recall about a Random Forest is that it operates on each data point by applying a collection of decision trees to the datum and returning a sequence of 'leaf nodes' (the leaf in the decision tree to which the input 'fell').

In normal operation each leaf in each tree has a class C associated with it, which is interpreted as "by the time a datum gets to this node in the tree I know it is overwhelmingly likely to be of class C." The Random Forest classifies by counting the 'leaf node class votes' from each tree and picking a winner. While highly effective on a broad range of data types, this approach drops a great deal of information on the floor.

If all you care about is the best guess for the class of a datum, then you don’t want to see that extra information, but sometimes you want more. This "extraneous" information can be made into a distance function by defining the distance between two data points as the number of times their respective 'leaf nodes' differed.

The distance function between two data points is a perfectly good metric (it is, in fact, the Hamming metric on a transformation of the data set), and as such we can apply TDA to it.

As an example, let us look at a random 5,000 point sample drawn from:

https://archive.ics.uci.edu/ml/datasets/Dataset+for+Sensorless+Drive+Diagnosis

The data set is moderately complex, having 48 continuous features which appear to be unexplained measurements of signals in the electric current for the hard drives. The data also includes a classification column with 11 possible values that describe different conditions (failure modes, perhaps?) for disc drive components. One obvious thing to try is a Euclidean metric on the feature columns, and then to color the graph by classes. Since we don't know anything about the feature columns the first thing to try are the neighborhood lenses. The result is a featureless blob:

That’s disappointing.

Using some in-house debugging capabilities I looked at the scatterplot of the neighborhood lenses and I see why it's so bad - it looks like a Christmas tree:

There is apparently no localization of the classes in the Euclidean metric.

However, if you build a Random Forest on the data set, the classifier has a very small out-of-bag error, which is a strong indication for the classifier working very reliably.

So I try making a graph using the Random Forest Hamming metric and the neighborhood lenses for that metric:

That looks very good. Just to confirm we also looked at the scatterplot of the neighborhood lenses and the results are what the figure above suggests:

What is obvious from both the graph and the scatterplot are that the Random Forest 'sees' strong structure below the level of the classification and is revealed by TDA. The reason is that RF doesn't efficiently use the "extraneous" data - but TDA does and benefits substantially from that information.

However, one might ask if that structure is illusory - perhaps an artifact of the algorithms we use somewhere in the system? In the case of this data set we can't really tell, since we don't know anything more about the groups.

Still, we have used the Random Forest metric on customer data to analyze possible failure modes for thousands of complex devices based on data collected during device burn-in. The classes were based on post-mortems done on devices returned to the manufacturer for various reasons (not all of which included failures).

What we found in this case was a Random Forest metric did a good job of classifying the failures, and the character of the pictures we got were similar to those above. More importantly, we discovered the distinct groups within a given failure mode sometimes had distinct causes.

The takeaway in both of these cases is that these causes would have been much more difficult to find without the further decomposition of the space that we got using TDA with the Random Forest.

The example we have just seen shows how TDA can work with a machine learning method to get better results than are possible by either technique individually.

This is what we mean when we say ML & TDA: Better Together.

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

分享0 收藏1 回帖

关键词：Topological Learning Analysis Enemies machine difference features provides question because

Topological Analysis and Machine Learning: Friends or Enemies? [推广有奖]

经管之家送您一份

经管之家联合CDA

感谢您参与论坛问题回答

扫码加我拉你入群

相关帖子

初级学术勋章

初级热心勋章

初级信用勋章

中级信用勋章

中级学术勋章

中级热心勋章

高级热心勋章

高级学术勋章

高级信用勋章

特级热心勋章

特级学术勋章

特级信用勋章

本版微信群

Topological Analysis and Machine Learning: Friends or Enemies? [推广有奖]

经管之家送您一份

经管之家联合CDA

感谢您参与论坛问题回答

扫码加我 拉你入群

相关帖子

初级学术勋章

初级热心勋章

初级信用勋章

中级信用勋章

中级学术勋章

中级热心勋章

高级热心勋章

高级学术勋章

高级信用勋章

特级热心勋章

特级学术勋章

特级信用勋章

本版微信群

扫码加我拉你入群