楼主: oliyiyi
1163 1

Topological Analysis and Machine Learning: Friends or Enemies? [推广有奖]

版主

泰斗

1%

还不是VIP/贵宾

-

TA的文库  其他...

计量文库

威望
7
论坛币
272143 个
通用积分
31452.8796
学术水平
1454 点
热心指数
1573 点
信用等级
1364 点
经验
384085 点
帖子
9628
精华
66
在线时间
5492 小时
注册时间
2007-5-21
最后登录
2024-6-14

初级学术勋章 初级热心勋章 初级信用勋章 中级信用勋章 中级学术勋章 中级热心勋章 高级热心勋章 高级学术勋章 高级信用勋章 特级热心勋章 特级学术勋章 特级信用勋章

+2 论坛币
k人 参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群
赵安豆老师微信:zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

感谢您参与论坛问题回答

经管之家送您两个论坛币!

+2 论坛币

What is the interaction between Topological Data Analysis and Machine Learning ? A case study shows how TDA decomposition of the data space provides useful features for improving Machine Learning results.

By Harlan Sexton (Ayasdi).

People new to topological data analysis (TDA) often ask me some form of the question, "What’s the difference between Machine Learning and TDA?" It's a hard question to answer, in part because it depends on what you mean by Machine Learning (ML).

Wikipedia has this to say about ML:

"Machine learning explores the study and construction of algorithms that can learn from and make predictions on data. Such algorithms operate by building a model from example inputs in order to make data-driven predictions or decisions, rather than following strictly static program instructions."

At this level of generality one could argue that TDA is a form of ML, but I think most people who work in both of these areas would disagree.

Concrete examples of ML are rather more similar to one another than they are any example of TDA. The same can be said for TDA where examples look more like TDA than ML.

In order to explain the differences between TDA and ML, and perhaps more importantly demonstrate how and why TDA and ML play well together, I’ll make two overly simple definitions and then I’ll illustrate the point with a real example.

One simple way to define ML would be any method that posits a parameterized model of data and learns the parameters from the data.

In the same vein we can define TDA as any method that uses as its model of data only a notion of 'similarity' between data points.

In this view, ML models are much more specific and detailed, and the success of the model depends on how well it fits the underlying data. The advantage is that when the data fits the model the results can be quite remarkable - apparent chaos is replaced by near perfect understanding.

The advantage for TDA is generality.

With TDA, any notion of similarity is enough to use the methodology. Conversely, with ML you need a strong notion of similarity (and more) to make any progress with any method at all.

For example, from a long list of first names it would be impossible to reliably predict height and weight. You need more.

A key element here is that algorithms derived from topology tend to be very, very forgiving of small errors - even if your notion of similarity is somehow flawed, as long as it is 'sort of right' TDA methods will usually yield something useful.

The generality of TDA methods has another advantage over ML techniques that remain in effect even when an ML method is a good fit - namely ML methods often create detailed internal state which generate notions of similarity, making it possible to couple TDA with ML to get even more insight into a data set.

This all sounds great, but it's general enough (or if you’re feeling less charitable, vague enough) it might mean anything.

So let’s be specific.

A Random ForestTM classifier is an ensemble learning method that operates by constructing a multitude of decision trees during training and using 'majority rule' on that 'forest' (collection of decision trees) to classify non-training data.

While the details of the construction are quite interesting and clever, they aren't germane at the moment. All you need to recall about a Random Forest is that it operates on each data point by applying a collection of decision trees to the datum and returning a sequence of 'leaf nodes' (the leaf in the decision tree to which the input 'fell').

In normal operation each leaf in each tree has a class C associated with it, which is interpreted as "by the time a datum gets to this node in the tree I know it is overwhelmingly likely to be of class C." The Random Forest classifies by counting the 'leaf node class votes' from each tree and picking a winner. While highly effective on a broad range of data types, this approach drops a great deal of information on the floor.

If all you care about is the best guess for the class of a datum, then you don’t want to see that extra information, but sometimes you want more. This "extraneous" information can be made into a distance function by defining the distance between two data points as the number of times their respective 'leaf nodes' differed.

The distance function between two data points is a perfectly good metric (it is, in fact, the Hamming metric on a transformation of the data set), and as such we can apply TDA to it.

As an example, let us look at a random 5,000 point sample drawn from:

https://archive.ics.uci.edu/ml/datasets/Dataset+for+Sensorless+Drive+Diagnosis

The data set is moderately complex, having 48 continuous features which appear to be unexplained measurements of signals in the electric current for the hard drives. The data also includes a classification column with 11 possible values that describe different conditions (failure modes, perhaps?) for disc drive components. One obvious thing to try is a Euclidean metric on the feature columns, and then to color the graph by classes. Since we don't know anything about the feature columns the first thing to try are the neighborhood lenses. The result is a featureless blob:



That’s disappointing.

Using some in-house debugging capabilities I looked at the scatterplot of the neighborhood lenses and I see why it's so bad - it looks like a Christmas tree:



There is apparently no localization of the classes in the Euclidean metric.

However, if you build a Random Forest on the data set, the classifier has a very small out-of-bag error, which is a strong indication for the classifier working very reliably.

So I try making a graph using the Random Forest Hamming metric and the neighborhood lenses for that metric:



That looks very good. Just to confirm we also looked at the scatterplot of the neighborhood lenses and the results are what the figure above suggests:



What is obvious from both the graph and the scatterplot are that the Random Forest 'sees' strong structure below the level of the classification and is revealed by TDA. The reason is that RF doesn't efficiently use the "extraneous" data - but TDA does and benefits substantially from that information.

However, one might ask if that structure is illusory - perhaps an artifact of the algorithms we use somewhere in the system? In the case of this data set we can't really tell, since we don't know anything more about the groups.

Still, we have used the Random Forest metric on customer data to analyze possible failure modes for thousands of complex devices based on data collected during device burn-in. The classes were based on post-mortems done on devices returned to the manufacturer for various reasons (not all of which included failures).

What we found in this case was a Random Forest metric did a good job of classifying the failures, and the character of the pictures we got were similar to those above. More importantly, we discovered the distinct groups within a given failure mode sometimes had distinct causes.

The takeaway in both of these cases is that these causes would have been much more difficult to find without the further decomposition of the space that we got using TDA with the Random Forest.

The example we have just seen shows how TDA can work with a machine learning method to get better results than are possible by either technique individually.

This is what we mean when we say ML & TDA: Better Together.


二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

关键词:Topological Learning Analysis Enemies machine difference features provides question because

缺少币币的网友请访问有奖回帖集合
https://bbs.pinggu.org/thread-3990750-1-1.html
沙发
masterxyq 发表于 2017-1-29 16:53:23 |只看作者 |坛友微信交流群
非常好,谢谢了!
已有 1 人评分论坛币 收起 理由
oliyiyi + 10 精彩帖子

总评分: 论坛币 + 10   查看全部评分

使用道具

您需要登录后才可以回帖 登录 | 我要注册

本版微信群
加好友,备注jltj
拉您入交流群

京ICP备16021002-2号 京B2-20170662号 京公网安备 11010802022788号 论坛法律顾问:王进律师 知识产权保护声明   免责及隐私声明

GMT+8, 2024-6-16 00:24