楼主: oliyiyi
1230 2

State-of-the-Art Machine Learning Automation with HDT [推广有奖]

版主

泰斗

0%

还不是VIP/贵宾

-

TA的文库  其他...

计量文库

威望
7
论坛币
271951 个
通用积分
31269.3519
学术水平
1435 点
热心指数
1554 点
信用等级
1345 点
经验
383775 点
帖子
9598
精华
66
在线时间
5468 小时
注册时间
2007-5-21
最后登录
2024-4-18

初级学术勋章 初级热心勋章 初级信用勋章 中级信用勋章 中级学术勋章 中级热心勋章 高级热心勋章 高级学术勋章 高级信用勋章 特级热心勋章 特级学术勋章 特级信用勋章

+2 论坛币
k人 参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群
赵安豆老师微信:zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

感谢您参与论坛问题回答

经管之家送您两个论坛币!

+2 论坛币

本帖隐藏的内容

In this article, we discuss a general machine learning technique to make predictions or score transnational data, applicable to very big, streaming data. This hybrid technique combines different algorithms to boost accuracy, outperforming each algorithm taken separately, yet it is simple enough to be reliably automated It is illustrated in the context of predicting the performance of articles published in media outlets or blogs, and has been used by the author to build an AI (artificial intelligence) system to detect articles worth curating, as well as to automatically schedule tweets and other postings in social networks.for maximum impact, with a goal of eventually fully automating digital publishing. This application is broad enough that the methodology can be applied to most NLP (natural language processing) contexts with large amounts of unstructured data. The results obtained in our particular case study are also very interesting.


[color=rgb(255, 255, 255) !important]


Figure 1: HDT 1.0. Here we describe HDT 2.0.


The algorithmic framework described here applies to any data set, text or not, with quantitative, non-quantitative (gender, race) or a mix of variables. It consists of several components; we discuss in details those that are new and original, The other, non original components are briefly mentioned, with references provided for further reading. No deep technical expertise and no mathematical knowledge is required to understand the concepts and methodology described here. The methodology, though state-of-the-art, is simple enough that it can even be implemented in Excel, for small data sets (one million observations.).

1. The Problem

Rather than first presenting a general, abstract framework and then showing how it applies to a specific problem (case study), we decided to proceed the other way around, as we believe that it will help the reader understand better our methodology. We then generalize to any kind of data set.

In its simplest form, our particular problem consists of analyzing historical data about articles and blog posts, to identify features (also called metrics or variables) that are good predictors of blog popularity when combined together, to build a system that can predict the popularity of an article before it gets published. The goal is to select the right mix of relevant articles to publish, to increase web traffic, and thus advertising dollars, for a niche digital publisher.

As in any similar problem, the historical data is called training set, and it is split into test data and control data for cross-validation purposes to avoid over-fitting. The features are selected to maximize some measure of predictive power, as described here. All of this is (so far) standard practice; the reader not familiar with this can Google the keywords introduced in this paragraph. In our particular case, we use our domain expertise to identify great features. These features are pretty generic and apply to numerous NLP contexts, so you can re-use them for your own data sets.

Feature Selection and Best Practices

One caveat is that some metrics are very sensitive to distortion. In our case, the response (that is, what we are trying to predict, also called dependent variable by statisticians) is the traffic volume. It can be measured in page views, unique page views, or number of users who read the article. Page views can easily be manipulated and the number is inflated by web robots, especially for articles that have little traffic. So instead, we chose "unique page views", a more robust metric available through Google Analytics. Also, older articles have accumulated more page views over time, so we need to correct for this effect. Correcting for time is explained in this article. Here we used a very simple approach instead: focusing on articles from the most recent, big channel instead (the time window is about two years), and taking the logarithm of unique page views (denoted as pv in the source code in the last section).

Taking the logarithm not only smooths out the effect of time and web robots, but also it makes perfect sense as the page view distribution is highly skewed -- well modeled using a Zipf distribution -- with a few incredibly popular (viral) articles and a large number of articles with average traffic: it is a bit like the income distribution.

As for selecting the features, we have two kinds of metrics that we can choose as predictors:

1. Metrics based on the article title, easy to compute:

  • Keywords found in the title
  • Article category (blog, event, forum question)
  • Channel
  • Creation date
  • Title contains numbers?
  • Title is a question?
  • Title contains special characters?
  • Length of title

2. Metrics based on the article body, more difficult to compute:

  • Size of article
  • Does it contain pictures?
  • Keywords found in body
  • Author (and author popularity)
  • First few words

Despite focusing only on a subset of features associated with the article title, we were able to get very interesting, actionable insights; we only used title keywords, and whether the posting is a blog, or not. You have to keep in mind that the methodology used here takes into account all potential key-value combinations, where a key is a subset of features, and value, the respective values: for instance key = (keyword_1, keyword_2, article category) and value = ("Python", "tutorial", "Blog"). So it is important to appropriately bin the variables when turning them into features, to prevent the number of key-value pairs from exploding. Another mechanism described  further down in this article is also used to keep the key-value database, stored as an hash table or associate array, manageable. Finally, it can easily be implemented in a distributed environment (Hadoop.)  

Due to the analogy with decision trees, a key-value is also called a node, and plays the same role as a node in a decision tree.

2. Methodology and Solution

As we have seen in the previous section, the problem consists of predicting pv, the logarithm of unique page views for an article (over some time period), as a function of keywords found in the title, and whether the article in question is a blog or not.

In order to do so, we created lists of all one-token and two-token keywords found in all the titles, as well as blog status, after cleaning the titles and eliminating some stop word such as "that", "and" or "the", that don't have impacts on the predictions. We were also careful about not eliminating all keywords made up of one or two letters: the one-letter keyword "R", corresponding to the programming language R, has a high predictive power.

For each element in our lists, we recorded the frequency and traffic popularity. More precisely, for each key-value pair, we recorder the number of articles (titles, actually) that are associated with it, as well as the average, minimum and maximum pv across these articles.

Example

For instance, the element or key-value (keyword_1 = "R", keyword_2 = "Python", article = "Blog") is associated with 6 articles, and has the following statistics: avg pv = 8.52, min pv = 7.41, and max pv = 10.45.

Since the average pv across all articles is equal to 6.83, this specific key-value pair (also called node) generates exp(8.52 - 6.83) = 5.42 times more traffic than an average article. It is thus a great node. Even the worst article, among the 6 articles belonging to this node, with a pv of 7.41, outperforms the average article across all nodes. So not only this is a great node, but also a stable one. Some nodes have a far larger volatility, for instance when one of the keywords has different meanings, such as the word "training", in "training deep learning" (training set) versus "deep learning training" (courses.)

Hidden decision trees (HDT) revisited

Note that here, the nodes are overlapping, allowing considerable flexibility. In particular, nodes with two keywords are sub-nodes of nodes with one keyword. A previous version of this technique, described here, did not consider overlapping nodes. Also, with highly granular features, the number of nodes explodes exponentially. A solution to this problem consists of

  • shuffling the observations
  • working with nodes built on no more than 4 or 5 features
  • proper binning
  • visiting the observations sequentially (after the shuffle) and every one million observations, deleting nodes that contain only one observation

The general idea behind this technique is to group articles into buckets that are large enough to provide predictions that are sound, without explicitly building decision trees. Not only the nodes are simple and easy to interpret, but unstable nodes are easy to detect and discard. There is no splitting/pruning involved as with classical decision trees, making this methodology simple and robust, and thus fit for artificial intelligence (automated processing.)




二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

关键词:Automation Learning earning machine State different published streaming learning discuss

已有 1 人评分学术水平 热心指数 信用等级 收起 理由
janyiyi + 3 + 3 + 3 精彩帖子

总评分: 学术水平 + 3  热心指数 + 3  信用等级 + 3   查看全部评分

缺少币币的网友请访问有奖回帖集合
https://bbs.pinggu.org/thread-3990750-1-1.html
沙发
oliyiyi 发表于 2017-2-11 09:08:38 |只看作者 |坛友微信交流群

General framework

Whether you are dealing with predicting the popularity of an article, or the risk for a client to default on a loan, the basic methodology is identical. It involves training sets, cross-validation, feature selection, binning, and populating hash tables of key-value pairs (referred to here as the nodes).

When you process a new observation, you check which node(s) it belongs to. If the best node it belongs to is stable and not too small, you use it to predict the future performance or value of your observation, or to score the transaction if you are dealing with transactional data such as credit card transactions. In our example, if the performance metric (the average pv in the node in question) is significantly above the global average, and other constraints are met (the node is not too small, and the minimum pv in the node in question not too low to guarantee stability), then we classify the observation as good, just like the node it belongs to. In our case, the observation is a potential article.

Also, you need to update your training set and the node table (including automatically discovered new nodes) every six months or so.

Parameters must be calibrated to guarantee that

  • error rate (classifying a good observation as bad or the other way around) is small enough; it is measured using a confusion matrix
  • the system is robust: we have a reasonable number of stable nodes that are big enough; it is great if less than 3,000 stable, not too small nodes cover 80% of the observations (by stable, we mean nodes with low variance) with an average of at least 10 observations per node
  • the binning and feature selection mechanism offer real predictive power: the average response (our pv) measured in a node classified as good, is much above the general average, and the other way around for nodes classified as bad; in addition, the response shows little volatility within each node (in our case, pv is relatively stable across all observations within a same usable node)
  • we have enough usable nodes (that is, after excluding the small ones) to cover at least 50% of all observations, and if possible up to 95% of all observations (100% would be ideal but never exists in practice)

We discuss the parameters of our technique, and how to fine-tune them, in the next section. Fine-tuning can be automated or made more robust by testing (say) 2,000 sets of parameters and identify regions of stability that meet our criteria (in terms of error rate and so on) in the parameter space.   

A big question is what to do with observations not belonging to any usable node: they can not be classified. In our example it does not matter if 30% of the observations can not be classified, but in many applications, it does matter. One way to address this issue is to use super-nodes: in our case, a node for all posts that are blogs, and another one for all posts that are not blogs (these two nodes cover 100% of observations, both past and future.) The problem is that usually, these super-nodes don't have much predictive power. A better solution consists of using two algorithms: the one described here based on usable nodes (let's call it algorithm A) and another one called algorithm B, that classifies all observations. Observations that can't be classified or scored with algorithm A are classified/scored with algorithm B. You can read the details about how to blend the results of two algorithms, in one of my patents.  In practice, we have used Jackknife regression for algorithm B, a technique easy to implement, easy to understand, leading to simple interpretations, and very robust. These feature are important for systems that are designed to run automatically.

The resulting hybrid algorithm is called Hidden Decision Trees - hidden because you don't even realize that you have created a bunch of mini decision trees: it was all implicit. The version described here is version 2, with new features to prevent the node table from exploding, and allowing nodes to overlap, making it more suitable for data sets with a larger number of variables.

3. Case Study: Results

Our application about predicting page views for an article has been explained in detail in the previous sections. So here we focus on the results obtained.

Output from the algorithm

If you run the script listed in the next section, besides producing the table of key-value pairs (the nodes) as a text file for further automated processing, it displays summary statistics that look like the following:

Average pv: 6.81
Number of articles marked as good: 865 (real number is 1079)
Number of articles marked as bad: 1752 (real number is 1538)
Avg pv: articles marked as good: 8.23
Avg pv: articles marked as bad: 6.13
Number of false positive: 50 (bad marked as good)
Number of false negative: 264 (good marked as bad)
Number of articles: 2617
Error Rate: 0.12
Number of feature values: 16712 (marked as good: 3409)
Aggregation factor: 1.62

The number of "feature values" is the total number of key-value pairs found, including the small unstable ones, regardless as to whether they are classified as good or bad. Any article with a pv above the arbitrary value pv_threshold = 7.1 (see source code) is considered as good. This corresponds to articles having about 1.3 times more traffic than average, since we use a log scale and the average pv is 6.81. The traffic for articles classified as good by the algorithm (pv = 8.23) is about 4.2 times above the traffic that an average article receives.

Two important metrics are:

  • Aggregation factor: it is an indicator of the average size of a node. The minimum is 1, corresponding to nodes that only have one observation. A value above 5 is highly desirable, but here, because we are dealing with a small data set and with niche articles, even a small value is OK.
  • The error rate is the number of articles wrongly classified. Here we care much more about bad articles classified as good.

Also note that we correctly identify the vast majority of good articles, but this is because we work with small nodes. Finally an article is marked as good if it triggers at least one node marked as good (that is, satisfying the criterion defined in the next sub-section.)

Parameters

Besides pv_threshold, the algorithm uses 12 parameters to identify a usable, stable node classified as good. These parameters are illustrated in the following piece of code (see source code):

  if ( (($n > 3)&&($n < 8)&&($min > 6.9)&&($avg > 7.6)) ||
       (($n >= 8)&&($n < 16)&&($min > 6.7)&&($avg > 7.4)) ||
       (($n >= 16)&&($n < 200)&&($min > 6.1)&&($avg > 7.2)) ) {

Here, n represents the number of observations in a node, while, avg and min are the average and minimum pv for the node in question.  We tested many combinations of values for these parameters. Increasing the required size (denoted as n)of a usable node will do the following:

  • decrease the number of good articles correctly identified as good
  • increase the error rate
  • increase the stability of the system
  • decrease the predictive power
  • increase the aggregation factor (see previous sub-section)

Improving the methodology

Here we share some caveats and possible improvements to our technique.

You need to use a table of one-token keywords that look like two tokens, for increased efficiency, and consider these keywords as being one-token. For instance "San Francisco" is a one-token keyword, despite its appearance. Such tables are easy to build as you always see the two parts together.

Also, we looked at nodes containing (keyword_1, keyword_2) where the two keywords are adjacent. If you allow the two keywords not to be adjacent, the number of key-value pairs (the nodes) increases significantly, but you don't get much additional predictive power in return: there is even a risk of over-fitting.

Another improvement consists of having/favoring nodes containing observations spread over a long time period, to avoid any kind of concentration (which could otherwise result in over-fitting.)

Finally, in our case, we can not exclusively focus on articles with great potential. It is important to have many, less popular articles as well: they constitute the long tail. Without these articles, we face problems such as excessive content concentration, which have negative impacts in the long term. The obvious negative impact is that we might miss nascent topics, and thus getting stuck into an non-adaptive mix of articles at some point, thus slowing growth.

Interesting findings

Titles with the following features work well:

  • contains a number (10, 15 and so on) as we have many popular articles such as "10 great deep learning articles".
  • contains the current year (2017)
  • is a question  (how to)
  • not a blog, but a book category
  • a blog

Titles containing the following keywords work well:

  • everyone (as in "10 regression techniques everyone should know")
  • libraries
  • infographic
  • explained
  • algorithms
  • languages
  • amazing
  • must read
  • r python
  • job interview questions
  • should know (as in "10 regression techniques everyone should know")
  • nosql databases
  • versus
  • decision trees
  • logistic regression
  • correlations
  • tutorials
  • code
  • free

Recommended reading

You might also like my related articles about a data scientists sharing his secrets, and turning unstructured data into structured data. To read my best data science and machine learning articles, click here.

4. Source Code

The source code is easy to read and has deliberately made longer than needed to provide enough details, avoid complicated iterations, and facilitate maintenance and translation into Python or R. The output file hdt-out2.txt stores the key-value pairs (or nodes) that are usable, corresponding to populat articles. Here is the input data set: HDT-data3.txt.


已有 1 人评分学术水平 热心指数 信用等级 收起 理由
janyiyi + 3 + 3 + 3 精彩帖子

总评分: 学术水平 + 3  热心指数 + 3  信用等级 + 3   查看全部评分

缺少币币的网友请访问有奖回帖集合
https://bbs.pinggu.org/thread-3990750-1-1.html

使用道具

藤椅
fengyg 企业认证  发表于 2017-2-11 12:00:12 |只看作者 |坛友微信交流群
kankan

使用道具

您需要登录后才可以回帖 登录 | 我要注册

本版微信群
加好友,备注jltj
拉您入交流群

京ICP备16021002-2号 京B2-20170662号 京公网安备 11010802022788号 论坛法律顾问:王进律师 知识产权保护声明   免责及隐私声明

GMT+8, 2024-4-26 12:41