In this article, we discuss a general machine learning technique to make predictions or score transnational data, applicable to very big, streaming data. This hybrid technique combines different algorithms to boost accuracy, outperforming each algorithm taken separately, yet it is simple enough to be reliably automated It is illustrated in the context of predicting the performance of articles published in media outlets or blogs, and has been used by the author to build an AI (artificial intelligence) system to detect articles worth curating, as well as to automatically schedule tweets and other postings in social networks.for maximum impact, with a goal of eventually fully automating digital publishing. This application is broad enough that the methodology can be applied to most NLP (natural language processing) contexts with large amounts of unstructured data. The results obtained in our particular case study are also very interesting. Figure 1: HDT 1.0. Here we describe HDT 2.0. The algorithmic framework described here applies to any data set, text or not, with quantitative, non-quantitative (gender, race) or a mix of variables. It consists of several components; we discuss in details those that are new and original, The other, non original components are briefly mentioned, with references provided for further reading. No deep technical expertise and no mathematical knowledge is required to understand the concepts and methodology described here. The methodology, though state-of-the-art, is simple enough that it can even be implemented in Excel, for small data sets (one million observations.). 1. The Problem Rather than first presenting a general, abstract framework and then showing how it applies to a specific problem (case study), we decided to proceed the other way around, as we believe that it will help the reader understand better our methodology. We then generalize to any kind of data set. In its simplest form, our particular problem consists of analyzing historical data about articles and blog posts, to identify features (also called metrics or variables) that are good predictors of blog popularity when combined together, to build a system that can predict the popularity of an article before it gets published. The goal is to select the right mix of relevant articles to publish, to increase web traffic, and thus advertising dollars, for a niche digital publisher. As in any similar problem, the historical data is called training set, and it is split into test data and control data for cross-validation purposes to avoid over-fitting. The features are selected to maximize some measure of predictive power, as described here. All of this is (so far) standard practice; the reader not familiar with this can Google the keywords introduced in this paragraph. In our particular case, we use our domain expertise to identify great features. These features are pretty generic and apply to numerous NLP contexts, so you can re-use them for your own data sets. Feature Selection and Best Practices One caveat is that some metrics are very sensitive to distortion. In our case, the response (that is, what we are trying to predict, also called dependent variable by statisticians) is the traffic volume. It can be measured in page views, unique page views, or number of users who read the article. Page views can easily be manipulated and the number is inflated by web robots, especially for articles that have little traffic. So instead, we chose "unique page views", a more robust metric available through Google Analytics. Also, older articles have accumulated more page views over time, so we need to correct for this effect. Correcting for time is explained in this article. Here we used a very simple approach instead: focusing on articles from the most recent, big channel instead (the time window is about two years), and taking the logarithm of unique page views (denoted as pv in the source code in the last section). Taking the logarithm not only smooths out the effect of time and web robots, but also it makes perfect sense as the page view distribution is highly skewed -- well modeled using a Zipf distribution -- with a few incredibly popular (viral) articles and a large number of articles with average traffic: it is a bit like the income distribution. As for selecting the features, we have two kinds of metrics that we can choose as predictors: 1. Metrics based on the article title, easy to compute: 2. Metrics based on the article body, more difficult to compute: Despite focusing only on a subset of features associated with the article title, we were able to get very interesting, actionable insights; we only used title keywords, and whether the posting is a blog, or not. You have to keep in mind that the methodology used here takes into account all potential key-value combinations, where a key is a subset of features, and value, the respective values: for instance key = (keyword_1, keyword_2, article category) and value = ("Python", "tutorial", "Blog"). So it is important to appropriately bin the variables when turning them into features, to prevent the number of key-value pairs from exploding. Another mechanism described further down in this article is also used to keep the key-value database, stored as an hash table or associate array, manageable. Finally, it can easily be implemented in a distributed environment (Hadoop.) Due to the analogy with decision trees, a key-value is also called a node, and plays the same role as a node in a decision tree. 2. Methodology and Solution As we have seen in the previous section, the problem consists of predicting pv, the logarithm of unique page views for an article (over some time period), as a function of keywords found in the title, and whether the article in question is a blog or not. In order to do so, we created lists of all one-token and two-token keywords found in all the titles, as well as blog status, after cleaning the titles and eliminating some stop word such as "that", "and" or "the", that don't have impacts on the predictions. We were also careful about not eliminating all keywords made up of one or two letters: the one-letter keyword "R", corresponding to the programming language R, has a high predictive power. For each element in our lists, we recorded the frequency and traffic popularity. More precisely, for each key-value pair, we recorder the number of articles (titles, actually) that are associated with it, as well as the average, minimum and maximum pv across these articles. Example For instance, the element or key-value (keyword_1 = "R", keyword_2 = "Python", article = "Blog") is associated with 6 articles, and has the following statistics: avg pv = 8.52, min pv = 7.41, and max pv = 10.45. Since the average pv across all articles is equal to 6.83, this specific key-value pair (also called node) generates exp(8.52 - 6.83) = 5.42 times more traffic than an average article. It is thus a great node. Even the worst article, among the 6 articles belonging to this node, with a pv of 7.41, outperforms the average article across all nodes. So not only this is a great node, but also a stable one. Some nodes have a far larger volatility, for instance when one of the keywords has different meanings, such as the word "training", in "training deep learning" (training set) versus "deep learning training" (courses.) Hidden decision trees (HDT) revisited Note that here, the nodes are overlapping, allowing considerable flexibility. In particular, nodes with two keywords are sub-nodes of nodes with one keyword. A previous version of this technique, described here, did not consider overlapping nodes. Also, with highly granular features, the number of nodes explodes exponentially. A solution to this problem consists of The general idea behind this technique is to group articles into buckets that are large enough to provide predictions that are sound, without explicitly building decision trees. Not only the nodes are simple and easy to interpret, but unstable nodes are easy to detect and discard. There is no splitting/pruning involved as with classical decision trees, making this methodology simple and robust, and thus fit for artificial intelligence (automated processing.) General framework Whether you are dealing with predicting the popularity of an article, or the risk for a client to default on a loan, the basic methodology is identical. It involves training sets, cross-validation, feature selection, binning, and populating hash tables of key-value pairs (referred to here as the nodes). When you process a new observation, you check which node(s) it belongs to. If the best node it belongs to is stable and not too small, you use it to predict the future performance or value of your observation, or to score the transaction if you are dealing with transactional data such as credit card transactions. In our example, if the performance metric (the average pv in the node in question) is significantly above the global average, and other constraints are met (the node is not too small, and the minimum pv in the node in question not too low to guarantee stability), then we classify the observation as good, just like the node it belongs to. In our case, the observation is a potential article. Also, you need to update your training set and the node table (including automatically discovered new nodes) every six months or so.本帖隐藏的内容