人大经济论坛 › 论坛 › 计量经济学与统计论坛五区 › 计量经济学与统计软件 › LATEX论坛 › Next Topic Model

CDA数据分析研究院

商业数据分析与大数据领航教育品牌



经管云课堂

经管/金融/财会/社科/名师公开课



学术培训

Stata 空间计量 SSCI Python

贵宾：通行论坛特权+数据库权限
+案例库+下载特权 VIP：论坛特权+更多下载次数
+ccerdata数据库+更高阅读权限+……

发帖

楼主: oliyiyi

1716 2

Next Topic Model [推广有奖]

1关注
184
粉丝

版主

泰斗

还不是VIP/贵宾

TA的文库 其他...

计量文库

威望: 7 级
论坛币: 271951 个
通用积分: 31269.3519
学术水平: 1435 点
热心指数: 1554 点
信用等级: 1345 点
经验: 383775 点
帖子: 9598
精华: 66
在线时间: 5468 小时
注册时间: 2007-5-21
最后登录: 2024-4-18

楼主

oliyiyi 发表于 2016-7-18 07:09:44 |只看作者 |坛友微信交流群|倒序 |AI写论文

是否 +2 论坛币

k人参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群

赵安豆老师微信：zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

立即领取

感谢您参与论坛问题回答

经管之家送您两个论坛币！

+2 论坛币

Topic modeling is a a great way to get a bird's eye view on a large document collection using machine learning. Here are 3 ways to use open source Python tool Gensim to choose the best topic model.

By Lev Konstantinovskiy, RaRe Technologies.

"How to choose the best topic model?" is the #1 question on our community mailing list. At RaRe Technologies I manage the community for the Python open source topic modeling packagegensim. As so many people are looking for the answer, we’ve recently released an updated gensim 0.13.1 incorporating several new exciting features which evaluate if your model is any good, helping you to select the best topic model.

Fig 1. Top: 15 most probable words for four selected topics. Bottom: a text document with words colored according to which topic they belong to. Taken from Latent Dirichlet Allocation paper by David M. Blei.
What is Topic Modeling?

Topic modeling is a technique for taking some unstructured text and automatically extracting its common themes, using machine learning. It is a great way to get a bird's eye view on a large text collection.

A quick recap on what topic modeling does: a topic is a probability distribution over the vocabulary. For example, if we were to create three topics for the Harry Potter series of books manually, we might come up with something like this:

(the Muggle topic) 50% “Muggle”, 25% “Dursey”, 10% “Privet”, 5% “Mudblood”...
(the Voldemort topic) 65% “Voldemort”, 12% “Death”, 10% “Horcrux”, 5% “Snake”...
(the Harry topic) 42% “Harry Potter”, 15% “Scar”, 7% “Quidditch”, 7% “Gryffindor”...

In the same way, we can represent individual documents as a probability distribution over topics. For example, Chapter 1 of Harry Potter book 1 introduces the Dursley family and has Dumbledore discuss Harry’s parent’s death. If we take this chapter to be a single document, it could be broken up into topics like this: 40% Muggle topic, 30% Voldemort topic, and the remaining 30% is the Harry topic.

Of course, we don’t want to extract the topics and document probabilities by hand like this. We want the machine to do it automatically using our unlabelled text collection as the only input. Because there is no document labeling nor human annotations, topic modeling an example of an unsupervised machine learning technique.

Another, more practical example would be breaking your internal company documents into topics, providing a bird's eye view of their contents for convenient visualization and browsing:

Fig 2. Use topic models to create bird’s eye view of internal company documents, with an option to drill down into individual documents by their topic (rather than just keywords).
Latent Dirichlet Allocation = LDA

The most popular topic model in use today is Latent Dirichlet Allocation. To understand how this works, Edwin Chen’s blog post is a very good resource. This link has a nice repository of explanations of LDA, which might require a little mathematical background. Thispaper by David Blei is a good go-to as it sums up various types of topic models which have been developed to date.

If you want to get your hands dirty with some nice LDA and vector space code, the gensim tutorial is always handy.

Choosing the Best Topic Model: Coloring words

Once you have your topics the next step is to determine if they are any good or not. If they are, then you would simply go ahead and plug them into your collection browser or classifier. If not, maybe you should train the model a bit more or with different parameters.

One of the ways to analyze the model is to color document words depending on the topic they belong to. This feature was recently added to gensim by our 2016 Google Summer of Code student Bhargav. You can take a look at the Python code in this notebook. Figure 1 above is an example of this functionality from the original LDA paper by David Blei.

An interesting example would be the word ‘bank’ which could mean ‘a financial institution’ or ‘a river bank’. A good topic modeling algorithm can tell the difference between these two meanings based on context. Coloring words is a quick way to assess if the model understands their meaning and if it is any good.

For example, we trained two topic models on a toy corpus of nine documents.

texts = [['river', 'bank', 'nature'], ['money', 'finance', 'bank', 'currency', 'up', 'down'], ['trading', 'computer', 'interface', 'system'], ['forest', 'nature', 'water', 'tree'], ['option', 'derivative', 'latency', 'bank', 'trading'], ['tree'], ['exchange', 'rate', 'option'], ['interest', 'rate', 'up'], ['field', 'forest', 'river']]dictionary = Dictionary(texts)corpus = [dictionary.doc2bow(text) for text in texts]goodLdaModel = LdaModel(corpus=corpus, id2word=dictionary, iterations=50, num_topics=2)badLdaModel = LdaModel(corpus=corpus, id2word=dictionary, iterations=1, num_topics=2)

One LDA model is trained for 50 iterations and another is trained for just one iteration. We expect the model to get better the longer we train it.

You may notice that the texts above don’t look like the texts we are used to, instead they are actually Python lists. It is because we converted them to a Bag of Words representation. That is how the LDA model sees text. The word order doesn’t matter and some very frequent words are removed. For example 'A bank of a fast river.' becomes ['bank', 'river', 'fast'] in Bag of Words format.

Let’s see how good the two models are at distinguishing between a ‘river bank’ and a ‘financial bank. If all the words in a document are about nature then our swing word ‘bank’ should become a “river bank” colored in the nature topic color of ‘blue’.

bow_water = ['bank','water','river', 'tree']color_words(goodLdaModel, bow_water)

bank river water tree

color_words(badLdaModel, bow_water)

bank river water tree

The good model successfully completes this task while the bad model thinks it is a ‘financial bank’ and colors it red.

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

分享0 收藏0 回帖

关键词：topic model mode Next del

相关帖子

缺少币币的网友请访问有奖回帖集合：
https://bbs.pinggu.org/thread-3990750-1-1.html

使用道具举报

沙发

oliyiyi 发表于 2016-7-18 07:10:01 |只看作者 |坛友微信交流群

Choosing the Best Topic Model: pyLDAvis

We can also tell the better trained model fits quite well because it has clear Nature and Finance topics. The visualisation below is from pyLDAvis, a wonderful visualisation tool for qualitative assessment of Topic Models. You can play interactively with this particular visualization in this Jupyter notebook.There is also a great introduction to pyLDAvis from its creator Ben Mabey in his talk on YouTube.

Fig 3. Good Topic Model in pyLDAvis. Most relevant words are shown on the right for the circle (topic) highlighted in red. The blue bar next to the word, for example ‘bank’, corresponds to the frequency of the word ‘bank’ appearing in the collection of documents. The red part of the bar is the frequency of the term ‘bank’ in the selected topic. We can confidently name topic #1 a Finance topic because the words shown next to it are exactly what we would expect in Finance: ‘bank’, ‘trading’, ‘option’ and ’rate’. Also the word ‘bank’ appears most often in this topic as it has a large red bar.

Fig 4. Bad Topic Model in pyLDAvis. The words inside a topic don’t relate to each other. Let’s compare a good model trained for 50 iterations (9*50 = 450 total documents) to a bad untrained model, trained only for 1 iteration (nine documents). It has very unrelated words in one topic. Both ‘tree’ and ‘trading’ appear in the list for the same topic #2. We would expect to see them in different ones: ‘tree’ relates to Nature and ‘trading’ should be in Finance. So this topic model doesn’t make sense.
Choosing the Best Topic Model: Quantitative approach

There is a new gensim feature to automatically choose the best model without a manual visualisation in pyLDAvis or word coloring. It is called ‘topic coherence’. One of the students currently enrolled in our Incubator program, Devashish, has implemented this in Python based on paper by Michael Röder et al.

There is an interesting twist here. Surprisingly, a mathematically rigorous calculation of model fit (data likelihood, perplexity) doesn't always agree with human opinion about the quality of the model, as shown in a well-titled paper "Reading Tea Leaves: How Humans Interpret Topic Models". But another formulahas been found to correlate well with human judgement. It is called 'C_v topic coherence'. It measures how often the topic words appear together in the corpus. Of course, the trick is how to define ‘together’. Gensim supports several topic coherence measures including C_v. You can explore them in this Jupyter notebook.

As expected from our manual inspections above, the model which trained for 50 epochs has higher coherence. Now you can automatically choose the best model using this number.

goodcm = CoherenceModel(model=goodLdaModel, texts=texts, dictionary=dictionary, coherence='c_v')print goodcm.get_coherence()0.552164532134badcm = CoherenceModel(model=badLdaModel, texts=texts, dictionary=dictionary, coherence='c_v')print badcm.get_coherence()0.5269189184

Conclusion

We have covered three ways to evaluate a topic model – coloring words, pyLDAvis and topic coherence. The one you choose depends on the number of models and topics. If you have a handful of models and a small number of topics, then you could run the manual inspections in a reasonable amount of time. Coloring swing words in your specific domain is an important one to get right. In other situations manual inspections are not feasible. For example, if you’ve run an LDA parameter grid search and have a lot of models, or if you have thousands of topics. In that case the only way is the automated topic coherence to find the most coherent model, then a quick manual validation of the winner with word coloring and pyLDAvis.

I hope you find these model selection techniques useful in your NLP applications! Let us know if you have any questions about them on the gensim mailing list. We also offer NLP consulting services at RaRe Technologies.

Bio: Lev Konstantinovskiy, an expert in natural language processing, is a Python and Java developer. Lev has extensive experience working with financial institutions and is RaRe’s manager of open source communities including gensim, an open source machine learning toolkit for understanding human language. Lev holds the position of Open Source Evangelist, R&D at RaRe Technologies.