楼主: Scalachen
1362 2

Topic Models using Julia [推广有奖]

  • 0关注
  • 0粉丝

已卖:147份资源

本科生

56%

还不是VIP/贵宾

-

TA的文库  其他...

Haskell NewOccidental

Splunk NewOccidental

Apache Storm NewOccidental

威望
0
论坛币
5149 个
通用积分
0
学术水平
9 点
热心指数
11 点
信用等级
9 点
经验
1156 点
帖子
24
精华
1
在线时间
0 小时
注册时间
2015-3-29
最后登录
2017-8-22

楼主
Scalachen 发表于 2015-3-31 21:26:44 |AI写论文

+2 论坛币
k人 参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群
赵安豆老师微信:zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

感谢您参与论坛问题回答

经管之家送您两个论坛币!

+2 论坛币
  1. module TopicModels

  2. import Base.length

  3. typealias RaggedMatrix{T} Array{Array{T,1},1}

  4. type Corpus
  5.   documents::RaggedMatrix{Int64}
  6.   weights::RaggedMatrix{Float64}

  7.   Corpus(documents::RaggedMatrix{Int64},
  8.          weights::RaggedMatrix{Float64}) = begin
  9.     return new(
  10.       documents,
  11.       weights
  12.     )
  13.   end
  14.   
  15.   Corpus(documents::RaggedMatrix{Int64}) = begin
  16.     weights = map(documents) do doc
  17.       ones(Float64, length(doc))
  18.     end
  19.     return new(
  20.       documents,
  21.       weights
  22.     )
  23.   end
  24. end

  25. type Model
  26.   alphaPrior::Vector{Float64}
  27.   betaPrior::Float64
  28.   topics::Array{Float64,2}
  29.   topicSums::Vector{Float64}
  30.   documentSums::Array{Float64,2}
  31.   assignments::RaggedMatrix{Int64}
  32.   frozen::Bool
  33.   corpus::Corpus

  34.   Model(alphaPrior::Vector{Float64},
  35.         betaPrior::Float64,
  36.         V::Int64,
  37.         corpus::Corpus) = begin
  38.     K = length(alphaPrior)
  39.     m = new(
  40.       alphaPrior,
  41.       betaPrior,
  42.       zeros(Float64, K, V), # topics
  43.       zeros(Float64, K), # topicSums
  44.       zeros(Float64, K, length(corpus.documents)), #documentSums
  45.       fill(Array(Int64, 0), length(corpus.documents)), # assignments
  46.       false,
  47.       corpus
  48.     )
  49.     initializeAssignments(m)
  50.     return m
  51.   end

  52.   Model(trainedModel::Model, corpus::Corpus) = begin
  53.     m = new(
  54.       trainedModel.alphaPrior,
  55.       trainedModel.betaPrior,
  56.       trainedModel.topics,
  57.       trainedModel.topicSums,
  58.       trainedModel.documentSums,
  59.       fill(Array(Int64, 0), length(corpus.documents)),
  60.       true,
  61.       corpus
  62.     )
  63.     initializeAssignments(m)
  64.     return m
  65.   end
  66. end

  67. function length(corpus::Corpus)
  68.   return length(corpus.documents)
  69. end

  70. function initializeAssignments(model::Model)
  71.   for dd in 1:length(model.corpus)
  72.     @inbounds words = model.corpus.documents[dd]
  73.     @inbounds model.assignments[dd] = fill(0, length(words))
  74.     for ww in 1:length(words)
  75.       @inbounds word = words[ww]
  76.       topic = sampleMultinomial(model.alphaPrior)
  77.       @inbounds model.assignments[dd][ww] = topic
  78.       updateSufficientStatistics(
  79.         word, topic, dd, model.corpus.weights[dd][ww], model)
  80.     end
  81.   end
  82.   return
  83. end

  84. function sampleMultinomial(p::Array{Float64,1})
  85.   pSum = sum(p)
  86.   r = rand() * pSum
  87.   K = length(p)
  88.   for k in 1:K
  89.     if r < p[k]
  90.       return k
  91.     else
  92.       r -= p[k]
  93.     end
  94.   end
  95.   return 0
  96. end

  97. function wordDistribution(word::Int,
  98.                           document::Int,
  99.                           model::Model,
  100.                           out::Vector{Float64})
  101.   V = size(model.topics, 2)
  102.   for ii in 1:length(out)
  103.     u = (model.documentSums[ii, document] + model.alphaPrior[ii]) *
  104.         (model.topics[ii, word] + model.betaPrior) /
  105.         (model.topicSums[ii] + V * model.betaPrior)
  106.     @inbounds out[ii] = u
  107.   end
  108.   return
  109. end

  110. function sampleWord(word::Int,
  111.                     document::Int,
  112.                     model::Model,
  113.                     p::Vector{Float64})
  114.   wordDistribution(word, document, model, p)
  115.   sampleMultinomial(p)
  116. end


  117. function updateSufficientStatistics(word::Int64,
  118.                                     topic::Int64,
  119.                                     document::Int64,
  120.                                     scale::Float64,
  121.                                     model::Model)
  122.   fr = float64(!model.frozen)
  123.   @inbounds model.documentSums[topic, document] += scale
  124.   @inbounds model.topicSums[topic] += scale * fr
  125.   @inbounds model.topics[topic, word] += scale * fr
  126.   return
  127. end

  128. function sampleDocument(document::Int,
  129.                         model::Model)
  130.   @inbounds words = model.corpus.documents[document]
  131.   Nw = length(words)
  132.   @inbounds weights = model.corpus.weights[document]
  133.   K = length(model.alphaPrior)
  134.   p = Array(Float64, K)
  135.   @inbounds assignments = model.assignments[document]
  136.   for ii in 1:Nw
  137.     @inbounds word = words[ii]
  138.     @inbounds oldTopic = assignments[ii]
  139.     updateSufficientStatistics(word, oldTopic, document, -weights[ii], model)
  140.     newTopic = sampleWord(word, document, model, p)
  141.     @inbounds assignments[ii] = newTopic
  142.     updateSufficientStatistics(word, newTopic, document, weights[ii], model)
  143.   end
  144.   return
  145. end

  146. function sampleCorpus(model::Model)
  147.   for ii in 1:length(model.corpus)
  148.     sampleDocument(ii, model)
  149.   end
  150.   return
  151. end

  152. # Note, files are zero indexed, but we are 1-indexed.
  153. function termToWordSequence(term::String)
  154.   parts = split(term, ":")
  155.   fill(int64(parts[1]) + 1, int64(parts[2]))
  156. end

  157. # The functions below are designed for public consumption
  158. function trainModel(model::Model,
  159.                     numIterations::Int64)
  160.   for ii in 1:numIterations
  161.     println(string("Iteration ", ii, "..."))
  162.     sampleCorpus(model)
  163.   end
  164.   return
  165. end

  166. function topTopicWords(model::Model,
  167.                        lexicon::Array{ASCIIString,1},
  168.                        numWords::Int64)
  169.   [lexicon[reverse(sortperm(model.topics'[1:end, row]))[1:numWords]]
  170.    for row in 1:size(model.topics,1)]
  171. end

  172. function readDocuments(stream)
  173.   lines = readlines(stream)
  174.   convert(
  175.     RaggedMatrix{Int64},
  176.     [apply(vcat, [termToWordSequence(term) for term in split(line, " ")[2:end]])
  177.      for line in lines])
  178. end

  179. function readLexicon(stream)
  180.   lines = readlines(stream)
  181.   map(chomp, convert(Array{String,1}, lines))
  182. end

  183. export Corpus,
  184.        Model,
  185.        readDocuments,
  186.        readLexicon,
  187.        topTopicWords,
  188.        trainModel

  189. end
复制代码

In machine learning and natural language processing, a topic model is a type of statistical model for discovering the abstract "topics" that occur in a collection of documents. Intuitively, given that a document is about a particular topic, one would expect particular words to appear in the document more or less frequently: "dog" and "bone" will appear more often in documents about dogs, "cat" and "meow" will appear in documents about cats, and "the" and "is" will appear equally in both. A document typically concerns multiple topics in different proportions; thus, in a document that is 10% about cats and 90% about dogs, there would probably be about 9 times more dog words than cat words. A topic model captures this intuition in a mathematical framework, which allows examining a set of documents and discovering, based on the statistics of the words in each, what the topics might be and what each document's balance of topics is.


In practice researchers attempt to fit appropriate model parameters to the data corpus using one of several heuristics for maximum likelihood fit. A recent survey by Blei describes this suite of algorithms.[4] Several groups of researchers starting with Papadimitriou et al.[1] have attempted to design algorithms with probable guarantees. Assuming that the data was actually generated by the model in question, they try to design algorithms that probably find the model that was used to create the data. Techniques used here includesingular value decomposition (SVD), the method of moments, and very recently an algorithm based upon non-negative matrix factorization (NMF). This last algorithm also generalizes to topic models that allow correlations among topics
二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

关键词:models topic Using model Julia

本帖被以下文库推荐

沙发
fantuanxiaot 发表于 2015-3-31 22:44:21
不错不错

藤椅
20115326 学生认证  发表于 2017-9-28 10:52:49
谢谢分享

您需要登录后才可以回帖 登录 | 我要注册

本版微信群
加好友,备注jltj
拉您入交流群
GMT+8, 2026-1-7 22:35