楼主: ReneeBK
942 8

A Nonsensical Language Model using Theano LSTM [推广有奖]

  • 1关注
  • 62粉丝

VIP

学术权威

14%

还不是VIP/贵宾

-

TA的文库  其他...

R资源总汇

Panel Data Analysis

Experimental Design

威望
1
论坛币
49492 个
通用积分
53.3254
学术水平
370 点
热心指数
273 点
信用等级
335 点
经验
57815 点
帖子
4006
精华
21
在线时间
582 小时
注册时间
2005-5-8
最后登录
2023-11-26

相似文件 换一批

+2 论坛币
k人 参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群
赵安豆老师微信:zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

感谢您参与论坛问题回答

经管之家送您两个论坛币!

+2 论坛币
  1. Today we will train a nonsensical language model !

  2. We will first collect some language data, convert it to numbers, and then feed it to a recurrent neural network and ask it to predict upcoming words. When we are done we will have a machine that can generate sentences from our made-up language ad-infinitum !
复制代码

本帖隐藏的内容

A Nonsensical Language Model using Theano LSTM.pdf (546.33 KB)


二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

关键词:Language Theano model Using lang

沙发
ReneeBK 发表于 2017-9-11 02:47:07 |只看作者 |坛友微信交流群
  1. def generate_nonsense(word = ""):
  2.     if word.endswith("."):
  3.         return word
  4.     else:
  5.         if len(word) > 0:
  6.             word += " "
  7.         word += samplers["stop"]()
  8.         word += " " + samplers["noun"]()
  9.         if random.random() > 0.7:
  10.             word += " " + samplers["adverb"]()
  11.             if random.random() > 0.7:
  12.                 word += " " + samplers["adverb"]()
  13.         word += " " + samplers["verb"]()
  14.         if random.random() > 0.8:
  15.             word += " " + samplers["noun"]()
  16.             if random.random() > 0.9:
  17.                 word += "-" + samplers["noun"]()
  18.         if len(word) > 500:
  19.             word += "."
  20.         else:
  21.             word += " " + samplers["punctuation"]()
  22.         return generate_nonsense(word)

  23. def generate_dataset(total_size, ):
  24.     sentences = []
  25.     for i in range(total_size):
  26.         sentences.append(generate_nonsense())
  27.     return sentences

  28. # generate dataset               
  29. lines = generate_dataset(100)
复制代码

使用道具

藤椅
ReneeBK 发表于 2017-9-11 02:47:26 |只看作者 |坛友微信交流群
  1. ### Utilities:
  2. class Vocab:
  3.     __slots__ = ["word2index", "index2word", "unknown"]
  4.    
  5.     def __init__(self, index2word = None):
  6.         self.word2index = {}
  7.         self.index2word = []
  8.         
  9.         # add unknown word:
  10.         self.add_words(["**UNKNOWN**"])
  11.         self.unknown = 0
  12.         
  13.         if index2word is not None:
  14.             self.add_words(index2word)
  15.                
  16.     def add_words(self, words):
  17.         for word in words:
  18.             if word not in self.word2index:
  19.                 self.word2index[word] = len(self.word2index)
  20.                 self.index2word.append(word)
  21.                        
  22.     def __call__(self, line):
  23.         """
  24.         Convert from numerical representation to words
  25.         and vice-versa.
  26.         """
  27.         if type(line) is np.ndarray:
  28.             return " ".join([self.index2word[word] for word in line])
  29.         if type(line) is list:
  30.             if len(line) > 0:
  31.                 if line[0] is int:
  32.                     return " ".join([self.index2word[word] for word in line])
  33.             indices = np.zeros(len(line), dtype=np.int32)
  34.         else:
  35.             line = line.split(" ")
  36.             indices = np.zeros(len(line), dtype=np.int32)
  37.         
  38.         for i, word in enumerate(line):
  39.             indices[i] = self.word2index.get(word, self.unknown)
  40.             
  41.         return indices
  42.    
  43.     @property
  44.     def size(self):
  45.         return len(self.index2word)
  46.    
  47.     def __len__(self):
  48.         return len(self.index2word)
复制代码

使用道具

板凳
ReneeBK 发表于 2017-9-11 02:48:17 |只看作者 |坛友微信交流群
  1. Create a Mapping from numbers to words
  2. Now we can use the Vocab class to gather all the words and store an Index:

  3. In [ ]:
  4. vocab = Vocab()
  5. for line in lines:
  6.     vocab.add_words(line.split(" "))
  7. To send our sentences in one big chunk to our neural network we transform each sentence into a row vector and place each of these rows into a bigger matrix that holds all these rows. Not all sentences have the same length, so we will pad those that are too short with 0s in pad_into_matrix:

  8. In [168]:
  9. def pad_into_matrix(rows, padding = 0):
  10.     if len(rows) == 0:
  11.         return np.array([0, 0], dtype=np.int32)
  12.     lengths = map(len, rows)
  13.     width = max(lengths)
  14.     height = len(rows)
  15.     mat = np.empty([height, width], dtype=rows[0].dtype)
  16.     mat.fill(padding)
  17.     for i, row in enumerate(rows):
  18.         mat[i, 0:len(row)] = row
  19.     return mat, list(lengths)

  20. # transform into big numerical matrix of sentences:
  21. numerical_lines = []
  22. for line in lines:
  23.     numerical_lines.append(vocab(line))
  24. numerical_lines, numerical_lengths = pad_into_matrix(numerical_lines)
复制代码

使用道具

报纸
ReneeBK 发表于 2017-9-11 02:49:25 |只看作者 |坛友微信交流群
  1. Prediction
  2. We have now defined our network. At each timestep we can produce a probability distribution for each input index:

  3. def create_prediction(self, greedy=False):
  4.     def step(idx, *states):
  5.         # new hiddens are the states we need to pass to LSTMs
  6.         # from past. Because the StackedCells also include
  7.         # the embeddings, and those have no state, we pass
  8.         # a "None" instead:
  9.         new_hiddens = [None] + list(states)

  10.         new_states = self.model.forward(idx, prev_hiddens = new_hiddens)
  11.         return new_states[1:]
  12.     ...

  13. Our inputs are an integer matrix Theano symbolic variable:

  14.     ...
  15.     # in sequence forecasting scenario we take everything
  16.     # up to the before last step, and predict subsequent
  17.     # steps ergo, 0 ... n - 1, hence:
  18.     inputs = self.input_mat[:, 0:-1]
  19.     num_examples = inputs.shape[0]
  20.     # pass this to Theano's recurrence relation function:
复制代码

使用道具

地板
MouJack007 发表于 2017-9-11 06:46:45 |只看作者 |坛友微信交流群
谢谢楼主分享!

使用道具

7
MouJack007 发表于 2017-9-11 06:47:47 |只看作者 |坛友微信交流群

使用道具

8
yangbing1008 发表于 2017-9-11 09:56:03 |只看作者 |坛友微信交流群
感谢分享

使用道具

9
seoulcityyxx 发表于 2017-9-16 22:21:41 |只看作者 |坛友微信交流群
poas 8efp9w

使用道具

您需要登录后才可以回帖 登录 | 我要注册

本版微信群
加好友,备注jltj
拉您入交流群

京ICP备16021002-2号 京B2-20170662号 京公网安备 11010802022788号 论坛法律顾问:王进律师 知识产权保护声明   免责及隐私声明

GMT+8, 2024-9-6 18:54