Today we will train a nonsensical language model !
We will first collect some language data, convert it to numbers, and then feed it to a recurrent neural network and ask it to predict upcoming words. When we are done we will have a machine that can generate sentences from our made-up language ad-infinitum !
Now we can use the Vocab class to gather all the words and store an Index:
In [ ]:
vocab = Vocab()
for line in lines:
vocab.add_words(line.split(" "))
To send our sentences in one big chunk to our neural network we transform each sentence into a row vector and place each of these rows into a bigger matrix that holds all these rows. Not all sentences have the same length, so we will pad those that are too short with 0s in pad_into_matrix:
In [168]:
def pad_into_matrix(rows, padding = 0):
if len(rows) == 0:
return np.array([0, 0], dtype=np.int32)
lengths = map(len, rows)
width = max(lengths)
height = len(rows)
mat = np.empty([height, width], dtype=rows[0].dtype)
mat.fill(padding)
for i, row in enumerate(rows):
mat[i, 0:len(row)] = row
return mat, list(lengths)
# transform into big numerical matrix of sentences: