以下为漫谈,即瞎聊,利用通俗的语言来谈谈神经网络模型中4种序列解码模型,主要是从整体概念和思路上进行通俗解释帮助理解。预警,以下可能为了偷懒就不贴公式了,一些细节也被略过了,感兴趣的可以直接去阅读原文[1][2][3]。
[1] Sequence to Sequence Learning with Neural Networks
[2] Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation
[3] Neural Machine Translation by Jointly Learning to Align and Translate
利用神经网络进行序列编码的模型主要为RNN,目前比较火的一些变种模型有LSTM和GRU,只是cell单元不同而已。以下统统用RNN来代表。
编码模型比较简单,如下图所示,输入文本{X1-X6}经过循环迭代编码,在每个时刻得到当前时刻的一个隐层状态,最后序列结束后进行特征融合得到句子的表示。注意,一种比较常用的方式是将编码模型最后一个时刻的隐层状态做为整个序列的编码表示,但是实际应用中这种效果并不太好,因而我们的图例中直接采用了整个序列隐层编码进行求和平均的方式得到序列的编码向量。
早期的一些任务主要是做一些主题分类、情感检测等等分类任务,那么在编码向量上面添加一个softmax就可以解决问题。但是对于机器翻译和语音识别等问题则需要进行序列化解码。
注意到,编码时RNN每个时刻除了自己上一时刻的隐层状态编码外,还有当前时刻的输入字符,而解码时则没有这种输入。那么,一种比较直接的方式是把编码端得到的编码向量做为解码模型的每时刻输入特征。如下图所示:
简单直观而且解码模型和编码模型并没有任何区别,然而学者感觉该模型并不优雅,那么接下来我们就来介绍一些精巧点的吧。
我们用考试作弊来做为一个通俗的例子来解释一下模型。
首先我们假设输入文本是所学课本,编码端则是对课本的理解所整理的课堂笔记。解码端的隐层神经网络则是我们的大脑,而每一时刻的输出则是考试时要写在卷子上的答案。在上面最简单的解码模型中,可以考虑成是考试时一边写答案一边翻看课堂笔记。如果这是一般作弊学生的做法,学霸则不需要翻书,他们有一个强大的大脑神经网络,可以记住自己的课堂笔记。解码时只需要回顾一下自己前面写过什么,然后依次认真的把答案写在答卷上,就是下面这种模型了[1]:
还有很多学弱,他们不只需要作弊,而且翻看笔记的时候还需要回顾自己上一时刻写在答卷上的答案(学弱嘛,简直弱到连自己上一时刻写在答卷上的文字都记不住了),就是下面的答题模式了[2]:
然而学渣渣也是存在的,他们不只需要作弊,不只需要回顾自己上一时刻卸载答卷上的答案,还需要老师在课本上画出重点才能整理出自己的课题笔记(这就是一种注意力机制Attention,记笔记的时候一定要根据考题画出重点啊!),真的很照顾渣渣了,他们的答题模式如下[3]:
可见,除了学霸以外,其他人都作弊了,在答题的时候翻看课堂笔记(很多文献中叫这种解码模型结构为peek(偷看),是不是很像在作弊?),而且学渣渣还去找过老师给画过重点,有了清楚的重点之后就不用翻书偷看了,瞄一眼就可以了,文献中叫glimpse(一瞥),是不是很像?
如果我们将他们的大脑网络设定为同样结构的话(将他们的IQ强制保持一致),肯定是作弊的同学得分最高了,学霸模式好吃亏啊。我们来简单做一个模型测试。
测试数据:
输入序列文本 = [‘1 2 3 4 5’
, ‘6 7 8 9 10′
, ’11 12 13 14 15′
, ’16 17 18 19 20′
, ’21 22 23 24 25’]
目标序列文本 = [‘one two three four five’
, ‘six seven eight nine ten’
, ‘eleven twelve thirteen fourteen fifteen’
, ‘sixteen seventeen eighteen nineteen twenty’
, ‘twenty_one twenty_two twenty_three twenty_four twenty_five’]
设定一些参数如下:
–
(‘Vocab size:’, 51, ‘unique words’)
(‘Input max length:’, 5, ‘words’)
(‘Target max length:’, 5, ‘words’)
(‘Dimension of hidden vectors:’, 20)
(‘Number of training stories:’, 5)
(‘Number of test stories:’, 5)
–
观察训练过程:
其中,第一种解码模型为 普通作弊,第二种解码模型为 学霸模式,第三种解码模型为 学弱作弊,第四种解码模型为 学渣作弊。
可以看到在IQ值(解码模型的神经网络结构)相同的情况下,学渣作弊模式答题(训练收敛速度)更快,而学霸模式答题最慢。
文章[1]中已经提到过,想通过学霸模式达到一个好的性能需要模型隐层有4000个节点(学霸的IQ果然是高的,有一颗强大的大脑网络)。
可以想想,在课本内容很多很多时,学霸也会累的,而且学弱们你们确定课上能听懂吗?学渣就会笑啦,因而老师给他们画重点了!!!!
本博文中测试的示例代码见【Github地址】:
- # -*- encoding:utf-8 -*-
- “”"
- 测试Encoder-Decoder 2016/03/22
- “”"
- from keras.models import Sequential
- from keras.layers.recurrent import LSTM
- from keras.layers.embeddings import Embedding
- from keras.layers.core import RepeatVector, TimeDistributedDense, Activation
- from seq2seq.layers.decoders import LSTMDecoder, LSTMDecoder2, AttentionDecoder
- import time
- import numpy as np
- import re
- __author__ = ’http://jacoxu.com’
- def pad_sequences(sequences, maxlen=None, dtype=’int32′,
- padding=’pre’, truncating=’pre’, value=0.):
- ”’Pads each sequence to the same length:
- the length of the longest sequence.
- If maxlen is provided, any sequence longer
- than maxlen is truncated to maxlen.
- Truncation happens off either the beginning (default) or
- the end of the sequence.
- Supports post-padding and pre-padding (default).
- # Arguments
- sequences: list of lists where each element is a sequence
- maxlen: int, maximum length
- dtype: type to cast the resulting sequence.
- padding: ’pre’ or ’post’, pad either before or after each sequence.
- truncating: ’pre’ or ’post’, remove values from sequences larger than
- maxlen either in the beginning or in the end of the sequence
- value: float, value to pad the sequences to the desired value.
- # Returns
- x: numpy array with dimensions (number_of_sequences, maxlen)
- ”’
- lengths = [len(s) for s in sequences]
- nb_samples = len(sequences)
- if maxlen is None:
- maxlen = np.max(lengths)
- # take the sample shape from the first non empty sequence
- # checking for consistency in the main loop below.
- sample_shape = tuple()
- for s in sequences:
- if len(s) > 0:
- sample_shape = np.asarray(s).shape[1:]
- break
- x = (np.ones((nb_samples, maxlen) sample_shape) * value).astype(dtype)
- for idx, s in enumerate(sequences):
- if len(s) == 0:
- continue # empty list was found
- if truncating == ’pre’:
- trunc = s[-maxlen:]
- elif truncating == ’post’:
- trunc = s[:maxlen]
- else:
- raise ValueError(‘Truncating type ”%s” not understood’ % truncating)
- # check `trunc` has expected shape
- trunc = np.asarray(trunc, dtype=dtype)
- if trunc.shape[1:] != sample_shape:
- raise ValueError(‘Shape of sample %s of sequence at position %s is different from expected shape %s’ %
- (trunc.shape[1:], idx, sample_shape))
- if padding == ’post’:
- x[idx, :len(trunc)] = trunc
- elif padding == ’pre’:
- x[idx, -len(trunc):] = trunc
- else:
- raise ValueError(‘Padding type ”%s” not understood’ % padding)
- return x
- def vectorize_stories(input_list, tar_list, word_idx, input_maxlen, tar_maxlen, vocab_size):
- x_set = []
- Y = np.zeros((len(tar_list), tar_maxlen, vocab_size), dtype=np.bool)
- for _sent in input_list:
- x = [word_idx[w] for w in _sent]
- x_set.append(x)
- for s_index, tar_tmp in enumerate(tar_list):
- for t_index, token in enumerate(tar_tmp):
- Y[s_index, t_index, word_idx[token]] = 1
- return pad_sequences(x_set, maxlen=input_maxlen), Y
- def tokenize(sent):
- ”’Return the tokens of a sentence including punctuation.
- >>> tokenize(‘Bob dropped the apple. Where is the apple?’)
- ['Bob', 'dropped', 'the', 'apple', '.', 'Where', 'is', 'the', 'apple', '?']
- ”’
- return [x.strip() for x in re.split('(\W )?', sent) if x.strip()]
- def main():
- input_text = ['1 2 3 4 5'
- , '6 7 8 9 10'
- , '11 12 13 14 15'
- , '16 17 18 19 20'
- , '21 22 23 24 25']
- tar_text = ['one two three four five'
- , 'six seven eight nine ten'
- , 'eleven twelve thirteen fourteen fifteen'
- , 'sixteen seventeen eighteen nineteen twenty'
- , 'twenty_one twenty_two twenty_three twenty_four twenty_five']
- input_list = []
- tar_list = []
- for tmp_input in input_text:
- input_list.append(tokenize(tmp_input))
- for tmp_tar in tar_text:
- tar_list.append(tokenize(tmp_tar))
- vocab = sorted(reduce(lambda x, y: x | y, (set(tmp_list) for tmp_list in input_list tar_list)))
- # Reserve 0 for masking via pad_sequences
- vocab_size = len(vocab) 1 # keras进行embedding的时候必须进行len(vocab) 1
- input_maxlen = max(map(len, (x for x in input_list)))
- tar_maxlen = max(map(len, (x for x in tar_list)))
- output_dim = vocab_size
- hidden_dim = 20
- print(‘-’)
- print(‘Vocab size:’, vocab_size, ’unique words’)
- print(‘Input max length:’, input_maxlen, ’words’)
- print(‘Target max length:’, tar_maxlen, ’words’)
- print(‘Dimension of hidden vectors:’, hidden_dim)
- print(‘Number of training stories:’, len(input_list))
- print(‘Number of test stories:’, len(input_list))
- print(‘-’)
- print(‘Vectorizing the word sequences…’)
- word_to_idx = dict((c, i 1) for i, c in enumerate(vocab)) # 编码时需要将字符映射成数字index
- idx_to_word = dict((i 1, c) for i, c in enumerate(vocab)) # 解码时需要将数字index映射成字符
- inputs_train, tars_train = vectorize_stories(input_list, tar_list, word_to_idx, input_maxlen, tar_maxlen, vocab_size)
- decoder_mode = 1 # 0 最简单模式,1 [1]向后模式,2 [2] Peek模式,3 [3]Attention模式
- if decoder_mode == 3:
- encoder_top_layer = LSTM(hidden_dim, return_sequences=True)
- else:
- encoder_top_layer = LSTM(hidden_dim)
- if decoder_mode == 0:
- decoder_top_layer = LSTM(hidden_dim, return_sequences=True)
- decoder_top_layer.get_weights()
- elif decoder_mode == 1:
- decoder_top_layer = LSTMDecoder(hidden_dim=hidden_dim, output_dim=hidden_dim
- , output_length=tar_maxlen, state_input=False, return_sequences=True)
- elif decoder_mode == 2:
- decoder_top_layer = LSTMDecoder2(hidden_dim=hidden_dim, output_dim=hidden_dim
- , output_length=tar_maxlen, state_input=False, return_sequences=True)
- elif decoder_mode == 3:
- decoder_top_layer = AttentionDecoder(hidden_dim=hidden_dim, output_dim=hidden_dim
- , output_length=tar_maxlen, state_input=False, return_sequences=True)
- en_de_model = Sequential()
- en_de_model.add(Embedding(input_dim=vocab_size,
- output_dim=hidden_dim,
- input_length=input_maxlen))
- en_de_model.add(encoder_top_layer)
- if decoder_mode == 0:
- en_de_model.add(RepeatVector(tar_maxlen))
- en_de_model.add(decoder_top_layer)
- en_de_model.add(TimeDistributedDense(output_dim))
- en_de_model.add(Activation(‘softmax’))
- print(‘Compiling…’)
- time_start = time.time()
- en_de_model.compile(loss=’categorical_crossentropy’, optimizer=’rmsprop’)
- time_end = time.time()
- print(‘Compiled, cost time:%fsecond!’ % (time_end - time_start))
- for iter_num in range(5000):
- en_de_model.fit(inputs_train, tars_train, batch_size=3, nb_epoch=1, show_accuracy=True)
- out_predicts = en_de_model.predict(inputs_train)
- for i_idx, out_predict in enumerate(out_predicts):
- predict_sequence = []
- for predict_vector in out_predict:
- next_index = np.argmax(predict_vector)
- next_token = idx_to_word[next_index]
- predict_sequence.append(next_token)
- print(‘Target output:’, tar_text[i_idx])
- print(‘Predict output:’, predict_sequence)
- print(‘Current iter_num is:%d’ % iter_num)
- if __name__ == ’__main__‘:
- main()