


令牌掩码Token Masking是一种广泛应用于语言模型分类变体和生成模型训练的策略。BERT语言模型首先使用,并被用于许多变体(RoBERTa, ALBERT, DeBERTa…)。

而Text Corruption是一种更大的令牌遮蔽策略。在BART研究论文中,进行了大量实验来训练具有不同策略的编码器-解码器生成模型。






在Text Corruption中(特别是在Token Masking、Token Deletion和Text Infilling中),每个单词可能会按照固定概率(通常约为15-20%)进行遮蔽。这个概率保持较低,以便模型即使在序列被损坏的情况下也能学习每个句子的上下文。

还有一些技术,如Sentence Permutation 或Document Rotation,不会专注于按照一定概率遮蔽单词,我们后面会介绍。



我们已经简要介绍了使用Text Corruption训练语言模型的一些背景知识,下面我们开始使用示例代码来介绍不同的Text Corruption技术。


import stanza
 # Text used in our examples
 text = "Huntington's disease is a neurodegenerative autosomal disease 
 results due to expansion of polymorphic CAG repeats in the huntingtin gene. 
 Phosphorylation of the translation initiation factor 4E-BP results in the 
 alteration of the translation control leading to unwanted protein synthesis 
 and neuronal function. Consequences of mutant huntington (mhtt) gene 
 transcription are not well known. Variability of age of onset is an 
 important factor of Huntington's disease separating adult and juvenile types. 
 The factors which are taken into account are-genetic modifiers, maternal 
 protection i.e excessive paternal transmission, superior ageing genes 
 and environmental threshold. A major focus has been given to the molecular 
 pathogenesis which includes-motor disturbance, cognitive disturbance and 
 neuropsychiatric disturbance. The diagnosis part has also been taken care of. 
 This includes genetic testing and both primary and secondary symptoms. 
 The present review also focuses on the genetics and pathology of Huntington's 
 # We will use a stanza model for getting each different sentence 
 # as an element of the list
 nlp = stanza.Pipeline('en', use_gpu=False)
 doc = nlp(text)
 sentences = [sentence.text for sentence in doc.sentences]

Token Masking




from transformers import AutoTokenizer, DataCollatorForLanguageModeling
 import torch
 def load_dataset_mlm(sentences, tokenizer_class=AutoTokenizer, 
                      mlm=True, mlm_probability=0.20):
     tokenizer = tokenizer_class.from_pretrained('google-bert/bert-base-uncased')
     inputs = tokenizer(sentences, return_tensors='pt', padding=True, 
     # Random masking configuration
     data_collator = collator_class(
     """The collator expects a tuple of tensors, so you have to split 
    the input tensors and then remove the first dimension and pass it 
    to a tuple. """
     tuple_ids = torch.split(inputs['input_ids'], 1, dim=0)
     tuple_ids = list(tuple_ids)
     for tensor in range(len(tuple_ids)):
         tuple_ids[tensor] = tuple_ids[tensor].squeeze(0)
     tuple_ids = tuple(tuple_ids)
     # Get input_ids, attention_masks and labels for each sentence.
     batch = data_collator(tuple_ids)
     return batch['input_ids'], inputs['attention_mask'], batch['labels']
 input_ids, attention_mask, labels = load_dataset_mlm(sentences)
 tensor([ 101, 16364, 1005, 1055,   103, 2003, 1037,   103, 10976, 3207,
          103, 25284,   103, 25426, 16870, 4295, 3463, 2349, 2000,   103,
          1997, 26572, 18078, 6187, 2290, 17993, 1999, 1996, 5933, 7629,
          103,   103,   102,     0,     0])
 tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0])
 tensor([ -100, -100, -100, -100, 4295, -100, -100, 11265, -100, -100,
          6914, -100, 8285, -100, 2389, -100, -100, -100, -100, 4935,
          -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
          4962, 1012, -100, -100, -100]))




from transformers import BartTokenizer, DataCollatorForLanguageModeling
 import torch
 def load_dataset_mlm(sentences, tokenizer_class=BartTokenizer, 
                      mlm=True, mlm_probability=0.20):
     tokenizer = tokenizer_class.from_pretrained('facebook/bart-base')
     inputs = tokenizer(sentences, return_tensors='pt', padding=True, 
     # Random masking configuration
     data_collator = collator_class(
         mlm=mlm,  # True for Masked Language Modelling
         mlm_probability=mlm_probability  # Chance for every token to get masked
     """The collator expects a tuple of tensors, so you have to split 
    the input tensors and then remove the first dimension and pass it 
    to a tuple. """
     tuple_ids = torch.split(inputs['input_ids'], 1, dim=0)
     tuple_ids = list(tuple_ids)
     for tensor in range(len(tuple_ids)):
         tuple_ids[tensor] = tuple_ids[tensor].squeeze(0)
     tuple_ids = tuple(tuple_ids)
     # Get input_ids, attention_masks and labels for each sentence.
     batch = data_collator(tuple_ids)
     batch['labels'] = inputs['input_ids']
     return batch['input_ids'], inputs['attention_mask'],  batch['labels']
 input_ids, attention_mask, labels = load_dataset_mlm(sentences)
 tensor([   0, 38831, 2577, 1054,   18, 2199,   16,   10, 14913, 28904,
          5777, 3693, 32226, 38868, 2199,   775,   528,     7, 2919,     9,
        48052,   636,   230, 3450, 35315,   11,     5, 50264, 50264, 50264,
            4,     2])
 tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1])
 tensor([   0, 38831, 2577, 1054,   18, 2199,   16,   10, 14913, 28904,
          5777, 3693, 32226, 38868, 2199,   775,   528,     7, 2919,     9,
        48052,   636,   230, 3450, 35315,   11,     5, 8217, 24276, 10596,
            4,     2])


Token Deletion

使用标记删除 Token Deletion,模型必须学习确切的位置和缺失的词是什么,因此它必须比仅使用Token Masking学习更多的特征。


def token_deletion(sentences, tokenizer_class=BartTokenizer,collator_class=DataCollatorForLanguageModeling, 
                  mlm=True, mlm_probability=0.20):
     tokenizer = tokenizer_class.from_pretrained('facebook/bart-base')
     inputs = tokenizer(sentences, return_tensors='pt', padding=True, truncation=True)
     data_collator = collator_class(
     tuple_ids = torch.split(inputs['input_ids'], 1, dim=0)
     tuple_ids = list(tuple_ids)
     for tensor in range(len(tuple_ids)):
         tuple_ids[tensor] = tuple_ids[tensor].squeeze(0)
     tuple_ids = tuple(tuple_ids)
     batch = data_collator(tuple_ids)
     # We use the initial inputs as labels
     batch['labels'] = batch['input_ids'].clone()
     # We remove tokens with mask identifier and thus make token deletion
     # Change the value to the mask identifier of the specific token model
     # It is necessary to know the identifier of the mask token for 
     # that specific model
     mask = batch['input_ids'] != 50264
     initial_size = batch['input_ids'].size(1)
     total_sentences = batch['input_ids'].size(0)
     # When we remove the specific token, we must fill with the padding 
     # token otherwise the tensor size is not respected.
     for i in range(total_sentences):
         new_tensor = batch['input_ids'][i][mask[i]]
         new_tensor = F.pad(new_tensor, (0, initial_size - new_tensor.size(0)), value=1)
         batch['input_ids'][i] = new_tensor
         attention_mask = batch['input_ids'][i] == 1
         inputs['attention_mask'][i][attention_mask] = 0
     return batch['input_ids'], inputs['attention_mask'], batch['labels']
 input_ids, attention_mask, labels = token_deletion(sentences)
 tensor([   0, 38831, 2577, 1054, 2199, 14913, 28904, 3693, 32226, 38868,
          2199,   775,   528,     7, 2919,     9, 23404,   636,   230, 35315,
            11,     5, 24276, 10596,     4,     2,     1,     1,     1,     1,
            1,     1])
 tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 0, 0, 0, 0, 0, 0])
 tensor([   0, 38831, 2577, 1054, 50264, 2199, 50264, 50264, 14913, 28904,
        50264, 3693, 32226, 38868, 2199,   775,   528,     7, 2919,     9,
        23404,   636,   230, 50264, 35315,   11,     5, 50264, 24276, 10596,
            4,     2])

当使用Token Deletion训练BART时,长序列用于问答、摘要生成任务和会话任务会有一定的提高。

Text Infilling

文本填充 Text Infilling允许模型学习每个屏蔽位置可以有多少个单词。而先前的方法假设每个屏蔽位置只有一个单词。

Text Infilling与Token Masking类似,因为我们会以一定的概率在原始文本上使用屏蔽。但是不同之处在于屏蔽可以覆盖多个单词。在BART中,屏蔽是用泊松分布 lambda = 3 进行的;这意味着平均而言,每次对句子中的文本进行屏蔽时,会有三个单词被包含在一个单个的<mask>标记中,但由于这是一个概率分布,可能会有更多或更少的屏蔽单词。


import numpy as np
 from transformers import BartTokenizer
 def text_infilling(sentence, probability=0.2, poisson_lambda=3):
     # We'll use a binary mask to determine which words to replace
     mask = np.random.choice([0, 1], size=len(sentence), p=[1-probability, probability])
     # Now we'll replace the chosen words with a mask token
     # We'll also use a Poisson distribution to determine the length of the spans to mask
     for i in range(len(mask)):
         if mask[i] == 1:
             span_length = np.random.poisson(poisson_lambda)
             for j in range(span_length):
                 if i + j < len(sentence):
                     sentence[i + j] = "<mask>"
     infilled_sentence = []
     for token in range(len(sentence)):
         if sentence[token] == "<mask>":
             if token < len(sentence)-1:
                 if sentence[token+1] == "<mask>":
     return " ".join(infilled_sentence)
 def text_infilling_input(masked_sentences, sentences, tokenizer_class=BartTokenizer):
     tokenizer = tokenizer_class.from_pretrained('facebook/bart-base')
     inputs = tokenizer(masked_sentences, return_tensors='pt', padding=True, truncation=True)
     labels = tokenizer(sentences, return_tensors='pt', padding=True, truncation=True)
     return inputs['input_ids'], inputs['attention_mask'], labels['input_ids']
 input_ids, attention_mask, labels = text_infilling_input(masked_sentences, sentences)
 tensor([   0, 50264,   16, 50264, 2199,   775,   528, 50264, 48052,   636,
        50264, 8217, 24276, 10596,     4,     2,     1,     1,     1,     1,
            1,     1,     1])
 tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0])
 tensor([   0, 38831, 2577, 1054,   18, 2199,   16,   10, 14913, 28904,
          5777, 3693, 32226, 38868, 2199,   775,   528,     7, 2919,     9,
        48052,   636,   230, 3450, 35315,   11,     5, 8217, 24276, 10596,
            4,     2])

Text Infilling比Token Deletion更能改善BART语言模型的结果,在问题回答、文本摘要和会话任务中提供更好的生成。

Sentence Permutation


在Sentence Permutation中,考虑适合模型输入序列的句子数量是至关重要的(在小型模型中,输入序列在512到1024之间)。在确定符合序列的句子数量之后,需要将它们分离到一个列表或数组中,并随机选择,而不重复其中任何一个。

# It selects the first "number_sentences" within a given set of "sentences" 
 # and returns those sentences in a random order.
 def sentence_permutation(sentences, number_sentences):
     new_sentences = sentences[:number_sentences]
     new_sentences = sentence_joiner(new_sentences)
     return new_sentences
 def permuted_data_generation(sentences: list, total_sentences: int):
     training_sentences = []
     training_labels = []
     sentences_copy = sentences.copy()
     # We can apply sentence_permutation a number of times equal to the 
     # size of the list - 1 to get an example with each new sentence in 
     # the text, removing the oldest one.
     for _ in range(len(sentences)-total_sentences+1):
         new_sentences = sentence_permutation(sentences_copy, total_sentences)
         joined_sentences = sentence_joiner(sentences_copy[:total_sentences])
         sentences_copy = sentences_copy[1:]
     return training_sentences, training_labels
 def permutation_training(sentences: list, sentences_labels: list, 
                         mlm=True, mlm_probability=0.0):
     # We get input_ids and attention mask from the permuted sentences
     input, attention_mask, _ = load_dataset_mlm(sentences, tokenizer_class, collator_class, mlm,mlm_probability)
     # Labels from the original sentences
     labels, _, _ = load_dataset_mlm(sentences_labels, tokenizer_class, collator_class, mlm,mlm_probability)
     return input.squeeze(0), attention_mask.squeeze(0), labels.squeeze(0)
 input_ids, attention_mask, labels = permutation_training(training_sentences, training_labels_sentences)
 tensor([   0, 38831, 2577, 1054,   18, 2199,   16,   10, 14913, 28904,
          5777, 3693, 32226, 38868, 2199,   775,   528,     7, 2919,     9,
        48052,   636,   230, 3450, 35315,   11,     5, 8217, 24276, 10596,
            4, 2585, 33430, 8457,     9, 41419, 8217, 1054,   36,   119,
        49491,   43, 10596, 37118,   32,   45,   157,   684,     4, 4129,
        33839, 4405, 35019,     9,     5, 19850, 34939, 3724,   204,   717,
            12, 21792,   775,   11,     5, 39752,     9,     5, 19850,   797,
          981,     7, 15067, 8276, 37423,     8, 46282, 5043,     4,     2])
 tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1])
 tensor([   0, 38831, 2577, 1054,   18, 2199,   16,   10, 14913, 28904,
          5777, 3693, 32226, 38868, 2199,   775,   528,     7, 2919,     9,
        48052,   636,   230, 3450, 35315,   11,     5, 8217, 24276, 10596,
            4, 4129, 33839, 4405, 35019,     9,     5, 19850, 34939, 3724,
          204,   717,   12, 21792,   775,   11,     5, 39752,     9,     5,
        19850,   797,   981,     7, 15067, 8276, 37423,     8, 46282, 5043,
            4, 2585, 33430, 8457,     9, 41419, 8217, 1054,   36,   119,
        49491,   43, 10596, 37118,   32,   45,   157,   684,     4,     2])


Document Rotation


如果要应用Document Rotation,必须考虑到每个批次使用的维度。在应用填充的情况下,这个填充不能与文档的其余部分一起旋转,而是必须保持其原始位置,同时整个文档旋转。

def sentence_joiner(sentences: list):
   return ' '.join(sentences)
 # With this function we gather as many sentences as we want to form the input data to the tokenizer.
 def rotated_data_generation(sentences: list, total_sentences: int):
   training_sentences = []
   sentences_copy = sentences.copy()
   for _ in range(len(sentences)-total_sentences+1):
     new_sentences = sentences_copy[:total_sentences]
     new_sentences = sentence_joiner(new_sentences)
     sentences_copy = sentences_copy[1:]
   return training_sentences
 # Apply this function over the rotated sentences from previous function
 def document_rotation_training(sentences, tokenizer_class=BartTokenizer):
   tokenizer = tokenizer_class.from_pretrained('facebook/bart-base')
   tokens = tokenizer(sentences, return_tensors='pt', padding=True, truncation=True)
   tokens['input_ids'] = tokens['input_ids'].squeeze(0)
   tokens['labels'] = tokens['input_ids'].clone()
   iterations = tokens['input_ids'].size(0)
   for i in range(iterations):
     # Get the attention mask and convert to list
     attention_mask = tokens['attention_mask'][i].tolist()
     # Calculate the position where padding starts
     if 0 in attention_mask:
       padding_start_position = attention_mask.index(0)
       padding_start_position = False
     # We take into account the position of the padding so as not to rotate it along with the rest of the document.
     if padding_start_position:
       random_token = torch.randint(1, padding_start_position-1, (1,))
       tokens['input_ids'][i] = torch.cat((tokens['input_ids'][i][0].unsqueeze(0), #initial token
                                       tokens['input_ids'][i][random_token.item():padding_start_position-1], #from random to padding
                                       tokens['input_ids'][i][1:random_token.item()], #from 1 to random
                                       tokens['input_ids'][i][-1].unsqueeze(0)), 0)
     # If there is no padding, we rotate the document without taking the padding into account.
       random_token = torch.randint(1, tokens['input_ids'].size(0)-1, (1,))
       tokens['input_ids'][i] = torch.cat((tokens['input_ids'][i][0].unsqueeze(0), #initial token
                                       tokens['input_ids'][i][random_token.item():-1], #from random to end
                                       tokens['input_ids'][i][-1].unsqueeze(0)), 0)
   return tokens['input_ids'], tokens['attention_mask'].squeeze(0), tokens['labels']
 data = rotated_data_generation(sentences, 3)
 input_ids, attention_mask, labels = document_rotation_training(data)
 tensor([   0, 2433,   61,   32,   551,   88, 1316,   32,   12, 4138,
        15557, 47605,     6, 22835, 2591,   939,     4,   242, 10079, 38422,
          9235,     6, 10295, 22540, 14819,     8, 3039, 11543,     4,   347,
        37347, 8457,     9, 41419, 8217, 1054,   36,   119, 49491,   43,
        10596, 37118,   32,   45,   157,   684,     4, 41058, 4484,     9,
          1046,     9, 23808,   16,   41,   505, 3724,     9, 18073,   18,
          2199, 18246, 4194,     8, 13430, 3505,     4,   20,     2,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1])
 tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0])
 tensor([   0,   347, 37347, 8457,     9, 41419, 8217, 1054,   36,   119,
        49491,   43, 10596, 37118,   32,   45,   157,   684,     4, 41058,
          4484,     9, 1046,     9, 23808,   16,   41,   505, 3724,     9,
        18073,   18, 2199, 18246, 4194,     8, 13430, 3505,     4,   20,
          2433,   61,   32,   551,   88, 1316,   32,   12, 4138, 15557,
        47605,     6, 22835, 2591,   939,     4,   242, 10079, 38422, 9235,
            6, 10295, 22540, 14819,     8, 3039, 11543,     4,     2,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1])



本文介绍了讨论了训练语言模型的不同的令牌掩码。虽然这些都是比较常见的方法,但是大多数模型只使用了Token Masking。

对于短文本序列来说,Sentence Permutation 和Document Rotation技术可能没有帮助甚至会降低准确率。而Token Masking、Token Deletion和Text Infilling 在短文本和长文本序列中都可以使用。

责任编辑:华轩 来源: DeepHub IMBA

