The Annotated GPT2注释加量版,读懂代码才算读懂了GPT 原创

发布于 2024-6-14 14:56
浏览
0收藏

The Annotated GPT2注释加量版,读懂代码才算读懂了GPT -AI.x社区

The Annotated Transformer这篇文章从零复现了2017年那篇Transformer论文,The Annotated Transformer注释加量版在此基础上追加注释和输出数据维度信息进一步揭开Transformer的一些细节,原始Transformer是一个Encoder-Decoder架构的模型,今天我将用同样的方法继续学习GPT系列模型中最简单的GPT2,GPT是Decoder only架构模型,所以,我们只需要关注Transformer的右侧部分。

The Annotated GPT2注释加量版,读懂代码才算读懂了GPT -AI.x社区

由于代码过长,所以没有把全部代码拷贝过来,建议打开下面代码作为参照阅读本文。

代码运行环境:google colab

​https://github.com/AIDajiangtang/annotated-transformer/blob/master/gpt_model_from_scratch.ipynb​

-1.超参数

GPT_CONFIG_124M = {
"vocab_size": 50257, #词表大小
"context_length": 256, #上下文长度
"emb_dim": 768,#词嵌入维度
"n_heads": 12,#头的个数
"n_layers": 12,#N=12
"drop_rate": 0.1,
"qkv_bias": False
}

0.下载训练数据

训练数据来自伊迪丝·华顿(Edith Wharton)的一篇短篇小说《Roman Fever》。

import requests
url = "https://raw.githubusercontent.com/rasbt/LLMs-from-scratch/main/ch02/01_main-chapter-code/the-verdict.txt"
#下载数据
# Send a GET request to the URL
response = requests.get(url)
text_data = response.tex

total_characters = len(text_data)
total_tokens = len(tokenizer.encode(text_data))
print(f"Total characters: {total_characters}")
print(f"Total tokens: {total_tokens}")

Total characters: 20479
Total tokens: 5145

训练数据包含20479个字符,5145个tokens,下面是打印的部分内容。

I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no great surprise to me to hear that, in the height of his glory, he had dropped his painting, married a rich widow, and established himself in a villa on the Riviera. (Though I rather thought it would have been Rome or Florence.)
"The height of his glory"--that was what the women called it. I can hear Mrs. Gideon Thwing--his last Chicago sitter--deploring his unaccountable abdication. "Of course it's going to send the value of my picture 'way up; but I don't think of that, Mr. Rickham--the loss to Arrt is all I think of." The word, on Mrs. Thwing's lips, multiplied its rs
Well!--even through the prism of Hermia's tears I felt able to face the fact with equanimity. Poor Jack Gisburn! The women had made him--it was fitting that they should mourn him. Among his own sex fewer regrets were heard, and in his own trade hardly a murmur. Professional jealousy? Perhaps. If it were, the honour of the craft was vindicated by little Claude Nutley, who, in all good faith, brought out in the Burlington a very handsome "obituary" on Jack--one of those showy articles stocked with random technicalities that I have heard (I won't say by whom) compared to Gisburn's painting. And so--his resolve being apparently irrevocable--the discussion gradually died out, and, as Mrs. Thwing had predicted, the price of "Gisburns" went up.
It was not till three years later that, in the course of a few weeks' idling on the Riviera, it suddenly occurred to me to wonder why Gisburn had given up his painting. On reflection, it really was a tempting problem. To accuse his wife would have been too easy--his fair sitters had been denied the solace of saying that Mrs. Gisburn had "dragged him down." For Mrs. Gisburn--as such--had not existed till nearly a year after Jack's resolve had been taken. It might be that he had married her--since he liked his ease--because he didn't want to go on painting; but it would have been hard to prove that he had given up his painting because he had married her.
Of course, if she had not dragged him down, she had equally, as Miss Croft contended, failed to "lift him up"--she had not led him back to the easel. To put the brush into his hand again--what a vocation for a wife! But Mrs. Gisburn appeared to have disdained it--and I felt it might be interesting to find out why.

1.分词

我们本次要实现的GPT2使用的是一种基于子词(BPE)的分词方法。

分词是将上面的文本转换成数字索引,根据超参数"vocab_size": 50257,整个词表有50257单词(子词),所以数字索引范围是0-50256。

import tiktoken
# 加载gpt2 tokenizer
tokenizer = tiktoken.get_encoding("gpt2")

# 文本转换成token ids
def text_to_token_ids(text, tokenizer):
    encoded = tokenizer.encode(text, allowed_special={'<|endoftext|>'})
    encoded_tensor = torch.tensor(encoded).unsqueeze(0)
    # .unsqueeze(0) adds the batch dimension
    return encoded_tensor

# token ids转换成文本
def token_ids_to_text(token_ids, tokenizer):
    flat = token_ids.squeeze(0) # Remove batch dimension
    return tokenizer.decode(flat.tolist())

如果想详细了解分词方法的请参考下面这篇文章。

1.图解tokenization

为了查看后面输出日志中的token ids对应的文本,可以通过text_to_token_ids和token_ids_to_text进行文本和ids相互转换。

也可以使用在线工具:The Tokenizer Playground,但是在Playground中没有找到GPT2选项,可以选择GPT3。

​https://huggingface.co/spaces/Xenova/the-tokenizer-playground​

The Annotated GPT2注释加量版,读懂代码才算读懂了GPT -AI.x社区

如上图,BPE分词方法将lowest分成两个子词low和est。

2.构造输入X和标签Y

训练数据一共有5145个token,90%*5145作为训练集,10%*5145作为验证集。

train_ratio = 0.90 # 90% of data will be training, 10% will be validation
split_index = int(train_ratio * len(text_data))
train_data = text_data[:split_index]
val_data = text_data[split_index:]

根据超参数的设置上下文长度"context_length": 256,也就是训练时输入到模型的每个样本长度256个token,stride=256,所以会从训练数据的头开始,每隔256个token取256个tokens作为输入X,标签则是将窗口向右移动一位。

import torch
from torch.utils.data import Dataset, DataLoader

 # Create a data loader
class GPTDatasetV1(Dataset):
    def __init__(self, txt, tokenizer, max_length, stride):
        self.input_ids = []
        self.target_ids = []

        token_ids = tokenizer.encode(txt)

        for i in range(0, len(token_ids) - max_length, stride):
            # The input chunk
            input_chunk = token_ids[i:i + max_length]
            # The target chunk is the input chunk, offset by 1 character
            target_chunk = token_ids[i + 1: i + max_length + 1]

            # Append chunk to list of chunks
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return self.input_ids[idx], self.target_ids[idx]

因为batch_size=2,所以每个batch的输入和标签维度都是torch.Size([2, 256]) 。

torch.manual_seed(123)
train_loader = create_dataloader_v1(
train_data,
batch_size=2,
max_length=GPT_CONFIG_124M["context_length"],# 256
stride=GPT_CONFIG_124M["context_length"],# 256
drop_last=True,
shuffle=True,
num_workers=0
)
val_loader = create_dataloader_v1(
val_data,
batch_size=2,
max_length=GPT_CONFIG_124M["context_length"],
stride=GPT_CONFIG_124M["context_length"],
drop_last=False,
shuffle=False,
num_workers=0,
)

# Load the data
def create_dataloader_v1(txt, batch_size=4, max_length=256, stride=128, shuffle=True, drop_last=True, num_workers=0):
    tokenizer = tiktoken.get_encoding("gpt2")
    dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)
    dataloader = DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=shuffle,
        drop_last=drop_last,
        num_workers=0
    )

    return dataloader

为了便于理解,举个简单的例子,假设训练数据的token_ids = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],context_length=4,stride=4,batch_size=2。

Input IDs: [tensor([1, 2, 3, 4]), tensor([5, 6, 7, 8])]
Target IDs: [tensor([2, 3, 4, 5]),tensor([6, 7, 8, 9])]
X=[tensor([1, 2, 3, 4]), tensor([5, 6, 7, 8])]
Y=[tensor([2, 3, 4, 5]),tensor([6, 7, 8, 9])]

3.词嵌入

The Annotated GPT2注释加量版,读懂代码才算读懂了GPT -AI.x社区

接下来将[2, 256]个token ids转换成词嵌入,根据超参数设置:"emb_dim": 768,也就是每个token id映射成一个768维的向量。

简单地讲,这768个数字中每个数字都可以表示一个属性,维度越高能表述的属性越丰富,想详细了解词嵌入的请阅读下面文章。

另外,在计算注意力时,没有考虑token之间的相对位置,所以要在词嵌入上加一个位置编码,位置编码向量维度与词嵌入维度相同,都是768。

class GPTModel(nn.Module):
  def __init__(self, config):
    super().__init__()

    self.token_embedding = nn.Embedding(config["vocab_size"], config["emb_dim"])#[50257,768]
    self.positional_embedding = nn.Embedding(config["context_length"], config["emb_dim"])#[256,768]
    #随机将一些元素的值设置为零,不改变维度
    self.drop_embedding = nn.Dropout(config["drop_rate"])

    self.transformer_blocks = nn.Sequential(
        *[TransformerBlock(config) for _ in range(config["n_layers"])]
    )
    #对数据进行平滑,不改变维度
    self.final_norm = LayerNorm(config["emb_dim"])
    self.out_head = nn.Linear(config["emb_dim"], config["vocab_size"], bias=False)

  def forward(self, in_idx):
    batch_size, sequence_length = in_idx.shape #in_idx.shape:训练时[2, 256],推理时[1, N]
    token_embeddings = self.token_embedding(in_idx)#训练时[2, 256,768],推理时[1, N,768]
    positional_embeddings = self.positional_embedding(
        torch.arange(sequence_length, device=in_idx.device)
    )#[256,768]
    x = token_embeddings + positional_embeddings #训练时[2, 256,768],推理时[1, N,768]
    x = self.drop_embedding(x)#[2, 256,768]

    x = self.transformer_blocks(x)#训练时[2, 256,768],推理时[1, N,768]
    x = self.final_norm(x)#训练时[2, 256,768],推理时[1, N,768]
    logits = self.out_head(x)#训练时[2, 256,50257],推理时[1, N,50257]
    return logits

词嵌入和位置编码通过下面这两个可学习的嵌入层完成,嵌入层可以简单地理解成是一个映射表(map)。将token id或者位置索引映射成一个向量。

self.token_embedding = nn.Embedding(config["vocab_size"], config["emb_dim"])
self.positional_embedding = nn.Embedding(config["context_length"], config["emb_dim"])

在训练过程中,token_embedding嵌入层将一个batch的输入数据[2, 256]个token id映射成[2, 256, 768]维词嵌入。

在推理过程中,token_embedding嵌入层将输入数据[1, N]映射成[1, N, 768]维词嵌入,N是随着不断地预测下一个token而增长的,直到遇到结束符。

positional_embeddings = self.positional_embedding(
torch.arange(sequence_length, device=in_idx.device)
)

positional_embeddings 嵌入层将256个位置映射成[256,768]位置编码向量,位置通过torch.arange生成,元素内容是0,1....255。

位置编码之所以是[256,768]而非[2, 256, 768],说明每个batch下所有样本共用同一个位置编码。

与前一篇采用固定计算方式不同,gpt2采用的是可学习的方法,也就是在训练过程中positional_embedding中的内容会通过反向传播进行更改。

最终输出[2, 256, 768]维词嵌入。

4.TransformerBlock

The Annotated GPT2注释加量版,读懂代码才算读懂了GPT -AI.x社区

根据超参数设置"n_layers": 12,模型会经过12个结构相同,但参数独立的TransformerBlock模块。def forward(self, x): shortcut = x

class TransformerBlock(nn.Module):
  def __init__(self, config):
    super().__init__()

    self.attention = MultiHeadAttention(
        d_in=config["emb_dim"],
        d_out=config["emb_dim"],
        context_length=config["context_length"],
        dropout=config["drop_rate"],
        num_heads=config["n_heads"],
        qkv_bias=config["qkv_bias"]
    )

    self.ff = FeedForward(config)
    self.norm1 = LayerNorm(config["emb_dim"])
    self.norm2 = LayerNorm(config["emb_dim"])
    self.drop_shortcut = nn.Dropout(config["drop_rate"])

  def forward(self, x):
    shortcut = x

    # Attention layer
    x = self.norm1(x)
    x = self.attention(x)
    x = self.drop_shortcut(x)
    x = x + shortcut         # Add the original input back

    # Feedforward layer
    shortcut = x
    x = self.norm2(x)
    x = self.ff(x)
    x = self.drop_shortcut(x)
    x = x + shortcut         # Add the original input back
    return x

TransformerBlock是由MultiHeadAttention、FeedForward和LayerNorm构成。

接下来我们看看数据是如何流经这些层的。

5.MultiHeadAttention

class MultiHeadAttention(nn.Module):
  def __init__(self, d_in, d_out, context_length, dropout, num_heads, qkv_bias=False):
    super().__init__()

    assert d_out % num_heads == 0, "d_out must be divisible by num_heads"

    self.d_out = d_out                  # 768
    self.num_heads = num_heads          # 12
    self.head_dim = d_out // num_heads  # 64
    self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
    self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)
    self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
    self._out_proj = nn.Linear(d_out, d_out)
    self.dropout = nn.Dropout(dropout)
    self.register_buffer(
        'mask',
        torch.triu(torch.ones(
            context_length,             # 256
            context_length,             # 256
          ), diagonal=1)
    )

  def forward(self, x):
    batch_size, num_tokens, embedding_length = x.shape
    keys = self.W_key(x)
    queries = self.W_query(x)
    values = self.W_value(x)

    # Add the num_heads and head_dim dimensions
    keys = keys.view(batch_size, num_tokens, self.num_heads, self.head_dim)       # Transform to a tensor of dimensions: 2 x 256 x 12 x 64
    queries = queries.view(batch_size, num_tokens, self.num_heads, self.head_dim) # Transform to a tensor of dimensions: 2 x 256 x 12 x 64
    values = values.view(batch_size, num_tokens, self.num_heads, self.head_dim)   # Transform to a tensor of dimensions: 2 x 256 x 12 x 64

    # Transpose from (batch_size, num_tokens, num_heads, head_dim) to (batch_size, num_heads, num_tokens, head_dim)
    queries = queries.transpose(1, 2)
    keys = keys.transpose(1, 2)
    values = values.transpose(1, 2)

    # Calculate attention scores
    attention_scores = queries @ keys.transpose(2, 3)
    mask_bool = self.mask.bool()[:num_tokens, :num_tokens]

    # Mask the attention scores
    attention_scores.masked_fill_(mask_bool, -torch.inf)

    # Calculate attention weights
    attention_weights = torch.softmax(attention_scores / keys.shape[-1]**0.5, dim=-1)

    # Apply dropout to attention weights
    attention_weights = self.dropout(attention_weights)

    # Calculate context vectors
    context_vectors = (attention_weights @ values).transpose(1, 2)

    # Concatenate the context vectors
    context_vectors = context_vectors.contiguous().view(batch_size, num_tokens, self.d_out)
    return self._out_proj(context_vectors)

输入的词嵌入[2, 256, 768]先经过三个矩阵[768, 768]变换,分别得到q、k、v,维度都是[2, 256, 768]。

根据超参数设置"n_heads": 12,将q、k、v reshape成[2, 256, 12,64],再转置成[2, 12,256, 64]。将原始768维词嵌入划分到12个头中,每个头64维,这就实现了多头注意力机制。

The Annotated GPT2注释加量版,读懂代码才算读懂了GPT -AI.x社区图片

然后计算每个头的注意力,注意力分数矩阵维度[2, 12, 256, 256]。

为了防止看到未来时刻的内容,构造一个上三角掩码矩阵[256, 256],其对角线以上的部分设置True, 再将注意力分数矩阵中对应掩码矩阵为True的位置设置为负无穷,这样softmax 之后接近于零,以屏蔽未来位置的注意力得分。

self.register_buffer(
    'mask',
    torch.triu(torch.ones(
        context_length,             # 256
        context_length,             # 256
      ), diagonal=1)
)
mask_bool = self.mask.bool()[:num_tokens, :num_tokens]

# Mask the attention scores
attention_scores.masked_fill_(mask_bool, -torch.inf)


The Annotated GPT2注释加量版,读懂代码才算读懂了GPT -AI.x社区

然后将注意力分数矩阵[2, 12, 256, 256]与v[2, 12,256, 64]相乘,输出[2, 12,256, 64]。

最后将多个头的输出通过reshape拼接在一起输出[2, 256, 768],再经过一个线性层[768, 768]输出[2, 256, 768],最终与输入进行残差链接输出[2, 256, 768]。

与原始Transformer的Encoder-Decoder架构不同,GPT2是Decoder only架构,所以q、k、v全部来自输入或者前一层输出,而不是来自Encoder。

6.LaynerNorm

class LayerNorm(nn.Module):
def init(self, emb_dim):
super().init()
self.eps = 1e-5
self.scale = nn.Parameter(torch.ones(emb_dim))
self.shift = nn.Parameter(torch.zeros(emb_dim))
def forward(self, x):
mean = x.mean(dim=-1, keepdim=True)
var = x.var(dim=-1, keepdim=True, unbiased=False)
normalized_x = (x - mean) / torch.sqrt(var + self.eps)
return self.scale * normalized_x + self.shift

LaynerNorm的目的是为了计算稳定,不改变维度,LaynerNorm层的输入输出维度均是[2, 256, 768]。

7.FeedForward

FeedForward是一个MLP层,前面LaynerNorm层的输出[2, 256, 768],2256个词嵌入并行通过MLP层,先升维到4768,再恢复到768,中间使用GELU非线性激活函数。

# Implement feed-forward neural network
class FeedForward(nn.Module):
  def __init__(self, config):
    super().__init__()
    self.layers = nn.Sequential(
        nn.Linear(config["emb_dim"], 4 * config["emb_dim"]),
        GELU(),
        nn.Linear(4 * config["emb_dim"], config["emb_dim"]),
    )

  def forward(self, x):
    return self.layers(x)

MLP层不会改变输入维度[2, 256, 768],但会通过非线性变换会进一步修改词嵌入的值,以次提升模型的表示能力,生成更高层次的抽象特征。

8.输出

The Annotated GPT2注释加量版,读懂代码才算读懂了GPT -AI.x社区

class GPTModel(nn.Module):
  def __init__(self, config):
    super().__init__()

    self.token_embedding = nn.Embedding(config["vocab_size"], config["emb_dim"])
    self.positional_embedding = nn.Embedding(config["context_length"], config["emb_dim"])
    self.drop_embedding = nn.Dropout(config["drop_rate"])

    self.transformer_blocks = nn.Sequential(
        *[TransformerBlock(config) for _ in range(config["n_layers"])]
    )

    self.final_norm = LayerNorm(config["emb_dim"])
    self.out_head = nn.Linear(config["emb_dim"], config["vocab_size"], bias=False)

  def forward(self, in_idx):
    batch_size, sequence_length = in_idx.shape
    token_embeddings = self.token_embedding(in_idx)
    positional_embeddings = self.positional_embedding(
        torch.arange(sequence_length, device=in_idx.device)
    )
    x = token_embeddings + positional_embeddings
    x = self.drop_embedding(x)

    x = self.transformer_blocks(x)
    x = self.final_norm(x)
    logits = self.out_head(x)
    return logits

MLP层的输出[2, 256, 768]先经过一个LaynerNorm进行平滑操作。

最后2*256个token并行经过一个输出线性层[768, 50257],将[2, 256, 768]映射成[2, 256, 50257]。

也就是每个token都会输出一个概率分布,这50257个概率值表示下一个token属于词表中50257个词的概率。

9.计算损失

训练过程中需要通过计算损失来更新参数,如何根据输出[2, 256, 50257]计算损失呢?

在准备训练数据时已经构造了标签,维度与输入X一致,也是[2, 256]。

def calc_loss_batch(input_batch, target_batch, model, device):
  """
  Calculates the loss for a single batch.
  """
  input_batch = input_batch.to(device)
  target_batch = target_batch.to(device)


  # Run the model
  logits = model(input_batch)
  print("target_batch loss")
  print(target_batch.flatten().shape)
  print("logits.flatten(0, 1)")
  print(logits.flatten(0, 1).shape)
  # Calculate the loss
  loss = torch.nn.functional.cross_entropy(
      logits.flatten(0, 1),
      target_batch.flatten(),
  )
  return loss

input_batch是输入X,维度[2, 256],target_batch是标签,维度[2, 256],输入经过模型后输出[2, 256, 50257],展平后[512, 50257],标签展平后[512],每个元素表示词表中位置。

cross_entropy估计是根据这[512]位置构造one-hot编码,然后与输出logits计算损失值。

推理阶段

def generate_text_simple(model, idx, max_new_tokens, context_size):
    for _ in range(max_new_tokens):
        idx_cond = idx[:, -context_size:]
        with torch.no_grad():
            logits = model(idx_cond)

        logits = logits[:, -1, :]#取最后一个词的概率分布
        probas = torch.softmax(logits, dim=-1)
        idx_next = torch.argmax(probas, dim=-1, keepdim=True)#取概率最大的
        idx = torch.cat((idx, idx_next), dim=1)#将其加到输入后面

    return idx

GPT通过自回归方式预测下一个token,但在训练过程中,通过矩阵,掩码实现了并行计算。

但在预测阶段每次前向计算只能预测一个token,然后将这个新token添加到输入末尾在此作为输入,直到输出结束符。

设置生成50个词。

第一轮:

初始上下文:"Oh Juliet, where is"

转换成token后:

torch.Size([1, 5])
tensor([[ 5812, 38201,    11,   810,   318]])

只取is对应输出的概率分布,模型预测is的下一个词是“,”,将“,”加到输入后面作为下一轮输入。

第二轮:

输入:"Oh Juliet, where is,"

转换成token后:

torch.Size([1, 6])
tensor([[ 5812, 38201,    11,   810,   318,    13]])

只取“,”对应输出的概率分布,模型预测“,”的下一个词是 "and",将“and”加到输入后面作为下一轮输入。

以此类推,直到生成50个词。

Oh Juliet, where is, and, and,, and,,,, and, and,,,,,,,,, and,,,, and,,,, and,, and,,,,, and,,,,,, and


本文转载自公众号人工智能大讲堂 

原文链接:​​​​https://mp.weixin.qq.com/s/YZU9rPPyYZTbSPhCZtbGZg​


©著作权归作者所有,如需转载,请注明出处,否则将追究法律责任
标签
收藏
回复
举报
回复
相关推荐