盘点目前最常用的四种语言模型压缩技术-51CTO.COM

你能在不牺牲性能的情况下让大型语言模型（LLM）变得更小？尽管人们总是对越来越大的语言模型感兴趣，但MistralAI向我们展示了尺寸的重要性是相对的，而对边缘计算日益增长的兴趣促使我们用小型语言模型获得不错的结果。另一种方法是通过压缩技术。在本文中，我将解释这些技术，并提供一些简单的代码片段作为示例。

模型压缩是在不损害其有效性的情况下最小化机器学习模型大小的行为。由于大型神经网络经常因为过度参数化而包含冗余的计算单元，这种方法对它们是有效的。

压缩意味着减少参数数量或整体内存占用，从而实现更小的模型大小（例如，从10GB减少到9GB）。这个过程有助于在存储和推理速度方面提高模型的效率，使它们更容易部署在资源有限的环境中。常见的模型压缩技术包括：

量化：通过改变模型权重（例如，从32位浮点数到8位整数）的精度来减少内存占用。
剪枝：移除不太重要的权重或神经元，减少参数数量。
知识蒸馏：训练一个更小的模型（学生模型）来模仿一个更大的模型（教师模型），将知识蒸馏成具有类似性能的压缩版本。
权重共享：在不同层之间使用共享权重来减少存储需求，无论是通过设计还是在训练后应用。

模型量化

模型量化通过改变权重或激活的精度表示（通常是32位或16位）来压缩LLM，将其转换为低精度表示（例如，8位、4位甚至二进制）。我们可以量化权重、激活函数或进行其他技巧：

权重量化：神经网络使用的权重通常存储为32位或16位浮点数。量化将这些权重减少到更低的位宽，如8位整数（INT8）或4位整数（INT4）。这是通过将原始权重范围映射到具有较少位的较小范围来实现的，显著减少了内存使用。
激活量化：与权重类似，激活（推理期间层的输出）可以被量化为更低的精度。通过用较少的位表示激活，减少了模型在推理期间的内存占用。
量化感知训练（QAT）：在QAT中，模型在模拟量化的同时进行训练，允许它适应更低的精度。这有助于保持准确性，因为模型学会了对量化效应更加健壮（参见Tailor等人在Arxiv上的研究）。
训练后量化（PTQ）：这种方法涉及以全精度正常训练模型，然后在此之后应用量化。虽然PTQ更简单、更快，但与QAT相比，它可能导致准确性的更大下降（如Wang等人在NIPS2021上的研究）。

权重量化可以使用bitsandbytes轻松实现。安装库：

pip install torch transformers bitsandbytes

例如，对于GPT2运行以下代码：

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Specify the model you want to use
model_name = "gpt2"  # You can replace this with any other LLM model
# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Load the model with 8-bit quantization using bitsandbytes
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    load_in_8bit=True,  # Enable 8-bit quantization
    device_map="auto"   # Automatically allocate to available device (CPU/GPU)
)
# Example text for inference
input_text = "Weight Quantization is an efficient technique for compressing language models."
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")
# Generate text
with torch.no_grad():
    output_ids = model.generate(input_ids, max_length=50)
# Decode and print the generated text
output_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print(output_text)

剪枝

剪枝移除不必要的或不太重要的权重、神经元或整个层，就像从树上移除不必要的分支一样。这减少了模型的大小，加快了推理速度，并降低了内存和计算需求，使其在尽可能保持原始性能的同时更加高效。

这比量化更直接，因为我们首先需要找到冗余的部分。例如，我们需要找到冗余的参数并在没有它们的情况下微调模型。

最常见的是，我们移除权重、神经元或层，但对注意力头剪枝（特定于基于Transformer的模型）作为一种结构化剪枝的兴趣日益增长（参见Wang等人在Arxiv上的研究）。在这里，每个注意力层有多个头。一些头对模型性能的贡献比其他头更大，所以注意力头剪枝移除了不太重要的头。

剪枝的示例代码可能如下，我们从GPT2模型中移除一定百分比的权重：


import torch
import torch.nn.utils.prune as prune
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load the pretrained model and tokenizer
model_name = "gpt2"  # You can replace this with any other LLM model
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Define a pruning method (here we use L1 unstructured pruning)
def prune_model_layer(layer, amount=0.3):
    # Prune 30% of the weights with the lowest L1 norm in the linear layers
    for name, module in layer.named_modules():
        if isinstance(module, torch.nn.Linear):
            prune.l1_unstructured(module, name="weight", amount=amount)
            print(f"Pruned layer {name} with amount {amount}")
# Apply pruning to all transformer layers in the model
for layer in model.transformer.h:
    prune_model_layer(layer, amount=0.3)  # Prune 30% of the weights
# Check the sparsity of the model
total_params = 0
pruned_params = 0
for name, module in model.named_modules():
    if isinstance(module, torch.nn.Linear):
        total_params += module.weight.nelement()
        pruned_params += torch.sum(module.weight == 0).item()
print(f"Total parameters: {total_params}")
print(f"Pruned parameters: {pruned_params}")
print(f"Sparsity: {pruned_params / total_params:.2%}")
# Test the pruned model on a sample input
input_text = "Pruning is an effective way to compress language models."
input_ids = tokenizer(input_text, return_tensors="pt").input_ids
# Generate text using the pruned model
with torch.no_grad():
    output_ids = model.generate(input_ids, max_length=50)
# Decode and print the generated text
output_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print(output_text)

模型蒸馏

模型蒸馏是一种将“知识”从大型、更复杂的模型（称为教师模型）转移到小型、更简单的模型（称为学生模型）的技术，后者的参数更少。这个过程使得学生模型在保持更小的尺寸或速度的同时，能够达到接近教师模型的性能，正如我们在开始时承诺的。

这个过程从一个大型的、预训练的LLM开始，作为教师模型，例如GPT2或LLama。这个模型通常非常准确，但需要大量的计算资源来进行推理。

一个更小、更高效的模型（“学生模型”）被训练来模仿教师模型的行为，如miniGPT2或TinyLlama（尽管Tinyllama是以不同的方式构建的）。学生模型从原始训练数据和教师模型生成的输出（软标签）中学习。

以下是Python中教师-学生互动的示例，从教师GPT2开始：

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
import torch.nn.functional as F

# Load the teacher (large) and student (smaller) models
teacher_model_name = "gpt2"  # You can replace this with any large LLM
student_model_name = "tiny-gpt2"  # A smaller variant to act as the student
# Load the teacher model and tokenizer
teacher_model = AutoModelForCausalLM.from_pretrained(teacher_model_name).to("cuda")
teacher_tokenizer = AutoTokenizer.from_pretrained(teacher_model_name)
# Load the student model and tokenizer
student_model = AutoModelForCausalLM.from_pretrained(student_model_name).to("cuda")
student_tokenizer = AutoTokenizer.from_pretrained(student_model_name)
# Load a dataset for training (e.g., Wikitext for language modeling)
dataset = load_dataset("wikitext", "wikitext-2-raw-v1", split="train")
# Set training parameters
learning_rate = 5e-5
epochs = 3
optimizer = torch.optim.AdamW(student_model.parameters(), lr=learning_rate)
# Set temperature for softening probabilities
temperature = 2.0
alpha = 0.5  # Weighting factor for combining loss functions
# Training loop for knowledge distillation
for epoch in range(epochs):
    for i, example in enumerate(dataset):
        # Get the input text
        input_text = example["text"]
        
        # Skip empty lines
        if not input_text.strip():
            continue
        
        # Tokenize the input text for the teacher and student models
        teacher_inputs = teacher_tokenizer(input_text, return_tensors="pt", truncation=True, padding="max_length", max_length=32).to("cuda")
        student_inputs = student_tokenizer(input_text, return_tensors="pt", truncation=True, padding="max_length", max_length=32).to("cuda")
        
        # Get teacher predictions (soft labels)
        with torch.no_grad():
            teacher_outputs = teacher_model(**teacher_inputs)
            teacher_logits = teacher_outputs.logits / temperature
            teacher_probs = F.softmax(teacher_logits, dim=-1)
        
        # Get student predictions
        student_outputs = student_model(**student_inputs)
        student_logits = student_outputs.logits
        
        # Calculate distillation loss (Kullback-Leibler divergence)
        distillation_loss = F.kl_div(
            input=F.log_softmax(student_logits / temperature, dim=-1),
            target=teacher_probs,
            reduction="batchmean",
            log_target=False
        ) * (temperature ** 2)
        
        # Calculate student task loss (Cross-Entropy with true labels)
        target_labels = student_inputs["input_ids"]
        task_loss = F.cross_entropy(student_logits.view(-1, student_logits.size(-1)), target_labels.view(-1), ignore_index=student_tokenizer.pad_token_id)
        
        # Combined loss
        loss = alpha * distillation_loss + (1 - alpha) * task_loss
        
        # Backpropagation and optimization
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        # Print training progress
        if i % 100 == 0:
            print(f"Epoch [{epoch + 1}/{epochs}], Step [{i}], Loss: {loss.item():.4f}")
print("Knowledge distillation completed!")

权重共享

通过在几个模型组件之间共享参数，我们可以减少神经网络的内存占用。当一些或所有层共享同一组权重而不是每层或组件都有独特的权重时，模型必须保持的参数数量大大减少。人们可以预先定义模型的架构，使其具有共享权重，或者在训练后将权重共享作为一种模型压缩技术。例如，一种可能性是像下面的代码一样对权重进行聚类：


import torch
import numpy as np
from sklearn.cluster import KMeans

def apply_weight_sharing(model, num_clusters=16):
    # Iterate through each parameter in the model
    for name, param in model.named_parameters():
        if param.requires_grad:  # Only consider trainable parameters
            # Flatten the weights into a 1D array for clustering
            weights = param.data.cpu().numpy().flatten().reshape(-1, 1)
            # Apply k-means clustering
            kmeans = KMeans(n_clusters=num_clusters)
            kmeans.fit(weights)
            # Replace weights with their corresponding cluster centroids
            cluster_centroids = kmeans.cluster_centers_
            labels = kmeans.labels_
            # Map the original weights to their shared values
            shared_weights = np.array([cluster_centroids[label] for label in labels]).reshape(param.data.shape)
            # Update the model's parameters with the shared weights
            param.data = torch.tensor(shared_weights, dtype=param.data.dtype).to(param.device)
    return model
# Example usage with a pre-trained model
from transformers import GPT2LMHeadModel
model = GPT2LMHeadModel.from_pretrained("gpt2")
model = apply_weight_sharing(model, num_clusters=16)  # Apply weight sharing with 16 clusters
print("Weight sharing applied to the model!")