使用 Llama 3.2-Vision 多模态 LLM 和图像“聊天”-51CTO.COM

一、引言

将视觉能力与大型语言模型（LLMs）结合，正在通过多模态 LLM（MLLM）彻底改变计算机视觉领域。这些模型结合了文本和视觉输入，展示了在图像理解和推理方面的卓越能力。虽然这些模型以前只能通过 API 访问，但最近的开放源代码选项现在允许本地执行，使其在生产环境中更具吸引力。

在本教程中，我们将学习如何使用开源的 Llama 3.2-Vision 模型与图像进行对话，您将对其 OCR、图像理解和推理能力感到惊叹。所有代码都方便地提供在一个 Colab 笔记本中。

二、背景

Llama 是 “Large Language Model Meta AI” 的缩写，是由 Meta 开发的一系列先进 LLM。其最新版本 Llama 3.2 引入了先进的视觉能力。视觉变体有两种尺寸：11B 和 90B 参数，能够在边缘设备上进行推理。凭借高达 128k 的上下文窗口和对高达 1120x1120 像素的高分辨率图像的支持，Llama 3.2 可以处理复杂的视觉和文本信息。

三、架构

Llama 系列模型是仅解码器的 Transformer。Llama 3.2-Vision 基于预训练的 Llama 3.1 纯文本模型构建。它采用了标准的密集自回归 Transformer 架构，与前代 Llama 和 Llama 2 没有显著偏离。

为了支持视觉任务，Llama 3.2 使用预训练的视觉编码器（ViT-H/14）提取图像表示向量，并通过视觉适配器将这些表示集成到冻结的语言模型中。适配器由一系列交叉注意力层组成，允许模型专注于与正在处理的文本相对应的图像部分 [1]。

适配器在文本-图像对上进行训练，以将图像表示与语言表示对齐。在适配器训练期间，图像编码器的参数会更新，而语言模型的参数保持冻结，以保留现有的语言能力。

Llama 3.2-Vision 架构。视觉模块（绿色）集成到固定的语言模型（粉色）中

这种设计使 Llama 3.2 在多模态任务中表现出色，同时保持了强大的纯文本性能。生成的模型在需要图像和语言理解的任务中展示了令人印象深刻的能力，并允许用户与其视觉输入进行交互式通信。在了解了 Llama 3.2 的架构后，我们可以深入实际实现。但首先，我们需要做一些准备工作。

四、准备工作

在 Google Colab 上运行 Llama 3.2 — Vision 11B 之前，我们需要进行以下准备工作：

(1) GPU 设置：

推荐使用至少 22GB VRAM 的高端 GPU 以实现高效推理 [2]。
对于 Google Colab 用户：导航到“运行时” > “更改运行时类型” > 选择“A100 GPU”。请注意，高端 GPU 可能不适用于免费 Colab 用户。

(2) 模型权限：在此处申请 Llama 3.2 模型的访问权限。

(3) Hugging Face 设置：

如果您还没有 Hugging Face 账户，请在此处创建一个。
如果您还没有访问令牌，请从您的 Hugging Face 账户生成一个。
对于 Google Colab 用户，在 Google Colab Secrets 中将 Hugging Face 令牌设置为名为“HF_TOKEN”的秘密环境变量。

(4) 安装所需库。

五、加载模型

在设置好环境和获取必要权限后，我们将使用 Hugging Face Transformers 库实例化模型及其关联的处理器。处理器负责为模型准备输入并格式化其输出。

model_id = "meta-llama/Llama-3.2-11B-Vision-Instruct"

model = MllamaForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto")

processor = AutoProcessor.from_pretrained(model_id)

1.期望的聊天模板

聊天模板通过存储“用户”（我们）和“助手”（AI 模型）之间的对话历史来保持上下文。对话历史被结构化为一个名为 messages 的列表，其中每个字典代表一个对话轮次，包括用户和模型的响应。用户轮次可以包括图像-文本或纯文本输入，{"type": "image"} 表示图像输入。例如，经过几次聊天迭代后，messages 列表可能如下所示：

messages = [
    {"role": "user",      "content": [{"type": "image"}, {"type": "text", "text": prompt1}]},
    {"role": "assistant", "content": [{"type": "text", "text": generated_texts1}]},
    {"role": "user",      "content": [{"type": "text", "text": prompt2}]},
    {"role": "assistant", "content": [{"type": "text", "text": generated_texts2}]},
    {"role": "user",      "content": [{"type": "text", "text": prompt3}]},
    {"role": "assistant", "content": [{"type": "text", "text": generated_texts3}]}
]

这个 messages 列表稍后会传递给 apply_chat_template() 方法，以将对话转换为模型期望格式的单个可标记化字符串。

2.主函数

在本教程中，我提供了一个 chat_with_mllm 函数，该函数支持与 Llama 3.2 MLLM 进行动态对话。此函数处理图像加载、预处理图像和文本输入、生成模型响应，并管理对话历史以启用聊天模式交互。

def chat_with_mllm (model, processor, prompt, images_path=[],do_sample=False, temperature=0.1, show_image=False, max_new_tokens=512, messages=[], images=[]):

    # Ensure list:
    if not isinstance(images_path, list):
        images_path =  [images_path]

    # Load images 
    if len (images)==0 and len (images_path)>0:
            for image_path in tqdm (images_path):
                image = load_image(image_path)
                images.append (image)
                if show_image:
                    display ( image )

    # If starting a new conversation about an image
    if len (messages)==0:
        messages = [{"role": "user", "content": [{"type": "image"}, {"type": "text", "text": prompt}]}]

    # If continuing conversation on the image
    else:
        messages.append ({"role": "user", "content": [{"type": "text", "text": prompt}]})

    # process input data
    text = processor.apply_chat_template(messages, add_generation_prompt=True)
    inputs = processor(images=images, text=text, return_tensors="pt", ).to(model.device)

    # Generate response
    generation_args = {"max_new_tokens": max_new_tokens, "do_sample": True}
    if do_sample:
        generation_args["temperature"] = temperature
    generate_ids = model.generate(**inputs,**generation_args)
    generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:-1]
    generated_texts = processor.decode(generate_ids[0], clean_up_tokenization_spaces=False)

    # Append the model's response to the conversation history
    messages.append ({"role": "assistant", "content": [  {"type": "text", "text": generated_texts}]})

    return generated_texts, messages, images

六、与 Llama 对话

1. 蝴蝶图像示例

在我们的第一个示例中，我们将与 Llama 3.2 讨论一张孵化中的蝴蝶图像。由于 Llama 3.2-Vision 在使用图像时不支持系统提示，我们将直接在用户提示中附加指令以指导模型的响应。通过设置 do_sample=True 和 temperature=0.2，我们允许轻微的随机性，同时保持响应的一致性。对于固定答案，可以设置 do_sample=False。messages 参数（保存聊天历史）最初为空，images 参数也是如此。

instructions = "Respond concisely in one sentence."
prompt = instructions + "Describe the image."

response, messages,images= chat_with_mllm ( model, processor, prompt,
                                             images_path=[img_path],
                                             do_sample=True,
                                             temperature=0.2,
                                             show_image=True,
                                             messages=[],
                                             images=[])

# Output:  "The image depicts a butterfly emerging from its chrysalis, 
#           with a row of chrysalises hanging from a branch above it."

正如我们所见，输出准确且简洁，表明模型有效地理解了图像。在下一个聊天迭代中，我们将传递一个新的提示以及聊天历史（messages）和图像文件（images）。新提示旨在评估 Llama 3.2 的推理能力：

prompt = instructions + "What would happen to the chrysalis in the near future?"
response, messages, images= chat_with_mllm ( model, processor, prompt,
                                             images_path=[img_path,],
                                             do_sample=True,
                                             temperature=0.2,
                                             show_image=False,
                                             messages=messages,
                                             images=images)

# Output: "The chrysalis will eventually hatch into a butterfly."

我们在提供的 Colab 笔记本中继续了这次对话，并获得了以下对话内容：

对话突出了模型通过准确描述场景来理解图像的能力。它还展示了其推理能力，通过逻辑连接信息，正确推断出蛹会发生什么，并解释了为什么有些是棕色的而有些是绿色的。

2. 表情包图像示例

在这个示例中，我将向模型展示我自己创建的一个表情包，以评估 Llama 的 OCR 能力，并确定它是否理解我的幽默感。

instructions = "You are a computer vision engineer with sense of humor."
prompt = instructions + "Can you explain this meme to me?"


response, messages,images= chat_with_mllm ( model, processor, prompt,
                                             images_path=[img_path,],
                                             do_sample=True,
                                             temperature=0.5,
                                             show_image=True,
                                             messages=[],
                                             images=[])
instructions = "You are a computer vision engineer with sense of humor."
prompt = instructions + "Can you explain this meme to me?"


response, messages,images= chat_with_mllm ( model, processor, prompt,
                                             images_path=[img_path,],
                                             do_sample=True,
                                             temperature=0.5,
                                             show_image=True,
                                             messages=[],
                                             images=[])

这是输入的表情包：

这是模型的响应：

正如我们所见，模型展示了出色的 OCR 能力，并理解了图像中的文本含义。至于它的幽默感——你怎么看，它理解了吗？你理解了吗？