本地构建Llama 3.2-Vision多模态LLM聊天应用实战-51CTO.COM

译者 | 朱先忠

审校 | 重楼

本文将以实战案例探讨如何在类似聊天的模式下从本地构建Llama3.2-Vision模型，并在Colab笔记本上探索其多模态技能。

简介

视觉功能与大型语言模型（LLM）的集成正在通过多模态LLM（MLLM）彻底改变计算机视觉领域。这些模型结合了文本和视觉输入，在图像理解和推理方面表现出令人印象深刻的能力。虽然这些模型以前只能通过API访问，但是最近发布的一些开源项目已经支持在本地执行，这使得它们对生产环境中一线应用更具吸引力。

在本文中，我们将学习如何使用开源Llama3.2-Vision模型与我们提供的图像聊天，其间你会惊叹于该模型的OCR、图像理解和推理能力。示例工程的所有代码都将方便地提供在一个Colab笔记本文件中。

Llama 3.2-Vision模型

背景

Llama是“大型语言模型MetaAI”的缩写，是Meta公司开发的一系列高级大语言模型。他们的产品Llama 3.2推出了先进的视觉功能。视觉变体有两种大小：11B和90B参数，可在边缘设备上进行推理。Llama 3.2具有高达128k个标记的上下文窗口，支持高达1120x1120像素的高分辨率图像，可以处理复杂的视觉和文本信息。

架构

Llama系列模型是仅使用解码器的转换器。Llama3.2-Vision模型建立在预训练的Llama 3.1纯文本模型之上。它采用标准的密集自回归转换器架构，与其前身Llama和Llama 2并无太大差异。

为了支持视觉任务，Llama 3.2使用预训练的视觉编码器（ViT-H/14）提取图像表示向量，并使用视觉适配器将这些表示集成到冻结语言模型中。适配器由一系列交叉注意层组成，允许模型专注于与正在处理的文本相对应的图像的特定部分（参考文献【1】）。

适配器基于“文本-图像”对进行训练，以使图像表示与语言表示对齐。在适配器训练期间，图像编码器的参数会更新，而语言模型参数保持冻结以保留现有的语言能力。

Llama 3.2-Vision模型架构：视觉模块（绿色）集成到固定语言模型（粉红色）中

这种设计使Llama 3.2在多模态任务中表现出色，同时保持其强大的纯文本性能。生成的模型在需要图像和语言理解的任务中展示了令人印象深刻的能力，并允许用户与他们的视觉输入进行交互式交流。

编码实战

有了对Llama 3.2架构的基本了解后，让我们深入研究其实际实现。但首先，我们需要做一些准备工作。

准备

在Google Colab上运行Llama3.2—Vision11B之前，我们需要做一些准备：

1.GPU设置

建议使用至少具有22GB VRAM的高端GPU进行高效推理（参考文献【2】）。
对于Google Colab用户来说：需要导航至“运行时”>“更改运行时类型”>“A100 GPU”。请注意，高端GPU可能不适用于免费的Colab用户。

2.模型权限

请求访问Llama 3.2模型在链接https://www.llama.com/llama-downloads/处提供。

3.HuggingFace设置

如果你还没有Hugging Face账户，请在链接https://huggingface.co/join处创建一个。

如果你没有Hugging Face账户，请在链接https://huggingface.co/join处生成访问令牌。

对于Google Colab用户，请在谷歌Colab Secrets中将Hugging Face令牌设置为名为“HF_TOKEN”的秘密环境变量。

4.安装所需的库

加载模型

设置环境并获得必要的权限后，我们将使用Hugging Face转换库来实例化模型及其相关的处理器。处理器负责为模型准备输入并格式化其输出。

model_id = "meta-llama/Llama-3.2-11B-Vision-Instruct"

model = MllamaForConditionalGeneration.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto")

processor = AutoProcessor.from_pretrained(model_id)1.
2.
3.
4.
5.
6.
7.
8.

预期的聊天模板

聊天模板通过存储“用户”（我们）和“助手”（AI模型）之间的交流，通过对话历史记录来维护上下文。对话历史记录的结构为一个称为消息的字典列表，其中每个字典代表一个对话轮次，包括用户和模型响应。用户轮次可以包括图像文本或纯文本输入，其中{"type": "image"}表示图像输入。

例如，经过几次聊天迭代后，消息列表可能如下所示：

messages = [
{"role": "user",      "content": [{"type": "image"}, {"type": "text", "text": prompt1}]},
{"role": "assistant", "content": [{"type": "text", "text": generated_texts1}]},
{"role": "user",      "content": [{"type": "text", "text": prompt2}]},
{"role": "assistant", "content": [{"type": "text", "text": generated_texts2}]},
{"role": "user",      "content": [{"type": "text", "text": prompt3}]},
{"role": "assistant", "content": [{"type": "text", "text": generated_texts3}]}
]1.
2.
3.
4.
5.
6.
7.
8.

此消息列表随后会传递给apply_chat_template()方法，以便将对话转换为模型期望格式的单个可标记字符串。

主函数

在本教程中，我提供了一个chat_with_mllm函数，该函数可实现与Llama 3.2 MLLM的动态对话。此函数能够处理图像加载、预处理图像和文本输入、生成模型响应并管理对话历史记录以启用聊天模式交互。

def chat_with_mllm (model, processor, prompt, images_path=[],do_sample=False, temperature=0.1, show_image=False, max_new_tokens=512, messages=[], images=[]):

# 确保列表形式：
if not isinstance(images_path, list):
images_path =  [images_path]

#加载图像
if len (images)==0 and len (images_path)>0:
for image_path in tqdm (images_path):
image = load_image(image_path)
images.append (image)
if show_image:
display ( image )

#如果开始了一个关于一个图像的新的对话
if len (messages)==0:
messages = [{"role": "user", "content": [{"type": "image"}, {"type": "text", "text": prompt}]}]

# 如果继续对图像进行对话
else:
messages.append ({"role": "user", "content": [{"type": "text", "text": prompt}]})

# 处理输入数据
text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(images=images, text=text, return_tensors="pt", ).to(model.device)

    生成相应
generation_args = {"max_new_tokens": max_new_tokens, "do_sample": True}
if do_sample:
generation_args["temperature"] = temperature
generate_ids = model.generate(**inputs,**generation_args)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:-1]
generated_texts = processor.decode(generate_ids[0], clean_up_tokenization_spaces=False)

# 附加该模型对对话历史记录的响应
messages.append ({"role": "assistant", "content": [  {"type": "text", "text": generated_texts}]})

return generated_texts, messages, images1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
31.
32.
33.
34.
35.
36.
37.
38.

与Llama聊天

蝴蝶图像示例

在我们的第一个示例中，我们将与Llama3.2进行聊天，讨论一张孵化蝴蝶的图像。由于Llama3.2-Vision在使用图像时不支持使用系统提示进行提示，因此我们将直接在用户提示中附加说明，以指导模型的响应。通过设置do_sample=True和temperature=0.2，我们可以在保持响应一致性的同时实现轻微的随机性。对于固定答案，你可以设置do_sample==False。保存聊天历史记录的messages参数最初为空，如images参数中所示：

instructions = "Respond concisely in one sentence."
prompt = instructions + "Describe the image."

response, messages,images= chat_with_mllm ( model, processor, prompt,
images_path=[img_path],
do_sample=True,
temperature=0.2,
show_image=True,
messages=[],
images=[])

# 输出："The image depicts a butterfly emerging from its chrysalis, 
#           with a row of chrysalises hanging from a branch above it."1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.

图片来自Pixabay（https://www.pexels.com/photo/brown-and-white-swallowtail-butterfly-under-white-green-and-brown-cocoon-in-shallow-focus-lens-63643/）。

我们可以看到，输出准确而简洁，表明模型有效地理解了图像。

对于下一次聊天迭代，我们将传递一个新提示以及聊天历史记录和图像文件。新提示旨在评估Llama3.2的推理能力：

prompt = instructions + "What would happen to the chrysalis in the near future?"
response, messages, images= chat_with_mllm ( model, processor, prompt,
images_path=[img_path,],
do_sample=True,
temperature=0.2,
show_image=False,
messages=messages,
images=images)

# 输出: "The chrysalis will eventually hatch into a butterfly."1.
2.
3.
4.
5.
6.
7.
8.
9.
10.

我们在提供的Colab笔记本中继续此聊天，并得到了以下对话：

对话通过准确描述场景，突出了模型的图像理解能力。它还展示了它的推理能力，通过逻辑地连接信息来正确推断蛹会发生什么，并解释为什么有些蛹是棕色的，而有些蛹是绿色的。

模因图像示例

在这个例子中，我将向模型展示我自己创建的模因，以评估Llama的OCR能力并确定它是否理解我的幽默感。

instructions = "You are a computer vision engineer with sense of humor."
prompt = instructions + "Can you explain this meme to me?"


response, messages,images= chat_with_mllm ( model, processor, prompt,
images_path=[img_path,],
do_sample=True,
temperature=0.5,
show_image=True,
messages=[],
images=[])1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.

这是输入模因：