基于BLIP-2和Gemini开发多模态搜索引擎代理原创

发布于 2025-3-6 08:37

浏览

0收藏

本文将利用基于文本和图像的联合搜索功能来开发一个多模态时装辅助代理应用程序。

简介

传统模型只能处理单一类型的数据，例如文本、图像或表格数据。多模态是人工智能研究界的一个流行概念，指的是模型能够同时从多种类型的数据中学习。这项新技术（并不是很新，但在过去几个月里有了显著的改进）有许多潜在的应用，它将改变许多产品的用户体验。

这方面一个很好的例子是未来搜索引擎的新工作方式：用户可以使用多种方式输入查询，例如文本、图像、音频等。另一个例子是改进人工智能驱动的客户支持系统，以实现语音和文本输入。在电子商务中，他们通过允许用户使用图像和文本进行搜索来增强产品发现。我们将在本文中使用后者作为案例研究。

前沿的一些人工智能研究实验室每月都会推出几种支持多模态的模型。例如，OpenAI公司的CLIP和DALL-E；Salesforce公司的BLIP-2将图像和文本结合在一起；Meta的ImageBind将多模态概念扩展到六种模态（文本、音频、深度、温度、图像和惯性测量单元）。

在本文中，我们将通过解释BLIP-2的架构、损失函数的工作方式及其训练过程来对它展开详细探索。我们还提供了一个实际用例，该用例结合了BLIP-2和Gemini两种模型，以创建一个多模态时尚搜索代理，该代理可以帮助客户根据文本或文本和图像组合提示找到最佳服装。

基于BLIP-2和Gemini开发多模态搜索引擎代理-AI.x社区

图1：多模态搜索代理（图片由作者使用Gemini提供）

与往常一样，本文对应的示例代码可在我们的GitHub代码仓库上获取。

BLIP-2：多模态模型

BLIP-2（引导式语言图像预训练）（【引文1】）是一种视觉语言模型，旨在解决诸如视觉问答或基于两种模态输入（图像和文本）的多模态推理等任务。正如我们将在下面看到的，该模型是为了解决视觉语言领域的两个主要挑战而开发的：

使用冻结的预训练视觉编码器和LLM降低计算成本，与视觉和语言网络的联合训练相比，大幅减少所需的训练资源。
通过引入Q-Former来改善视觉语言对齐。Q-Former使视觉和文本嵌入更加接近，从而提高了推理任务的性能和执行多模态检索的能力。

架构

BLIP-2的架构采用模块化设计，集成了三个模块：

Visual Encoder：一种冻结的视觉模型，例如ViT，它从输入图像中提取视觉嵌入（然后用于下游任务）。
查询转换器（Q-Former）：是此架构的关键。它由一个可训练的轻量级转换器组成，充当视觉模型和语言模型之间的中间层。它负责从视觉嵌入生成上下文化查询，以便语言模型能够有效地处理它们。
LLM：一种冻结的预训练LLM，可处理精炼的视觉嵌入以生成文本描述或答案。

基于BLIP-2和Gemini开发多模态搜索引擎代理-AI.x社区

图2：BLIP-2架构（图片来自作者本人）

损失函数

BLIP-2有三个损失函数来训练Q-Former模块：

图像-文本对比损失（【引文2】）：通过最大化成对的图像-文本表示的相似性，同时推开不相似的图像-文本对，来强制视觉和文本嵌入之间的对齐。
图像-文本匹配损失（【引文3】）：一种二元分类损失，旨在通过预测文本描述是否与图像匹配（正，即目标=1）或不匹配（负，即目标=0）来使模型学习细粒度对齐。
基于图像的文本生成损失（【引文4】）：是LLM中使用的交叉熵损失，用于预测序列中下一个标记的概率。Q-Former架构不允许图像嵌入和文本标记之间进行交互；因此，必须仅基于视觉信息生成文本，从而迫使模型提取相关的视觉特征。

对于图像文本对比损失和图像文本匹配损失，作者使用了批量负采样技术。这意味着，如果我们的批量大小为512，则每个图像文本对都有一个正样本和511个负样本。这种方法提高了效率，因为负样本是从批次中抽取的，不需要搜索整个数据集。它还提供了一组更加多样化的比较，从而实现更好的梯度估计和更快的收敛。

基于BLIP-2和Gemini开发多模态搜索引擎代理-AI.x社区

图3：训练损失解释（图片来自作者本人）

训练过程

BLIP-2的训练包含两个阶段：

第1阶段——引导视觉语言表征：

该模型接收图像作为输入，然后使用冻结的视觉编码器将其转换为嵌入。
除了这些图像，模型还会接收它们的文本描述，并将其转换为嵌入。
Q-Former使用图像文本对比损失进行训练，确保视觉嵌入与其对应的文本嵌入紧密对齐，并远离不匹配的文本描述。同时，图像文本匹配损失通过学习对给定文本是否正确描述图像进行分类，帮助模型开发细粒度表示。

基于BLIP-2和Gemini开发多模态搜索引擎代理-AI.x社区

图4：第一阶段训练过程（图片来自作者本人）

第2阶段——引导视觉到语言的生成：

预训练语言模型被集成到架构中，以根据先前学习的表示生成文本。
通过使用基于图像的文本生成损失，将重点从对齐转移到文本生成，从而提高模型的推理和文本生成能力。

基于BLIP-2和Gemini开发多模态搜索引擎代理-AI.x社区

图5：第二阶段训练过程（图片由作者提供）

使用BLIP-2和Gemini创建多模态时尚搜索代理

在本节中，我们将利用BLIP-2的多模态功能构建一个时尚代理搜索代理，该代理可以接收输入的文本和/或图像并返回建议。对于代理的对话功能，我们将使用VertexAI中托管的Gemini 1.5 Pro；对于界面，我们将构建一个Streamlit应用实现。

本实例中使用的时尚数据集是根据MIT许可证授权的，可以通过以下链接访问：时尚产品图像数据集，它包含超过44,000张时尚产品图像。

实现此目的的第一步是设置一个向量数据库。这使代理能够根据商店中可用商品的图像嵌入以及输入中的文本或图像嵌入执行向量化搜索。我们使用Docker和docker-compose来帮助我们设置环境：

Docker-Compose与Postgres（数据库）和允许向量化搜索的PGVector扩展一起使用。

services:
  postgres:
    container_name: container-pg
    image: ankane/pgvector
    hostname: localhost
    ports:
      - "5432:5432"
    env_file:
      - ./env/postgres.env
    volumes:
      - postgres-data:/var/lib/postgresql/data
    restart: unless-stopped

  pgadmin:
    container_name: container-pgadmin
    image: dpage/pgadmin4
    depends_on:
      - postgres
    ports:
      - "5050:80"
    env_file:
      - ./env/pgadmin.env
    restart: unless-stopped

volumes:
  postgres-data:

Postgres对应的.env文件定义部分，其中包含用于登录数据库的变量。

POSTGRES_DB=postgres
POSTGRES_USER=admin
POSTGRES_PASSWORD=root

Pgadmin对应的.env文件定义部分，其中包含用于登录UI以手动查询数据库的变量（可选）。

PGADMIN_DEFAULT_EMAIL=admin@admin.com 
PGADMIN_DEFAULT_PASSWORD=root

连接功能对应的.env文件部分，包含使用Langchain连接到PGVector所需的所有组件。

DRIVER=psycopg
HOST=localhost
PORT=5432
DATABASE=postgres
USERNAME=admin
PASSWORD=root

一旦设置并运行Vector DB（docker-compose up -d），就该创建代理和工具来执行多模态搜索了。我们构建了两个代理来解决此场景应用：一个用于了解用户的请求，另一个用于提供建议：

分类器：负责接收来自客户的输入消息并提取用户正在寻找的衣服类别，例如T恤、裤子、鞋子、运动衫或衬衫。它还将返回客户想要的商品数量，以便我们可以从Vector DB中检索准确的数量。

from langchain_core.output_parsers import PydanticOutputParser
from langchain_core.prompts import PromptTemplate
from langchain_google_vertexai import ChatVertexAI
from pydantic import BaseModel, Field

class ClassifierOutput(BaseModel):
    """
    模型输出的数据结构。
    """

    category: list = Field(
        description="A list of clothes category to search for ('t-shirt', 'pants', 'shoes', 'jersey', 'shirt')."
    )
    number_of_items: int = Field(description="The number of items we should retrieve.")

class Classifier:
    """
    用于输入文本分类的分类器类。
    """

    def __init__(self, model: ChatVertexAI) -> None:
        """
        通过创建链来初始化 Chain 类。
        参数:
            model (ChatVertexAI): 大型语言模型 (LLM)。
        """
        super().__init__()

        parser = PydanticOutputParser(pydantic_object=ClassifierOutput)

        text_prompt = """
        You are a fashion assistant expert on understanding what a customer needs and on extracting the category or categories of clothes a customer wants from the given text.
        Text:
        {text}

        Instructions:
        1. Read carefully the text.
        2. Extract the category or categories of clothes the customer is looking for, it can be:
            - t-shirt if the custimer is looking for a t-shirt.
            - pants if the customer is looking for pants.
            - jacket if the customer is looking for a jacket.
            - shoes if the customer is looking for shoes.
            - jersey if the customer is looking for a jersey.
            - shirt if the customer is looking for a shirt.
        3. If the customer is looking for multiple items of the same category, return the number of items we should retrieve. If not specfied but the user asked for more than 1, return 2.
        4. If the customer is looking for multiple category, the number of items should be 1.
        5. Return a valid JSON with the categories found, the key must be 'category' and the value must be a list with the categories found and 'number_of_items' with the number of items we should retrieve.

        Provide the output as a valid JSON object without any additional formatting, such as backticks or extra text. Ensure the JSON is correctly structured according to the schema provided below.
        {format_instructions}

        Answer:
        """

        prompt = PromptTemplate.from_template(
            text_prompt, partial_variables={"format_instructions": parser.get_format_instructions()}
        )
        self.chain = prompt | model | parser

    def classify(self, text: str) -> ClassifierOutput:
        """
        根据文本上下文从模型获取类别。
        参数:
            text (str): 用户消息。
        返回值:
            ClassifierOutput:模型的答案。
        """
        try:
            return self.chain.invoke({"text": text})
        except Exception as e:
            raise RuntimeError(f"Error invoking the chain: {e}")

助手：负责使用从Vector DB中检索到的个性化建议进行回答。在这种情况下，我们还利用Gemini的多模态功能来分析检索到的图像并给出更好的答案。

from langchain_core.output_parsers import PydanticOutputParser
from langchain_core.prompts import PromptTemplate
from langchain_google_vertexai import ChatVertexAI
from pydantic import BaseModel, Field

class AssistantOutput(BaseModel):
    """
    模型输出的数据结构。
    """

    answer: str = Field(description="A string with the fashion advice for the customer.")

class Assistant:
    """
    提供时尚建议的代理类。
    """

    def __init__(self, model: ChatVertexAI) -> None:
        """
        通过创建链来初始化链类。
        参数:
            model (ChatVertexAI): LLM模型.
        """
        super().__init__()

        parser = PydanticOutputParser(pydantic_object=AssistantOutput)

        text_prompt = """
        You work for a fashion store and you are a fashion assistant expert on understanding what a customer needs.
        Based on the items that are available in the store and the customer message below, provide a fashion advice for the customer.
        Number of items: {number_of_items}

        Images of items:
        {items}

        Customer message:
        {customer_message}

        Instructions:
        1. Check carefully the images provided.
        2. Read carefully the customer needs.
        3. Provide a fashion advice for the customer based on the items and customer message.
        4. Return a valid JSON with the advice, the key must be 'answer' and the value must be a string with your advice.

        Provide the output as a valid JSON object without any additional formatting, such as backticks or extra text. Ensure the JSON is correctly structured according to the schema provided below.
        {format_instructions}

        Answer:
        """

        prompt = PromptTemplate.from_template(
            text_prompt, partial_variables={"format_instructions": parser.get_format_instructions()}
        )
        self.chain = prompt | model | parser

    def get_advice(self, text: str, items: list, number_of_items: int) -> AssistantOutput:
        """
        根据文本和项上下文从模型中获取建议。
        参数:
            text (str): 用户消息。
            items (list): 为客户找到的项。
            number_of_items (int): 要检索的项数。
        Returns:
            AssistantOutput: 模型的答案。
        """
        try:
            return self.chain.invoke({"customer_message": text, "items": items, "number_of_items": number_of_items})
        except Exception as e:
            raise RuntimeError(f"Error invoking the chain: {e}")

在工具方面，我们基于BLIP-2定义了一个工具。它由一个函数组成，该函数接收文本或图像作为输入并返回规范化的嵌入。根据输入，嵌入是使用BLIP-2的文本嵌入模型或图像嵌入模型生成的。

from typing import Optional

import numpy as np
import torch
import torch.nn.functional as F
from PIL import Image
from PIL.JpegImagePlugin import JpegImageFile
from transformers import AutoProcessor, Blip2TextModelWithProjection, Blip2VisionModelWithProjection

PROCESSOR = AutoProcessor.from_pretrained("Salesforce/blip2-itm-vit-g")
TEXT_MODEL = Blip2TextModelWithProjection.from_pretrained("Salesforce/blip2-itm-vit-g", torch_dtype=torch.float32).to(
    "cpu"
)
IMAGE_MODEL = Blip2VisionModelWithProjection.from_pretrained(
    "Salesforce/blip2-itm-vit-g", torch_dtype=torch.float32
).to("cpu")

def generate_embeddings(text: Optional[str] = None, image: Optional[JpegImageFile] = None) -> np.ndarray:
    """
    使用Blip2模型从文本或图像中生成嵌入。
    参数:
        text (Optional[str]): 客户输入文本
        image (Optional[Image]): 客户输入图像
    返回值:
        np.ndarray: 嵌入向量
    """
    if text:
        inputs = PROCESSOR(text=text, return_tensors="pt").to("cpu")
        outputs = TEXT_MODEL(**inputs)
        embedding = F.normalize(outputs.text_embeds, p=2, dim=1)[:, 0, :].detach().numpy().flatten()
    else:
        inputs = PROCESSOR(images=image, return_tensors="pt").to("cpu", torch.float16)
        outputs = IMAGE_MODEL(**inputs)
        embedding = F.normalize(outputs.image_embeds, p=2, dim=1).mean(dim=1).detach().numpy().flatten()

    return embedding

请注意，我们使用不同的嵌入模型创建与PGVector的连接，因为它是强制性的，但由于我们将直接存储由BLIP-2生成的嵌入，因此不会使用它。

在下面的循环中，我们遍历所有服装类别，加载图像，并创建要存储在向量数据库中的嵌入并将其附加到列表中。此外，我们将图像的路径存储为文本，以便我们可以在Streamlit应用中展示它。最后，我们存储起类别，以便根据分类器代理预测的类别过滤结果。

import glob
import os

from dotenv import load_dotenv
from langchain_huggingface.embeddings import HuggingFaceEmbeddings
from langchain_postgres.vectorstores import PGVector
from PIL import Image

from blip2 import generate_embeddings

load_dotenv("env/connection.env")

CONNECTION_STRING = PGVector.connection_string_from_db_params(
    driver=os.getenv("DRIVER"),
    host=os.getenv("HOST"),
    port=os.getenv("PORT"),
    database=os.getenv("DATABASE"),
    user=os.getenv("USERNAME"),
    password=os.getenv("PASSWORD"),
)

vector_db = PGVector(
    embeddings=HuggingFaceEmbeddings(model_name="nomic-ai/modernbert-embed-base"),  # 这对我们的情况来说并不重要
    collection_name="fashion",
    connection=CONNECTION_STRING,
    use_jsonb=True,
)

if __name__ == "__main__":

    # 生成图像嵌入
    # 以文本形式保存图像的路径
    # 在元数据中保存类别
    texts = []
    embeddings = []
    metadatas = []

    for category in glob.glob("images/*"):
        cat = category.split("/")[-1]
        for img in glob.glob(f"{category}/*"):
            texts.append(img)
            embeddings.append(generate_embeddings(image=Image.open(img)).tolist())
            metadatas.append({"category": cat})

    vector_db.add_embeddings(texts, embeddings, metadatas)

现在，我们可以构建Streamlit应用程序，以便与我们的代理聊天并征求建议了。聊天从代理询问它可以提供什么帮助开始，并为客户提供一个组件框来编写消息和/或上传文件。

一旦客户回复，工作流程如下：

分类代理可以识别顾客正在寻找哪些类别的衣服以及他们想要多少件。
如果客户上传文件，该文件将被转换为嵌入，我们将根据客户想要的衣服类别和单位数量在向量数据库中寻找类似的项目。
然后，检索到的项目和客户的输入信息被发送给代理代理，以产生与检索到的图像一起呈现的推荐信息。
如果客户没有上传文件，流程是相同的，但我们不是生成用于检索的图像嵌入，而是创建文本嵌入。

import os

import streamlit as st
from dotenv import load_dotenv
from langchain_google_vertexai import ChatVertexAI
from langchain_huggingface.embeddings import HuggingFaceEmbeddings
from langchain_postgres.vectorstores import PGVector
from PIL import Image

import utils
from assistant import Assistant
from blip2 import generate_embeddings
from classifier import Classifier

load_dotenv("env/connection.env")
load_dotenv("env/llm.env")

CONNECTION_STRING = PGVector.connection_string_from_db_params(
    driver=os.getenv("DRIVER"),
    host=os.getenv("HOST"),
    port=os.getenv("PORT"),
    database=os.getenv("DATABASE"),
    user=os.getenv("USERNAME"),
    password=os.getenv("PASSWORD"),
)

vector_db = PGVector(
    embeddings=HuggingFaceEmbeddings(model_name="nomic-ai/modernbert-embed-base"),  #这对我们的情况来说并不重要
    collection_name="fashion",
    connection=CONNECTION_STRING,
    use_jsonb=True,
)

model = ChatVertexAI(model_name=os.getenv("MODEL_NAME"), project=os.getenv("PROJECT_ID"), temperarture=0.0)
classifier = Classifier(model)
assistant = Assistant(model)

st.title("Welcome to ZAAI's Fashion Assistant")

user_input = st.text_input("Hi, I'm ZAAI's Fashion Assistant. How can I help you today?")

uploaded_file = st.file_uploader("Upload an image", type=["jpg", "jpeg", "png"])

if st.button("Submit"):

    #了解用户的要求
    classification = classifier.classify(user_input)

    if uploaded_file:

        image = Image.open(uploaded_file)
        image.save("input_image.jpg")
        embedding = generate_embeddings(image=image)

    else:

        # 在用户不上传图像时创建文本嵌入
        embedding = generate_embeddings(text=user_input)

    # 创建要检索的项目和路径的列表
    retrieved_items = []
    retrieved_items_path = []
    for item in classification.category:
        clothes = vector_db.similarity_search_by_vector(
            embedding, k=classification.number_of_items, filter={"category": {"$in": [item]}}
        )
        for clothe in clothes:
            retrieved_items.append({"bytesBase64Encoded": utils.encode_image_to_base64(clothe.page_content)})
            retrieved_items_path.append(clothe.page_content)

    #得到助理的建议
    assistant_output = assistant.get_advice(user_input, retrieved_items, len(retrieved_items))
    st.write(assistant_output.answer)

    cols = st.columns(len(retrieved_items)+1)
    for col, retrieved_item in zip(cols, ["input_image.jpg"]+retrieved_items_path):
        col.image(retrieved_item)

    user_input = st.text_input("")

else:
    st.warning("Please provide text.")

上面这两个例子运行结果如下所示：

图6显示了一个例子，其中客户上传了一张红色T恤的图片并要求代理商完成服装制作。

基于BLIP-2和Gemini开发多模态搜索引擎代理-AI.x社区

图6：文本和图像输入的示例（图片来自作者本人）

图7显示了一个更直接的例子，客户要求代理向他们展示黑色T恤。

基于BLIP-2和Gemini开发多模态搜索引擎代理-AI.x社区

图7：文本输入示例（图片来自作者本人）

结论

多模态AI已不再仅仅是一个研究课题。它正在业界用于重塑客户与公司产品目录的互动方式。在本文中，我们探讨了如何结合使用BLIP-2和Gemini等多模态模型来解决实际问题，并以可扩展的方式为客户提供更加个性化的体验。

其中，我们深入探索了BLIP-2的架构，展示了它如何弥合文本和图像模态之间的差距。为了扩展其功能，我们开发了一个代理系统，每个代理专门负责不同的任务。该系统集成了LLM（Gemini）和向量数据库，可以使用文本和图像嵌入检索产品目录。我们还利用Gemini的多模态推理来改进销售辅助代理的响应，使其更像真实的人类。

总之，借助BLIP-2、Gemini和PG Vector等工具，多模态搜索和检索的未来已经实现，未来的搜索引擎将与我们今天使用的搜索引擎大不相同。

参考文献

【1】Junnan Li、Dongxu Li、Silvio Savarese、Steven Hoi，2023年。BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models（BLIP-2：使用冻结图像编码器和大型语言模型进行引导语言图像预训练），arXiv:2301.12597。

【2】Prannay Khosla、Piotr Teterwak、Chen Wang、Aaron Sarna、Yonglong Tian、Phillip Isola、Aaron Maschinot、Ce Liu、Dilip Krishnan，2020年。Supervised Contrastive Learning（监督对比学习），arXiv:2004.11362。

【3】Junnan Li、Ramprasaath R. Selvaraju、Akhilesh Deepak Gotmare、Shafiq Joty、Caiming Xiong、Steven Hoi，2021年。Align before Fuse: Vision and Language Representation Learning with Momentum Distillation（融合前对齐：使用动量蒸馏进行视觉和语言表征学习），arXiv:2107.07651。

【4】李东，南阳，王文辉，魏福如，刘晓东，王宇，高剑锋，周明，Hsiao-Wen Hon。2019。Unified Language Model Pre-training for Natural Language Understanding and Generation（自然语言理解和生成的统一语言模型预训练），arXiv:1905.03197。

译者介绍

朱先忠，51CTO社区编辑，51CTO专家博客、讲师，潍坊一所高校计算机教师，自由编程界老兵一枚。

原文标题：Multimodal Search Engine Agents Powered by BLIP-2 and Gemini，作者：Luís Roque，Rafael Guedes

标签

BLIP-2

Gemini

人工智能

已于2025-3-6 08:46:01修改

社区头条

51CTO

51CTO博客

51CTO学堂

基于BLIP-2和Gemini开发多模态搜索引擎代理原创

简介

BLIP-2：多模态模型

架构

损失函数

训练过程

第1阶段——引导视觉语言表征：

第2阶段——引导视觉到语言的生成：

使用BLIP-2和Gemini创建多模态时尚搜索代理

结论

参考文献

译者介绍

目录

51CTO

51CTO博客

51CTO学堂

基于BLIP-2和Gemini开发多模态搜索引擎代理 原创

简介

BLIP-2：多模态模型

架构

损失函数

训练过程

第1阶段——引导视觉语言表征：

第2阶段——引导视觉到语言的生成：

使用BLIP-2和Gemini创建多模态时尚搜索代理

结论

参考文献

译者介绍

目录

基于BLIP-2和Gemini开发多模态搜索引擎代理原创