基于任何数据集创建LLM（Large Language Models）机器人-51CTO.COM

今天偶然翻到一个仓库 Embedchain，觉得很实用，分享给大家。仓库地址如下：

https://github.com/embedchain/embedchain

它是基于 OpenAI 的，但是你可以添加自己的数据集，然后生成一个对话机器人，使用方法简单，很容易上手。

Embedchain 简介

Embedchain 是一个可以方便地基于任何数据集创建 LLM（Large Language Models）机器人的框架。它抽象了加载数据集、分块、创建嵌入向量以及存储在向量数据库中的整个过程。你可以使用 .add 和 .add_local 函数添加单个或多个数据集，然后使用 .query 函数从添加的数据集中查找答案。

假如你崇拜一个很厉害的人 - Naval Ravikant，你想把他的知识做成一个对话机器人，你可以把他的 Youtube 视频、PDF 书籍、博客文章，以及你提供的一个问题和答案对，添加到 Embedchain，Embedchain 将为你创建一个机器人。这是一个例子：

from embedchain import App

naval_chat_bot = App()

# 嵌入在线资源
naval_chat_bot.add("youtube_video", "https://www.youtube.com/watch?v=3qHkcs3kG44")
naval_chat_bot.add("pdf_file", "https://navalmanack.s3.amazonaws.com/Eric-Jorgenson_The-Almanack-of-Naval-Ravikant_Final.pdf")
naval_chat_bot.add("web_page", "https://nav.al/feedback")
naval_chat_bot.add("web_page", "https://nav.al/agi")

# 嵌入本地资源
naval_chat_bot.add_local("qna_pair", ("Who is Naval Ravikant?", "Naval Ravikant is an Indian-American entrepreneur and investor."))

naval_chat_bot.query("What unique capacity does Naval argue humans possess when it comes to understanding explanations or concepts?")
# 答案：Naval 认为，人类在理解解释或概念方面拥有独特的能力，这是在这个物理现实中可能的最大程度。1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.

Embedchain 使用

要开始使用 Embedchain，首先确保你已经安装了该包。如果还没有安装，可以使用 pip 进行安装：

pip install embedchain1.

Embedchain 使用 OpenAI 的嵌入模型创建块的嵌入，使用 ChatGPT API 作为 LLM，给出相关文档的答案。确保你有一个 OpenAI 帐户和 API 密钥。如果你没有 API 密钥，可以通过访问此链接 [1] 创建一个。

一旦你有了 API 密钥，将其设置在一个名为 OPENAI_API_KEY 的环境变量中

import os
os.environ["OPENAI_API_KEY"] = "sk-xxxx"1.
2.

接下来，从 embedchain 中导入 App 类并使用 .add 函数添加任何数据集。

from embedchain import App

naval_chat_bot = App()

# 嵌入在线资源
naval_chat_bot.add("youtube_video", "https://www.youtube.com/watch?v=3qHkcs3kG44")
naval_chat_bot.add("pdf_file", "https://navalmanack.s3.amazonaws.com/Eric-Jorgenson_The-Almanack-of-Naval-Ravikant_Final.pdf")
naval_chat_bot.add("web_page", "https://nav.al/feedback")
naval_chat_bot.add("web_page", "https://nav.al/agi")

# 嵌入本地资源
naval_chat_bot.add_local("qna_pair", ("Who is Naval Ravikant?", "Naval Ravikant is an Indian-American entrepreneur and investor."))1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.

如果在你的脚本或应用中有任何其他的应用实例，你可以更改导入如下

from embedchain import App as EmbedChainApp

# 或者

from embedchain import App as ECApp1.
2.
3.
4.
5.

现在你的应用已经创建好了。可以使用 .query 函数获得任何查询的答案。

print(naval_chat_bot.query("What unique capacity does Naval argue humans possess when it comes to understanding explanations or concepts?"))
# answer: Naval argues that humans possess the unique capacity to understand explanations or concepts to the maximum extent possible in this physical reality.1.
2.

支持的格式

支持以下格式：

Youtube 视频

要将任何 Youtube 视频添加到你的应用中，使用数据类型（.add 的第一个参数）为 youtube_video。例如：

app.add('youtube_video', 'a_valid_youtube_url_here')1.

PDF 文件

要添加任何 PDF 文件，使用数据类型为 pdf_file。例如：

app.add('pdf_file', 'a_valid_url_where_pdf_file_can_be_accessed')1.

注意，不支持密码保护的 PDF。

网页

要添加任何网页，使用数据类型为 web_page。例如：

app.add('web_page', 'a_valid_web_page_url')1.

文本

要提供你自己的文本，使用数据类型为 text 并输入一个字符串。文本不会被处理，这可以非常多样化。例如：

app.add_local('text', 'Seek wealth, not money or status. Wealth is having assets that earn while you sleep. Money is how we transfer time and wealth. Status is your place in the social hierarchy.')1.

注意：这在示例中没有使用，因为在大多数情况下，你将提供整个段落或文件。