LanceDB：为 AI 应用打造的高效嵌入式向量数据库

发布于 2024-12-24 11:41

浏览

0收藏

当前，向量数据库已经成了一个红海市场，新兴的还是传统数据库厂商都在做这方面的工作。然而，在嵌入式，端上的向量数据库比较少，chromaDB算是其中一个，但它不算是一个纯原生、深度优化的的嵌入式向量数据库，仍采用parquet格式（读一行数据需要读取整个块解压，比较慢，另外副本占用空间），功能也比较少，那有没有更好的选择呢？很多人自然想到关系型嵌入式数据库王者——Sqlite，奈何它的向量版本 sqlite-vec还处于开发中，那有没有文档性能还好的替代品呢？LanceDB是一个选择。

LanceDB 是一个专为构建 AI 应用而设计的开源向量数据库。它采用嵌入式架构,无需部署独立服务器,可以轻松集成到各种应用场景中。

LanceDB：为 AI 应用打造的高效嵌入式向量数据库-AI.x社区

核心功能和优势在于:

嵌入式架构。与需要部署服务器的 Qdrant 等产品不同,LanceDB 采用嵌入式设计,作为应用的一部分运行,易于集成且无需额外的基础设施管理。
专为AI设计的Lance 数据格式（最大亮点）。LanceDB 使用专门优化的 Lance 列式存储格式,相比传统的 Parquet 格式具有更快的扫描速度。它支持数据分片,只加载必要的数据片段,大大减少 IO 开销。同时具有机器学习所需的自动数据版本管理能力，不同的版本会关联该版本相关文件、模式及 blob 的元数据,更新数据时无需完整重写（Zero-copy）。

LanceDB：为 AI 应用打造的高效嵌入式向量数据库-AI.x社区

相较于其他的常见格式对比，在机器学习场景场景中优势明显：

LanceDB：为 AI 应用打造的高效嵌入式向量数据库-AI.x社区

数据cap理论

	Lance	Parquet & ORC	JSON & XML	TFRecord	Database	Warehouse
Analytics	Fast	Fast	Slow	Slow	Decent	Fast
Feature Engineering	Fast	Fast	Decent	Slow	Decent	Good
Training	Fast	Decent	Slow	Fast	N/A	N/A
Exploration	Fast	Slow	Fast	Slow	Fast	Decent
Infra Support	Rich	Rich	Decent	Limited	Rich	Rich

高性能向量搜索。基于 Rust 语言开发,具有优秀的性能表现。根据官方基准测试,在同等硬件条件下,对于 128 维向量的 10 亿规模数据集,查询延迟可以控制在 100ms 以内。并且支持GPU加速。
丰富的生态集成。LanceDB 原生支持 Python 和JavaScript/TypeScript,并与 LangChain 、LlamaIndex 等主流 AI 框架无缝集成。同时也支持 Apache Arrow 、Pandas 、Polars 、DuckDB 等数据处理工具。
多模态数据支持。除了向量数据,LanceDB 还能高效存储和检索文本、图像、音频等非结构化数据,无需额外的存储解决方案。

使用 LanceDB 非常简单,下面是使用示例：

Python版本:

import lancedb

# 连接数据库
db = lancedb.connect("data/sample-lancedb")

# 创建表并插入数据
table = db.create_table("my_table",
    data=[{"vector": [3.1, 4.1], "item": "foo", "price": 10.0},
          {"vector": [5.9, 26.5], "item": "bar", "price": 20.0}])

# 执行向量搜索
result = table.search([100, 100]).limit(2).to_pandas()1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.

js版本,搭配transformers使用。

async function example() {

    const lancedb = require('vectordb')

    // Import transformers and the all-MiniLM-L6-v2 model (https://huggingface.co/Xenova/all-MiniLM-L6-v2)
    const { pipeline } = await import('@xenova/transformers')
    const pipe = await pipeline('feature-extraction', 'Xenova/all-MiniLM-L6-v2');


    // Create embedding function from pipeline which returns a list of vectors from batch
    // sourceColumn is the name of the column in the data to be embedded
    //
    // Output of pipe is a Tensor { data: Float32Array(384) }, so filter for the vector
    const embed_fun = {}
    embed_fun.sourceColumn = 'text'
    embed_fun.embed = async function (batch) {
        let result = []
        for (let text of batch) {
            const res = await pipe(text, { pooling: 'mean', normalize: true })
            result.push(Array.from(res['data']))
        }
        return (result)
    }

    // Link a folder and create a table with data
    const db = await lancedb.connect('data/sample-lancedb')

    const data = [
        { id: 1, text: 'Cherry', type: 'fruit' },
        { id: 2, text: 'Carrot', type: 'vegetable' },
        { id: 3, text: 'Potato', type: 'vegetable' },
        { id: 4, text: 'Apple', type: 'fruit' },
        { id: 5, text: 'Banana', type: 'fruit' }
    ]

    const table = await db.createTable('food_table', data, embed_fun)


    // Query the table
    const results = await table
        .search("a sweet fruit to eat")
        .metricType("cosine")
        .limit(2)
        .execute()
    console.log(results.map(r => r.text))

}

example().then(_ => { console.log("Done!") })1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
31.
32.
33.
34.
35.
36.
37.
38.
39.
40.
41.
42.
43.
44.
45.
46.
47.
48.
49.

更多参考资源：https://github.com/lancedb/vectordb-recipes

相比需要部署服务器的向量数据库,LanceDB 的嵌入式架构特别适合:

需要在本地运行的桌面应用
资源受限的边缘计算环境
对数据隐私有严格要求的场景
快速原型开发和测试

虽然在处理海量数据时,LanceDB 展现出了显著的性能优势,但对于大多数中小规模的 AI 应用来说,开发效率和易用性可能是更重要的考虑因素。LanceDB 简单直观的 API 设计和完善的生态支持,使其成为构建各类 AI 应用的理想选择。

小结

事实上，当前很多的应用都选择lancedb作为其实现方案，比如微软的GraphRAG，Character AI ， MidJourney等，它们也获得了YC 800 万美元的种子轮融资。2025年，我们将迎来多模态LLM应用的爆发，这也将会带来向量数据库的新一轮的热潮，作为嵌入式向量数据库的最佳代表，无论是用于构建原型还是部署生产环境,都是一个值得考虑的选择，甚至可能是不二选择。

参考：

https://blog.lancedb.com/new-funding-and-a-new-foundation-for-multimodal-ai-data/

https://lancedb.github.io/

https://github.com/lancedb/lancedb

本文转载自 AI工程化，作者： ully

标签

数据库

嵌入式