RAG文档解析器，核心技术剖析

发布于 2024-9-20 11:08

浏览

0收藏

最近，RAG技术逐渐走红，但文档解析这一重要环节却鲜为人知。说到底，无论使用多么高级的检索和生成技术，最终效果都取决于文档本身的质量。如果文档信息不全或格式混乱，那么再怎么优化检索策略、嵌入模型或大型语言模型（LLMs）也无济于事。

本文介绍三种流行的文档提取策略，并以亚马逊2024年第一季度报告中的表格解析为例，展示这些策略的实际应用。

1 文本解析器：基础工具

文本解析器已经发展多年，这些工具能够读取文档并从中提取文本。常见的工具有PyPDF、PyMUPDF和PDFMiner。接下来，重点介绍PyMUPDF，并通过LlamaIndex集成的PyMUPDF来解析特定页面。以下是相应的代码示例：

from llama_index.core.schema import TextNode
from llama_index.core.node_parser import SentenceSplitter
import fitz

file_path = "/content/AMZN-Q1-2024-Earnings-Release.pdf"
doc = fitz.open(file_path) 
text_parser = SentenceSplitter(
    chunk_size=2048,
)
text_chunks = [] #C
for doc_idx, page in enumerate(doc):
    page_text = page.get_text("text")
    cur_text_chunks = text_parser.split_text(page_text)
    text_chunks.extend(cur_text_chunks)
nodes = [] #D
for idx, text_chunk in enumerate(text_chunks):
    node = TextNode(
        text=text_chunk,
    )
    nodes.append(node)
print(nodes[10].text)

PyMUPDF在提取文本方面表现优秀，但文本的格式处理并不理想。这在后续的生成过程中可能会造成问题，尤其是当大型语言模型难以识别文档结构时。

以下是亚马逊公司的财务报表摘要：

AMAZON.COM, INC.
Consolidated Statements of Comprehensive Income
(in millions)
(unaudited)
  
Three Months Ended
March 31,
 
2023
2024
Net income
$ 
3,172 $ 
10,431 
Other comprehensive income (loss):
Foreign currency translation adjustments, net of tax of $(10) and $30
 
386  
(1,096) 
Available-for-sale debt securities:
Change in net unrealized gains (losses), net of tax of $(29) and $(158)
 
95  
536 
Less: reclassification adjustment for losses (gains) included in “Other income 
(expense), net,” net of tax of $(10) and $0
 
33  
1 
Net change
 
128  
537 
Other, net of tax of $0 and $(1)
 
—  
1 
Total other comprehensive income (loss)
 
514  
(558) 
Comprehensive income
$ 
3,686 $ 
9,873

接下来，让我们看看OCR在文档解析中的表现。

2 OCR技术：图像识别

from PIL import Image
import pytesseract
import sys
from pdf2image import convert_from_path
import os
pages = convert_from_path(file_path)
i=10
filename = "page"+str(i)+".jpg"
pages[i].save(filename, 'JPEG')
outfile =  "page"+str(i)+"_text.txt"
f = open(outfile, "a")
text= str(((pytesseract.image_to_string(Image.open(filename)))))
text = text.replace('-\n', '')    
f.write(text)
f.close()

print(text)

OCR（如下所示）能更好地捕捉文档文本和结构。

AMAZON.COM, INC.
Consolidated Statements of Comprehensive Income
(in millions)

(unaudited)
Three Months Ended
March 31,
2023 2024
Net income $ 3,172 §$ 10,431
Other comprehensive income (loss):
Foreign currency translation adjustments, net of tax of $(10) and $30 386 (1,096)
Available-for-sale debt securities:
Change in net unrealized gains (losses), net of tax of $(29) and $(158) 95 536
Less: reclassification adjustment for losses (gains) included in “Other income
(expense), net,” net of tax of $(10) and $0 33 1
Net change 128 231
Other, net of tax of $0 and $(1) _— 1
Total other comprehensive income (loss) 514 (558)

Comprehensive income $ 3,686 $ 9,873

最后，来看看智能文档解析。

3 智能文档解析（IDP）：结构化提取

智能文档解析（IDP）是一项新兴技术，旨在从文档中提取所有相关信息，并以结构化格式呈现。市面上有多种IDP工具，如LlamaParse、DocSumo、Unstructured.io以及Azure Doc Intelligence等。

这些工具的共同点在于，它们都融合了OCR（光学字符识别）、文本提取技术、多模态大型语言模型（LLMs），以及将内容转换为markdown格式的能力，以实现文本的高效提取。以LlamaIndex推出的LlamaParse为例，使用前需要先获取API密钥，然后便可以通过API接口来解析文档。

import getpass
import os
from copy import deepcopy

os.environ["LLAMA_CLOUD_API_KEY"] = getpass.getpass()
from llama_parse import LlamaParse
import nest_asyncio
nest_asyncio.apply()
documents = LlamaParse(result_type="markdown").load_data(file_path)
def get_page_nodes(docs, separator="\n---\n"):
    """Split each document into page node, by separator."""
    nodes = [] #C
    for doc in docs:
        doc_chunks = doc.text.split(separator)
        for doc_chunk in doc_chunks:
            node = TextNode(
                text=doc_chunk,
                metadata=deepcopy(doc.metadata),
            )
            nodes.append(node)

    return nodes


nodes_lp = get_page_nodes(documents)
print(nodes_lp[10].text)

下面的内容以markdown格式结构化，应该是目前结构最好的表示。

# 亚马逊公司

# 综合收益表

| |Three Months Ended March 31, 2023|Three Months Ended March 31, 2024|
|---|---|---|
|Net income|$3,172|$10,431|
|Other comprehensive income (loss):| | |
|Foreign currency translation adjustments, net of tax of $(10) and $30|386|(1,096)|
|Available-for-sale debt securities:| | |
|Change in net unrealized gains (losses), net of tax of $(29) and $(158)|95|536|
|Less: reclassification adjustment for losses (gains) included in “Other income (expense), net,” net of tax of $(10) and $0|33|1|
|Net change|128|537|
|Other, net of tax of $0 and $(1)|—|1|
|Total other comprehensive income (loss)|514|(558)|
|Comprehensive income|$3,686|$9,873|

不过，有一点需要注意，上述内容忽略了一些关键的上下文信息。特别是，解析后的文档中不再包含“millions”（百万）这样的单位标识，这可能会导致生成器LLM在理解时产生误解。

4 结论

要提升你的RAG应用性能，重点在于选择合适的文档解析器。各种解析策略各有千秋，也各有局限：

文本解析器：使用PyPDF或PyMUPDF等工具，可以高效提取文本，但可能会丢失文档结构，这在生成内容时可能会让你的语言模型感到困惑。
OCR技术：选择Pytesseract等OCR工具，能更精准地捕捉文本及其结构，更好地保留原始文档的格式和上下文。但OCR处理通常耗时较长，且效果很大程度上取决于具体应用场景。你需要权衡准确性提升是否值得增加的处理时间。
智能文档解析（IDP）：采用LlamaParse等高级IDP工具，可以整合OCR、文本提取和多模态语言模型，将文档转换为结构化的markdown格式。但要注意，这种方法有时会丢失关键的上下文信息，如度量单位。此外，IDP技术尚在成熟过程中，可能面临可扩展性和延迟问题。在部署IDP时，要充分考虑这些限制，并为可能的系统瓶颈做好准备。

最终，选择哪种解析器，需要根据你的具体应用场景来决定。最佳做法是尝试不同的解析器，评估它们在你的应用中的表现，然后选择最满足你需求的那一个。有时候，结合多种方法可能会更有效。不断试验和调整，以期达到最佳的RAG应用效果。

本文转载自 AI科技论谈，作者： AI科技论谈

标签

RAG

解析器

性能