Crawl4AI，智能体网络自动采集利器

发布于 2024-11-8 14:59

2682浏览

0收藏

Crawl是一款免费的开源工具，利用AI技术简化网络爬取和数据提取，提高信息收集与分析的效率。它智能识别网页内容，并将数据转换为易于处理的格式，功能全面且操作简便。

1 使用 Crawl 的步骤

步骤 1：安装与设置

pip install “crawl4ai @ git+https://github.com/unclecode/crawl4ai.git" transformers torch nltk1.

步骤 2：数据提取

创建Python脚本，启动网络爬虫并从URL提取数据：

from crawl4ai import WebCrawler

# 创建 WebCrawler 的实例
crawler = WebCrawler()

# 预热爬虫（加载必要的模型）
crawler.warmup()

# 在 URL 上运行爬虫
result = crawler.run(url="https://openai.com/api/pricing/")

# 打印提取的内容
print(result.markdown)1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.

步骤 3：数据结构化

使用LLM（大型语言模型）定义提取策略，将数据转换为结构化格式：

import os
from crawl4ai import WebCrawler
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from pydantic import BaseModel, Field

class OpenAIModelFee(BaseModel):
    model_name: str = Field(..., description="OpenAI 模型的名称。")
    input_fee: str = Field(..., description="OpenAI 模型的输入令牌费用。")
    output_fee: str = Field(..., description="OpenAI 模型的输出令牌费用。")

url = 'https://openai.com/api/pricing/'
crawler = WebCrawler()
crawler.warmup()

result = crawler.run(
        url=url,
        word_count_threshold=1,
        extraction_strategy= LLMExtractionStrategy(
            provider= "openai/gpt-4o", api_token = os.getenv('OPENAI_API_KEY'), 
            schema=OpenAIModelFee.schema(),
            extraction_type="schema",
            instruction="""从爬取的内容中提取所有提到的模型名称以及它们的输入和输出令牌费用。不要遗漏整个内容中的任何模型。提取的模型 JSON 格式应该像这样：
            {"model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens"}."""
        ),            
        bypass_cache=True,
    )

print(result.extracted_content)1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.

步骤 4：集成AI智能体

将 Crawl 与 Praison CrewAI 智能体集成，实现高效的数据处理：

pip install praisonai1.

创建工具文件（tools.py）来包装 Crawl 工具：

# tools.py
import os
from crawl4ai import WebCrawler
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from pydantic import BaseModel, Field
from praisonai_tools import BaseTool

class ModelFee(BaseModel):
    llm_model_name: str = Field(..., description="模型的名称。")
    input_fee: str = Field(..., description="模型的输入令牌费用。")
    output_fee: str = Field(..., description="模型的输出令牌费用。")

class ModelFeeTool(BaseTool):
    name: str = "ModelFeeTool"
    description: str = "从给定的定价页面提取模型的费用信息。"

    def _run(self, url: str):
        crawler = WebCrawler()
        crawler.warmup()

        result = crawler.run(
            url=url,
            word_count_threshold=1,
            extraction_strategy= LLMExtractionStrategy(
                provider="openai/gpt-4o",
                api_token=os.getenv('OPENAI_API_KEY'), 
                schema=ModelFee.schema(),
                extraction_type="schema",
                instruction="""从爬取的内容中提取所有提到的模型名称以及它们的输入和输出令牌费用。不要遗漏整个内容中的任何模型。提取的模型 JSON 格式应该像这样：
                {"model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens"}."""
            ),            
            bypass_cache=True,
        )
        return result.extracted_content

if __name__ == "__main__":
    # 测试 ModelFeeTool
    tool = ModelFeeTool()
    url = "https://www.openai.com/pricing"
    result = tool.run(url)
    print(result)1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
31.
32.
33.
34.
35.
36.
37.
38.
39.
40.
41.