RAG爬虫太拉垮?快来试试智能爬虫Crawl4AI,开源高效,专为AI量身打造!附实测效果 原创

发布于 2025-3-10 08:12
浏览
0收藏

最近,销售团队频繁反馈一个问题:在给客户演示时,我们的AI系统知识库爬虫表现不佳,输入客户的网页地址后,往往什么都抓取不到,导致知识库无法更新。作为技术负责人,我一开始也有些头疼,毕竟我对爬虫的了解还停留在Scrapy和Selenium的时代,觉得这些工具既复杂又耗时,于是干脆拒绝了销售的需求。销售团队一度认为我们的爬虫功能“鸡肋”,直到我发现了这款好用的爬虫工具——Crawl4AI。

自从用上Crawl4AI,销售团队反馈说,之前爬不到的内容现在都能轻松搞定!它不仅能够应对动态内容和反爬虫机制,还能通过大模型将数据转换成适合AI处理的Markdown格式。今天,我就来给大家详细介绍一下这款强大的工具。

为什么选择 Crawl4AI?

  1. 为LLM量身打造:Crawl4AI生成的Markdown文档专门为RAG(检索增强生成)和微调应用程序优化,简洁且智能。
  2. 快如闪电:相比传统爬虫,Crawl4AI的速度提升了6倍,实时且经济高效。
  3. 灵活的浏览器控制:支持会话管理、代理和自定义钩子,确保数据访问无缝衔接。
  4. 智能化提取:采用高级算法,减少对昂贵模型的依赖,提升提取效率。
  5. 开源且可部署:完全开源,无需API密钥,支持Docker和云集成。
  6. 活跃的社区支持:拥有一个充满活力的开发者社区,GitHub存储库持续更新。

核心特点

1. Markdown生成

  • 纯净Markdown:生成结构清晰、格式准确的Markdown文档。
  • 优化Markdown:通过启发式过滤,去除噪音和不相关部分,便于AI处理。
  • 引用与参考文献:自动将页面链接转换为带编号的参考文献列表。
  • 自定义策略:用户可以根据需求创建自己的Markdown生成策略。
  • BM25算法:采用BM25过滤技术,提取核心信息,去除无关内容。

2. 结构化数据提取

  • 大语言模型驱动:支持所有主流大语言模型(开源和专有)进行结构化数据提取。
  • 分块策略:基于主题、正则表达式或句子级别进行分块处理,确保内容精准提取。
  • 余弦相似度:根据用户查询,查找相关内容块,进行语义提取。
  • 基于CSS的提取:使用XPath和CSS选择器进行快速模式化数据提取。
  • 自定义模式:支持从重复模式中提取结构化JSON数据。

3. 浏览器集成

  • 托管浏览器:用户可以使用自己的浏览器,完全控制爬取过程,避免被检测为机器人。
  • 远程浏览器控制:通过Chrome开发者工具协议,实现远程大规模数据提取。
  • 浏览器配置文件管理:支持创建和管理持久化配置文件,保存认证状态、Cookies和设置。
  • 代理支持:无缝连接带认证的代理,确保安全访问。
  • 多浏览器支持:兼容Chromium、Firefox和WebKit。

4. 动态内容处理

  • JavaScript执行:能够执行JavaScript并等待异步或同步内容加载,确保动态内容被完整抓取。
  • 懒加载处理:等待图片完全加载,避免遗漏内容。
  • 全页扫描:模拟滚动加载,适用于无限滚动页面。

5. 部署与扩展

  • Docker化设置:优化Docker镜像,集成FastAPI服务器,便于快速部署。
  • 云部署:提供主流云平台的即用部署配置,支持大规模生产环境。
  • 安全认证:内置JWT令牌认证,保障API安全。

如何使用Crawl4AI?

安装

pip3 install crawl4ai # 安装 crawl4ai 库

crawl4ai-setup # 设置浏览器

安装后执行执行 crawl4ai-doctor 验证是否安装成功

# crawl4ai-doctor

[INIT].... → Running Crawl4AI health check...
[INIT].... → Crawl4AI 0.5.0.post2
[TEST].... ℹ Testing crawling capabilities...

[EXPORT].. ℹ Exporting PDF and taking screenshot took 1.31s
[FETCH]... ↓ https://crawl4ai.com... | Status: True | Time: 4.04s
[SCRAPE].. ◆ https://crawl4ai.com... | Time: 0.06s
[COMPLETE] ● https://crawl4ai.com... | Status: True | Total: 4.10s
[COMPLETE] ● ✅ Crawling test passed!

如果遇到任何浏览器相关的错误,可以执行:

python -m playwright install --with-deps chromium

运行一个简单的爬虫

import asyncio
from crawl4ai import *

asyncdef main():
    asyncwith AsyncWebCrawler() as crawler:
        run_config = CrawlerRunConfig(
            word_count_threshold=10,  # 每个内容块的最小字数
            exclude_external_links=True,  # 移除外部链接
            remove_overlay_elements=True,  # 移除弹窗/模态框
            process_iframes=True# 处理iframe内容
        )
        result = await crawler.arun(
            url="https://www.sentosa.com.sg/en/get-inspired/sentosa-guides/top-free-things-to-do",
            config=run_config
        )
        print(result.markdown)

if __name__ == "__main__":
    asyncio.run(main())


可以看到,Crawl4AI抓取的内容几乎与原始网页一一对应,无论是文字、图片还是链接,都被完整且准确地提取出来。相比之下,传统的Scrapy爬虫在处理动态内容和复杂网页结构时,往往显得力不从心。

RAG爬虫太拉垮?快来试试智能爬虫Crawl4AI,开源高效,专为AI量身打造!附实测效果-AI.x社区



RAG爬虫太拉垮?快来试试智能爬虫Crawl4AI,开源高效,专为AI量身打造!附实测效果-AI.x社区


使用LLM做数据提取

Crawl4AI提供了多种数据提取策略,包括基于CSS/XPath的传统方法和基于LLM的智能提取。以下是使用LLM提取策略的示例:

import asyncio
from crawl4ai import *
from pydantic import BaseModel, Field

INSTRUCTION_TO_LLM = """从抓取的内容中提取所有的标题,内容和标题的图片链接link"""

class Sentosa(BaseModel):
    name: str = Field(..., descriptinotallow="标题")
    content: str = Field(..., descriptinotallow="内容")
    link: str = Field(..., descriptinotallow="链接link")


llm_strategy = LLMExtractionStrategy(
        llm_cnotallow=LLMConfig(provider="openai/gpt-4o", api_token="api_key"),
        schema=Sentosa.model_json_schema(),
        extraction_type="schema",
        instructinotallow=INSTRUCTION_TO_LLM,
        chunk_token_threshold=1000,
        overlap_rate=0.0,
        apply_chunking=True,
        input_format="markdown",
        extra_args={"temperature": 0.0, "max_tokens": 3600},
    )

browser_cfg = BrowserConfig(headless=True, verbose=True)

asyncdef main():
    asyncwith AsyncWebCrawler(cnotallow=browser_cfg) as crawler:
        run_config = CrawlerRunConfig(
            word_count_threshold=10,  # Minimum words per content block
            exclude_external_links=True,  # Remove external links
            remove_overlay_elements=True,  # Remove popups/modals
            process_iframes=True,  # Process iframe content,
            extraction_strategy=llm_strategy
        )

        result = await crawler.arun(
            url="https://www.sentosa.com.sg/en/get-inspired/sentosa-guides/top-free-things-to-do",
            cnotallow=run_config
        )
        print(result.extracted_content)

if __name__ == "__main__":
    asyncio.run(main())

输出结果如下:

[
    {
        "name": "Top Free Things to Do in Sentosa",
        "content": "Explore the best free activities and attractions in Sentosa, from beautiful beaches to scenic nature trails.",
        "link": "https://www.sentosa.com.sg/en/get-inspired/sentosa-guides/top-free-things-to-do",
        "error": false
    },
    {
        "name": "Stroll along the Sentosa Boardwalk",
        "content": "Before we get to do all the exciting stuff waiting at Sentosa, we have to get there first. And what better way to do that than to stroll along the Sentosa Boardwalk? With its picturesque view of the city backdrop across the sea, it’ll be a waste not to snap a picture and share it with all your friends to see!",
        "link": "https://www.sentosa.com.sg/-/media/sentosa/article-listing/articles/13-free-things-to-do/13tipsgallery14.jpg?revisinotallow=622c8081-1f61-427f-a8f1-e4da1d457ca6",
        "error": false
    },
    {
        "name": "Chill at Tanjong Beach",
        "content": "For those looking to have a relaxing time at the beach indulging in their favourite book and music, Tanjong Beach is definitely the place to get away from the hectic pace of city life. Its tranquil atmosphere is thanks to its remote location at the southern end of Sentosa beachfront coastline.",
        "link": "https://www.sentosa.com.sg/en/things-to-do/attractions/tanjong-beach/",
        "error": false
    },
    {
        "name": "Explore the southernmost tip of Asia",
        "content": "For some adventuring at the beach, cross the suspension bridge at Palawan Beach to the southernmost tip of Asia. While you’re there, be sure to climb up the watchtower to enjoy a panoramic view of the South China sea as ships sail pass.",
        "link": "https://www.sentosa.com.sg/en/things-to-do/attractions/southernmost-point-of-continental-asia/",
        "error": false
    },
    {
        "name": "Play some beach sports at Siloso Beach",
        "content": "The beach isn’t just for sightseeing and chilling, it’s where all sorts of people gather to play beach sports such as Beach Volleyball, Ultimate Frisbee and Football! So call up your friends and family and head down to Siloso Beach to engage in one of the most fun beach activities you can do!",
        "link": "https://www.sentosa.com.sg/en/things-to-do/attractions/siloso-beach/",
        "error": false
    },
    {
        "name": "Unleash and discover magical experiences at Sensoryscape",
        "content": "Calling all my nature lovers, those who love scenic views or mesmerising interactive projections! Sensoryscape is the perfect place for you, and it's completely free too! So come on down for relaxing vibes during the day and stay for an enchanting night experience. As night falls, watch how this calming place transforms into the ImagiNite experience. Immerse yourself in the various sensory gardens like Symphony Streams enchanting underwater world, interactive projections at Palate Playground, dancing light beams at Lookout Loop, glowing giant flower stalks at Glow Garden and more. Don’t forget to download the ImagiNite App and witness the light shows and projections come to life! End your day in the most magical way possible at Sensoryscape where one can experience a blooming merge between vibrant reefs and lush ridges.",
        "link": "https://www.sentosa.com.sg/en/things-to-do/attractions/sensoryscape/",
        "error": false
    },
    {
        "name": "Walk along Fort Siloso Skywalk",
        "content": "Singapore has its own fair share of well-known Skywalks such as the OCBC Skyway and Henderson Waves, but the tallest one yet is Fort Siloso Skywalk, at 11 storeys high! It boasts beautiful views of Western Sentosa, Mount Faber and Keppel Harbour. Be sure to take plenty of photos but please don’t drop your phones while doing so!",
        "link": "https://www.sentosa.com.sg/en/things-to-do/attractions/fort-siloso-skywalk/",
        "error": false
    },
    {
        "name": "Go back in time at Fort Siloso",
        "content": "Singapore’s only preserved coastal fort is a treasure trove of WWII memorabilia. You can learn of its rich history by walking along its two trails: the Heritage Trail and the Gun Trail. Alternatively, you can go back in time with the Surrender Chambers immersive show where you get to relive Singapore’s epoch-making events.",
        "link": "https://www.sentosa.com.sg/en/things-to-do/attractions/fort-siloso/",
        "error": false
    },
    {
        "name": "Go hiking at Sentosa Nature Discovery",
        "content": "Nature lovers are sure to love this particular activity because in this 1.8-kilometre trek through a rainforest, you’ll get to get up and close with the birds, insects, plants of Sentosa. If you’re really observant, there’s over 20 different species of birds and even other animals like geckos and squirrels to see!",
        "link": "https://www.sentosa.com.sg/en/things-to-do/attractions/sentosa-nature-discovery/",
        "error": false
    },
    {
        "name": "Bring your date to Quayside Isle",
        "content": "Situated near Sentosa Cove, taking a leisurely walk along Quayside Isle’s cobbled pavements in the evening is the perfect way to wrap up a date. The neatly arranged yachts also lend some charm to the scene of the setting sun. With its quiet atmosphere and scenic view, it’s a great way to unwind and end the day.",
        "link": "",
        "error": false
    },
    {
        "name": "Experience free live music & events",
        "content": "From outdoor movie nights to beach concerts, Sentosa is always alive with free entertainment. Keep an eye on Sentosa’s event calendar for upcoming music gigs, pop-up markets, and cultural performances.",
        "link": "https://www.sentosa.com.sg/en/things-to-do/events/live-music-performance/",
        "error": false
    },
    {
        "name": "Hike through Sentosa’s nature trails",
        "content": "Escape the crowds and explore Sentosa’s lush greenery with these hidden gems: Imbiah Trail – Spot unique wildlife and discover ancient rock formations. Coastal Trail – A scenic, seaside route with panoramic ocean views.",
        "link": "",
        "error": false
    }
]

处理动态内容

Crawl4AI能够处理通过JavaScript动态加载的内容。以下是配置爬虫执行JavaScript的示例:

async with AsyncWebCrawler() as crawler:
    result = await crawler.arun(
        url="https://example.com",
        js_code="window.scrollTo(0, document.body.scrollHeight);",
        wait_for="document.querySelector('.content-loaded')"
    )
    print(result.markdown)

总结

Crawl4AI不仅解决了传统爬虫工具的痛点,还通过智能化、模块化的设计,大大提升了数据抓取的效率和准确性。无论是处理动态内容、反爬虫机制,还是生成适合AI处理的Markdown格式,Crawl4AI都表现得游刃有余。如果你也在为爬虫问题头疼,不妨试试Crawl4AI,相信它会给你带来惊喜!


本文转载自公众号AI 博物院 作者:longyunfeigu

原文链接:​​https://mp.weixin.qq.com/s/YOBNX7LqwWwd0ZoVgvWWxA​

©著作权归作者所有,如需转载,请注明出处,否则将追究法律责任
已于2025-3-10 08:12:03修改
收藏
回复
举报
回复
相关推荐