使用 Crawl4AI 抓取搜狐文章教程

2025-07-19

Crawl4AI 介绍

Crawl4AI 是一个开源的异步网络爬虫库，专为 AI 应用设计。它允许开发者轻松抓取网页内容、提取结构化数据，并支持自定义提取策略。Crawl4AI 内置了对 JavaScript 支持的浏览器自动化，适合处理动态网页。官方文档：https://docs.crawl4ai.com/。

Crawl4AI 的核心优势包括：

异步操作：高效处理并发请求。
提取策略：支持 CSS 选择器、JSON schema 等方式提取数据。
浏览器集成：可与 Playwright 等工具结合，处理需要 JavaScript 渲染的页面。
缓存和配置：灵活的缓存模式和运行配置。

如果你是 Python 开发者，有基本的异步编程和网页抓取经验，这篇教程将带你从入门到实践，使用 Crawl4AI 抓取搜狐（Sohu）网站的文章。

环境准备

安装 Crawl4AI

首先，确保你有 Python 3.12+ 环境。然后安装 Crawl4AI：

1	pip install crawl4ai

然后使用 crawl4ai-setup 执行相关依赖的安装，在 Linux 环境中也许需要使用 sudo 来安装一些相关的软件依赖（使用 apt）

1	crawl4ai-setup

其他依赖

asyncio：异步操作。
bs4 (BeautifulSoup)：HTML 解析。
logging：日志记录。
json：数据处理。

基本概念

Crawl4AI 的核心类是 AsyncWebCrawler，用于运行爬取任务。关键配置包括：

CrawlerRunConfig：定义 CSS 选择器、提取策略、缓存模式等。
JsonCssExtractionStrategy：使用 JSON schema 从页面提取结构化数据。

抓取流程通常分为：

获取链接列表。
抓取并提取文章内容。

我们以搜狐为例，演示如何实现。

实践：抓取搜狐文章

步骤 1：获取文章链接

搜狐有多个来源，如新闻（sohu_news）和自媒体（sohu_mp）。我们定义一个通用函数 base_fetch_links。

import logging
from typing import Callable
from crawl4ai import AsyncWebCrawler, CacheMode, CrawlerRunConfig, CrawlResult
from dubhe.domain.vo import Link, new_link

logger = logging.getLogger(__name__)

def _filter_link(link: Link) -> bool:
  return link.href.startswith('https://www.sohu.com/a/')

async def base_fetch_links(
  url: str,
  run_config: CrawlerRunConfig,
  cb_filter: Callable[[Link], bool] = _filter_link
) -> list[Link]:
  browser_config = BROWSER_CONFIG.clone(browser_type=CONFIG.browsers.urls_browser)
  async with AsyncWebCrawler(config=browser_config) as crawler:
    result: CrawlResult = await crawler.arun(url=url, config=run_config)  # type: ignore
    if not result.success:
      raise DataError.error(
        f'获取搜狐文章失败, url: {url}',
        detail={'title': '获取链接失败', 'url': url, 'error_message': result.error_message},
      )
    internal_links = result.links.get('internal', [])
    links = [new_link(link, url_clean) for link in internal_links]
    logger.info('所有内部链接数量: %d', len(links))
    links = [link for link in links if cb_filter(link)]
    logger.info('文章链接数量: %d', len(links))

    return links

这里使用了 crawl4ai 自带的 links 功能。links
是一个 Python 字典类型，通常带有 internal 和 external 两个 list。每个条目可能有 href、text、title
等。如果您没有禁用链接提取，则会自动捕获此信息。然后，我们通过判断链接是否以 https://www.sohu.com/a/
开头来判断它是否是文章链接。这样做的好处就是我们只需要判断链接即可，不用管相关页面的布局是否有改变。

步骤 2：提取文章内容

定义 JSON schema 来提取文章元素，如标题、作者、内容、图片等。使用 JsonCssExtractionStrategy。

SCHEMA = {
  'name': 'Sohu-Article',
  'baseSelector': '#article-container',
  'fields': [
    {'name': 'author', 'selector': '.user-info h4', 'type': 'text'},
    {'name': 'title', 'selector': '.main div.text-title h1', 'type': 'text'},
    {'name': 'published_date', 'selector': '.main .article-info #news-time', 'type': 'text'},
    {'name': 'content', 'selector': '.main article', 'type': 'html'},
    {'name': 'original_link', 'selector': '.main .article-info [data-role="original-link"]', 'type': 'text'},
    {
      'name': 'imgs',
      'selector': '.article img',
      'type': 'list',
      'fields': [
        {'name': 'src', 'type': 'attribute', 'attribute': 'src'},
        {'name': 'alt', 'type': 'attribute', 'attribute': 'alt'},
      ],
    },
  ],
}

下面来详细解读 SCHEMA：

name: 'Sohu-Article': 这是一个标识符，表示这个 schema 用于搜狐文章。
baseSelector: '#article-container': 这是基础选择器，使用 CSS 选择器指定页面中文章容器的根元素。所有字段的提取都相对于这个容器进行。
fields: 一个列表，包含多个字段定义。每个字段描述了要提取的数据类型、选择器和提取方式。

fields 列表详解

author:
- name: 'author': 字段名称。
- selector: '.user-info h4': 使用 CSS 选择器定位作者信息，通常是页面中的 h4 标签。
- type: 'text': 表示提取该元素的文本内容。
title:
- name : 'title': 字段名称。
- selector : '.main div.text-title h1': 定位文章标题的 h1 标签。
- type : 'text': 提取标题文本。
published_date:
- name : 'published_date': 字段名称。
- selector : '.main .article-info #news-time': 定位发布日期的元素，通过 ID ‘news-time’。
- type : 'text': 提取日期文本。
content:
- name : 'content': 字段名称。
- selector : '.main article': 定位文章正文内容。
- type : 'html': 提取该元素的完整 HTML 内容，而不是纯文本。
original_link:
- name : 'original_link': 字段名称。
- selector : '.main .article-info [data-role="original-link"]': 使用属性选择器定位原始链接。
- type : 'text': 提取链接文本（可能为 URL 或描述）。
imgs(嵌套列表字段):
- name: 'imgs': 字段名称。
- selector: '.article img': 定位文章中所有 img 标签。
- type: 'list': 表示这是一个列表字段，会提取多个项。
- fields: 一个子列表，定义每个图片项的子字段：
  - src:
    - name: 'src': 子字段名称。
    - type: 'attribute': 表示提取属性值。
    - attribute: 'src': 具体提取 img 标签的 src 属性（图片 URL）。
  - alt:
    - name: 'alt': 子字段名称。
    - type: 'attribute': 提取属性值。
    - attribute: 'alt': 提取 img 标签的 alt 属性（图片描述）。

抓取函数：

async def fetch_sohu_articles(pages: list[OriginalPage], run_config: CrawlerRunConfig):
  # ... (完整代码见 sohu_article.py)
  async with AsyncWebCrawler(config=browser_config) as crawler:
    async for result in await crawler.arun_many(urls=urls, config=run_config, dispatcher=dispatcher):
      # 提取的 `extracted_content` 是个 JSON 数组，通常我们只取第 1 个元素
      item: dict[str, Any] = json.loads(result.extracted_content)[0]

      # 文章正文 HTML 片段
      content = clean_html(item.get('content', ''))
      # 抓取的文章图片
      images = list(Img.new_imgs(item.get('imgs', []), access_schema))
      # 文章发布时间
      published_time = parse_datetime(item['published_date'])
      # 发布作者
      author = item.get('author', '')
      author = author if author else remove_whitespace(item.get('original_link', ''))
      # 下载图片并返回图片在本地存储的路径
      img_paths = await download_imgs(images, access_schema)

      # 更多其它代码略 ....

图片下载代码如下：

async def download_imgs(images: list[Img], access_schema: str) -> list[str]:
  if not images:
    return []
  img_paths = []
  async with AsyncClient() as client:
    for img in images:
      img_url = img.src
      parsed_url = urlparse(img_url)
      save_path = CONFIG.download_path / (parsed_url.hostname or 'unknown') / remove_suffix(parsed_url.path[1:])
      parent_dir = save_path.parent
      if not parent_dir.is_dir():
        parent_dir.mkdir(parents=True, exist_ok=True)
      if not parsed_url.scheme:
        img_url = f'{access_schema}:{img_url}'
      saved_path = await download_img(client, img_url, save_path)
      if saved_path:
        img_paths.append(str(saved_path))
  return img_paths

步骤 3：应用示例

在 main 函数中组合使用：

import asyncio

async def main():
  pages = find_all_pending_pages(site=SiteEnum.SOHU_NEWS)
  await fetch_sohu_articles(pages, CRAWLER_RUN_CONFIG)

if __name__ == '__main__':
  asyncio.run(main())

注意事项

页面类型检测：可以使用BeautifulSoup来解析页面，检查是否为视频或图片页面。
错误处理：捕获提取失败并回退到备用 schema。
并发：利用 arun_many 处理多个 URL。
实践建议：从简单链接开始测试，逐步扩展到并发抓取。监控日志，避免 IP 封禁。

总结

通过本教程，您将能够快速掌握使用 Crawl4AI 抓取搜狐文章的核心技巧。如果您希望探索更多高级功能，请参阅官方文档以获取深入指导！比如：基于 LLM 的抓取策略

programming