一名合格的数据分析师分享Python网络爬虫二三事（Scrapy自动爬虫）-51CTO.COM

接上篇《一名合格的数据分析师分享Python网络爬虫二三事(综合实战案例)》

[[184080]]

五、综合实战案例

3. 利用Scrapy框架爬取

（1）了解Scrapy

Scrapy使用了Twisted异步网络库来处理网络通讯。整体架构大致如下(注：图片来自互联网)：

关于Scrapy的使用方法请参考其官方文档

（2）Scrapy自动爬虫

前面的实战中我们都是通过循环构建URL进行数据爬取，其实还有另外一种实现方式，首先设定初始URL，获取当前URL中的新链接，基于这些链接继续爬取，直到所爬取的页面不存在新的链接为止。

(a)需求

采用自动爬虫的方式爬取糗事百科文章链接与内容，并将文章头部内容与链接存储到MySQL数据库中。

(b)分析

A. 怎么提取首页文章链接?

打开首页后查看源码，搜索首页任一篇文章内容，可以看到"/article/118123230"链接，点击进去后发现这就是我们所要的文章内容，所以我们在自动爬虫中需设置链接包含"article"

B. 怎么提取详情页文章内容与链接

内容

打开详情页后，查看文章内容如下：

分析可知利用包含属性class且其值为content的div标签可***确定文章内容，表达式如下：

"//div[@class='content']/text()"

链接

打开任一详情页，复制详情页链接，查看详情页源码，搜索链接如下：

采用以下XPath表达式可提取文章链接。

["//link[@rel='canonical']/@href"]

（3）项目源码

A. 创建爬虫项目

打开CMD，切换到存储爬虫项目的目录下，输入：

scrapy startproject qsbkauto

B. 项目结构说明

spiders.qsbkspd.py：爬虫文件
items.py：项目实体，要提取的内容的容器，如当当网商品的标题、评论数等
pipelines.py：项目管道，主要用于数据的后续处理，如将数据写入Excel和db等
settings.py：项目设置，如默认是不开启pipeline、遵守robots协议等
scrapy.cfg：项目配置

C. 创建爬虫

进入创建的爬虫项目，输入：

scrapy genspider -t crawl qsbkspd qiushibaie=ke.com（域名）

D. 定义items

import scrapyclass QsbkautoItem(scrapy.Item): 
    # define the fields for your item here like: 
    # name = scrapy.Field() 
    Link = scrapy.Field()     #文章链接 
    Connent = scrapy.Field()  #文章内容 
    pass

E. 编写爬虫

qsbkauto.py

# -*- coding: utf-8 -*-import scrapyfrom scrapy.linkextractors import LinkExtractorfrom scrapy.spiders import CrawlSpider, Rulefrom qsbkauto.items import QsbkautoItemfrom scrapy.http import Requestclass QsbkspdSpider(CrawlSpider): 
  name = 'qsbkspd' 
  allowed_domains = ['qiushibaike.com'] 
  #start_urls = ['http://qiushibaike.com/'] 
  def start_requests(self): 
      i_headers={"User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.22 Safari/537.36 SE 2.X MetaSr 1.0"} 
      yield Request('http://www.qiushibaike.com/',headers=i_headers) 
  rules = ( 
      Rule(LinkExtractor(allow=r'article/'), callback='parse_item', follow=True), 
  ) 
  def parse_item(self, response): 
      #i = {} 
      #i['domain_id'] = response.xpath('//input[@id="sid"]/@value').extract() 
      #i['name'] = response.xpath('//div[@id="name"]').extract() 
      #i['description'] = response.xpath('//div[@id="description"]').extract() 
      i = QsbkautoItem() 
      i["content"]=response.xpath("//div[@class='content']/text()").extract() 
      i["link"]=response.xpath("//link[@rel='canonical']/@href").extract() 
      return i

pipelines.py

import MySQLdbimport timeclass QsbkautoPipeline(object): 
  def exeSQL(self,sql): 
      ''' 
      功能：连接MySQL数据库并执行sql语句 
      @sql：定义SQL语句 
      ''' 
      con = MySQLdb.connect( 
          host='localhost',  # port 
          user='root',       # usr_name 
          passwd='xxxx',     # passname 
          db='spdRet',       # db_name 
          charset='utf8', 
          local_infile = 1 
          ) 
      con.query(sql) 
      con.commit() 
      con.close() 
  def process_item(self, item, spider): 
      link_url = item['link'][0] 
      content_header = item['content'][0][0:10] 
      curr_date = time.strftime('%Y-%m-%d',time.localtime(time.time())) 
      content_header = curr_date+'__'+content_header 
      if (len(link_url) and len(content_header)):#判断是否为空值 
          try: 
              sql="insert into qiushi(content,link) values('"+content_header+"','"+link_url+"')" 
              self.exeSQL(sql) 
          except Exception as er: 
              print("插入错误，错误如下：") 
              print(er) 
      else: 
          pass 
      return item

setting.py

关闭ROBOTSTXT_OBEY
设置USER_AGENT
开启ITEM_PIPELINES

F. 执行爬虫

scrapy crawl qsbkauto --nolog

G. 结果

【本文是51CTO专栏机构“岂安科技”的原创文章，转载请通过微信公众号(bigsec)联系原作者】

戳这里，看该作者更多好文