博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
scrapy框架通用爬虫、深度爬虫、分布式爬虫、分布式深度爬虫,源码解析及应用...
阅读量:6005 次
发布时间:2019-06-20

本文共 19615 字,大约阅读时间需要 65 分钟。

scrapy框架是爬虫界最为强大的框架,没有之一,它的强大在于它的高可扩展性和低耦合,使使用者能够轻松的实现更改和补充。

其中内置三种爬虫主程序模板,scrapy.Spider、RedisSpider、CrawlSpider、RedisCrawlSpider(深度分布式爬虫)分别为别为一般爬虫、分布式爬虫、深度爬虫提供内部逻辑;下面将从源码和应用来学习,

scrapy.Spider

源码:

"""Base class for Scrapy spidersSee documentation in docs/topics/spiders.rst"""import loggingimport warningsfrom scrapy import signalsfrom scrapy.http import Requestfrom scrapy.utils.trackref import object_reffrom scrapy.utils.url import url_is_from_spiderfrom scrapy.utils.deprecate import create_deprecated_classfrom scrapy.exceptions import ScrapyDeprecationWarningfrom scrapy.utils.deprecate import method_is_overriddenclass Spider(object_ref):    """Base class for scrapy spiders. All spiders must inherit from this    class.    """    name = None    custom_settings = None    def __init__(self, name=None, **kwargs):        if name is not None:            self.name = name        elif not getattr(self, 'name', None):            raise ValueError("%s must have a name" % type(self).__name__)        self.__dict__.update(kwargs)        if not hasattr(self, 'start_urls'):            self.start_urls = []    @property    def logger(self):        logger = logging.getLogger(self.name)        return logging.LoggerAdapter(logger, {
'spider': self}) def log(self, message, level=logging.DEBUG, **kw): """Log the given message at the given log level This helper wraps a log call to the logger within the spider, but you can use it directly (e.g. Spider.logger.info('msg')) or use any other Python logger too. """ self.logger.log(level, message, **kw) @classmethod def from_crawler(cls, crawler, *args, **kwargs): spider = cls(*args, **kwargs) spider._set_crawler(crawler) return spider def set_crawler(self, crawler): warnings.warn("set_crawler is deprecated, instantiate and bound the " "spider to this crawler with from_crawler method " "instead.", category=ScrapyDeprecationWarning, stacklevel=2) assert not hasattr(self, 'crawler'), "Spider already bounded to a " \ "crawler" self._set_crawler(crawler) def _set_crawler(self, crawler): self.crawler = crawler self.settings = crawler.settings crawler.signals.connect(self.close, signals.spider_closed) def start_requests(self): cls = self.__class__ if method_is_overridden(cls, Spider, 'make_requests_from_url'): warnings.warn( "Spider.make_requests_from_url method is deprecated; it " "won't be called in future Scrapy releases. Please " "override Spider.start_requests method instead (see %s.%s)." % ( cls.__module__, cls.__name__ ), ) for url in self.start_urls: yield self.make_requests_from_url(url) else: for url in self.start_urls: yield Request(url, dont_filter=True) def make_requests_from_url(self, url): """ This method is deprecated. """ return Request(url, dont_filter=True) def parse(self, response): raise NotImplementedError('{}.parse callback is not defined'.format(self.__class__.__name__)) @classmethod def update_settings(cls, settings): settings.setdict(cls.custom_settings or {}, priority='spider') @classmethod def handles_request(cls, request): return url_is_from_spider(request.url, cls) @staticmethod def close(spider, reason): closed = getattr(spider, 'closed', None) if callable(closed): return closed(reason) def __str__(self): return "<%s %r at 0x%0x>" % (type(self).__name__, self.name, id(self)) __repr__ = __str__BaseSpider = create_deprecated_class('BaseSpider', Spider)class ObsoleteClass(object): def __init__(self, message): self.message = message def __getattr__(self, name): raise AttributeError(self.message)spiders = ObsoleteClass( '"from scrapy.spider import spiders" no longer works - use ' '"from scrapy.spiderloader import SpiderLoader" and instantiate ' 'it with your project settings"')# Top-level importsfrom scrapy.spiders.crawl import CrawlSpider, Rulefrom scrapy.spiders.feed import XMLFeedSpider, CSVFeedSpiderfrom scrapy.spiders.sitemap import SitemapSpider

其中需要关注的是name(爬虫名字)、start_urls(抓取的起始url列表)、allowed_domains(限定抓取的url所在域名)、start_requests(开始抓取的方法)

name、start_urls、allowed_domains是属性,在创建创建项目的时候已经建好了,稍作修改即可。start_requests是起始的抓取方法,一般是默认的遍历start_urls列表生成Request对象,在scrapy中需要登录的时候可以复写该方法,这个比较简单不在赘述。

CrawlSpider

深度爬虫,根据连接提取规则,会自动抓取页面中满足规则的连接,然后再请求解析,再抓取从而一直深入。

源码

"""This modules implements the CrawlSpider which is the recommended spider to usefor scraping typical web sites that requires crawling pages.See documentation in docs/topics/spiders.rst"""import copyimport sixfrom scrapy.http import Request, HtmlResponsefrom scrapy.utils.spider import iterate_spider_outputfrom scrapy.spiders import Spiderdef identity(x):    return xclass Rule(object):    def __init__(self, link_extractor, callback=None, cb_kwargs=None, follow=None, process_links=None, process_request=identity):        self.link_extractor = link_extractor        self.callback = callback        self.cb_kwargs = cb_kwargs or {}        self.process_links = process_links        self.process_request = process_request        if follow is None:            self.follow = False if callback else True        else:            self.follow = followclass CrawlSpider(Spider):    rules = ()    def __init__(self, *a, **kw):        super(CrawlSpider, self).__init__(*a, **kw)        self._compile_rules()    def parse(self, response):        return self._parse_response(response, self.parse_start_url, cb_kwargs={}, follow=True)    def parse_start_url(self, response):        return []    def process_results(self, response, results):        return results    def _build_request(self, rule, link):        r = Request(url=link.url, callback=self._response_downloaded)        r.meta.update(rule=rule, link_text=link.text)        return r    def _requests_to_follow(self, response):        if not isinstance(response, HtmlResponse):            return        seen = set()        for n, rule in enumerate(self._rules):            links = [lnk for lnk in rule.link_extractor.extract_links(response)                     if lnk not in seen]            if links and rule.process_links:                links = rule.process_links(links)            for link in links:                seen.add(link)                r = self._build_request(n, link)                yield rule.process_request(r)    def _response_downloaded(self, response):        rule = self._rules[response.meta['rule']]        return self._parse_response(response, rule.callback, rule.cb_kwargs, rule.follow)    def _parse_response(self, response, callback, cb_kwargs, follow=True):        if callback:            cb_res = callback(response, **cb_kwargs) or ()            cb_res = self.process_results(response, cb_res)            for requests_or_item in iterate_spider_output(cb_res):                yield requests_or_item        if follow and self._follow_links:            for request_or_item in self._requests_to_follow(response):                yield request_or_item    def _compile_rules(self):        def get_method(method):            if callable(method):                return method            elif isinstance(method, six.string_types):                return getattr(self, method, None)        self._rules = [copy.copy(r) for r in self.rules]        for rule in self._rules:            rule.callback = get_method(rule.callback)            rule.process_links = get_method(rule.process_links)            rule.process_request = get_method(rule.process_request)    @classmethod    def from_crawler(cls, crawler, *args, **kwargs):        spider = super(CrawlSpider, cls).from_crawler(crawler, *args, **kwargs)        spider._follow_links = crawler.settings.getbool(            'CRAWLSPIDER_FOLLOW_LINKS', True)        return spider    def set_crawler(self, crawler):        super(CrawlSpider, self).set_crawler(crawler)        self._follow_links = crawler.settings.getbool('CRAWLSPIDER_FOLLOW_LINKS', True)

CrawlSpider是继承于Spider,也实现了其中的常用属性和方法,新增了一个rules属性(连接提取规则集合),但是不同的是Crawl内部实现了parse解析方法,不能在Crawl中使用该关键词。

def parse(self, response):

return self._parse_response(response, self.parse_start_url, cb_kwargs={}, follow=True)

也提供了一个可复写(overrideable)的方法:

  • parse_start_url(response)
  • 当start_url的请求返回时,该方法被调用。 该方法分析最初的返回值并必须返回一个 Item对象或者 一个 Request 对象或者 一个可迭代的包含二者对象。

rules

在rules中包含一个或多个Rule对象,每个Rule对爬取网站的动作定义了特定操作。如果多个rule匹配了相同的链接,则根据规则在本集合中被定义的顺序,第一个会被使用。

class scrapy.spiders.Rule(        link_extractor,         callback = None,         cb_kwargs = None,         follow = None,         process_links = None,         process_request = None)
  • link_extractor:是一个Link Extractor对象,用于定义需要提取的链接(Link Extractor对象见下)。
  • callback: 从link_extractor中每获取到链接时,参数所指定的值作为回调函数,该回调函数接受一个response作为其第一个参数。
    注意:当编写爬虫规则时,避免使用parse作为回调函数。由于CrawlSpider使用parse方法来实现其逻辑,如果覆盖了 parse方法,crawl spider将会运行失败。
  • follow:是一个布尔(boolean)值,指定了根据该规则从response提取的链接是否需要跟进。 如果callback为None,follow 默认设置为True ,否则默认为False。
  • process_links:指定该spider中哪个的函数将会被调用,从link_extractor中获取到链接列表时将会调用该函数。该方法主要用来过滤。
  • process_request:指定该spider中哪个的函数将会被调用, 该规则提取到每个request时都会调用该函数。 (用来过滤request)

LinkExtractors

class scrapy.linkextractors.LinkExtractor

Link Extractors 的目的很简单: 提取链接。

每个LinkExtractor有唯一的公共方法是 extract_links(),它接收一个 Response 对象,并返回一个 scrapy.link.Link 对象。

Link Extractors要实例化一次,并且 extract_links 方法会根据不同的 response 调用多次提取链接。

class scrapy.linkextractors.LinkExtractor(    allow = (),    deny = (),    allow_domains = (),    deny_domains = (),    deny_extensions = None,    restrict_xpaths = (),    tags = ('a','area'),    attrs = ('href'),    canonicalize = True,    unique = True,    process_value = None)

主要参数:

  • allow:满足括号中“正则表达式”的值会被提取,如果为空,则全部匹配。
  • deny:与这个正则表达式(或正则表达式列表)不匹配的URL一定不提取。
  • allow_domains:会被提取的链接的domains。
  • deny_domains:一定不会被提取链接的domains。
  • restrict_xpaths:使用xpath表达式,和allow共同作用过滤链接。

案例

from scrapy.linkextractors import LinkExtractorfrom scrapy.spiders import CrawlSpider, Ruleclass TestSpider(CrawlSpider):    name = 'Test'    allowed_domains = ['Test.com']    start_urls = ['http://Test.com/']    rules = (        Rule(LinkExtractor(allow=r'Items/'), callback='parse_test', follow=True),    )    def parse_test(self, response):        items = {}············        return items

RedisSpider、RedisCrawlSpider

webp
image

Scrapy-redis提供了下面四种组件:

  • Scheduler(调度程序)
  • Duplication Filter(过滤)
  • Item Pipeline(数据管道)
  • Base Spider(爬虫基类)

Scheduler

Scrapy中跟“待爬队列”直接相关的就是调度器Scheduler,它负责对新的request进行入列操作(加入Scrapy queue),取出下一个要爬取的request(从Scrapy queue中取出)等操作。它把待爬队列按照优先级建立了一个字典结构,比如:

{        优先级0 : 队列0        优先级1 : 队列1        优先级2 : 队列2    }

然后根据request中的优先级,来决定该入哪个队列,出列时则按优先级较小的优先出列。为了管理这个比较高级的队列字典,Scheduler需要提供一系列的方法。但是原来的Scheduler已经无法使用,所以使用Scrapy-redis的scheduler组件。

Duplication Filter

Scrapy中用集合实现这个request去重功能,Scrapy中把已经发送的request指纹放入到一个集合中,把下一个request的指纹拿到集合中比对,如果该指纹存在于集合中,说明这个request发送过了,如果没有则继续操作。这个核心的判重功能是这样实现的:

def request_seen(self, request):        # self.request_figerprints就是一个指纹集合          fp = self.request_fingerprint(request)        # 这就是判重的核心操作          if fp in self.fingerprints:            return True        self.fingerprints.add(fp)        if self.file:            self.file.write(fp + os.linesep)

在scrapy-redis中去重是由Duplication Filter组件来实现的,它通过redis的set 不重复的特性,巧妙的实现了Duplication Filter去重。scrapy-redis调度器从引擎接受request,将request的指纹存⼊redis的set检查是否重复,并将不重复的request push写⼊redis的 request queue。

引擎请求request(Spider发出的)时,调度器从redis的request queue队列⾥里根据优先级pop 出⼀个request 返回给引擎,引擎将此request发给spider处理。

Item Pipeline

引擎将(Spider返回的)爬取到的Item给Item Pipeline,scrapy-redis 的Item Pipeline将爬取到的 Item 存⼊redis的 items queue。

修改过Item Pipeline可以很方便的根据 key 从 items queue 提取item,从⽽实现 items processes集群。

Base Spider

不在使用scrapy原有的Spider类,重写的RedisSpider继承了Spider和RedisMixin这两个类,RedisMixin是用来从redis读取url的类。

当我们生成一个Spider继承RedisSpider时,调用setup_redis函数,这个函数会去连接redis数据库,然后会设置signals(信号):

  • 一个是当spider空闲时候的signal,会调用spider_idle函数,这个函数调用schedule_next_request函数,保证spider是一直活着的状态,并且抛出DontCloseSpider异常。
  • 一个是当抓到一个item时的signal,会调用item_scraped函数,这个函数会调用schedule_next_request函数,获取下一个request。
from scrapy import signalsfrom scrapy.exceptions import DontCloseSpiderfrom scrapy.spiders import Spider, CrawlSpiderfrom . import connection, defaultsfrom .utils import bytes_to_strclass RedisMixin(object):    """Mixin class to implement reading urls from a redis queue."""    redis_key = None    redis_batch_size = None    redis_encoding = None    # Redis client placeholder.    server = None    def start_requests(self):        """Returns a batch of start requests from redis."""        return self.next_requests()    def setup_redis(self, crawler=None):        """Setup redis connection and idle signal.        This should be called after the spider has set its crawler object.        """        if self.server is not None:            return        if crawler is None:            # We allow optional crawler argument to keep backwards            # compatibility.            # XXX: Raise a deprecation warning.            crawler = getattr(self, 'crawler', None)        if crawler is None:            raise ValueError("crawler is required")        settings = crawler.settings        if self.redis_key is None:            self.redis_key = settings.get(                'REDIS_START_URLS_KEY', defaults.START_URLS_KEY,            )        self.redis_key = self.redis_key % {
'name': self.name} if not self.redis_key.strip(): raise ValueError("redis_key must not be empty") if self.redis_batch_size is None: # TODO: Deprecate this setting (REDIS_START_URLS_BATCH_SIZE). self.redis_batch_size = settings.getint( 'REDIS_START_URLS_BATCH_SIZE', settings.getint('CONCURRENT_REQUESTS'), ) try: self.redis_batch_size = int(self.redis_batch_size) except (TypeError, ValueError): raise ValueError("redis_batch_size must be an integer") if self.redis_encoding is None: self.redis_encoding = settings.get('REDIS_ENCODING', defaults.REDIS_ENCODING) self.logger.info("Reading start URLs from redis key '%(redis_key)s' " "(batch size: %(redis_batch_size)s, encoding: %(redis_encoding)s", self.__dict__) self.server = connection.from_settings(crawler.settings) # The idle signal is called when the spider has no requests left, # that's when we will schedule new requests from redis queue crawler.signals.connect(self.spider_idle, signal=signals.spider_idle) def next_requests(self): """Returns a request to be scheduled or none.""" use_set = self.settings.getbool('REDIS_START_URLS_AS_SET', defaults.START_URLS_AS_SET) fetch_one = self.server.spop if use_set else self.server.lpop # XXX: Do we need to use a timeout here? found = 0 # TODO: Use redis pipeline execution. while found < self.redis_batch_size: data = fetch_one(self.redis_key) if not data: # Queue empty. break req = self.make_request_from_data(data) if req: yield req found += 1 else: self.logger.debug("Request not made from data: %r", data) if found: self.logger.debug("Read %s requests from '%s'", found, self.redis_key) def make_request_from_data(self, data): """Returns a Request instance from data coming from Redis. By default, ``data`` is an encoded URL. You can override this method to provide your own message decoding. Parameters ---------- data : bytes Message from redis. """ url = bytes_to_str(data, self.redis_encoding) return self.make_requests_from_url(url) def schedule_next_requests(self): """Schedules a request if available""" # TODO: While there is capacity, schedule a batch of redis requests. for req in self.next_requests(): self.crawler.engine.crawl(req, spider=self) def spider_idle(self): """Schedules a request if available, otherwise waits.""" # XXX: Handle a sentinel to close the spider. self.schedule_next_requests() raise DontCloseSpiderclass RedisSpider(RedisMixin, Spider): """Spider that reads urls from redis queue when idle. Attributes ---------- redis_key : str (default: REDIS_START_URLS_KEY) Redis key where to fetch start URLs from.. redis_batch_size : int (default: CONCURRENT_REQUESTS) Number of messages to fetch from redis on each attempt. redis_encoding : str (default: REDIS_ENCODING) Encoding to use when decoding messages from redis queue. Settings -------- REDIS_START_URLS_KEY : str (default: "
:start_urls") Default Redis key where to fetch start URLs from.. REDIS_START_URLS_BATCH_SIZE : int (deprecated by CONCURRENT_REQUESTS) Default number of messages to fetch from redis on each attempt. REDIS_START_URLS_AS_SET : bool (default: False) Use SET operations to retrieve messages from the redis queue. If False, the messages are retrieve using the LPOP command. REDIS_ENCODING : str (default: "utf-8") Default encoding to use when decoding messages from redis queue. """ @classmethod def from_crawler(self, crawler, *args, **kwargs): obj = super(RedisSpider, self).from_crawler(crawler, *args, **kwargs) obj.setup_redis(crawler) return objclass RedisCrawlSpider(RedisMixin, CrawlSpider): """Spider that reads urls from redis queue when idle. Attributes ---------- redis_key : str (default: REDIS_START_URLS_KEY) Redis key where to fetch start URLs from.. redis_batch_size : int (default: CONCURRENT_REQUESTS) Number of messages to fetch from redis on each attempt. redis_encoding : str (default: REDIS_ENCODING) Encoding to use when decoding messages from redis queue. Settings -------- REDIS_START_URLS_KEY : str (default: "
:start_urls") Default Redis key where to fetch start URLs from.. REDIS_START_URLS_BATCH_SIZE : int (deprecated by CONCURRENT_REQUESTS) Default number of messages to fetch from redis on each attempt. REDIS_START_URLS_AS_SET : bool (default: True) Use SET operations to retrieve messages from the redis queue. REDIS_ENCODING : str (default: "utf-8") Default encoding to use when decoding messages from redis queue. """ @classmethod def from_crawler(self, crawler, *args, **kwargs): obj = super(RedisCrawlSpider, self).from_crawler(crawler, *args, **kwargs) obj.setup_redis(crawler) return obj

在scrapy_redis组件中不仅提供了RedisSpider还提供了兼具深度爬虫的RedisCrawlSpider,至于其余几个Redis分布式组件将在后面逐一分享。

Redis分布式组件,新增redis_key 属性,用于早redis中去重和数据存储。

示例

from scrapy.linkextractors import LinkExtractorfrom scrapy.spiders import CrawlSpider, Rulefrom scrapy_redis.spiders import RedisCrawlSpiderclass TestSpider(RedisCrawlSpider):    name = 'test'    allowed_domains = ['www.test.com']    redis_key = 'testspider:start_urls'    rules = [        # 获取每一页的链接        Rule(link_extractor=LinkExtractor(allow=('/?page=\d+'))),        # 获取每一个公司的详情        Rule(link_extractor=LinkExtractor(allow=('/\d+')), callback='parse_item')    ]    def parse_item(self, response):         ······        return item

至于更多配置不再赘述, 后续将对一些组件继续深入分析。

转载地址:http://uxpmx.baihongyu.com/

你可能感兴趣的文章
聚类(三)FUZZY C-MEANS 模糊c-均值聚类算法——本质和逻辑回归类似啊
查看>>
像屎一样的 Spring Boot入门,总算有反应了
查看>>
Device Tree碎碎念
查看>>
STM32F103ZET6 之 ADC+TIM+DMA+USART 综合实验
查看>>
oracle 锁系列
查看>>
HBuilder开发App教程06-首页
查看>>
深入理解JVM:HotSpot虚拟机对象探秘
查看>>
virtualbox测试k8s要注意的情况
查看>>
mac系统下为emacs设置中文字体,解决乱码问题
查看>>
K8s的内部Pod之间都不通,搞了快两天
查看>>
android RecycleView复杂多条目的布局
查看>>
AAC帧格式及编码介绍
查看>>
一起talk C栗子吧(第一百三十一回:C语言实例--C程序内存布局三)
查看>>
2014年半年结
查看>>
Android项目实战(三十八):2017最新 将AndroidLibrary提交到JCenter仓库(图文教程)...
查看>>
ORA-12514 TNS 监听程序当前无法识别连接描述符中请求服务|转|
查看>>
ulimit用法简介
查看>>
tomcat 配置首页
查看>>
PHP工厂模式的好处
查看>>
半年拾遗
查看>>