这篇文章给大家分享的是有关python分布式爬虫中scrapy.spider的使用示例的内容。小编觉得挺实用的,因此分享给大家做个参考。一起跟随小编过来看看吧。
源码:
"""
Base class for Scrapy spiders
See documentation in docs/topics/spiders.rst
"""
import logging
import warnings
from scrapy import signals
from scrapy.http import Request
from scrapy.utils.trackref import object_ref
from scrapy.utils.url import url_is_from_spider
from scrapy.utils.deprecate import create_deprecated_class
from scrapy.exceptions import ScrapyDeprecationWarning
from scrapy.utils.deprecate import method_is_overridden
class Spider(object_ref):
"""Base class for scrapy spiders. All spiders must inherit from this
class.
"""
name = None
custom_settings = None
def __init__(self, name=None, **kwargs):
if name is not None:
self.name = name
elif not getattr(self, 'name', None):
raise ValueError("%s must have a name" % type(self).__name__)
self.__dict__.update(kwargs)
if not hasattr(self, 'start_urls'):
self.start_urls = []
@property
def logger(self):
logger = logging.getLogger(self.name)
return logging.LoggerAdapter(logger, {'spider': self})
def log(self, message, level=logging.DEBUG, **kw):
"""Log the given message at the given log level
This helper wraps a log call to the logger within the spider, but you
can use it directly (e.g. Spider.logger.info('msg')) or use any other
Python logger too.
"""
self.logger.log(level, message, **kw)
@classmethod
def from_crawler(cls, crawler, *args, **kwargs):
spider = cls(*args, **kwargs)
spider._set_crawler(crawler)
return spider
def set_crawler(self, crawler):
warnings.warn("set_crawler is deprecated, instantiate and bound the "
"spider to this crawler with from_crawler method "
"instead.",
category=ScrapyDeprecationWarning, stacklevel=2)
assert not hasattr(self, 'crawler'), "Spider already bounded to a " \
"crawler"
self._set_crawler(crawler)
def _set_crawler(self, crawler):
self.crawler = crawler
self.settings = crawler.settings
crawler.signals.connect(self.close, signals.spider_closed)
def start_requests(self):
cls = self.__class__
if method_is_overridden(cls, Spider, 'make_requests_from_url'):
warnings.warn(
"Spider.make_requests_from_url method is deprecated; it "
"won't be called in future Scrapy releases. Please "
"override Spider.start_requests method instead (see %s.%s)." % (
cls.__module__, cls.__name__
),
)
for url in self.start_urls:
yield self.make_requests_from_url(url)
else:
for url in self.start_urls:
yield Request(url, dont_filter=True)
def make_requests_from_url(self, url):
""" This method is deprecated. """
return Request(url, dont_filter=True)
def parse(self, response):
raise NotImplementedError('{}.parse callback is not defined'.format(self.__class__.__name__))
@classmethod
def update_settings(cls, settings):
settings.setdict(cls.custom_settings or {}, priority='spider')
@classmethod
def handles_request(cls, request):
return url_is_from_spider(request.url, cls)
@staticmethod
def close(spider, reason):
closed = getattr(spider, 'closed', None)
if callable(closed):
return closed(reason)
def __str__(self):
return "<%s %r at 0x%0x>" % (type(self).__name__, self.name, id(self))
__repr__ = __str__
BaseSpider = create_deprecated_class('BaseSpider', Spider)
class ObsoleteClass(object):
def __init__(self, message):
self.message = message
def __getattr__(self, name):
raise AttributeError(self.message)
spiders = ObsoleteClass(
'"from scrapy.spider import spiders" no longer works - use '
'"from scrapy.spiderloader import SpiderLoader" and instantiate '
'it with your project settings"'
)
# Top-level imports
from scrapy.spiders.crawl import CrawlSpider, Rule
from scrapy.spiders.feed import XMLFeedSpider, CSVFeedSpider
from scrapy.spiders.sitemap import SitemapSpider
其中需要关注的是name(爬虫名字)、start_urls(抓取的起始url列表)、allowed_domains(限定抓取的url所在域名)、start_requests(开始抓取的方法)
name、start_urls、allowed_domains是属性,在创建创建项目的时候已经建好了,稍作修改即可。start_requests是起始的抓取方法,一般是默认的遍历start_urls列表生成Request对象,在scrapy中需要登录的时候可以复写该方法,这个比较简单不在赘述。
感谢各位的阅读!关于python分布式爬虫中scrapy.spider的使用示例就分享到这里了,希望以上内容可以对大家有一定的帮助,让大家可以学到更多知识。如果觉得文章不错,可以把它分享出去让更多的人看到吧!
亿速云「云服务器」,即开即用、新一代英特尔至强铂金CPU、三副本存储NVMe SSD云盘,价格低至29元/月。点击查看>>
免责声明:本站发布的内容(图片、视频和文字)以原创、转载和分享为主,文章观点不代表本网站立场,如果涉及侵权请联系站长邮箱:is@yisu.com进行举报,并提供相关证据,一经查实,将立刻删除涉嫌侵权内容。
原文链接:https://www.py.cn/jishu/jichu/21352.html