在Scrapy中,默认情况下是使用单线程进行爬取的
pip install scrapy
scrapy startproject myproject
cd myproject
myproject/settings.py
文件中,找到DOWNLOAD_DELAY
设置并将其设置为所需的延迟时间(以秒为单位),以避免对目标网站造成过大的压力。例如,将其设置为1秒:DOWNLOAD_DELAY = 1
scrapy genspider myspider example.com
myproject/spiders/myspider.py
文件,并在其中定义你的爬虫。例如:import scrapy
class MySpider(scrapy.Spider):
name = 'myspider'
allowed_domains = ['example.com']
start_urls = ['http://example.com/']
def parse(self, response):
# 在这里编写你的解析逻辑
pass
middlewares.py
的文件,并在其中定义一个自定义的中间件。例如:import threading
from scrapy import signals
class MultithreadingMiddleware(object):
def __init__(self):
self.lock = threading.Lock()
@classmethod
def from_crawler(cls, crawler):
middleware = cls()
crawler.signals.connect(middleware.spider_opened, signal=signals.spider_opened)
return middleware
def spider_opened(self, spider):
spider.logger.info('Spider opened: %s' % spider.name)
myproject/settings.py
文件中,将新创建的中间件添加到DOWNLOADER_MIDDLEWARES
设置中:DOWNLOADER_MIDDLEWARES = {
'myproject.middlewares.MultithreadingMiddleware': 560,
}
middlewares.py
文件中的spider_opened
方法,以便在爬虫打开时创建一个新线程来执行抓取任务。例如:import threading
from scrapy import signals
class MultithreadingMiddleware(object):
def __init__(self):
self.lock = threading.Lock()
self.threads = []
@classmethod
def from_crawler(cls, crawler):
middleware = cls()
crawler.signals.connect(middleware.spider_opened, signal=signals.spider_opened)
return middleware
def spider_opened(self, spider):
spider.logger.info('Spider opened: %s' % spider.name)
thread = threading.Thread(target=self.start_crawling, args=(spider,))
thread.start()
self.threads.append(thread)
def start_crawling(self, spider):
# 在这里编写你的抓取逻辑
pass
现在,当你运行Scrapy爬虫时,它将使用多线程进行抓取。请注意,Scrapy的DOWNLOAD_DELAY
设置可能会影响多线程爬虫的性能。在实际应用中,你可能需要根据目标网站的限制和性能要求调整该设置。