这篇文章主要介绍了Scrapy框架基本命令有哪些,具有一定借鉴价值,感兴趣的朋友可以参考下,希望大家阅读完这篇文章之后大有收获,下面让小编带着大家一起了解一下。
1.创建爬虫项目
scrapy startproject [项目名称]
2.创建爬虫文件
scrapy genspider +文件名+网址
3.运行(crawl)
scrapy crawl 爬虫名称 # -o output 输出数据到文件 scrapy crawl [爬虫名称] -o zufang.json scrapy crawl [爬虫名称] -o zufang.csv
4.check检查错误
scrapy check
5.list返回项目所有spider
scrapy list
6.view 存储、打开网页
scrapy view http://www.baidu.com
7.scrapy shell, 进入终端
scrapy shell https://www.baidu.com
8.scrapy runspider
scrapy runspider zufang_spider.py
# -*- coding: utf-8 -*- # Scrapy settings for maitian project # # For simplicity, this file contains only settings considered important or # commonly used. You can find more settings consulting the documentation: # # https://doc.scrapy.org/en/latest/topics/settings.html # https://doc.scrapy.org/en/latest/topics/downloader-middleware.html # https://doc.scrapy.org/en/latest/topics/spider-middleware.html BOT_NAME = 'maitian' SPIDER_MODULES = ['maitian.spiders'] NEWSPIDER_MODULE = 'maitian.spiders' #不能批量设置 # Crawl responsibly by identifying yourself (and your website) on the user-agent USER_AGENT = 'maitian (+http://www.yourdomain.com)' #默认遵守robots协议 # Obey robots.txt rules ROBOTSTXT_OBEY = False #设置日志文件 LOG_FILE="maitian.log" #日志等级分为5种:1.DEBUG 2.INFO 3.Warning 4.ERROR 5.CRITICAL #等级越高 输出的日志越少 # LOG_LEVEL="INFO" #scrapy设置最大并发数 默认16 # Configure maximum concurrent requests performed by Scrapy (default: 16) #CONCURRENT_REQUESTS = 32 #设置批量延迟请求16 等待3秒再发16 秒 # Configure a delay for requests for the same website (default: 0) # See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs #DOWNLOAD_DELAY = 3 # The download delay setting will honor only one of: #CONCURRENT_REQUESTS_PER_DOMAIN = 16 #CONCURRENT_REQUESTS_PER_IP = 16 #cookie 不生效 默认是True # Disable cookies (enabled by default) #COOKIES_ENABLED = False #远程 # Disable Telnet Console (enabled by default) #TELNETCONSOLE_ENABLED = False #加载默认的请求头 # Override the default request headers: #DEFAULT_REQUEST_HEADERS = { # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', # 'Accept-Language': 'en', #} #爬虫中间件 # Enable or disable spider middlewares # See https://doc.scrapy.org/en/latest/topics/spider-middleware.html #SPIDER_MIDDLEWARES = { # 'maitian.middlewares.MaitianSpiderMiddleware': 543, #} #下载中间件 # Enable or disable downloader middlewares # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html #DOWNLOADER_MIDDLEWARES = { # 'maitian.middlewares.MaitianDownloaderMiddleware': 543, #} # Enable or disable extensions # See https://doc.scrapy.org/en/latest/topics/extensions.html #EXTENSIONS = { # 'scrapy.extensions.telnet.TelnetConsole': None, #} #在配置文件 开启管道 #优先级的范围 0--1000;值越小 优先级越高 # Configure item pipelines # See https://doc.scrapy.org/en/latest/topics/item-pipeline.html #ITEM_PIPELINES = { # 'maitian.pipelines.MaitianPipeline': 300, #} # Enable and configure the AutoThrottle extension (disabled by default) # See https://doc.scrapy.org/en/latest/topics/autothrottle.html #AUTOTHROTTLE_ENABLED = True # The initial download delay #AUTOTHROTTLE_START_DELAY = 5 # The maximum download delay to be set in case of high latencies #AUTOTHROTTLE_MAX_DELAY = 60 # The average number of requests Scrapy should be sending in parallel to # each remote server #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 # Enable showing throttling stats for every response received: #AUTOTHROTTLE_DEBUG = False # Enable and configure HTTP caching (disabled by default) # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings #HTTPCACHE_ENABLED = True #HTTPCACHE_EXPIRATION_SECS = 0 #HTTPCACHE_DIR = 'httpcache' #HTTPCACHE_IGNORE_HTTP_CODES = [] #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
感谢你能够认真阅读完这篇文章,希望小编分享的“Scrapy框架基本命令有哪些”这篇文章对大家有帮助,同时也希望大家多多支持亿速云,关注亿速云行业资讯频道,更多相关知识等着你来学习!
免责声明:本站发布的内容(图片、视频和文字)以原创、转载和分享为主,文章观点不代表本网站立场,如果涉及侵权请联系站长邮箱:is@yisu.com进行举报,并提供相关证据,一经查实,将立刻删除涉嫌侵权内容。