在Python中,使用requests库进行网络请求时,可以通过以下方法进行性能优化:
HTTPAdapter
的pool_connections
和pool_maxsize
参数,可以限制最大并发连接数和每个主机的最大连接数。from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
session = requests.Session()
adapter = HTTPAdapter(max_retries=Retry(total=3), pool_connections=100, pool_maxsize=100)
session.mount('http://', adapter)
session.mount('https://', adapter)
concurrent.futures
模块中的ThreadPoolExecutor
或ThreadPool
类来实现多线程爬虫。这样可以同时处理多个请求,提高性能。from concurrent.futures import ThreadPoolExecutor
import requests
def fetch(url):
response = requests.get(url)
return response.text
urls = ['http://example.com'] * 10
with ThreadPoolExecutor(max_workers=5) as executor:
results = list(executor.map(fetch, urls))
asyncio
库和aiohttp
库实现异步爬虫。异步编程可以在等待服务器响应时执行其他任务,从而提高性能。import aiohttp
import asyncio
async def fetch(url):
async with aiohttp.ClientSession() as session:
async with session.get(url) as response:
return await response.text()
async def main():
urls = ['http://example.com'] * 10
tasks = [fetch(url) for url in urls]
results = await asyncio.gather(*tasks)
loop = asyncio.get_event_loop()
loop.run_until_complete(main())
import requests
import time
url = 'http://example.com'
cache_file = 'cache.txt'
def save_cache(response, url):
with open(cache_file, 'w') as f:
f.write(f'{url}: {response}\n')
def load_cache():
try:
with open(cache_file, 'r') as f:
for line in f:
url, response = line.strip().split(':')
return url, response
except FileNotFoundError:
return None, None
def get_response(url):
cached_url, cached_response = load_cache()
if cached_url == url and time.time() - float(cached_response.split(':')[1]) < 3600:
return cached_response
response = requests.get(url)
save_cache(response, url)
return response.text
time.sleep()
函数在请求之间添加延迟,或使用第三方库如ratelimit
来实现更高级的速率限制。import time
import requests
url = 'http://example.com'
def rate_limited_request(url, delay=1):
response = requests.get(url)
time.sleep(delay)
return response
for _ in range(10):
response = rate_limited_request(url)
通过以上方法,可以在很大程度上提高Python爬虫的性能。在实际应用中,可以根据需求选择合适的优化策略。
亿速云「云服务器」,即开即用、新一代英特尔至强铂金CPU、三副本存储NVMe SSD云盘,价格低至29元/月。点击查看>>
推荐阅读:python分布爬虫如何进行性能优化