在Python中编写反爬虫程序时,应对反爬虫策略是非常重要的。以下是一些常见的反爬虫策略及其应对方法:
策略:服务器通过检查HTTP请求头中的User-Agent
字段来识别和阻止爬虫。
应对方法:
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
response = requests.get('http://example.com', headers=headers)
策略:服务器通过限制单个IP地址的请求频率来阻止爬虫。
应对方法:
proxies = {
'http': 'http://proxy.example.com:8080',
'https': 'http://proxy.example.com:8080'}
response = requests.get('http://example.com', proxies=proxies)
from fake_useragent import UserAgent
ua = UserAgent()
headers = {'User-Agent': ua.random}
for _ in range(10):
response = requests.get('http://example.com', headers=headers)
策略:服务器通过限制请求频率来防止爬虫过快地发送请求。
应对方法:
import time
for url in urls:
response = requests.get(url)
time.sleep(1) # 延迟1秒
from concurrent.futures import ThreadPoolExecutor
def fetch(url):
response = requests.get(url)
return response.text
urls = ['http://example.com'] * 10
with ThreadPoolExecutor(max_workers=5) as executor:
results = list(executor.map(fetch, urls))
策略:服务器通过动态生成内容来防止简单的爬虫。
应对方法:
from selenium import webdriver
driver = webdriver.Chrome()
driver.get('http://example.com')
content = driver.page_source
driver.quit()
import asyncio
from pyppeteer import launch
async def fetch(url):
browser = await launch()
page = await browser.newPage()
await page.goto(url)
content = await page.content()
await browser.close()
return content
urls = ['http://example.com'] * 10
loop = asyncio.get_event_loop()
results = loop.run_until_complete(asyncio.gather(*[fetch(url) for url in urls]))
策略:服务器通过要求用户输入验证码来阻止自动化爬虫。
应对方法:
import pytesseract
from PIL import Image
image = Image.open('captcha.png')
text = pytesseract.image_to_string(image)
import requests
def solve_captcha(image_path):
response = requests.post('https://api.example.com/solve_captcha', files={'file': open(image_path, 'rb')})
return response.text
captcha_text = solve_captcha('captcha.png')
策略:服务器通过JavaScript动态加载内容来防止爬虫获取完整页面。
应对方法:
from selenium import webdriver
driver = webdriver.Chrome()
driver.get('http://example.com')
content = driver.page_source
driver.quit()
import asyncio
from pyppeteer import launch
async def fetch(url):
browser = await launch()
page = await browser.newPage()
await page.goto(url)
content = await page.content()
await browser.close()
return content
urls = ['http://example.com'] * 10
loop = asyncio.get_event_loop()
results = loop.run_until_complete(asyncio.gather(*[fetch(url) for url in urls]))
通过这些方法,你可以有效地应对常见的反爬虫策略,提高爬虫的稳定性和效率。