Python爬虫之requests如何使用

发布时间：2022-08-24 11:45:17 来源：亿速云阅读：224 作者：iii 栏目：开发技术

Python爬虫之requests如何使用

引言

在网络爬虫的开发中，HTTP请求是最基础也是最关键的部分。Python的requests库是一个非常强大且易于使用的HTTP库，它简化了HTTP请求的发送和响应的处理过程。本文将详细介绍如何使用requests库进行网络爬虫的开发，涵盖从基础到高级的用法，并通过实战案例帮助读者更好地理解和应用。

requests库简介

requests库是Python中最受欢迎的HTTP库之一，它基于urllib3，提供了更加简洁和人性化的API。使用requests库，你可以轻松地发送HTTP请求，处理响应，设置请求头，使用代理，处理Cookies等。

安装requests库

在开始使用requests库之前，首先需要安装它。你可以使用pip命令来安装：

pip install requests

安装完成后，你可以在Python代码中导入requests库：

import requests

基本用法

发送GET请求

发送GET请求是requests库最基本的用法之一。你可以使用requests.get()方法来发送GET请求，并获取服务器的响应。

import requests

response = requests.get('https://www.example.com')
print(response.text)  # 打印网页内容

发送POST请求

发送POST请求与发送GET请求类似，只是使用的是requests.post()方法。你还可以通过data参数传递表单数据。

import requests

data = {'key1': 'value1', 'key2': 'value2'}
response = requests.post('https://www.example.com/post', data=data)
print(response.text)

处理响应

requests库的响应对象Response提供了多种属性和方法来处理服务器的响应。

response.status_code：获取HTTP状态码。
response.text：获取响应内容的字符串形式。
response.json()：将响应内容解析为JSON格式。
response.content：获取响应内容的字节形式。
response.headers：获取响应头。

import requests

response = requests.get('https://www.example.com')
print(response.status_code)  # 打印状态码
print(response.headers)  # 打印响应头
print(response.text)  # 打印网页内容

高级用法

设置请求头

有些网站会检查请求头中的User-Agent等信息，以防止爬虫的访问。你可以通过headers参数设置请求头。

import requests

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
}

response = requests.get('https://www.example.com', headers=headers)
print(response.text)

使用代理

在某些情况下，你可能需要使用代理来隐藏你的真实IP地址。你可以通过proxies参数设置代理。

import requests

proxies = {
    'http': 'http://10.10.1.10:3128',
    'https': 'http://10.10.1.10:1080',
}

response = requests.get('https://www.example.com', proxies=proxies)
print(response.text)

处理Cookies

requests库可以自动处理Cookies。你可以通过cookies参数手动设置Cookies，也可以通过response.cookies获取服务器返回的Cookies。

import requests

cookies = {'key': 'value'}
response = requests.get('https://www.example.com', cookies=cookies)
print(response.cookies)  # 打印服务器返回的Cookies

处理重定向

默认情况下，requests库会自动处理重定向。你可以通过allow_redirects参数来控制是否允许重定向。

import requests

response = requests.get('https://www.example.com', allow_redirects=False)
print(response.status_code)  # 打印状态码

超时设置

你可以通过timeout参数设置请求的超时时间，以防止请求长时间没有响应。

import requests

try:
    response = requests.get('https://www.example.com', timeout=5)
    print(response.text)
except requests.Timeout:
    print('请求超时')

文件上传

requests库还支持文件上传。你可以通过files参数上传文件。

import requests

files = {'file': open('example.txt', 'rb')}
response = requests.post('https://www.example.com/upload', files=files)
print(response.text)

错误处理

在实际开发中，网络请求可能会遇到各种错误，如连接超时、服务器错误等。requests库提供了异常处理机制来捕获这些错误。

import requests

try:
    response = requests.get('https://www.example.com', timeout=5)
    response.raise_for_status()  # 如果状态码不是200，抛出HTTPError异常
except requests.exceptions.HTTPError as errh:
    print(f"HTTP Error: {errh}")
except requests.exceptions.ConnectionError as errc:
    print(f"Error Connecting: {errc}")
except requests.exceptions.Timeout as errt:
    print(f"Timeout Error: {errt}")
except requests.exceptions.RequestException as err:
    print(f"Something went wrong: {err}")

实战案例

爬取网页内容

以下是一个简单的爬虫示例，用于爬取网页内容并保存到本地文件。

import requests

url = 'https://www.example.com'
response = requests.get(url)

if response.status_code == 200:
    with open('example.html', 'w', encoding='utf-8') as f:
        f.write(response.text)
    print('网页内容已保存到example.html')
else:
    print(f'请求失败，状态码：{response.status_code}')

模拟登录

有些网站需要登录后才能访问某些内容。你可以通过requests库模拟登录过程。

import requests

login_url = 'https://www.example.com/login'
data = {'username': 'your_username', 'password': 'your_password'}

session = requests.Session()
response = session.post(login_url, data=data)

if response.status_code == 200:
    print('登录成功')
    # 访问需要登录的页面
    profile_url = 'https://www.example.com/profile'
    profile_response = session.get(profile_url)
    print(profile_response.text)
else:
    print('登录失败')

爬取API数据

许多网站提供API接口，你可以通过requests库获取API数据并解析。

import requests

api_url = 'https://api.example.com/data'
params = {'key': 'your_api_key', 'q': 'search_query'}

response = requests.get(api_url, params=params)

if response.status_code == 200:
    data = response.json()
    print(data)
else:
    print(f'请求失败，状态码：{response.status_code}')

总结

requests库是Python中一个非常强大且易于使用的HTTP库，它简化了HTTP请求的发送和响应的处理过程。通过本文的介绍，你应该已经掌握了requests库的基本用法和高级用法，并能够使用它进行网络爬虫的开发。希望本文能够帮助你在实际项目中更好地应用requests库，提升开发效率。

向AI问一下细节

Python爬虫之requests如何使用

Python爬虫之requests如何使用

目录

引言

requests库简介

安装requests库

基本用法

发送GET请求

发送POST请求

处理响应

高级用法

设置请求头

使用代理

处理Cookies

处理重定向

超时设置

文件上传

错误处理

实战案例

爬取网页内容

模拟登录

爬取API数据

总结

猜你喜欢

Python爬虫之requests如何使用

Python爬虫之requests如何使用

目录

引言

requests库简介

安装requests库

基本用法

发送GET请求

发送POST请求

处理响应

高级用法

设置请求头

使用代理

处理Cookies

处理重定向

超时设置

文件上传

错误处理

实战案例

爬取网页内容

模拟登录

爬取API数据

总结

猜你喜欢

最新资讯

相关推荐

相关标签