要使用Python编写网络爬虫,您需要了解一些基本概念,如HTTP请求、HTML解析和数据提取。以下是一个简单的网络爬虫示例,使用了requests
和BeautifulSoup
库:
pip install requests
pip install beautifulsoup4
web_crawler.py
的Python文件,并在其中编写以下代码:import requests
from bs4 import BeautifulSoup
def get_html(url):
try:
response = requests.get(url)
response.raise_for_status()
return response.text
except requests.exceptions.RequestException as e:
print(f"Error while fetching URL: {e}")
return None
def parse_html(html):
soup = BeautifulSoup(html, 'html.parser')
return soup
def extract_data(soup, tag, attribute=None, value=None):
data = []
for element in soup.find_all(tag, **{attribute: value}):
if attribute and value:
data.append(element[attribute])
else:
data.append(element.text)
return data
def main():
url = input("Enter the URL of the website you want to crawl: ")
html = get_html(url)
if html:
soup = parse_html(html)
tag = input("Enter the HTML tag you want to extract data from (e.g., 'div', 'a'): ")
attribute = input("Enter the attribute you want to filter by (leave blank if not needed): ")
value = input("Enter the attribute value you want to filter by (leave blank if not needed): ")
data = extract_data(soup, tag, attribute, value)
print(f"\nExtracted {len(data)} data from {tag} tags:")
for item in data:
print(item)
if __name__ == "__main__":
main()
web_crawler.py
文件:python web_crawler.py
这个简单的网络爬虫将从指定的URL抓取HTML内容,解析HTML,提取所需的数据,并将其打印到控制台。您可以根据需要修改这个示例,以适应更复杂的网络爬虫项目。