Python中有多个优秀的XPath爬虫库,以下是一些推荐:
from lxml import etree
html = '''<html>
<body>
<div class="container">
<h1 class="title">Hello, world!</h1>
<p class="content">Some amazing content here.</p>
</div>
</body>
</html>'''
tree = etree.HTML(html)
title = tree.xpath('//h1[@class="title"]/text()')[0]
print(title) # 输出: Hello, world!
from bs4 import BeautifulSoup
html = '''<html>
<body>
<div class="container">
<h1 class="title">Hello, world!</h1>
<p class="content">Some amazing content here.</p>
</div>
</body>
</html>'''
soup = BeautifulSoup(html, 'lxml')
title = soup.select_one('.title').get_text()
print(title) # 输出: Hello, world!
import scrapy
class MySpider(scrapy.Spider):
name = 'myspider'
start_urls = ['http://example.com']
def parse(self, response):
title = response.xpath('//h1/text()').get()
print(title) # 输出: Hello, world!
from pyquery import PyQuery as pq
html = '''<html>
<body>
<div class="container">
<h1 class="title">Hello, world!</h1>
<p class="content">Some amazing content here.</p>
</div>
</body>
</html>'''
doc = pq(html)
title = doc('.title').text()
print(title) # 输出: Hello, world!
这些库各有特点,可以根据具体需求选择合适的库进行网页数据抓取。