介绍:
最近在学Python爬虫,在这里对数据解析模块lxml做个学习笔记。
lxml、xpath及解析器介绍:
lxml是Python的一个解析库,支持HTML和XML的解析,支持xpath解析方式,而且解析效率非常高。xpath,全称XML Path Language,即XML路径语言,它是一门在XML文档中查找信息的语言,它最初是用来搜寻XML文档的,但是它同样适用于HTML文档的搜索
xml文件/html文件结点关系:父节点(Parent)子节点(Children)同胞节点(Sibling)先辈节点(Ancestor)后代节点(Descendant)xpath语法:nodename 选取此节点的所有子节点// 从任意子节点中选取/ 从根节点选取. 选取当前节点.. 选取当前节点的父节点@ 选取属性解析器比较:解析器 速度 难度re 最快 难BeautifulSoup 慢 非常简单lxml 快 简单学习笔记:# -*- coding: utf-8 -*-from lxml import etreehtml_doc = """<html><head><title>The Dormouse's story</title></head><body><p><b>The Dormouse's story</b></p><p>Once upon a time there were three little sisters; and their names were<a href="http://example.com/elsie" class=... ... ... ... ... ... "sister" id="link1">Elsie</a>,<a href="http://example.com/lacie" id="link2">Lacie</a> and<a href="http://example.com/tillie" id="link3">Tillie</a>;and they lived at the bottom of a well.</p><p>...</p>"""selector = etree.HTML(html_doc) #创建一个对象links = selector.xpath('//p[@class="story"]/a/@href') # 取出页面内所有的链接for link in links: print linkxml_test = """<?xml version='1.0'?><?xml-stylesheet type="text/css" href="first.css"?><notebook> <user id="1" category='cb' class="dba python linux"> <name>lizibin</name> <sex>m</sex> <address>sjz</address> <age>28</age> <concat> <email>konigerwin@163.com</email> <phone>135......</phone> </concat> </user> <user id="2" category='za'> <name>wsq</name> <sex>f</sex> <address>shanghai</address> <age>25</age> <concat> <email>konigerwiner@163.com</email> <phone>135......</phone> </concat> </user> <user id="3" category='za'> <name>liqian</name> <sex>f</sex> <address>SH</address> <age>28</age> <concat> <email>konigerwinarry@163.com</email> <phone>135......</phone> </concat> </user> <user id="4" category='cb'> <name>qiangli</name> <sex>f</sex> <address>SH</address> <age>29</age> <concat> <email>konigerwinarry@163.com</email> <phone>135......</phone> </concat> </user> <user id="5" class="dba linux c java python test teacher"> <name>buzhidao</name> <sex>f</sex> <address>SH</address> <age>999</age> <concat> <email>konigerwinarry@163.com</email> <phone>135......</phone> </concat> </user></notebook>"""#r = requests.get('http://xxx.com/abc.xml') 也可以请求远程服务器上的xml文件#etree.HTML(r.text.encode('utf-8'))xml_code = etree.HTML(xml_test) #生成一个etree对象#选取所有子节点的name(地址)print xml_code.xpath('//name')选取所有子节点的name值(数据)print xml_code.xpath('//name/text()')print ''#以notebook以根节点选取所有数据notebook = xml_code.xpath('//notebook')#取出第一个节点的name值(数据)print notebook[0].xpath('.//name/text()')[0]addres = notebook[0].xpath('.//name')[0]#取出和第一个节点同级的 address 值print addres.xpath('../address/text()')#选取属性值print addres.xpath('../address/@lang')#选取notebook下第一个user的name属性print xml_code.xpath('//notebook/user[1]/name/text()')#选取notebook下最后一个user的name属性print xml_code.xpath('//notebook/user[last()]/name/text()')#选取notebook下倒数第二个user的name属性print xml_code.xpath('//notebook/user[last()-1]/name/text()')#选取notebook下前两名user的address属性print xml_code.xpath('//notebook/user[position()<3]/address/text()')#选取所有分类为web的nameprint xml_code.xpath('//notebook/user[@category="cb"]/name/text()')#选取所有年龄小于30的人print xml_code.xpath('//notebook/user[age<30]/name/text()')#选取所有class属性中包含dba的class属性print xml_code.xpath('//notebook/user[contains(@class,"dba")]/@class')print xml_code.xpath('//notebook/user[contains(@class,"dba")]/name/text()')
亿速云「云服务器」,即开即用、新一代英特尔至强铂金CPU、三副本存储NVMe SSD云盘,价格低至29元/月。点击查看>>
免责声明:本站发布的内容(图片、视频和文字)以原创、转载和分享为主,文章观点不代表本网站立场,如果涉及侵权请联系站长邮箱:is@yisu.com进行举报,并提供相关证据,一经查实,将立刻删除涉嫌侵权内容。