这篇文章主要介绍“python beautiful soup库的详细安装教程”,在日常操作中,相信很多人在python beautiful soup库的详细安装教程问题上存在疑惑,小编查阅了各式资料,整理出简单好用的操作方法,希望对大家解答”python beautiful soup库的详细安装教程”的疑惑有所帮助!接下来,请跟着小编一起来学习吧!
beautiful soup库的安装
beautiful soup库的理解
beautiful soup库的引用
BeautifulSoup类
回顾demo.html
Tag标签
Tag的attrs(属性)
Tag的NavigableString
HTML基本格式
标签树的下行遍历
标签树的上行遍历
标签的平行遍历
bs库的prettify()方法
bs4库的编码
pip install beautifulsoup4
beautiful soup库是解析、遍历、维护“标签树”的功能库
from bs4 import BeautifulSoup import bs4
BeautifulSoup对应一个HTML/XML文档的全部内容
import requests r = requests.get("http://python123.io/ws/demo.html") demo = r.text print(demo)
<html><head><title>This is a python demo page</title></head> <body> <p class="title"><b>The demo python introduces several python courses.</b></p> <p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses: <a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.</p> </body></html>
基本元素 | 说明 |
---|---|
Tag | 标签,最基本的信息组织单元,分别用<>和</>标明开头和结尾 |
import requests from bs4 import BeautifulSoup r = requests.get("http://python123.io/ws/demo.html") demo = r.text soup = BeautifulSoup(demo,"html.parser") print(soup.title) tag = soup.a print(tag)
<title>This is a python demo page</title> <a href="http://www.icourse163.org/course/BIT-268001" >Basic Python</a>
任何存在于HTML语法中的标签都可以用soup.访问获得。当HTML文档中存在多个相同对应内容时,soup.返回第一个
基本元素 | 说明 |
---|---|
Name | 标签的名字, … 的名字是'p',格式:.name |
import requests from bs4 import BeautifulSoup r = requests.get("http://python123.io/ws/demo.html") demo = r.text soup = BeautifulSoup(demo,"html.parser") print(soup.a.name) print(soup.a.parent.name) print(soup.a.parent.parent.name)
a p body
基本元素 | 说明 |
---|---|
Attributes | 标签的属性,字典形式组织,格式:.attrs |
import requests from bs4 import BeautifulSoup r = requests.get("http://python123.io/ws/demo.html") demo = r.text soup = BeautifulSoup(demo,"html.parser") tag = soup.a print(tag.attrs) print(tag.attrs['class']) print(tag.attrs['href']) print(type(tag.attrs)) print(type(tag))
{'href': 'http://www.icourse163.org/course/BIT-268001', 'class': ['py1'], 'id': 'link1'} ['py1'] http://www.icourse163.org/course/BIT-268001 <class 'dict'> <class 'bs4.element.Tag'>
Tag的NavigableString
基本元素 | 说明 |
---|---|
NavigableString | 标签内非属性字符串,<>…</>中字符串,格式:.string |
Tag的Comment
基本元素 | 说明 |
---|---|
Comment | 标签内字符串的注释部分,一种特殊的Comment类型 |
import requests from bs4 import BeautifulSoup newsoup = BeautifulSoup("<b><!--This is a comment--></b><p>This is not a comment</p>","html.parser") print(newsoup.b.string) print(type(newsoup.b.string)) print(newsoup.p.string) print(type(newsoup.p.string))
This is a comment <class 'bs4.element.Comment'> This is not a comment <class 'bs4.element.NavigableString'>
属性 | 说明 |
---|---|
.contents | 子节点的列表,将所有儿子结点存入列表 |
.children | 子节点的迭代类型,与.contents类似,用于循环遍历儿子结点 |
.descendents | 子孙节点的迭代类型,包含所有子孙节点,用于循环遍历 |
BeautifulSoup类型是标签树的根节点
import requests from bs4 import BeautifulSoup r = requests.get("http://python123.io/ws/demo.html") demo = r.text soup = BeautifulSoup(demo,"html.parser") print(soup.head) print(soup.head.contents) print(soup.body.contents) print(len(soup.body.contents)) print(soup.body.contents[1])
<head><title>This is a python demo page</title></head> [<title>This is a python demo page</title>] ['\n', <p ><b>The demo python introduces several python courses.</b></p>, '\n', <p >Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses: <a href="http://www.icourse163.org/course/BIT-268001" >Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" >Advanced Python</a>.</p>, '\n'] 5 <p ><b>The demo python introduces several python courses.</b></p>
for child in soup.body.children: print(child) #遍历儿子结点 for child in soup.body.descendants: print(child) #遍历子孙节点
属性 | 说明 |
---|---|
.parent | 节点的父亲标签 |
.parents | 节点先辈标签的迭代类型,用于循环遍历先辈节点 |
import requests from bs4 import BeautifulSoup r = requests.get("http://python123.io/ws/demo.html") demo = r.text soup = BeautifulSoup(demo,"html.parser") print(soup.title.parent) print(soup.html.parent)
<head><title>This is a python demo page</title></head> <html><head><title>This is a python demo page</title></head> <body> <p ><b>The demo python introduces several python courses.</b></p> <p >Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses: <a href="http://www.icourse163.org/course/BIT-268001" >Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" >Advanced Python</a>.</p> </body></html>
import requests from bs4 import BeautifulSoup r = requests.get("http://python123.io/ws/demo.html") demo = r.text soup = BeautifulSoup(demo,"html.parser") for parent in soup.a.parents: if parent is None: print(parent) else: print(parent.name)
p body html [document]
属性 | 说明 |
---|---|
.next_sibling | 返回按照HTML文本顺序的下一个平行节点标签 |
.previous.sibling | 返回按照HTML文本顺序的上一个平行节点标签 |
.next_siblings | 迭代类型,返回按照HTML文本顺序的后续所有平行节点标签 |
.previous.siblings | 迭代类型,返回按照HTML文本顺序的前续所有平行节点标签 |
import requests from bs4 import BeautifulSoup r = requests.get("http://python123.io/ws/demo.html") demo = r.text soup = BeautifulSoup(demo,"html.parser") print(soup.a.next_sibling) print(soup.a.next_sibling.next_sibling) print(soup.a.previous_sibling) print(soup.a.previous_sibling.previous_sibling) print(soup.a.parent)
and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a> Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses: None <p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses: <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
for sibling in soup.a.next_sibling: print(sibling) #遍历后续节点 for sibling in soup.a.previous_sibling: print(sibling) #遍历前续节点
import requests from bs4 import BeautifulSoup r = requests.get("http://python123.io/ws/demo.html") demo = r.text soup = BeautifulSoup(demo,"html.parser") print(soup.prettify())
<html> <head> <title> This is a python demo page </title> </head> <body> <p class="title"> <b> The demo python introduces several python courses. </b> </p> <p class="course"> Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses: Basic Python </a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2"> Advanced Python </a> . </p> </body> </html>
.prettify()为HTML文本<>及其内容增加更加'\n'
.prettify()可用于标签,方法:.prettify()
bs4库将任何HTML输入都变成utf-8编码
python 3.x默认支持编码是utf-8,解析无障碍
import requests from bs4 import BeautifulSoup soup = BeautifulSoup("<p>中文</p>","html.parser") print(soup.p.string) print(soup.p.prettify())
中文 <p> 中文 </p>
到此,关于“python beautiful soup库的详细安装教程”的学习就结束了,希望能够解决大家的疑惑。理论与实践的搭配能更好的帮助大家学习,快去试试吧!若想继续学习更多相关知识,请继续关注亿速云网站,小编会继续努力为大家带来更多实用的文章!
免责声明:本站发布的内容(图片、视频和文字)以原创、转载和分享为主,文章观点不代表本网站立场,如果涉及侵权请联系站长邮箱:is@yisu.com进行举报,并提供相关证据,一经查实,将立刻删除涉嫌侵权内容。