python beautiful soup库的详细安装教程

发布时间：2021-08-30 16:45:04 来源：亿速云阅读：207 作者：chen 栏目：开发技术

这篇文章主要介绍“python beautiful soup库的详细安装教程”，在日常操作中，相信很多人在python beautiful soup库的详细安装教程问题上存在疑惑，小编查阅了各式资料，整理出简单好用的操作方法，希望对大家解答”python beautiful soup库的详细安装教程”的疑惑有所帮助！接下来，请跟着小编一起来学习吧！

beautiful soup库的安装

pip install beautifulsoup4

beautiful soup库的理解

beautiful soup库是解析、遍历、维护“标签树”的功能库

beautiful soup库的引用

from bs4 import BeautifulSoup
import bs4

BeautifulSoup类

BeautifulSoup对应一个HTML/XML文档的全部内容

回顾demo.html

import requests

r = requests.get("http://python123.io/ws/demo.html")
demo = r.text
print(demo)

<html><head><title>This is a python demo page</title></head>
<body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.</p>
</body></html>

Tag标签

基本元素	说明
Tag	标签，最基本的信息组织单元，分别用<>和</>标明开头和结尾

import requests
from bs4 import BeautifulSoup
r = requests.get("http://python123.io/ws/demo.html")
demo = r.text
soup = BeautifulSoup(demo,"html.parser")
print(soup.title)
tag = soup.a
print(tag)

<title>This is a python demo page</title>
<a  href="http://www.icourse163.org/course/BIT-268001" >Basic Python</a>

任何存在于HTML语法中的标签都可以用soup.访问获得。当HTML文档中存在多个相同对应内容时，soup.返回第一个

Tag的name

基本元素	说明
Name	标签的名字， … 的名字是'p',格式：.name

import requests
from bs4 import BeautifulSoup
r = requests.get("http://python123.io/ws/demo.html")
demo = r.text
soup = BeautifulSoup(demo,"html.parser")
print(soup.a.name)
print(soup.a.parent.name)
print(soup.a.parent.parent.name)

a
p   
body

Tag的attrs（属性）

基本元素	说明
Attributes	标签的属性，字典形式组织，格式：.attrs

import requests
from bs4 import BeautifulSoup
r = requests.get("http://python123.io/ws/demo.html")
demo = r.text
soup = BeautifulSoup(demo,"html.parser")
tag = soup.a
print(tag.attrs)
print(tag.attrs['class'])
print(tag.attrs['href'])
print(type(tag.attrs))
print(type(tag))

{'href': 'http://www.icourse163.org/course/BIT-268001', 'class': ['py1'], 'id': 'link1'}
['py1']
http://www.icourse163.org/course/BIT-268001
<class 'dict'>
<class 'bs4.element.Tag'>

Tag的NavigableString

基本元素	说明
NavigableString	标签内非属性字符串，<>…</>中字符串，格式：.string

Tag的Comment

基本元素	说明
Comment	标签内字符串的注释部分，一种特殊的Comment类型

import requests
from bs4 import BeautifulSoup
newsoup = BeautifulSoup("<b><!--This is a comment--></b><p>This is not a comment</p>","html.parser")
print(newsoup.b.string)
print(type(newsoup.b.string))
print(newsoup.p.string)
print(type(newsoup.p.string))

This is a comment
<class 'bs4.element.Comment'>
This is not a comment
<class 'bs4.element.NavigableString'>

HTML基本格式

标签树的下行遍历

属性	说明
.contents	子节点的列表，将所有儿子结点存入列表
.children	子节点的迭代类型，与.contents类似，用于循环遍历儿子结点
.descendents	子孙节点的迭代类型，包含所有子孙节点，用于循环遍历

BeautifulSoup类型是标签树的根节点

import requests
from bs4 import BeautifulSoup
r = requests.get("http://python123.io/ws/demo.html")
demo = r.text

soup = BeautifulSoup(demo,"html.parser")
print(soup.head)
print(soup.head.contents)
print(soup.body.contents)
print(len(soup.body.contents))
print(soup.body.contents[1])

<head><title>This is a python demo page</title></head>
[<title>This is a python demo page</title>]
['\n', <p ><b>The demo python introduces several python courses.</b></p>, '\n', <p >Python 
is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the 
following courses:
<a  href="http://www.icourse163.org/course/BIT-268001" >Basic Python</a> and <a  href="http://www.icourse163.org/course/BIT-1001870001" >Advanced Python</a>.</p>, '\n']
5
<p ><b>The demo python introduces several python courses.</b></p>

for child in soup.body.children:
	print(child)  #遍历儿子结点
for child in soup.body.descendants:
	print(child) #遍历子孙节点

标签树的上行遍历

属性	说明
.parent	节点的父亲标签
.parents	节点先辈标签的迭代类型，用于循环遍历先辈节点

import requests
from bs4 import BeautifulSoup
r = requests.get("http://python123.io/ws/demo.html")
demo = r.text

soup = BeautifulSoup(demo,"html.parser")
print(soup.title.parent)
print(soup.html.parent)

<head><title>This is a python demo page</title></head>
<html><head><title>This is a python demo page</title></head>
<body>
<p ><b>The demo python introduces several python courses.</b></p>
<p >Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a  href="http://www.icourse163.org/course/BIT-268001" >Basic Python</a> and <a  href="http://www.icourse163.org/course/BIT-1001870001" >Advanced Python</a>.</p>
</body></html>

import requests
from bs4 import BeautifulSoup
r = requests.get("http://python123.io/ws/demo.html")
demo = r.text

soup = BeautifulSoup(demo,"html.parser")
for parent in soup.a.parents:
    if parent is None:
        print(parent)
    else:
        print(parent.name)

p
body      
html      
[document]

标签的平行遍历

属性	说明
.next_sibling	返回按照HTML文本顺序的下一个平行节点标签
.previous.sibling	返回按照HTML文本顺序的上一个平行节点标签
.next_siblings	迭代类型，返回按照HTML文本顺序的后续所有平行节点标签
.previous.siblings	迭代类型，返回按照HTML文本顺序的前续所有平行节点标签

import requests
from bs4 import BeautifulSoup
r = requests.get("http://python123.io/ws/demo.html")
demo = r.text

soup = BeautifulSoup(demo,"html.parser")
print(soup.a.next_sibling)
print(soup.a.next_sibling.next_sibling)

print(soup.a.previous_sibling)
print(soup.a.previous_sibling.previous_sibling)

print(soup.a.parent)

and 
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>
Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:

None
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>

for sibling in soup.a.next_sibling:
	print(sibling)  #遍历后续节点
for sibling in soup.a.previous_sibling:
	print(sibling)  #遍历前续节点

python beautiful soup库的详细安装教程

bs库的prettify()方法

import requests
from bs4 import BeautifulSoup
r = requests.get("http://python123.io/ws/demo.html")
demo = r.text

soup = BeautifulSoup(demo,"html.parser")
print(soup.prettify())

<html>
 <head>
  <title>
   This is a python demo page
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The demo python introduces several python courses.
   </b>
  </p>
  <p class="course">
   Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
    Basic Python
   </a>
   and
   <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">
    Advanced Python
   </a>
   .
  </p>
 </body>
</html>

.prettify()为HTML文本<>及其内容增加更加'\n'
.prettify()可用于标签，方法：.prettify()

bs4库的编码

bs4库将任何HTML输入都变成utf-8编码
python 3.x默认支持编码是utf-8,解析无障碍

import requests
from bs4 import BeautifulSoup

soup = BeautifulSoup("<p>中文</p>","html.parser")
print(soup.p.string)

print(soup.p.prettify())

中文

<p>  
 中文
</p>

到此，关于“python beautiful soup库的详细安装教程”的学习就结束了，希望能够解决大家的疑惑。理论与实践的搭配能更好的帮助大家学习，快去试试吧！若想继续学习更多相关知识，请继续关注亿速云网站，小编会继续努力为大家带来更多实用的文章！

向AI问一下细节

python beautiful soup库的详细安装教程

目录

beautiful soup库的安装

beautiful soup库的理解

beautiful soup库的引用

BeautifulSoup类

回顾demo.html

Tag标签

Tag的name

Tag的attrs（属性）

Tag的NavigableString

HTML基本格式

标签树的下行遍历

标签树的上行遍历

标签的平行遍历

bs库的prettify()方法

bs4库的编码

猜你喜欢

python beautiful soup库的详细安装教程

目录

beautiful soup库的安装

beautiful soup库的理解

beautiful soup库的引用

BeautifulSoup类

回顾demo.html

Tag标签

Tag的name

Tag的attrs（属性）

Tag的NavigableString

HTML基本格式

标签树的下行遍历

标签树的上行遍历

标签的平行遍历

bs库的prettify()方法

bs4库的编码

猜你喜欢

最新资讯

相关推荐

相关标签