解析网页数据：BeautifulSoup的基本用法

2023-04-17 00:00:00 网页解析用法

BeautifulSoup是Python中常用的HTML/XML解析库，能够方便地提取出网页中的特定数据。

安装：可以通过pip install beautifulsoup4进行安装。

使用方式：

导入BeautifulSoup库

from bs4 import BeautifulSoup

传入HTML文本

html_doc = """
<html>
<head>
    <title>pidancode</title>
</head>
<body>
    <h1>皮蛋编程</h1>
    <p>Python和AI技术分享</p>
    <a href="http://www.pidancode.com">pidancode</a>
</body>
</html>
"""

创建BeautifulSoup对象

soup = BeautifulSoup(html_doc, 'html.parser')

其中，第二个参数指定解析器。常用的还有lxml和html5lib。

根据标签名提取数据

# 获取title标签的文本
title = soup.title.string
print(title)

# 获取a标签的href属性值
link = soup.a['href']
print(link)

输出：

pidancode
http://www.pidancode.com

根据标签属性提取数据

# 获取含有href属性的a标签
a_tag = soup.find('a', href=True)
print(a_tag)

输出：

<a href="http://www.pidancode.com">pidancode</a>

根据标签文本提取数据

# 获取含有“Python”文本的p标签
p_tag = soup.find('p', text='Python和AI技术分享')
print(p_tag)

输出：

<p>Python和AI技术分享</p>

完整代码演示：

from bs4 import BeautifulSoup

html_doc = """
<html>
<head>
    <title>pidancode</title>
</head>
<body>
    <h1>皮蛋编程</h1>
    <p>Python和AI技术分享</p>
    <a href="http://www.pidancode.com">pidancode</a>
</body>
</html>
"""

soup = BeautifulSoup(html_doc, 'html.parser')

# 获取title标签的文本
title = soup.title.string
print(title)

# 获取a标签的href属性值
link = soup.a['href']
print(link)

# 获取含有href属性的a标签
a_tag = soup.find('a', href=True)
print(a_tag)

# 获取含有“Python”文本的p标签
p_tag = soup.find('p', text='Python和AI技术分享')
print(p_tag)

输出：

pidancode
http://www.pidancode.com
<a href="http://www.pidancode.com">pidancode</a>
<p>Python和AI技术分享</p>

相关文章