Python BeautifulSoup HTML 标签处理

2023-04-17 00:00:00 python beautifulsoup 标签

BeautifulSoup是一个用于解析HTML和XML文档的Python库。它可以处理HTML文档中的标签、属性、文本等信息。下面是一些具体的HTML标签处理的例子：

1.获取标签

使用方法find()、find_all()、select()等方法可以获取标签。

代码演示：

from bs4 import BeautifulSoup

html = '''
<html>
<head>
    <title>pidancode.com</title>
</head>
<body>
    <h1>皮蛋编程</h1>
    <p>Python学习指南</p>
</body>
</html>
'''

soup = BeautifulSoup(html, 'html.parser')
h1 = soup.find('h1')
p = soup.find_all('p')
print(h1.text)
for i in p:
    print(i.text)

输出结果：

皮蛋编程
Python学习指南

2.获取属性

使用方法get()可以获取标签的属性。

代码演示：

from bs4 import BeautifulSoup

html = '''
<html>
<head>
    <title>pidancode.com</title>
</head>
<body>
    <a href="https://www.pidancode.com">皮蛋编程</a>
</body>
</html>
'''

soup = BeautifulSoup(html, 'html.parser')
a = soup.find('a')
print(a.get('href'))

输出结果：

https://www.pidancode.com

3.修改标签

使用方法replace_with()可以修改标签的文本内容。

代码演示：

from bs4 import BeautifulSoup

html = '''
<html>
<head>
    <title>pidancode.com</title>
</head>
<body>
    <h1>pidancode.com</h1>
</body>
</html>
'''

soup = BeautifulSoup(html, 'html.parser')
h1 = soup.find('h1')
h1.replace_with('皮蛋编程')
print(soup)

输出结果：

<html>
<head>
<title>pidancode.com</title>
</head>
<body>
    皮蛋编程
</body>
</html>

4.删除标签

使用方法extract()可以删除标签。

代码演示：

from bs4 import BeautifulSoup

html = '''
<html>
<head>
    <title>pidancode.com</title>
</head>
<body>
    <h1>pidancode.com</h1>
    <p>Python学习指南</p>
</body>
</html>
'''

soup = BeautifulSoup(html, 'html.parser')
p = soup.find('p')
p.extract()
print(soup)

输出结果：

<html>
<head>
<title>pidancode.com</title>
</head>
<body>
    <h1>pidancode.com</h1>
</body>
</html>

总之，使用Python的BeautifulSoup库可以方便地处理HTML文档中的标签、属性、文本等信息。

相关文章