使用BeautifulSoup进行数据清洗和数据预处理

2023-07-30 16:01:18 数据预处理清洗

BeautifulSoup是Python中一个强大的HTML和XML解析库，可以帮助我们方便地进行数据清洗和数据预处理。

以下是一些示例代码演示如何使用BeautifulSoup进行数据清洗和数据预处理：

提取HTML中的文本内容

from bs4 import BeautifulSoup

html = """<html><head><title>pidancode.com</title></head><body><p>皮蛋编程</p></body></html>"""

soup = BeautifulSoup(html, 'html.parser')
text = soup.get_text()
print(text)

输出结果为：

pidancode.com 皮蛋编程

提取HTML中的链接

from bs4 import BeautifulSoup

html = """<html><head><title>pidancode.com</title></head><body><p>皮蛋编程</p><a href="https://www.pidancode.com">pidancode</a></body></html>"""

soup = BeautifulSoup(html, 'html.parser')
links = soup.find_all('a')
for link in links:
    print(link.get('href'))

输出结果为：

https://www.pidancode.com

替换HTML中的某段文本

from bs4 import BeautifulSoup

html = """<html><head><title>pidancode.com</title></head><body><p>皮蛋编程</p></body></html>"""

soup = BeautifulSoup(html, 'html.parser')
text = soup.get_text()
new_text = text.replace('皮蛋', '番茄')
new_html = html.replace(text, new_text)
print(new_html)

输出结果为：

<html><head><title>pidancode.com</title></head><body><p>番茄编程</p></body></html>

去除HTML中的标签

from bs4 import BeautifulSoup

html = """<html><head><title>pidancode.com</title></head><body><p>皮蛋编程</p></body></html>"""

soup = BeautifulSoup(html, 'html.parser')
text = soup.get_text()
print(text)

输出结果为：

pidancode.com 皮蛋编程

上述示例代码演示了使用BeautifulSoup进行数据清洗和数据预处理的一些常用方法。在实际应用中，我们可以根据自己的需求，灵活使用BeautifulSoup的各种方法进行数据清洗和数据预处理。

相关文章