使用BeautifulSoup进行网页内容的数据提取

2023-04-17 00:00:00 网页内容提取

使用BeautifulSoup可以在网页中寻找特定的标签，从而提取出需要的数据。以下是一个简单的演示，使用BeautifulSoup从pidancode.com网站中提取出所有的h2标签内容：

import requests
from bs4 import BeautifulSoup
url = 'https://pidancode.com'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
h2_tags = soup.find_all('h2')
for tag in h2_tags:
    print(tag.text)

在这段代码中，首先使用requests模块发送一个get请求，获取pidancode.com网站的html文本。然后将文本传递给BeautifulSoup对象，使用它来解析html，并获得所有的h2标签。最后的for循环打印出每个h2标签的文本内容。
如果我们需要从网页中提取出特定标签下的内容，例如从pidancode.com的正文中提取出所有的链接，可以这样做：

import requests
from bs4 import BeautifulSoup
url = 'https://pidancode.com'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
content_div = soup.find('div', {'class': 'post-content'})
for link in content_div.find_all('a'):
    print(link.get('href'))

这段代码与上面的代码类似，不过多了一个步骤：先使用soup.find方法找到正文所在的div标签（class属性为'post-content'），然后在该标签下搜索所有的a标签，并打印它们的href属性。通过这种方式，可以精确地获取我们需要的数据。

相关文章