Python BeautifulSoup网站批量抓取技巧

2023-04-17 00:00:00 技巧抓取批量

安装 BeautifulSoup

使用 pip 命令进行安装：

pip install beautifulsoup4

网页请求

使用 requests 库向目标网站发送请求，获取返回的 HTML 内容：

import requests

url = "http://www.pidancode.com"
res = requests.get(url)
html_content = res.text

解析 HTML

使用 BeautifulSoup 对 HTML 进行解析，以便提取所需数据：

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_content, "html.parser")

找到目标标签

使用 find 或 find_all 函数找到目标标签：

# 找到第一个 h1 标签
h1 = soup.find("h1")
# 找到所有 a 标签
a_list = soup.find_all("a")
# 找到 class 为 "my-class" 的所有 div 标签
div_list = soup.find_all("div", {"class": "my-class"})

提取数据

通过对目标标签使用类似字典的方式，获取标签内的数据：

# 获取 h1 标签中的文字
h1_text = h1.text
# 获取第一个 a 标签的 href 属性值
a_href = a_list[0]["href"]
# 获取 class 为 "my-class" 的第一个 div 标签中的文字
div_text = div_list[0].text

批量抓取数据

通过循环遍历多个网页，可以批量抓取数据：

for i in range(1, 11):
    url = f"http://www.pidancode.com/page/{i}"
    res = requests.get(url)
    html_content = res.text
    soup = BeautifulSoup(html_content, "html.parser")
    # 进一步处理数据

以上就是使用 Python BeautifulSoup 抓取网站数据的基本流程和技巧，通过熟悉这些技能，我们可以对各种网站进行数据挖掘和分析。

相关文章