利用BeautifulSoup和Pandas实现网页数据的处理和分析

2023-04-17 00:00:00 分析网页利用

BeautifulSoup 是一个Python库，可以从HTML或XML文件中提取数据。Pandas 是另一个Python库，可以处理和分析数据。结合起来使用，可以方便地从网页中提取数据，并进行数据分析。

下面是一个例子：从一个网页中提取标题和链接，并使用Pandas进行分析。

安装所需库

使用以下命令安装所需库：

pip install beautifulsoup4
pip install pandas

获取数据

假设我们要从pidancode.com网站中提取所有文章的标题和链接。可以使用Python的requests库获取网页内容：

import requests

url = 'https://www.pidancode.com/'
response = requests.get(url)
html = response.text

现在，html变量中包含pidancode.com网站的HTML代码。

解析数据

使用BeautifulSoup解析HTML代码，获取所有文章的标题和链接：

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')
articles = soup.find_all('article')

data = []
for article in articles:
    link = article.find('a')
    title = link.text.strip()
    url = link['href']
    data.append({'title': title, 'url': url})

现在，data变量中包含所有文章的标题和链接信息。

分析数据

使用Pandas将数据转换为数据框，并进行分析：

import pandas as pd

df = pd.DataFrame(data)
df['year'] = df['url'].str.extract('/(\d{4})/')

print(df.head())
print(df['year'].value_counts())

以上代码将数据转换为名为df的数据框，并使用正则表达式提取每篇文章的年份。然后，可以使用value_counts方法查看每年有多少篇文章。

完整代码如下：

import requests
from bs4 import BeautifulSoup
import pandas as pd

url = 'https://www.pidancode.com/'
response = requests.get(url)
html = response.text

soup = BeautifulSoup(html, 'html.parser')
articles = soup.find_all('article')

data = []
for article in articles:
    link = article.find('a')
    title = link.text.strip()
    url = link['href']
    data.append({'title': title, 'url': url})

df = pd.DataFrame(data)
df['year'] = df['url'].str.extract('/(\d{4})/')

print(df.head())
print(df['year'].value_counts())

该程序的输出如下：

                                               title  ...  year
0     Pandas数据框df.query('columnName == "value"')  ...  2021
1   Pandas索引iloc函数的使用方法（下标访问）与loc（标签访问）  ...  2021
2  Pandas concat方法的使用：合并多个数据框（行方向和列方向）  ...  2021
3                            Git使用总结（Git Bash）  ...  2021
4                Python doctest模块实现文档和测试的合一  ...  2021

[5 rows x 3 columns]

2021    82
2020    47
2019     7
2018     2
2017     2
2016     1
Name: year, dtype: int64

其中，数据框的前5行包含了每篇文章的标题、链接和年份信息，而最后一行显示了每年有多少篇文章。

相关文章