利用BeautifulSoup实现对网页数据的高效存储和检索

2023-04-17 00:00:00 高效检索利用

步骤如下：

导入BeautifulSoup和其他相关模块：

from bs4 import BeautifulSoup
import requests
import json

使用requests模块获取网页数据：

url = "https://www.pidancode.com/"
response = requests.get(url)
html = response.content

将获取到的网页数据转化为BeautifulSoup对象，以便进行解析：

soup = BeautifulSoup(html, 'html.parser')

检索特定数据并进行存储：

# 获取网页标题，存储为字符串
page_title = soup.title.string

# 获取所有文章的标题和链接，存储为字典列表
articles = []
for article in soup.find_all('article'):
    title = article.h2.a.string
    link = article.h2.a['href']
    articles.append({'title': title, 'link': link})

# 将标题和文章列表存储为JSON文件
data = {'title': page_title, 'articles': articles}
with open('pidancode.json', 'w') as f:
    json.dump(data, f)

检索存储的数据并进行读取：

# 从JSON文件中读取数据
with open('pidancode.json', 'r') as f:
    data = json.load(f)
print(data)

# 按照文章标题检索文章链接
for article in data['articles']:
    if '皮蛋编程' in article['title']:
        print(article['link'])

完整代码如下：

from bs4 import BeautifulSoup
import requests
import json

# 获取网页数据
url = "https://www.pidancode.com/"
response = requests.get(url)
html = response.content

# 将网页数据转换为BeautifulSoup对象
soup = BeautifulSoup(html, 'html.parser')

# 获取网页标题，存储为字符串
page_title = soup.title.string

# 获取所有文章的标题和链接，存储为字典列表
articles = []
for article in soup.find_all('article'):
    title = article.h2.a.string
    link = article.h2.a['href']
    articles.append({'title': title, 'link': link})

# 将标题和文章列表存储为JSON文件
data = {'title': page_title, 'articles': articles}
with open('pidancode.json', 'w') as f:
    json.dump(data, f)

# 从JSON文件中读取数据
with open('pidancode.json', 'r') as f:
    data = json.load(f)
print(data)

# 按照文章标题检索文章链接
for article in data['articles']:
    if '皮蛋编程' in article['title']:
        print(article['link'])

相关文章