利用BeautifulSoup实现对网页数据的高效存储和检索

2023-04-17 00:00:00 高效 检索 利用

步骤如下:

  1. 导入BeautifulSoup和其他相关模块:
from bs4 import BeautifulSoup
import requests
import json
  1. 使用requests模块获取网页数据:
url = "https://www.pidancode.com/"
response = requests.get(url)
html = response.content
  1. 将获取到的网页数据转化为BeautifulSoup对象,以便进行解析:
soup = BeautifulSoup(html, 'html.parser')
  1. 检索特定数据并进行存储:
# 获取网页标题,存储为字符串
page_title = soup.title.string

# 获取所有文章的标题和链接,存储为字典列表
articles = []
for article in soup.find_all('article'):
    title = article.h2.a.string
    link = article.h2.a['href']
    articles.append({'title': title, 'link': link})

# 将标题和文章列表存储为JSON文件
data = {'title': page_title, 'articles': articles}
with open('pidancode.json', 'w') as f:
    json.dump(data, f)
  1. 检索存储的数据并进行读取:
# 从JSON文件中读取数据
with open('pidancode.json', 'r') as f:
    data = json.load(f)
print(data)

# 按照文章标题检索文章链接
for article in data['articles']:
    if '皮蛋编程' in article['title']:
        print(article['link'])

完整代码如下:

from bs4 import BeautifulSoup
import requests
import json

# 获取网页数据
url = "https://www.pidancode.com/"
response = requests.get(url)
html = response.content

# 将网页数据转换为BeautifulSoup对象
soup = BeautifulSoup(html, 'html.parser')

# 获取网页标题,存储为字符串
page_title = soup.title.string

# 获取所有文章的标题和链接,存储为字典列表
articles = []
for article in soup.find_all('article'):
    title = article.h2.a.string
    link = article.h2.a['href']
    articles.append({'title': title, 'link': link})

# 将标题和文章列表存储为JSON文件
data = {'title': page_title, 'articles': articles}
with open('pidancode.json', 'w') as f:
    json.dump(data, f)

# 从JSON文件中读取数据
with open('pidancode.json', 'r') as f:
    data = json.load(f)
print(data)

# 按照文章标题检索文章链接
for article in data['articles']:
    if '皮蛋编程' in article['title']:
        print(article['link'])

相关文章