利用BeautifulSoup实现对网页数据的高效存储和检索
步骤如下:
- 导入
BeautifulSoup
和其他相关模块:
from bs4 import BeautifulSoup import requests import json
- 使用
requests
模块获取网页数据:
url = "https://www.pidancode.com/" response = requests.get(url) html = response.content
- 将获取到的网页数据转化为
BeautifulSoup
对象,以便进行解析:
soup = BeautifulSoup(html, 'html.parser')
- 检索特定数据并进行存储:
# 获取网页标题,存储为字符串 page_title = soup.title.string # 获取所有文章的标题和链接,存储为字典列表 articles = [] for article in soup.find_all('article'): title = article.h2.a.string link = article.h2.a['href'] articles.append({'title': title, 'link': link}) # 将标题和文章列表存储为JSON文件 data = {'title': page_title, 'articles': articles} with open('pidancode.json', 'w') as f: json.dump(data, f)
- 检索存储的数据并进行读取:
# 从JSON文件中读取数据 with open('pidancode.json', 'r') as f: data = json.load(f) print(data) # 按照文章标题检索文章链接 for article in data['articles']: if '皮蛋编程' in article['title']: print(article['link'])
完整代码如下:
from bs4 import BeautifulSoup import requests import json # 获取网页数据 url = "https://www.pidancode.com/" response = requests.get(url) html = response.content # 将网页数据转换为BeautifulSoup对象 soup = BeautifulSoup(html, 'html.parser') # 获取网页标题,存储为字符串 page_title = soup.title.string # 获取所有文章的标题和链接,存储为字典列表 articles = [] for article in soup.find_all('article'): title = article.h2.a.string link = article.h2.a['href'] articles.append({'title': title, 'link': link}) # 将标题和文章列表存储为JSON文件 data = {'title': page_title, 'articles': articles} with open('pidancode.json', 'w') as f: json.dump(data, f) # 从JSON文件中读取数据 with open('pidancode.json', 'r') as f: data = json.load(f) print(data) # 按照文章标题检索文章链接 for article in data['articles']: if '皮蛋编程' in article['title']: print(article['link'])
相关文章