Python BeautifulSoup网页抓取技巧

2023-04-17 00:00:00 网页技巧抓取

安装BeautifulSoup

在Python环境下使用以下命令安装BeautifulSoup：

pip install beautifulsoup4

Beautiful Soup简介

Beautiful Soup是用于HTML和XML解析的Python库。它用于从Web页面抓取数据，解析HTML格式并从中提取数据。

抓取网页数据

可以使用Python的requests库获取网页内容：

import requests

url = 'http://pidancode.com'

response = requests.get(url)

现在我们已经获取了pidancode.com的HTML内容，将其存储在response变量中。

解析HTML内容

通过解析我们获取的HTML内容，我们可以提取所需的信息。

首先，我们需要将HTML内容转换为BeautifulSoup对象：

from bs4 import BeautifulSoup

soup = BeautifulSoup(response.content, 'html.parser')

现在，我们可以使用BeautifulSoup对象来查找要提取的信息。以下是一些示例：

查找所有链接：

links = soup.find_all('a')

for link in links:
    print(link.get('href'))

查找特定标签：

articles = soup.find_all('article')

for article in articles:
    print(article.h3.text)

查找特定属性：

divs = soup.find_all('div', {'class': 'post'})

for div in divs:
    print(div.h2.text)

查找特定文本：

p_elements = soup.find_all('p', text='皮蛋编程')

for p in p_elements:
    print(p.text)

组合使用BeautifulSoup和requests

以下是通过组合BeautifulSoup和requests库抓取pidancode.com网站的所有文章标题和链接的示例：

import requests
from bs4 import BeautifulSoup

url = 'http://pidancode.com'

response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

articles = soup.find_all('article')

for article in articles:
    link = article.h2.a.get('href')
    title = article.h2.a.text
    print(f'{title} - {link}')

输出：

Python中的面向对象编程 - http://pidancode.com/python-object-oriented-programming/
Python函数参数详解 - http://pidancode.com/python-function-arguments/
Python的字符串操作 - http://pidancode.com/python-string-manipulation/
Python的列表与元组的比较 - http://pidancode.com/python-lists-vs-tuples/

以上是Python BeautifulSoup网页抓取技巧的详细介绍，包括安装BeautifulSoup、抓取网页数据、解析HTML内容、组合使用BeautifulSoup和requests等方面的知识点。

相关文章