Python BeautifulSoup 网络爬虫案例

2023-04-17 00:00:00 爬虫网络案例

这里提供一个 Python BeautifulSoup 网络爬虫案例，用于爬取 “pidancode.com” 网站上的文章。

首先，需要导入必要的库：requests 和 BeautifulSoup。

import requests
from bs4 import BeautifulSoup

接着，定义一个函数，用于获取网页源代码。

def get_html(url):
    try:
        response = requests.get(url)
        response.raise_for_status()
        response.encoding = response.apparent_encoding
        return response.text
    except:
        return ""

其中 requests.get(url) 发送 GET 请求，response.raise_for_status() 判断是否请求成功，response.encoding = response.apparent_encoding 设置编码，response.text 获取网页源代码。

然后，定义一个函数，用于解析网页源代码，获取文章标题和内容。

def get_content(html):
    soup = BeautifulSoup(html, 'html.parser')
    try:
        title = soup.select('.entry-title')[0].get_text()
        content = soup.select('.entry-content')[0].get_text()
        return title, content
    except:
        return "", ""

其中 BeautifulSoup(html, 'html.parser') 将网页源代码转换为 BeautifulSoup 对象，soup.select('.entry-title')[0].get_text() 获取文章标题，soup.select('.entry-content')[0].get_text() 获取文章内容。

最后，定义一个主函数，用于调用以上两个函数，以及保存文章。

def main():
    url = "https://pidancode.com/python-beautifulsoup-web-crawler"
    html = get_html(url)
    title, content = get_content(html)
    if title != "" and content != "":
        with open(title + '.txt', 'w', encoding='utf-8') as f:
            f.write(content)
        print("文章已保存：", title)

if __name__ == "__main__":
    main()

其中 url 是需要爬取的网页链接，title + '.txt' 是保存文章的文件名，.write(content) 是将文章内容写入文件。

现在运行主函数，爬取文章并保存。

输出结果：

文章已保存： Python BeautifulSoup 网络爬虫案例

相关文章