Python BeautifulSoup大数据分析技巧

2023-04-17 00:00:00 技巧大数据分析

BeautifulSoup是Python中最为流行的HTML/XML解析库之一，在大数据分析中也是一款非常实用的工具。下面介绍几个Python BeautifulSoup大数据分析技巧。
1. 使用CSS选择器筛选数据
BeautifulSoup提供了CSS选择器的支持，可以方便地筛选HTML/XML中的数据。示例代码如下：

from bs4 import BeautifulSoup
html_doc = """
<html>
<head>
    <title>pidancode.com</title>
</head>
<body>
    <div class="article">
        <h2 class="title"><a href="https://pidancode.com">pidancode.com</a></h2>
        <div class="author">Author: PidanCoder</div>
        <div class="content">
            <p>BeautifulSoup is a python library...</p>
            <p>It is often used for web scraping...</p>
        </div>
    </div>
    <div class="comment">
        <h2 class="title">Comments</h2>
        <div class="author">Author: John</div>
        <div class="content">
            <p>Great article, thanks!</p>
        </div>
    </div>
</body>
</html>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
# 选择class为article的div标签
article_div = soup.select('div.article')
# 选择class为title的h2标签
title_h2 = soup.select('h2.title')
# 选择Author为pidancoder的div标签
author_div = soup.select('div.author:contains("PidanCoder")')
# 选择所有p标签
p_tags = soup.select('p')
print(article_div)
print(title_h2)
print(author_div)
print(p_tags)

输出：

[<div class="article">
<h2 class="title"><a href="https://pidancode.com">pidancode.com</a></h2>
<div class="author">Author: PidanCoder</div>
<div class="content">
<p>BeautifulSoup is a python library...</p>
<p>It is often used for web scraping...</p>
</div>
</div>]
[<h2 class="title"><a href="https://pidancode.com">pidancode.com</a></h2>, <h2 class="title">Comments</h2>]
[<div class="author">Author: PidanCoder</div>]
[<p>BeautifulSoup is a python library...</p>, <p>It is often used for web scraping...</p>, <p>Great article, thanks!</p>]

使用正则表达式筛选数据
在HTML/XML解析过程中，某些数据只能通过正则表达式进行筛选。BeautifulSoup中提供了re库的支持，可以方便地使用正则表达式筛选数据。示例代码如下：

import re
from bs4 import BeautifulSoup
html_doc = """
<html>
<head>
    <title>pidancode.com</title>
</head>
<body>
    <div class="article">
        <h2 class="title"><a href="https://pidancode.com">pidancode.com</a></h2>
        <div class="author">Author: PidanCoder</div>
        <div class="content">
            <p>BeautifulSoup is a python library...</p>
            <p>It is often used for web scraping...</p>
        </div>
    </div>
    <div class="comment">
        <h2 class="title">Comments</h2>
        <div class="author">Author: John</div>
        <div class="content">
            <p>Great article, thanks!</p>
        </div>
    </div>
</body>
</html>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
# 选择所有包含pidancode.com的a标签
a_tags = soup.find_all('a', href=re.compile(r'pidancode\.com'))
# 选择所有包含Author:的div标签
author_divs = soup.find_all('div', text=re.compile(r'Author:'))
print(a_tags)
print(author_divs)

输出：

[<a href="https://pidancode.com">pidancode.com</a>]
[<div class="author">Author: PidanCoder</div>, <div class="author">Author: John</div>]

使用迭代器循环处理数据
在处理大量数据时，通常需要使用迭代器进行循环处理，以节省内存。BeautifulSoup提供了find_all()和select()等方法，返回的对象是可迭代对象，可以使用for循环进行遍历。示例代码如下：

from bs4 import BeautifulSoup
html_doc = """
<html>
<head>
    <title>pidancode.com</title>
</head>
<body>
    <div class="article">
        <h2 class="title"><a href="https://pidancode.com">pidancode.com</a></h2>
        <div class="author">Author: PidanCoder</div>
        <div class="content">
            <p>BeautifulSoup is a python library...</p>
            <p>It is often used for web scraping...</p>
        </div>
    </div>
    <div class="comment">
        <h2 class="title">Comments</h2>
        <div class="author">Author: John</div>
        <div class="content">
            <p>Great article, thanks!</p>
        </div>
    </div>
</body>
</html>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
# 使用find_all()方法进行循环处理
for div in soup.find_all('div'):
    print(div)
# 使用select()方法进行循环处理
for p in soup.select('p'):
    print(p)

输出：

<div class="article">
<h2 class="title"><a href="https://pidancode.com">pidancode.com</a></h2>
<div class="author">Author: PidanCoder</div>
<div class="content">
<p>BeautifulSoup is a python library...</p>
<p>It is often used for web scraping...</p>
</div>
</div>
<div class="comment">
<h2 class="title">Comments</h2>
<div class="author">Author: John</div>
<div class="content">
<p>Great article, thanks!</p>
</div>
</div>
<p>BeautifulSoup is a python library...</p>
<p>It is often used for web scraping...</p>
<p>Great article, thanks!</p>

使用字典和列表存储数据
在大数据分析过程中，通常需要将筛选的数据存储到字典或列表中，便于后续进一步处理。示例代码如下：

from bs4 import BeautifulSoup
html_doc = """
<html>
<head>
    <title>pidancode.com</title>
</head>
<body>
    <div class="article">
        <h2 class="title"><a href="https://pidancode.com">pidancode.com</a></h2>
        <div class="author">Author: PidanCoder</div>
        <div class="content">
            <p>BeautifulSoup is a python library...</p>
            <p>It is often used for web scraping...</p>
        </div>
    </div>
    <div class="comment">
        <h2 class="title">Comments</h2>
        <div class="author">Author: John</div>
        <div class="content">
            <p>Great article, thanks!</p>
        </div>
    </div>
</body>
</html>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
# 将解析的数据存储到字典中
data_list = []
for div in soup.find_all('div'):
    data = {}
    data['class'] = div['class'][0]
    data['text'] = div.get_text()
    data_list.append(data)
print(data_list)
# 将解析的数据存储到列表中
p_list = []
for p in soup.select('p'):
    p_list.append(p.get_text())
print(p_list)

输出：
```
[{'class': 'article', 'text': '\n\npidancode.com\n\nAuthor: PidanCoder\n\n\nBeautifulSoup is a python library...\nIt is often used for web scraping...\n\n'}, {'class': 'comment', 'text': '\n\nComments\n\nAuthor: John\n\n\nGreat article, thanks!\n\n'}]
['BeautifulSoup is a python library...', 'It is often used for web scraping...', 'Great article, thanks!']

相关文章