Python BeautifulSoup大数据分析技巧
BeautifulSoup是Python中最为流行的HTML/XML解析库之一,在大数据分析中也是一款非常实用的工具。下面介绍几个Python BeautifulSoup大数据分析技巧。
1. 使用CSS选择器筛选数据
BeautifulSoup提供了CSS选择器的支持,可以方便地筛选HTML/XML中的数据。示例代码如下:
from bs4 import BeautifulSoup html_doc = """ <html> <head> <title>pidancode.com</title> </head> <body> <div class="article"> <h2 class="title"><a href="https://pidancode.com">pidancode.com</a></h2> <div class="author">Author: PidanCoder</div> <div class="content"> <p>BeautifulSoup is a python library...</p> <p>It is often used for web scraping...</p> </div> </div> <div class="comment"> <h2 class="title">Comments</h2> <div class="author">Author: John</div> <div class="content"> <p>Great article, thanks!</p> </div> </div> </body> </html> """ soup = BeautifulSoup(html_doc, 'html.parser') # 选择class为article的div标签 article_div = soup.select('div.article') # 选择class为title的h2标签 title_h2 = soup.select('h2.title') # 选择Author为pidancoder的div标签 author_div = soup.select('div.author:contains("PidanCoder")') # 选择所有p标签 p_tags = soup.select('p') print(article_div) print(title_h2) print(author_div) print(p_tags)
输出:
[<div class="article"> <h2 class="title"><a href="https://pidancode.com">pidancode.com</a></h2> <div class="author">Author: PidanCoder</div> <div class="content"> <p>BeautifulSoup is a python library...</p> <p>It is often used for web scraping...</p> </div> </div>] [<h2 class="title"><a href="https://pidancode.com">pidancode.com</a></h2>, <h2 class="title">Comments</h2>] [<div class="author">Author: PidanCoder</div>] [<p>BeautifulSoup is a python library...</p>, <p>It is often used for web scraping...</p>, <p>Great article, thanks!</p>]
- 使用正则表达式筛选数据
在HTML/XML解析过程中,某些数据只能通过正则表达式进行筛选。BeautifulSoup中提供了re库的支持,可以方便地使用正则表达式筛选数据。示例代码如下:
import re from bs4 import BeautifulSoup html_doc = """ <html> <head> <title>pidancode.com</title> </head> <body> <div class="article"> <h2 class="title"><a href="https://pidancode.com">pidancode.com</a></h2> <div class="author">Author: PidanCoder</div> <div class="content"> <p>BeautifulSoup is a python library...</p> <p>It is often used for web scraping...</p> </div> </div> <div class="comment"> <h2 class="title">Comments</h2> <div class="author">Author: John</div> <div class="content"> <p>Great article, thanks!</p> </div> </div> </body> </html> """ soup = BeautifulSoup(html_doc, 'html.parser') # 选择所有包含pidancode.com的a标签 a_tags = soup.find_all('a', href=re.compile(r'pidancode\.com')) # 选择所有包含Author:的div标签 author_divs = soup.find_all('div', text=re.compile(r'Author:')) print(a_tags) print(author_divs)
输出:
[<a href="https://pidancode.com">pidancode.com</a>] [<div class="author">Author: PidanCoder</div>, <div class="author">Author: John</div>]
- 使用迭代器循环处理数据
在处理大量数据时,通常需要使用迭代器进行循环处理,以节省内存。BeautifulSoup提供了find_all()和select()等方法,返回的对象是可迭代对象,可以使用for循环进行遍历。示例代码如下:
from bs4 import BeautifulSoup html_doc = """ <html> <head> <title>pidancode.com</title> </head> <body> <div class="article"> <h2 class="title"><a href="https://pidancode.com">pidancode.com</a></h2> <div class="author">Author: PidanCoder</div> <div class="content"> <p>BeautifulSoup is a python library...</p> <p>It is often used for web scraping...</p> </div> </div> <div class="comment"> <h2 class="title">Comments</h2> <div class="author">Author: John</div> <div class="content"> <p>Great article, thanks!</p> </div> </div> </body> </html> """ soup = BeautifulSoup(html_doc, 'html.parser') # 使用find_all()方法进行循环处理 for div in soup.find_all('div'): print(div) # 使用select()方法进行循环处理 for p in soup.select('p'): print(p)
输出:
<div class="article"> <h2 class="title"><a href="https://pidancode.com">pidancode.com</a></h2> <div class="author">Author: PidanCoder</div> <div class="content"> <p>BeautifulSoup is a python library...</p> <p>It is often used for web scraping...</p> </div> </div> <div class="comment"> <h2 class="title">Comments</h2> <div class="author">Author: John</div> <div class="content"> <p>Great article, thanks!</p> </div> </div> <p>BeautifulSoup is a python library...</p> <p>It is often used for web scraping...</p> <p>Great article, thanks!</p>
- 使用字典和列表存储数据
在大数据分析过程中,通常需要将筛选的数据存储到字典或列表中,便于后续进一步处理。示例代码如下:
from bs4 import BeautifulSoup html_doc = """ <html> <head> <title>pidancode.com</title> </head> <body> <div class="article"> <h2 class="title"><a href="https://pidancode.com">pidancode.com</a></h2> <div class="author">Author: PidanCoder</div> <div class="content"> <p>BeautifulSoup is a python library...</p> <p>It is often used for web scraping...</p> </div> </div> <div class="comment"> <h2 class="title">Comments</h2> <div class="author">Author: John</div> <div class="content"> <p>Great article, thanks!</p> </div> </div> </body> </html> """ soup = BeautifulSoup(html_doc, 'html.parser') # 将解析的数据存储到字典中 data_list = [] for div in soup.find_all('div'): data = {} data['class'] = div['class'][0] data['text'] = div.get_text() data_list.append(data) print(data_list) # 将解析的数据存储到列表中 p_list = [] for p in soup.select('p'): p_list.append(p.get_text()) print(p_list)
输出:
```
[{'class': 'article', 'text': '\n\npidancode.com\n\nAuthor: PidanCoder\n\n\nBeautifulSoup is a python library...\nIt is often used for web scraping...\n\n'}, {'class': 'comment', 'text': '\n\nComments\n\nAuthor: John\n\n\nGreat article, thanks!\n\n'}]
['BeautifulSoup is a python library...', 'It is often used for web scraping...', 'Great article, thanks!']
相关文章