Python BeautifulSoup CSS选择器

2023-04-17 00:00:00 python beautifulsoup 选择器

BeautifulSoup是一个强大的Python库，可以用于从HTML和XML中提取数据。在BeautifulSoup中，可以使用CSS选择器来查找和操作元素，这让我们的工作更加简单和高效。

下面是一些使用Python BeautifulSoup CSS选择器的示例和说明：

根据标签名查找元素:

from bs4 import BeautifulSoup

html_doc = '''
<html>
<head>
<title>pidancode - 皮蛋编程</title>
</head>
<body>
<h1 class="title">Welcome to pidancode</h1>
<p class="description">Learn Python and web development</p>
</body>
</html>
'''

soup = BeautifulSoup(html_doc, 'html.parser')

# 使用标签名查找元素
title = soup.select_one('title')
print(title.text)

h1 = soup.select_one('h1')
print(h1.text)

p = soup.select_one('p')
print(p.text)

输出结果：

pidancode - 皮蛋编程
Welcome to pidancode
Learn Python and web development

根据类名查找元素:

from bs4 import BeautifulSoup

html_doc = '''
<html>
<head>
<title>pidancode - 皮蛋编程</title>
</head>
<body>
<h1 class="title">Welcome to pidancode</h1>
<p class="description">Learn Python and web development</p>
<a class="link" href="http://pidancode.com">pidancode.com</a>
</body>
</html>
'''

soup = BeautifulSoup(html_doc, 'html.parser')

# 使用类名查找元素
title = soup.select_one('.title')
print(title.text)

description = soup.select_one('.description')
print(description.text)

link = soup.select_one('.link')
print(link['href'])

输出结果：

Welcome to pidancode
Learn Python and web development
http://pidancode.com

根据id名查找元素:

from bs4 import BeautifulSoup

html_doc = '''
<html>
<head>
<title>pidancode - 皮蛋编程</title>
</head>
<body>
<h1 id="title">Welcome to pidancode</h1>
<p id="description">Learn Python and web development</p>
<a id="link" href="http://pidancode.com">pidancode.com</a>
</body>
</html>
'''

soup = BeautifulSoup(html_doc, 'html.parser')

# 使用id名查找元素
title = soup.select_one('#title')
print(title.text)

description = soup.select_one('#description')
print(description.text)

link = soup.select_one('#link')
print(link['href'])

输出结果：

Welcome to pidancode
Learn Python and web development
http://pidancode.com

组合选择器:

from bs4 import BeautifulSoup

html_doc = '''
<html>
<head>
<title>pidancode - 皮蛋编程</title>
</head>
<body>
<section>
<h1 class="title">Welcome to pidancode</h1>
<p class="description">Learn Python and web development</p>
</section>
<section>
<h2 class="title">Do more with Python</h2>
<a class="link" href="http://pidancode.com">pidancode.com</a>
</section>
</body>
</html>
'''

soup = BeautifulSoup(html_doc, 'html.parser')

# 组合选择器
section1_title = soup.select_one('section:nth-of-type(1) .title')
print(section1_title.text)

section2_link = soup.select_one('section:nth-of-type(2) .link')
print(section2_link['href'])

输出结果：

Welcome to pidancode
http://pidancode.com

遍历选择器:

from bs4 import BeautifulSoup

html_doc = '''
<html>
<head>
<title>pidancode - 皮蛋编程</title>
</head>
<body>
<ul class="list">
<li>Python</li>
<li>HTML</li>
<li>CSS</li>
</ul>
</body>
</html>
'''

soup = BeautifulSoup(html_doc, 'html.parser')

# 遍历选择器
for item in soup.select('.list li'):
    print(item.text)

输出结果：

Python
HTML
CSS

总结：
1. 在使用css选择器时，可以优先使用类名和id，避免标签名的使用，因为在一个页面中类名和id名可以是唯一的，而标签名会有很多个。
2. 在组合选择器时，使用nth-of-type可以获取某种类型的第几个元素，非常实用。
3. 在遍历选择器时，可以使用for循环来获取多个元素，非常方便。

相关文章