Python BeautifulSoup CSS选择器
BeautifulSoup是一个强大的Python库,可以用于从HTML和XML中提取数据。在BeautifulSoup中,可以使用CSS选择器来查找和操作元素,这让我们的工作更加简单和高效。
下面是一些使用Python BeautifulSoup CSS选择器的示例和说明:
- 根据标签名查找元素:
from bs4 import BeautifulSoup html_doc = ''' <html> <head> <title>pidancode - 皮蛋编程</title> </head> <body> <h1 class="title">Welcome to pidancode</h1> <p class="description">Learn Python and web development</p> </body> </html> ''' soup = BeautifulSoup(html_doc, 'html.parser') # 使用标签名查找元素 title = soup.select_one('title') print(title.text) h1 = soup.select_one('h1') print(h1.text) p = soup.select_one('p') print(p.text)
输出结果:
pidancode - 皮蛋编程 Welcome to pidancode Learn Python and web development
- 根据类名查找元素:
from bs4 import BeautifulSoup html_doc = ''' <html> <head> <title>pidancode - 皮蛋编程</title> </head> <body> <h1 class="title">Welcome to pidancode</h1> <p class="description">Learn Python and web development</p> <a class="link" href="http://pidancode.com">pidancode.com</a> </body> </html> ''' soup = BeautifulSoup(html_doc, 'html.parser') # 使用类名查找元素 title = soup.select_one('.title') print(title.text) description = soup.select_one('.description') print(description.text) link = soup.select_one('.link') print(link['href'])
输出结果:
Welcome to pidancode Learn Python and web development http://pidancode.com
- 根据id名查找元素:
from bs4 import BeautifulSoup html_doc = ''' <html> <head> <title>pidancode - 皮蛋编程</title> </head> <body> <h1 id="title">Welcome to pidancode</h1> <p id="description">Learn Python and web development</p> <a id="link" href="http://pidancode.com">pidancode.com</a> </body> </html> ''' soup = BeautifulSoup(html_doc, 'html.parser') # 使用id名查找元素 title = soup.select_one('#title') print(title.text) description = soup.select_one('#description') print(description.text) link = soup.select_one('#link') print(link['href'])
输出结果:
Welcome to pidancode Learn Python and web development http://pidancode.com
- 组合选择器:
from bs4 import BeautifulSoup html_doc = ''' <html> <head> <title>pidancode - 皮蛋编程</title> </head> <body> <section> <h1 class="title">Welcome to pidancode</h1> <p class="description">Learn Python and web development</p> </section> <section> <h2 class="title">Do more with Python</h2> <a class="link" href="http://pidancode.com">pidancode.com</a> </section> </body> </html> ''' soup = BeautifulSoup(html_doc, 'html.parser') # 组合选择器 section1_title = soup.select_one('section:nth-of-type(1) .title') print(section1_title.text) section2_link = soup.select_one('section:nth-of-type(2) .link') print(section2_link['href'])
输出结果:
Welcome to pidancode http://pidancode.com
- 遍历选择器:
from bs4 import BeautifulSoup html_doc = ''' <html> <head> <title>pidancode - 皮蛋编程</title> </head> <body> <ul class="list"> <li>Python</li> <li>HTML</li> <li>CSS</li> </ul> </body> </html> ''' soup = BeautifulSoup(html_doc, 'html.parser') # 遍历选择器 for item in soup.select('.list li'): print(item.text)
输出结果:
Python HTML CSS
总结:
1. 在使用css选择器时,可以优先使用类名和id,避免标签名的使用,因为在一个页面中类名和id名可以是唯一的,而标签名会有很多个。
2. 在组合选择器时,使用nth-of-type可以获取某种类型的第几个元素,非常实用。
3. 在遍历选择器时,可以使用for循环来获取多个元素,非常方便。
相关文章