Python BeautifulSoup数据提取技巧
- 获取标签内容
使用bs4库中的soup对象及其find()方法可以获取到指定标签的内容。例如,获取
标签中的内容:
from bs4 import BeautifulSoup html = '<p>pidancode.com is a website about Python programming.</p>' soup = BeautifulSoup(html, 'html.parser') p = soup.find('p') content = p.string print(content)
输出结果:
pidancode.com is a website about Python programming.
- 获取标签属性值
使用bs4库中的soup对象及其find()方法的attrs属性,可以获取到指定标签的属性值。例如,获取标签的href属性值:
from bs4 import BeautifulSoup html = '<a href="https://www.pidancode.com">pidancode.com</a>' soup = BeautifulSoup(html, 'html.parser') a = soup.find('a') href = a['href'] print(href)
输出结果:
https://www.pidancode.com
- 获取多个标签内容
使用soup对象及其find_all()方法可以获取到指定的多个标签的内容。例如,获取所有的
标签的内容:
from bs4 import BeautifulSoup html = ''' <html> <body> <p>pidancode.com is a website about Python programming.</p> <p>皮蛋编程是一个Python编程网站。</p> </body> </html> ''' soup = BeautifulSoup(html, 'html.parser') ps = soup.find_all('p') for p in ps: content = p.string print(content)
输出结果:
pidancode.com is a website about Python programming. 皮蛋编程是一个Python编程网站。
- 获取父标签内容
使用BeautifulSoup库中的parent属性可以获取到指定标签的父标签内容。例如,获取
标签的父标签内容:
from bs4 import BeautifulSoup html = '<div><p>pidancode.com is a website about Python programming.</p></div>' soup = BeautifulSoup(html, 'html.parser') p = soup.find('p') parent = p.parent print(parent)
输出结果:
<div><p>pidancode.com is a website about Python programming.</p></div>
- 获取子标签内容
使用BeautifulSoup库中的contents属性可以获取到指定标签的所有子标签内容。例如,获取
标签的所有子标签内容:
from bs4 import BeautifulSoup html = '''<div> <a href="https://www.pidancode.com">pidancode.com</a> <p>皮蛋编程是一个Python编程网站。</p> </div>''' soup = BeautifulSoup(html, 'html.parser') div = soup.find('div') children = div.contents for child in children: print(child)
输出结果:
<a href="https://www.pidancode.com">pidancode.com</a> <p>皮蛋编程是一个Python编程网站。</p>
- 获取兄弟标签内容
使用BeautifulSoup库中的next_sibling和previous_sibling属性可以获取到指定标签的相邻兄弟标签。例如,获取第一个
标签下一个相邻的
标签的内容:
from bs4 import BeautifulSoup html = '''<div> <p>pidancode.com is a website about Python programming.</p> <p>皮蛋编程是一个Python编程网站。</p> </div>''' soup = BeautifulSoup(html, 'html.parser') p1 = soup.find('p') p2 = p1.next_sibling.next_sibling content = p2.string print(content)
输出结果:
皮蛋编程是一个Python编程网站。
- 获取CSS选择器内容
使用BeautifulSoup库中的select()方法可以通过CSS选择器获取到指定标签的内容。例如,获取所有标签的href属性值:
from bs4 import BeautifulSoup html = '''<div> <a href="https://www.pidancode.com">pidancode.com</a> <a href="https://www.baidu.com">百度</a> </div>''' soup = BeautifulSoup(html, 'html.parser') as = soup.select('a') for a in as: href = a['href'] print(href)
输出结果:
https://www.pidancode.com https://www.baidu.com
相关文章