BeautifulSoup 计数标签而不深入解析它们

2022-01-10 00:00:00 python beautifulsoup xml xml-parsing

问题描述

我在写这个问题的答案时考虑了以下.

I thought about the following while writing an answer to this question.

假设我有一个像这样深度嵌套的 xml 文件(但嵌套更多且更长):

Suppose I have a deeply nested xml file like this (but much more nested and much longer):

<section name="1"> <subsection name"foo"> <subsubsection name="bar"> <deeper name="hey"> <much_deeper name"yo"> <li>Some content</li> </much_deeper> </deeper> </subsubsection> </subsection> </section> <section name="2"> ... and so forth </section>

len(soup.find_all("section")) 的问题在于，在执行 find_all("section") 时，BS 一直在深入搜索一个标签我知道不会包含任何其他 section 标记.

The problem with len(soup.find_all("section")) is that while doing find_all("section"), BS keeps searching deep into a tag that I know won't contain any other section tag.

那么，两个问题:

有没有办法让 BS 不递归搜索到已经找到的标签?
如果对 1 的回答是肯定的，是效率更高还是内部流程相同?

Is there a way to make BS not search recursively into an already found tag?

If the answer to 1 is yes, will it be more efficient or is it the same internal process?

解决方案

BeautifulSoup 不能只提供它找到的标签的计数/数量.

BeautifulSoup cannot give you just a count/number of tags it found.

不过，您可以改进的是:不要让 BeautifulSoup 通过传递 recursive=False 来搜索其他部分中的部分:

What you, though, can improve is: don't let BeautifulSoup go searching sections inside other sections by passing recursive=False:

len(soup.find_all("section", recursive=False))

除了改进之外，lxml 会更快地完成这项工作:

Aside from that improvement, lxml would do the job faster:

tree.xpath('count(//section)')

相关文章