Python XPath 操作符介绍

2023-04-17 00:00:00 python 操作介绍

XPath 是一种语言，用于在 XML 文件中查找信息。它可以用于选择 XML 文档中的任何元素，包括元素、属性、文本和命名空间。

在 XPath 中，有许多操作符可以用于搜索和筛选 XML 元素。下面介绍一些常用的操作符：

/ （除法符号）

斜杠是 XPath 中最常用的操作符之一。它表示一个元素的直接子元素。例如，如果要查找名为“pidancode.com”的元素，可以使用以下 XPath：

//*[@id="pidancode"]/div/div[2]/h1

这意味着，在 id 为“pidancode”的元素下，找到它的直接子元素 div，再找到这个 div 的子元素 div，最后找到 h1 元素。

// （双斜杠）

双斜杠类似于斜杠，但与斜杠不同，它会查找 XML 文档中的所有元素，而不仅仅是它们的直接子元素。例如，要查找所有名为“pidancode.com”的元素，可以使用以下 XPath：

//*[@id="pidancode"]//h1

这将返回 pidancode.com 的直接子元素和后代元素中（包括子元素的子元素），所有名为“h1”的元素。

@ （at 符号）

at 符号用于引用 XML 元素属性。例如，要查找 pidancode.com 的包含“皮蛋编程”文本的 div 元素，可以使用以下 XPath：

//*[@id="pidancode"]/div[@class="header"]/div[contains(text(), "皮蛋编程")]

这将返回 id 为“pidancode”的元素下的所有具有 class 属性等于“header”的 div 元素，其中包含文本“皮蛋编程”。

| （或操作符）

或操作符用于以逗号分隔的方式列出多个 XPath 查询。例如，要查找 pidancode.com 网站的 h1 和 h2 元素，可以使用以下 XPath：

//*[@id="pidancode"]//h1 | //*[@id="pidancode"]//h2

这将返回 pidancode.com 所有名为 h1 和 h2 的元素。

[] （谓词）

谓词使用方括号来筛选查询结果。例如，要查找 pidancode.com 中右侧侧边栏中所有链接的标题，可以使用以下 XPath：

//*[@id="pidancode"]//div[@id="secondary"]//a/text()

这个查询会返回所有 a 元素的文本值。现在，如果我们想要筛选出所有链接链接到“.com”网站的链接的标题，可以使用以下 XPath：

//*[@id="pidancode"]//div[@id="secondary"]//a[contains(@href, ".com")]/text()

这里使用了 contains 函数，它指示在 href 属性值中搜索“.com”文本的出现。

以上是常用的 XPath 操作符。为了演示其用法，以下是一个简单的 Python 代码，它使用 lxml 库解析 XML 文档并执行 XPath 查询：

from lxml import etree

xml = """
<library>
    <book>
        <title>The Hobbit</title>
        <author>J.R.R. Tolkien</author>
        <year>1937</year>
    </book>
    <book>
        <title>The Lord of the Rings</title>
        <author>J.R.R. Tolkien</author>
        <year>1954</year>
    </book>
</library>
"""

root = etree.fromstring(xml)

# 选择所有书籍的标题
titles = root.xpath("//book/title/text()")
print(titles)

# 选择所有作者为 J.R.R. Tolkien 的书籍
books = root.xpath("//book[author='J.R.R. Tolkien']")
for book in books:
    title = book.xpath("title/text()")[0]
    year = book.xpath("year/text()")[0]
    print(f"{title} ({year})")

输出：

['The Hobbit', 'The Lord of the Rings']
The Hobbit (1937)
The Lord of the Rings (1954)

相关文章