如何从 HTML 字符串中获取美丽汤中的开始和结束标记?

2022-01-18 00:00:00 python beautifulsoup tags

问题描述

我正在使用漂亮的汤编写一个 python 脚本,我必须从包含一些 HTML 代码的字符串中获取一个开始标签.

I am writing a python script using beautiful soup, where i have to get an opening tag from a string containing some HTML code.

这是我的字符串:

string = <p>...</p>

我想在名为 opening_tag 的变量中获取 <p> 并在名为 的变量中获取 </p>关闭标签.我已经搜索了文档,但似乎没有找到解决方案.谁能给我建议?

I want to get <p> in a variable called opening_tag and </p> in a variable called closing_tag. I have searched the documentation but don't seem to find the solution. Can anyone advise me with that?


解决方案

有一种方法可以使用 BeautifulSoup 和一个简单的 reg-ex:

There is a way to do this with BeautifulSoup and a simple reg-ex:

  • 将段落放在 BeautifulSoup 对象中,例如,soupParagraph.

  • Put the paragraph in a BeautifulSoup object, e.g., soupParagraph.

对于开始 (<p>) 和结束 (</p>) 标记之间的内容,将内容移动到另一个 BeautifulSoup 对象,例如,soupInnerParagraph.(通过移动内容,它们不会被删除).

For the contents between the opening (<p>) and closing (</p>) tags, move the contents to another BeautifulSoup object, e.g., soupInnerParagraph. (By moving the contents, they are not deleted).

然后,soupParagraph 将只有开始和结束标签.

Then, soupParagraph will just have the opening and closing tags.

将 soupParagraph 转换为 HTML 文本格式并将其存储在字符串变量中

Convert soupParagraph to HTML text-format and store that in a string variable

要获取开始标签,请使用正则表达式从字符串变量中删除结束标签.

To get the opening tag, use a regular expression to remove the closing tag from the string variable.

一般来说,用正则表达式解析 HTML 是有问题的,通常最好避免.但是,这里可能是合理的.

In general, parsing HTML with a regular-expression is problematic, and usually best avoided. However, it may be reasonable here.

结束标签很简单.它没有为其定义属性,并且不允许在其中添加注释.

A closing tag is simple. It does not have attributes defined for it, and a comment is not allowed within it.

我可以在结束标签上有属性吗?

元素开始标签内的HTML注释

此代码从 <body...> ... </body> 部分获取开始标记.代码已经过测试.

This code gets the opening tag from a <body...> ... </body> section. The code has been tested.

# The variable "body" is a BeautifulSoup object that contains a <body> section.
bodyInnerHtml = BeautifulSoup("", 'html.parser')
bodyContentsList = body.contents
for i in range(0, len(bodyContentsList)):
    # .append moves the HTML element from body to bodyInnerHtml
    bodyInnerHtml.append(bodyContentsList[0])

# Convert the <body> opening and closing tags to HTML text format
bodyTags = body.decode(formatter='html')
# Extract the opening tag, by removing the closing tag
regex = r"(s*</bodys*>s*$)"
substitution = ""
bodyOpeningTag, substitutionCount = re.subn(regex, substitution, bodyTags, 0, re.M)
if (substitutionCount != 1):
    print("")
    print("ERROR.  The expected HTML </body> tag was not found.")

相关文章