如何从 HTML 字符串中获取美丽汤中的开始和结束标记?

2022-01-18 00:00:00 python beautifulsoup tags

问题描述

我正在使用漂亮的汤编写一个 python 脚本，我必须从包含一些 HTML 代码的字符串中获取一个开始标签.

I am writing a python script using beautiful soup, where i have to get an opening tag from a string containing some HTML code.

这是我的字符串:

string = ...

我想在名为 opening_tag 的变量中获取  并在名为 的变量中获取 关闭标签.我已经搜索了文档，但似乎没有找到解决方案.谁能给我建议?

I want to get  in a variable called opening_tag and  in a variable called closing_tag. I have searched the documentation but don't seem to find the solution. Can anyone advise me with that?

解决方案

有一种方法可以使用 BeautifulSoup 和一个简单的 reg-ex:

There is a way to do this with BeautifulSoup and a simple reg-ex:

将段落放在 BeautifulSoup 对象中，例如，soupParagraph.

Put the paragraph in a BeautifulSoup object, e.g., soupParagraph.

对于开始 () 和结束 () 标记之间的内容，将内容移动到另一个 BeautifulSoup 对象，例如，soupInnerParagraph.(通过移动内容，它们不会被删除).

For the contents between the opening () and closing () tags, move the contents to another BeautifulSoup object, e.g., soupInnerParagraph. (By moving the contents, they are not deleted).

然后，soupParagraph 将只有开始和结束标签.

Then, soupParagraph will just have the opening and closing tags.

将 soupParagraph 转换为 HTML 文本格式并将其存储在字符串变量中

Convert soupParagraph to HTML text-format and store that in a string variable

要获取开始标签，请使用正则表达式从字符串变量中删除结束标签.

To get the opening tag, use a regular expression to remove the closing tag from the string variable.

一般来说，用正则表达式解析 HTML 是有问题的，通常最好避免.但是，这里可能是合理的.

In general, parsing HTML with a regular-expression is problematic, and usually best avoided. However, it may be reasonable here.

结束标签很简单.它没有为其定义属性，并且不允许在其中添加注释.

A closing tag is simple. It does not have attributes defined for it, and a comment is not allowed within it.

我可以在结束标签上有属性吗?

元素开始标签内的HTML注释

此代码从 <body...> ... </body> 部分获取开始标记.代码已经过测试.

This code gets the opening tag from a <body...> ... </body> section. The code has been tested.

# The variable "body" is a BeautifulSoup object that contains a <body> section. bodyInnerHtml = BeautifulSoup("", 'html.parser') bodyContentsList = body.contents for i in range(0, len(bodyContentsList)): # .append moves the HTML element from body to bodyInnerHtml bodyInnerHtml.append(bodyContentsList[0]) # Convert the <body> opening and closing tags to HTML text format bodyTags = body.decode(formatter='html') # Extract the opening tag, by removing the closing tag regex = r"(s*</bodys*>s*$)" substitution = "" bodyOpeningTag, substitutionCount = re.subn(regex, substitution, bodyTags, 0, re.M) if (substitutionCount != 1): print("") print("ERROR. The expected HTML </body> tag was not found.")

相关文章