如何从 HTML 字符串中获取美丽汤中的开始和结束标记?
问题描述
我正在使用漂亮的汤编写一个 python 脚本,我必须从包含一些 HTML 代码的字符串中获取一个开始标签.
I am writing a python script using beautiful soup, where i have to get an opening tag from a string containing some HTML code.
这是我的字符串:
string = <p>...</p>
我想在名为 opening_tag
的变量中获取 <p>
并在名为 的变量中获取
.我已经搜索了文档,但似乎没有找到解决方案.谁能给我建议?</p>
关闭标签
I want to get <p>
in a variable called opening_tag
and </p>
in a variable called closing_tag
. I have searched the documentation but don't seem to find the solution. Can anyone advise me with that?
解决方案
有一种方法可以使用 BeautifulSoup 和一个简单的 reg-ex:
There is a way to do this with BeautifulSoup and a simple reg-ex:
将段落放在 BeautifulSoup 对象中,例如,soupParagraph.
Put the paragraph in a BeautifulSoup object, e.g., soupParagraph.
对于开始 (<p>
) 和结束 (</p>
) 标记之间的内容,将内容移动到另一个 BeautifulSoup 对象,例如,soupInnerParagraph.(通过移动内容,它们不会被删除).
For the contents between the opening (<p>
) and closing (</p>
) tags, move the contents to another BeautifulSoup object, e.g., soupInnerParagraph. (By moving the contents, they are not deleted).
然后,soupParagraph 将只有开始和结束标签.
Then, soupParagraph will just have the opening and closing tags.
将 soupParagraph 转换为 HTML 文本格式并将其存储在字符串变量中
Convert soupParagraph to HTML text-format and store that in a string variable
要获取开始标签,请使用正则表达式从字符串变量中删除结束标签.
To get the opening tag, use a regular expression to remove the closing tag from the string variable.
一般来说,用正则表达式解析 HTML 是有问题的,通常最好避免.但是,这里可能是合理的.
In general, parsing HTML with a regular-expression is problematic, and usually best avoided. However, it may be reasonable here.
结束标签很简单.它没有为其定义属性,并且不允许在其中添加注释.
A closing tag is simple. It does not have attributes defined for it, and a comment is not allowed within it.
我可以在结束标签上有属性吗?
元素开始标签内的HTML注释
此代码从 <body...>
... </body>
部分获取开始标记.代码已经过测试.
This code gets the opening tag from a <body...>
... </body>
section. The code has been tested.
# The variable "body" is a BeautifulSoup object that contains a <body> section.
bodyInnerHtml = BeautifulSoup("", 'html.parser')
bodyContentsList = body.contents
for i in range(0, len(bodyContentsList)):
# .append moves the HTML element from body to bodyInnerHtml
bodyInnerHtml.append(bodyContentsList[0])
# Convert the <body> opening and closing tags to HTML text format
bodyTags = body.decode(formatter='html')
# Extract the opening tag, by removing the closing tag
regex = r"(s*</bodys*>s*$)"
substitution = ""
bodyOpeningTag, substitutionCount = re.subn(regex, substitution, bodyTags, 0, re.M)
if (substitutionCount != 1):
print("")
print("ERROR. The expected HTML </body> tag was not found.")
相关文章