Python正则表达式提取网页标题

2023-04-03 00:00:00 网页提取标题

在 HTML 中，网页标题通常被包含在标签中。我们可以使用 Python 中的正则表达式来匹配 <title> 标签并提取其中的文本内容，从而获取网页的标题信息。 下面是一个简单的示例代码，用于提取网页标题： <div class="codehilite"><pre>import re html = """ <html> <head> <title>pidancode.com - 皮蛋编程</title> </head> <body> <h1>Hello, World!</h1> Welcome to pidancode.com. </body> </html> """ pattern = r'<title>(.*?)</title>' title = re.search(pattern, html) if title: print(title.group(1)) </pre></div> 上述代码中，我们首先定义了一个正则表达式模式 pattern，用于匹配 <title> 标签及其包含的文本内容。然后我们使用 re.search() 函数来搜索 HTML 代码中的第一个匹配项，并使用 group() 方法来获取匹配结果中第一个括号中的文本内容。 运行上述代码，我们可以得到以下输出： <div class="codehilite"><pre>pidancode.com - 皮蛋编程 </pre></div> 在实际使用中，我们还可以使用 re.findall() 函数来获取 HTML 代码中的所有标题信息，并使用列表等数据结构来保存这些信息。需要注意的是，在使用正则表达式提取网页标题时，还需要考虑到不同网页可能具有不同的 HTML 结构，因此需要根据具体情况定义合适的正则表达式模式。 </div> <div class=""> 相关文章 </div> </article> </div> </main> <footer> <div class="container"> 友情链接： <a href="https://www.688576.com" target="_blank">雨伦博客</a> <a href="https://www.yaanbbs.net" target="_blank">雅安论坛</a> <a href="https://beian.miit.gov.cn" target="_blank">京ICP备15023317号-6</a> </div> </footer> <script> var _hmt = _hmt || []; (function() { var hm = document.createElement("script"); hm.src = "https://hm.baidu.com/hm.js?30b42218aa13759c43de5f1971d0a93b"; var s = document.getElementsByTagName("script")[0]; s.parentNode.insertBefore(hm, s); })(); </script> </body> </html>