解析原始 HTTP 标头

2022-01-17 00:00:00 python http-headers

问题描述

我有一个原始 HTTP 字符串,我想表示对象中的字段.有什么方法可以解析 HTTP 字符串中的各个标头?

I have a string of raw HTTP and I would like to represent the fields in an object. Is there any way to parse the individual headers from an HTTP string?

'GET /search?sourceid=chrome&ie=UTF-8&q=ergterst HTTP/1.1
Host: www.google.com
Connection: keep-alive
Accept: application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_6; en-US) AppleWebKit/534.13 (KHTML, like Gecko) Chrome/9.0.597.45 Safari/534.13
Accept-Encoding: gzip,deflate,sdch
Avail-Dictionary: GeNLY2f-
Accept-Language: en-US,en;q=0.8

[...]'


解决方案

更新:现在是 2019 年,所以我已经为 Python 3 重写了这个答案,这是在程序员试图使用代码的困惑评论之后.原始 Python 2 代码现在位于答案的底部.

Update: It’s 2019, so I have rewritten this answer for Python 3, following a confused comment from a programmer trying to use the code. The original Python 2 code is now down at the bottom of the answer.

标准库中提供了出色的工具,既可用于解析 RFC 821 标头,也可用于解析整个 HTTP 请求.这是一个示例请求字符串(请注意,Python 将其视为一个大字符串,即使我们为了可读性将其分成几行),我们可以将其提供给我的示例:

There are excellent tools in the Standard Library both for parsing RFC 821 headers, and also for parsing entire HTTP requests. Here is an example request string (note that Python treats it as one big string, even though we are breaking it across several lines for readability) that we can feed to my examples:

request_text = (
    b'GET /who/ken/trust.html HTTP/1.1
'
    b'Host: cm.bell-labs.com
'
    b'Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.3
'
    b'Accept: text/html;q=0.9,text/plain
'
    b'
'
)

正如@TryPyPy 指出的那样,您可以使用 Python 的电子邮件消息库来解析标头 - 尽管我们应该添加生成的 Message 对象在您完成创建后就像一个标头字典:

As @TryPyPy points out, you can use Python’s email message library to parse the headers — though we should add that the resulting Message object acts like a dictionary of headers once you are done creating it:

from email.parser import BytesParser
request_line, headers_alone = request_text.split(b'
', 1)
headers = BytesParser().parsebytes(headers_alone)

print(len(headers))     # -> "3"
print(headers.keys())   # -> ['Host', 'Accept-Charset', 'Accept']
print(headers['Host'])  # -> "cm.bell-labs.com"

但这当然会忽略请求行,或者让您自己解析它.事实证明,有一个更好的解决方案.

But this, of course, ignores the request line, or makes you parse it yourself. It turns out that there is a much better solution.

如果您使用它的 BaseHTTPRequestHandler,标准库将为您解析 HTTP.尽管它的文档有点晦涩——标准库中的整套 HTTP 和 URL 工具存在问题——但要让它解析字符串,你所要做的就是 (a) 将字符串包装在 BytesIO(),(b) 读取 raw_requestline 以便它准备好被解析,并且 (c) 捕获在解析期间发生的任何错误代码,而不是让它尝试将它们写回客户(因为我们没有客户!).

The Standard Library will parse HTTP for you if you use its BaseHTTPRequestHandler. Though its documentation is a bit obscure — a problem with the whole suite of HTTP and URL tools in the Standard Library — all you have to do to make it parse a string is (a) wrap your string in a BytesIO(), (b) read the raw_requestline so that it stands ready to be parsed, and (c) capture any error codes that occur during parsing instead of letting it try to write them back to the client (since we do not have one!).

这是我们对标准库类的特化:

So here is our specialization of the Standard Library class:

from http.server import BaseHTTPRequestHandler
from io import BytesIO

class HTTPRequest(BaseHTTPRequestHandler):
    def __init__(self, request_text):
        self.rfile = BytesIO(request_text)
        self.raw_requestline = self.rfile.readline()
        self.error_code = self.error_message = None
        self.parse_request()

    def send_error(self, code, message):
        self.error_code = code
        self.error_message = message

再次,我希望标准库的人们已经意识到 HTTP 解析应该以一种不需要我们编写 9 行代码来正确调用它的方式进行分解,但是你能做什么呢?下面是如何使用这个简单的类:

Again, I wish the Standard Library folks had realized that HTTP parsing should be broken out in a way that did not require us to write nine lines of code to properly call it, but what can you do? Here is how you would use this simple class:

# Using this new class is really easy!

request = HTTPRequest(request_text)

print(request.error_code)       # None  (check this first)
print(request.command)          # "GET"
print(request.path)             # "/who/ken/trust.html"
print(request.request_version)  # "HTTP/1.1"
print(len(request.headers))     # 3
print(request.headers.keys())   # ['Host', 'Accept-Charset', 'Accept']
print(request.headers['host'])  # "cm.bell-labs.com"

如果解析时出错,error_code不会是None:

If there is an error during parsing, the error_code will not be None:

# Parsing can result in an error code and message

request = HTTPRequest(b'GET
Header: Value

')

print(request.error_code)     # 400
print(request.error_message)  # "Bad request syntax ('GET')"

我更喜欢像这样使用标准库,因为我怀疑他们已经遇到并解决了任何边缘情况,如果我尝试自己使用正则表达式重新实现 Internet 规范,可能会对我造成困扰.

I prefer using the Standard Library like this because I suspect that they have already encountered and resolved any edge cases that might bite me if I try re-implementing an Internet specification myself with regular expressions.

这是我第一次写这个答案的原始代码:

Here’s the original code for this answer, back when I first wrote it:

request_text = (
    'GET /who/ken/trust.html HTTP/1.1
'
    'Host: cm.bell-labs.com
'
    'Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.3
'
    'Accept: text/html;q=0.9,text/plain
'
    '
'
    )

还有:

# Ignore the request line and parse only the headers

from mimetools import Message
from StringIO import StringIO
request_line, headers_alone = request_text.split('
', 1)
headers = Message(StringIO(headers_alone))

print len(headers)     # -> "3"
print headers.keys()   # -> ['accept-charset', 'host', 'accept']
print headers['Host']  # -> "cm.bell-labs.com"

还有:

from BaseHTTPServer import BaseHTTPRequestHandler
from StringIO import StringIO

class HTTPRequest(BaseHTTPRequestHandler):
    def __init__(self, request_text):
        self.rfile = StringIO(request_text)
        self.raw_requestline = self.rfile.readline()
        self.error_code = self.error_message = None
        self.parse_request()

    def send_error(self, code, message):
        self.error_code = code
        self.error_message = message

还有:

# Using this new class is really easy!

request = HTTPRequest(request_text)

print request.error_code       # None  (check this first)
print request.command          # "GET"
print request.path             # "/who/ken/trust.html"
print request.request_version  # "HTTP/1.1"
print len(request.headers)     # 3
print request.headers.keys()   # ['accept-charset', 'host', 'accept']
print request.headers['host']  # "cm.bell-labs.com"

还有:

# Parsing can result in an error code and message

request = HTTPRequest('GET
Header: Value

')

print request.error_code     # 400
print request.error_message  # "Bad request syntax ('GET')"

相关文章