如何从 JSON 中获取字符串对象而不是 Unicode?

问题描述

我正在使用 Python 2 从 ASCII 编码 文本文件中解析 JSON.

I'm using Python 2 to parse JSON from ASCII encoded text files.

使用 json 加载这些文件时或 simplejson,我所有的字符串值都被转换为 Unicode 对象字符串对象.问题是,我必须将数据与一些只接受字符串对象的库一起使用.我无法更改库也无法更新它们.

When loading these files with either json or simplejson, all my string values are cast to Unicode objects instead of string objects. The problem is, I have to use the data with some libraries that only accept string objects. I can't change the libraries nor update them.

是否可以获取字符串对象而不是 Unicode 对象?

Is it possible to get string objects instead of Unicode ones?

>>> import json
>>> original_list = ['a', 'b']
>>> json_list = json.dumps(original_list)
>>> json_list
'["a", "b"]'
>>> new_list = json.loads(json_list)
>>> new_list
[u'a', u'b']  # I want these to be of type `str`, not `unicode`

更新

很久以前问过这个问题,当时我被 Python 2 困住了.今天一个简单而干净的解决方案是使用最新版本的 Python - 即 Python 3 及更高版本.

Update

This question was asked a long time ago, when I was stuck with Python 2. One easy and clean solution for today is to use a recent version of Python — i.e. Python 3 and forward.


解决方案

使用object_hook的解决方案

:针对 Python 2.7 和 3.x 兼容性进行了更新.

A solution with object_hook

[edit]: Updated for Python 2.7 and 3.x compatibility.

import json

def json_load_byteified(file_handle):
    return _byteify(
        json.load(file_handle, object_hook=_byteify),
        ignore_dicts=True
    )

def json_loads_byteified(json_text):
    return _byteify(
        json.loads(json_text, object_hook=_byteify),
        ignore_dicts=True
    )

def _byteify(data, ignore_dicts = False):
    if isinstance(data, str):
        return data

    # if this is a list of values, return list of byteified values
    if isinstance(data, list):
        return [ _byteify(item, ignore_dicts=True) for item in data ]
    # if this is a dictionary, return dictionary of byteified keys and values
    # but only if we haven't already byteified it
    if isinstance(data, dict) and not ignore_dicts:
        return {
            _byteify(key, ignore_dicts=True): _byteify(value, ignore_dicts=True)
            for key, value in data.items() # changed to .items() for python 2.7/3
        }

    # python 3 compatible duck-typing
    # if this is a unicode string, return its string representation
    if str(type(data)) == "<type 'unicode'>":
        return data.encode('utf-8')

    # if it's anything else, return it in its original form
    return data

示例用法:

>>> json_loads_byteified('{"Hello": "World"}')
{'Hello': 'World'}
>>> json_loads_byteified('"I am a top-level string"')
'I am a top-level string'
>>> json_loads_byteified('7')
7
>>> json_loads_byteified('["I am inside a list"]')
['I am inside a list']
>>> json_loads_byteified('[[[[[[[["I am inside a big nest of lists"]]]]]]]]')
[[[[[[[['I am inside a big nest of lists']]]]]]]]
>>> json_loads_byteified('{"foo": "bar", "things": [7, {"qux": "baz", "moo": {"cow": ["milk"]}}]}')
{'things': [7, {'qux': 'baz', 'moo': {'cow': ['milk']}}], 'foo': 'bar'}
>>> json_load_byteified(open('somefile.json'))
{'more json': 'from a file'}

这是如何工作的,我为什么要使用它?

Mark Amery 的函数比这些更短更清晰,那么它们有什么意义呢?为什么要使用它们?

How does this work and why would I use it?

Mark Amery's function is shorter and clearer than these ones, so what's the point of them? Why would you want to use them?

纯粹是为了性能.Mark 的答案首先使用 unicode 字符串完全解码 JSON 文本,然后递归整个解码值以将所有字符串转换为字节字符串.这有几个不良影响:

Purely for performance. Mark's answer decodes the JSON text fully first with unicode strings, then recurses through the entire decoded value to convert all strings to byte strings. This has a couple of undesirable effects:

  • 在内存中创建整个解码结构的副本
  • 如果您的 JSON 对象真的嵌套很深(500 层或更多),那么您将达到 Python 的最大递归深度
  • A copy of the entire decoded structure gets created in memory
  • If your JSON object is really deeply nested (500 levels or more) then you'll hit Python's maximum recursion depth

此答案通过使用 json.loadjson.loadsobject_hook 参数来缓解这两个性能问题.来自文档:

This answer mitigates both of those performance issues by using the object_hook parameter of json.load and json.loads. From the docs:

object_hook 是一个可选函数,将调用任何对象文字解码的结果(dict).将使用 object_hook 的返回值而不是 dict.此功能可用于实现自定义解码器

object_hook is an optional function that will be called with the result of any object literal decoded (a dict). The return value of object_hook will be used instead of the dict. This feature can be used to implement custom decoders

由于嵌套在其他字典深处的许多级别的字典在被解码时被传递给 object_hook ,因此我们可以在此时将其中的任何字符串或列表字节化并避免以后需要深度递归.

Since dictionaries nested many levels deep in other dictionaries get passed to object_hook as they're decoded, we can byteify any strings or lists inside them at that point and avoid the need for deep recursion later.

Mark 的答案不适合用作 object_hook,因为它会递归到嵌套字典中.我们使用 _byteifyignore_dicts 参数来防止此答案中的递归,当 object_hook 将一个新的 dict 传递给它以进行字节化.ignore_dicts 标志告诉 _byteify 忽略 dict,因为它们已经被字节化了.

Mark's answer isn't suitable for use as an object_hook as it stands, because it recurses into nested dictionaries. We prevent that recursion in this answer with the ignore_dicts parameter to _byteify, which gets passed to it at all times except when object_hook passes it a new dict to byteify. The ignore_dicts flag tells _byteify to ignore dicts since they already been byteified.

最后,我们的 json_load_byteifiedjson_loads_byteified 实现在结果上调用 _byteify(使用 ignore_dicts=True)从 json.loadjson.loads 返回以处理被解码的 JSON 文本在顶层没有 dict 的情况.

Finally, our implementations of json_load_byteified and json_loads_byteified call _byteify (with ignore_dicts=True) on the result returned from json.load or json.loads to handle the case where the JSON text being decoded doesn't have a dict at the top level.

相关文章