如何从 JSON 中获取字符串对象而不是 Unicode?
问题描述
我正在使用 Python 2 从 ASCII 编码 文本文件中解析 JSON.
I'm using Python 2 to parse JSON from ASCII encoded text files.
使用 json
加载这些文件时或 simplejson
,我所有的字符串值都被转换为 Unicode 对象字符串对象.问题是,我必须将数据与一些只接受字符串对象的库一起使用.我无法更改库也无法更新它们.
When loading these files with either json
or simplejson
, all my string values are cast to Unicode objects instead of string objects. The problem is, I have to use the data with some libraries that only accept string objects. I can't change the libraries nor update them.
是否可以获取字符串对象而不是 Unicode 对象?
Is it possible to get string objects instead of Unicode ones?
>>> import json
>>> original_list = ['a', 'b']
>>> json_list = json.dumps(original_list)
>>> json_list
'["a", "b"]'
>>> new_list = json.loads(json_list)
>>> new_list
[u'a', u'b'] # I want these to be of type `str`, not `unicode`
更新
很久以前问过这个问题,当时我被 Python 2 困住了.今天一个简单而干净的解决方案是使用最新版本的 Python - 即 Python 3 及更高版本.
Update
This question was asked a long time ago, when I was stuck with Python 2. One easy and clean solution for today is to use a recent version of Python — i.e. Python 3 and forward.
解决方案
使用object_hook
的解决方案
:针对 Python 2.7 和 3.x 兼容性进行了更新.
A solution with object_hook
[edit]: Updated for Python 2.7 and 3.x compatibility.
import json
def json_load_byteified(file_handle):
return _byteify(
json.load(file_handle, object_hook=_byteify),
ignore_dicts=True
)
def json_loads_byteified(json_text):
return _byteify(
json.loads(json_text, object_hook=_byteify),
ignore_dicts=True
)
def _byteify(data, ignore_dicts = False):
if isinstance(data, str):
return data
# if this is a list of values, return list of byteified values
if isinstance(data, list):
return [ _byteify(item, ignore_dicts=True) for item in data ]
# if this is a dictionary, return dictionary of byteified keys and values
# but only if we haven't already byteified it
if isinstance(data, dict) and not ignore_dicts:
return {
_byteify(key, ignore_dicts=True): _byteify(value, ignore_dicts=True)
for key, value in data.items() # changed to .items() for python 2.7/3
}
# python 3 compatible duck-typing
# if this is a unicode string, return its string representation
if str(type(data)) == "<type 'unicode'>":
return data.encode('utf-8')
# if it's anything else, return it in its original form
return data
示例用法:
>>> json_loads_byteified('{"Hello": "World"}')
{'Hello': 'World'}
>>> json_loads_byteified('"I am a top-level string"')
'I am a top-level string'
>>> json_loads_byteified('7')
7
>>> json_loads_byteified('["I am inside a list"]')
['I am inside a list']
>>> json_loads_byteified('[[[[[[[["I am inside a big nest of lists"]]]]]]]]')
[[[[[[[['I am inside a big nest of lists']]]]]]]]
>>> json_loads_byteified('{"foo": "bar", "things": [7, {"qux": "baz", "moo": {"cow": ["milk"]}}]}')
{'things': [7, {'qux': 'baz', 'moo': {'cow': ['milk']}}], 'foo': 'bar'}
>>> json_load_byteified(open('somefile.json'))
{'more json': 'from a file'}
这是如何工作的,我为什么要使用它?
Mark Amery 的函数比这些更短更清晰,那么它们有什么意义呢?为什么要使用它们?
How does this work and why would I use it?
Mark Amery's function is shorter and clearer than these ones, so what's the point of them? Why would you want to use them?
纯粹是为了性能.Mark 的答案首先使用 unicode 字符串完全解码 JSON 文本,然后递归整个解码值以将所有字符串转换为字节字符串.这有几个不良影响:
Purely for performance. Mark's answer decodes the JSON text fully first with unicode strings, then recurses through the entire decoded value to convert all strings to byte strings. This has a couple of undesirable effects:
- 在内存中创建整个解码结构的副本
- 如果您的 JSON 对象真的嵌套很深(500 层或更多),那么您将达到 Python 的最大递归深度
- A copy of the entire decoded structure gets created in memory
- If your JSON object is really deeply nested (500 levels or more) then you'll hit Python's maximum recursion depth
此答案通过使用 json.load
和 json.loads
的 object_hook
参数来缓解这两个性能问题.来自文档:
This answer mitigates both of those performance issues by using the object_hook
parameter of json.load
and json.loads
. From the docs:
object_hook
是一个可选函数,将调用任何对象文字解码的结果(dict
).将使用 object_hook 的返回值而不是 dict
.此功能可用于实现自定义解码器
object_hook
is an optional function that will be called with the result of any object literal decoded (adict
). The return value of object_hook will be used instead of thedict
. This feature can be used to implement custom decoders
由于嵌套在其他字典深处的许多级别的字典在被解码时被传递给 object_hook
,因此我们可以在此时将其中的任何字符串或列表字节化并避免以后需要深度递归.
Since dictionaries nested many levels deep in other dictionaries get passed to object_hook
as they're decoded, we can byteify any strings or lists inside them at that point and avoid the need for deep recursion later.
Mark 的答案不适合用作 object_hook
,因为它会递归到嵌套字典中.我们使用 _byteify
的 ignore_dicts
参数来防止此答案中的递归,当 object_hook
将一个新的 dict
传递给它以进行字节化.ignore_dicts
标志告诉 _byteify
忽略 dict
,因为它们已经被字节化了.
Mark's answer isn't suitable for use as an object_hook
as it stands, because it recurses into nested dictionaries. We prevent that recursion in this answer with the ignore_dicts
parameter to _byteify
, which gets passed to it at all times except when object_hook
passes it a new dict
to byteify. The ignore_dicts
flag tells _byteify
to ignore dict
s since they already been byteified.
最后,我们的 json_load_byteified
和 json_loads_byteified
实现在结果上调用 _byteify
(使用 ignore_dicts=True
)从 json.load
或 json.loads
返回以处理被解码的 JSON 文本在顶层没有 dict
的情况.
Finally, our implementations of json_load_byteified
and json_loads_byteified
call _byteify
(with ignore_dicts=True
) on the result returned from json.load
or json.loads
to handle the case where the JSON text being decoded doesn't have a dict
at the top level.
相关文章