为什么json序列化比Python中的yaml序列化快那么多?
问题描述
我的代码严重依赖 yaml 进行跨语言序列化,在加快某些东西的速度时,我注意到 yaml 与其他序列化方法(例如,pickle、json)相比速度非常慢.
I have code that relies heavily on yaml for cross-language serialization and while working on speeding some stuff up I noticed that yaml was insanely slow compared to other serialization methods (e.g., pickle, json).
所以真正让我大吃一惊的是,当输出几乎相同时,json 比 yaml 快得多.
So what really blows my mind is that json is so much faster that yaml when the output is nearly identical.
>>> import yaml, cjson; d={'foo': {'bar': 1}}
>>> yaml.dump(d, Dumper=yaml.SafeDumper)
'foo: {bar: 1}
'
>>> cjson.encode(d)
'{"foo": {"bar": 1}}'
>>> import yaml, cjson;
>>> timeit("yaml.dump(d, Dumper=yaml.SafeDumper)", setup="import yaml; d={'foo': {'bar': 1}}", number=10000)
44.506911039352417
>>> timeit("yaml.dump(d, Dumper=yaml.CSafeDumper)", setup="import yaml; d={'foo': {'bar': 1}}", number=10000)
16.852826118469238
>>> timeit("cjson.encode(d)", setup="import cjson; d={'foo': {'bar': 1}}", number=10000)
0.073784112930297852
PyYaml 的 CSafeDumper 和 cjson 都是用 C 编写的,所以这不像是 C 与 Python 的速度问题.我什至向它添加了一些随机数据,以查看 cjson 是否正在做任何缓存,但它仍然比 PyYaml 快得多.我意识到 yaml 是 json 的超集,但是 yaml 序列化程序怎么能在如此简单的输入下慢 2 个数量级呢?
PyYaml's CSafeDumper and cjson are both written in C so it's not like this is a C vs Python speed issue. I've even added some random data to it to see if cjson is doing any caching, but it's still way faster than PyYaml. I realize that yaml is a superset of json, but how could the yaml serializer be 2 orders of magnitude slower with such simple input?
解决方案
一般来说,决定解析速度的不是输出的复杂度,而是接受的输入的复杂度.JSON 语法非常简洁.YAML 解析器相对复杂,导致开销增加.
In general, it's not the complexity of the output that determines the speed of parsing, but the complexity of the accepted input. The JSON grammar is very concise. The YAML parsers are comparatively complex, leading to increased overheads.
JSON 的首要设计目标是简单性和普遍性.因此,JSON 很容易生成和解析,以减少人力为代价可读性.它还使用最低公分母信息模型,确保任何 JSON 数据都可以轻松由每一个现代程序处理环境.
JSON’s foremost design goal is simplicity and universality. Thus, JSON is trivial to generate and parse, at the cost of reduced human readability. It also uses a lowest common denominator information model, ensuring any JSON data can be easily processed by every modern programming environment.
相比之下,YAML 最重要的设计目标是人类可读性和支持序列化任意本机数据结构.因此,YAML允许非常可读的文件,但更复杂的是生成和解析.此外,YAML 风险投资超出最低公分母数据类型,要求更复杂穿越时的处理不同的编程环境.
In contrast, YAML’s foremost design goals are human readability and support for serializing arbitrary native data structures. Thus, YAML allows for extremely readable files, but is more complex to generate and parse. In addition, YAML ventures beyond the lowest common denominator data types, requiring more complex processing when crossing between different programming environments.
我不是 YAML 解析器实现者,因此如果没有一些分析数据和大量示例,我无法具体说明数量级.无论如何,在对基准数据充满信心之前,一定要对大量输入进行测试.
I'm not a YAML parser implementor, so I can't speak specifically to the orders of magnitude without some profiling data and a big corpus of examples. In any case, be sure to test over a large body of inputs before feeling confident in benchmark numbers.
更新 哎呀,误读了这个问题.:-( 尽管输入语法很大,但序列化仍然可以非常快;但是,浏览源代码,它看起来像 PyYAML 的 Python 级序列化 构造一个表示图,而 simplejson 将内置 Python 数据类型直接编码为文本块.
Update Whoops, misread the question. :-( Serialization can still be blazingly fast despite the large input grammar; however, browsing the source, it looks like PyYAML's Python-level serialization constructs a representation graph whereas simplejson encodes builtin Python datatypes directly into text chunks.
相关文章