为什么 PyYAML 使用生成器来构造对象?

2022-01-14 00:00:00 python pyyaml yaml ruamel.yaml

问题描述

我一直在阅读 PyYAML 源代码,试图了解如何定义一个合适的构造函数,我可以使用 add_constructor 添加该构造函数.我现在对该代码的工作原理有了很好的理解,但我仍然不明白为什么 SafeConstructor 中的默认 YAML 构造函数是生成器.比如SafeConstructor的方法construct_yaml_map:

I've been reading the PyYAML source code to try to understand how to define a proper constructor function that I can add with add_constructor. I have a pretty good understanding of how that code works now, but I still don't understand why the default YAML constructors in the SafeConstructor are generators. For example, the method construct_yaml_map of SafeConstructor:

def construct_yaml_map(self, node):
    data = {}
    yield data
    value = self.construct_mapping(node)
    data.update(value)

我了解生成器如何在 BaseConstructor.construct_object 中使用,如下所示以存根一个对象,并且仅在传递 deep=False 时使用来自节点的数据填充它construct_mapping:

I understand how the generator is used in BaseConstructor.construct_object as follows to stub out an object and only populate it with data from the node if deep=False is passed to construct_mapping:

    if isinstance(data, types.GeneratorType):
        generator = data
        data = generator.next()
        if self.deep_construct:
            for dummy in generator:
                pass
        else:
            self.state_generators.append(generator)

并且我了解在 deep=False for construct_mapping 的情况下如何在 BaseConstructor.construct_document 中生成数据.

And I understand how the data is generated in BaseConstructor.construct_document in the case where deep=False for construct_mapping.

def construct_document(self, node):
    data = self.construct_object(node)
    while self.state_generators:
        state_generators = self.state_generators
        self.state_generators = []
        for generator in state_generators:
            for dummy in generator:
                pass

我不明白的是,将数据对象存根并通过迭代 construct_document 中的生成器来处理对象的好处.是否必须这样做以支持 YAML 规范中的某些内容,还是提供性能优势?

What I don't understand is the benefit of stubbing out the data objects and working down through the objects by iterating over the generators in construct_document. Does this have to be done to support something in the YAML spec, or does it provide a performance benefit?

这个关于另一个问题的答案有点帮助,但我不明白为什么这个答案会这样:

This answer on another question was somewhat helpful, but I don't understand why that answer does this:

def foo_constructor(loader, node):
    instance = Foo.__new__(Foo)
    yield instance
    state = loader.construct_mapping(node, deep=True)
    instance.__init__(**state)

而不是这个:

def foo_constructor(loader, node):
    state = loader.construct_mapping(node, deep=True)
    return Foo(**state)

我已经测试过后一种形式适用于发布在另一个答案上的示例,但也许我错过了一些极端情况.

I've tested that the latter form works for the examples posted on that other answer, but perhaps I am missing some edge case.

我使用的是 3.10 版的 PyYAML,但看起来有问题的代码在最新版 (3.12) 的 PyYAML 中是相同的.

I am using version 3.10 of PyYAML, but it looks like the code in question is the same in the latest version (3.12) of PyYAML.


解决方案

在 YAML 中你可以有 锚和别名.有了它,您可以直接或间接地创建自引用结构.

In YAML you can have anchors and aliases. With that you can make self-referential structures, directly or indirectly.

如果 YAML 没有这种自引用的可能性,您可以先构造所有子结构,然后一次性创建父结构.但是由于自我引用,您可能还没有孩子来填写"您正在创建的结构.通过使用生成器的两步过程(我称之为两步,因为在方法结束之前它只有一个yield),您可以部分创建一个对象并用自引用填充它,因为对象存在(即它在内存中的位置已定义).

If YAML would not have this possibility of self-reference, you could just first construct all the children and then create the parent structure in one go. But because of the self-references you might not have the child yet to "fill-out" the structure that you are creating. By using the two-step process of the generator (I call this two step, because it has only one yield before you come to the end of the method), you can create an object partially and the fill it out with a self-reference, because the object exist (i.e. its place in memory is defined).

好处不在于速度,而纯粹是因为使自引用成为可能.

The benefit is not in speed, but purely because of making the self-reference possible.

如果您从您引用的答案中简化示例,则会加载以下内容:

If you simplify the example from the answer you refer to a bit, the following loads:

import sys
import ruamel.yaml as yaml


class Foo(object):
    def __init__(self, s, l=None, d=None):
        self.s = s
        self.l1, self.l2 = l
        self.d = d


def foo_constructor(loader, node):
    instance = Foo.__new__(Foo)
    yield instance
    state = loader.construct_mapping(node, deep=True)
    instance.__init__(**state)

yaml.add_constructor(u'!Foo', foo_constructor)

x = yaml.load('''
&fooref
!Foo
s: *fooref
l: [1, 2]
d: {try: this}
''', Loader=yaml.Loader)

yaml.dump(x, sys.stdout)

但如果你将 foo_constructor() 更改为:

but if you change foo_constructor() to:

def foo_constructor(loader, node):
    instance = Foo.__new__(Foo)
    state = loader.construct_mapping(node, deep=True)
    instance.__init__(**state)
    return instance

(yield 被移除,添加了最终返回),你会得到一个 ConstructorError: with as message

(yield removed, added a final return), you get a ConstructorError: with as message

found unconstructable recursive node 
  in "<unicode string>", line 2, column 1:
    &fooref

PyYAML 应该给出类似的消息.检查该错误的回溯,您可以看到 ruamel.yaml/PyYAML 尝试在源代码中解析别名的位置.

PyYAML should give a similar message. Inspect the traceback on that error and you can see where ruamel.yaml/PyYAML tries to resolve the alias in the source code.

相关文章