为什么 Python 集不保留插入顺序?
问题描述
我最近惊讶地发现,虽然字典在 Python 3.7+ 中可以保证保留插入顺序,但集合却不是:
>>>d = {'a':1,'b':2,'c':3}>>>d{'a':1,'b':2,'c':3}>>>d['d'] = 4>>>d{'a':1,'b':2,'c':3,'d':4}
>>>s = {'a', 'b', 'c'}>>>s{'b','a','c'}>>>s.add('d')>>>s{'d','b','a','c'}
造成这种差异的原因是什么?导致 Python 团队更改 dict 实现的相同效率改进是否也不适用于集合?
我不是在寻找指向有序集合实现的指针或使用字典作为集合替代的方法.我只是想知道为什么 Python 团队没有在为 dicts 做的同时让内置集合保持顺序.
解决方案Sets 和 dicts 针对不同的用例进行了优化.集合的主要用途是快速成员资格测试,它与顺序无关. 对于 dicts,查找成本是最关键的操作,并且密钥更有可能出现.使用集合,预先不知道元素的存在或不存在,因此集合实现需要针对找到和未找到的情况进行优化.此外,对联合和交集等常见集合操作的一些优化使得很难在不降低性能的情况下保持集合顺序.
虽然这两种数据结构都是基于散列的,但一个常见的误解是,集合只是作为具有空值的 dicts 实现的.甚至在 CPython 3.6 中的紧凑 dict 实现之前,set 和 dict 实现已经存在显着差异,几乎没有代码重用.例如,dicts 使用随机探测,但 set 使用线性探测和开放寻址的组合,以提高缓存局部性.初始线性探测(在 CPython 中默认 9 个步骤)将检查一系列相邻的键/哈希对,通过降低哈希冲突处理的成本来提高性能 - 连续的内存访问比分散的探测更便宜.
dictobject.c
- master,v3.5.9setobject.c
- master,v3.5.9- issue18771 - 变更集以降低 Python 3.4 中集合对象的哈希冲突成本.
理论上可能将 CPython 的 set 实现更改为类似于紧凑型 dict,但在实践中存在缺陷,著名的核心开发人员反对进行此类更改.p><块引用>
集合保持无序.(为什么?使用模式不同.另外,不同的实现.)
– Guido van Rossum
<块引用>集合使用不同的算法,无法修改以保留插入顺序.如果需要顺序,成套操作就会失去灵活性和优化.集合数学是根据无序集合定义的.简而言之,集合排序不是在不久的将来.
– 雷蒙德·赫廷格
关于是否为 3.7 压缩集合的详细讨论,以及为什么不这样做,可以在 python-dev 邮件列表中找到.
综上所述,要点是:不同的使用模式(插入排序dicts如**kwargs是有用,对集合来说不太有用),压缩集合的空间节省不太重要(因为只有键 + 哈希数组来致密,而不是键 + 哈希 + 值数组),以及前面提到的当前使用的集合的线性探测优化与紧凑的实现不兼容.
我将在下面复制 Raymond 的帖子,其中涵盖了最重要的几点.
<块引用><块引用>2016 年 9 月 14 日下午 3:50,Eric Snow 写道:
<块引用>那么,我也会对集合做同样的事情.
除非我有误解,否则雷蒙德反对制作类似的更改为设置.
没错.在人们面前,这里有一些关于这个主题的想法开始狂奔.
对于紧凑型 dict,空间节省是一项净赢,因为索引消耗了额外的空间,并且为键/值/哈希数组被改进的偏移量超过键/值/哈希数组的密度.但是对于套装,网络很多不太有利,因为我们仍然需要索引和过度分配但只能通过致密三个中的两个来抵消空间成本数组.换句话说,当你有浪费了键、值和散列的空间.如果你失去其中之一第三,它不再引人注目.
set 的使用模式与 dicts 不同.前者有更多的命中或未命中查找.后者往往缺少较少的密钥查找.此外,对 set-to-set 操作的一些优化很难在不影响的情况下保留集合顺序性能.
我寻求其他途径来提高布景性能.而不是压缩(这并没有太大的空间优势,并且产生了附加间接),我添加了线性探测以降低成本冲突并提高缓存性能.这种改进是与我提倡的压缩方法不兼容字典.
目前,对字典的排序副作用是无法保证的,所以现在开始坚持集合也是有序的还为时过早.文档已经链接到创建 OrderedSet 的配方(https://code.activestate.com/recipes/576694/ )但它看起来像吸收几乎为零.此外,现在 Eric Snow 给了我们一个快速的 OrderedDict,比以往任何时候都更容易从MutableSet 和 OrderedDict,但我再次没有观察到任何真实的兴趣,因为典型的成套数据分析并不真正需要或关心订购.同样,快速会员的主要用途测试与订单无关.
也就是说,我确实认为有向 PyPI 添加替代集实现的空间.特别是有一些有趣的可排序数据的特殊情况,其中集合到集合操作可以是通过比较整个范围的键来加速(参见https://code.activestate.com/recipes/230113-使用排序列表的集合实现为起点).IIRC,PyPI 已经有类似集合的绽放代码过滤器和杜鹃散列.
我知道让 Python 核心接受一个主要代码块是令人兴奋的,但这不应该打开闸门除非我们确定,否则对其他数据类型进行更重大的重写这是有保证的.
<块引用>
– 雷蒙德·海廷格
从 [Python-Dev] Python 3.6 dict 变得紧凑和获得私人版本;和关键字变得有序,2016 年 9 月.
I was surprised to discover recently that while dicts are guaranteed to preserve insertion order in Python 3.7+, sets are not:
>>> d = {'a': 1, 'b': 2, 'c': 3}
>>> d
{'a': 1, 'b': 2, 'c': 3}
>>> d['d'] = 4
>>> d
{'a': 1, 'b': 2, 'c': 3, 'd': 4}
>>> s = {'a', 'b', 'c'}
>>> s
{'b', 'a', 'c'}
>>> s.add('d')
>>> s
{'d', 'b', 'a', 'c'}
What is the rationale for this difference? Do the same efficiency improvements that led the Python team to change the dict implementation not apply to sets as well?
I'm not looking for pointers to ordered-set implementations or ways to use dicts as stand-ins for sets. I'm just wondering why the Python team didn't make built-in sets preserve order at the same time they did so for dicts.
解决方案Sets and dicts are optimized for different use-cases. The primary use of a set is fast membership testing, which is order agnostic. For dicts, cost of the lookup is the most critical operation, and the key is more likely to be present. With sets, the presence or absence of an element is not known in advance, and so the set implementation needs to optimize for both the found and not-found case. Also, some optimizations for common set operations such as union and intersection make it difficult to retain set ordering without degrading performance.
While both data structures are hash based, it's a common misconception that sets are just implemented as dicts with null values. Even before the compact dict implementation in CPython 3.6, the set and dict implementations already differed significantly, with little code reuse. For example, dicts use randomized probing, but sets use a combination of linear probing and open addressing, to improve cache locality. The initial linear probe (default 9 steps in CPython) will check a series of adjacent key/hash pairs, improving performance by reducing the cost of hash collision handling - consecutive memory access is cheaper than scattered probes.
dictobject.c
- master, v3.5.9setobject.c
- master, v3.5.9- issue18771 - changeset to reduce the cost of hash collisions for set objects in Python 3.4.
It would be possible in theory to change CPython's set implementation to be similar to the compact dict, but in practice there are drawbacks, and notable core developers were opposed to making such a change.
Sets remain unordered. (Why? The usage patterns are different. Also, different implementation.)
– Guido van Rossum
Sets use a different algorithm that isn't as amendable to retaining insertion order. Set-to-set operations lose their flexibility and optimizations if order is required. Set mathematics are defined in terms of unordered sets. In short, set ordering isn't in the immediate future.
– Raymond Hettinger
A detailed discussion about whether to compactify sets for 3.7, and why it was decided against, can be found in the python-dev mailing lists.
In summary, the main points are: different usage patterns (insertion ordering dicts such as **kwargs is useful, less so for sets), space savings for compacting sets are less significant (because there are only key + hash arrays to densify, as opposed to key + hash + value arrays), and the aforementioned linear probing optimization which sets currently use is incompatible with a compact implementation.
I will reproduce Raymond's post below which covers the most important points.
On Sep 14, 2016, at 3:50 PM, Eric Snow wrote:
Then, I'll do same to sets.
Unless I've misunderstood, Raymond was opposed to making a similar change to set.
That's right. Here are a few thoughts on the subject before people starting running wild.
For the compact dict, the space savings was a net win with the additional space consumed by the indices and the overallocation for the key/value/hash arrays being more than offset by the improved density of key/value/hash arrays. However for sets, the net was much less favorable because we still need the indices and overallocation but can only offset the space cost by densifying only two of the three arrays. In other words, compacting makes more sense when you have wasted space for keys, values, and hashes. If you lose one of those three, it stops being compelling.
The use pattern for sets is different from dicts. The former has more hit or miss lookups. The latter tends to have fewer missing key lookups. Also, some of the optimizations for the set-to-set operations make it difficult to retain set ordering without impacting performance.
I pursued alternative path to improve set performance. Instead of compacting (which wasn't much of space win and incurred the cost of an additional indirection), I added linear probing to reduce the cost of collisions and improve cache performance. This improvement is incompatible with the compacting approach I advocated for dictionaries.
For now, the ordering side-effect on dictionaries is non-guaranteed, so it is premature to start insisting the sets become ordered as well. The docs already link to a recipe for creating an OrderedSet ( https://code.activestate.com/recipes/576694/ ) but it seems like the uptake has been nearly zero. Also, now that Eric Snow has given us a fast OrderedDict, it is easier than ever to build an OrderedSet from MutableSet and OrderedDict, but again I haven't observed any real interest because typical set-to-set data analytics don't really need or care about ordering. Likewise, the primary use of fast membership testings is order agnostic.
That said, I do think there is room to add alternative set implementations to PyPI. In particular, there are some interesting special cases for orderable data where set-to-set operations can be sped-up by comparing entire ranges of keys (see https://code.activestate.com/recipes/230113-implementation-of-sets-using-sorted-lists for a starting point). IIRC, PyPI already has code for set-like bloom filters and cuckoo hashing.
I understanding that it is exciting to have a major block of code accepted into the Python core but that shouldn't open to floodgates to engaging in more major rewrites of other datatypes unless we're sure that it is warranted.
– Raymond Hettinger
From [Python-Dev] Python 3.6 dict becomes compact and gets a private version; and keywords become ordered, Sept 2016.
相关文章