在迭代其元素时更新集合

2022-01-17 00:00:00 python iteration set python-internals

问题描述

当我尝试在迭代其元素时更新集合时，它的行为应该是什么?

When I try to update a set while iterating over its elements, what should be its behavior?

我在各种场景中尝试过它，它不会对迭代开始后添加的元素进行迭代，也不会对迭代期间删除的元素进行迭代.如果我在迭代期间删除并放回任何元素，则正在考虑该元素.确切的行为是什么以及它是如何工作的?

I tried it over various scenarios and it does not iterate over elements added after iteration is started and also the elements removed during iteration. If I remove and put back any element during iteration, that element is being considered. What's the exact behavior and how does it work?

这会打印字符串的所有排列:

This prints all the permutations of a string:

def permutations(s): ans = [] def helper(created, remaining): if len(created) == len(s): ans.append(''.join(created)) return for ch in remaining: remaining.remove(ch) created.append(ch) helper(created, remaining) remaining.add(ch) created.pop() helper([], set(s)) return ans

这里的行为是不可预测的，有时会打印出 e 而有时却不会:

Here the behavior is unpredictable, sometimes e is printed while sometimes it's not:

ab = set(['b','c','d']) x = True for ch in ab: if x: ab.remove('c') ab.add('e') x = False print(ch)

在这里，我总是只看到一次 'c'.即使第一个字符是 'c':

Here I always see 'c' only once. Even when first character is 'c':

ab = set(['b','c','d']) x = True for ch in ab: if x: ab.remove('c') ab.add('c') x = False print(ch)

以及实现上述功能相同目标的另一种方法:

And an alternate way to achieve the same objective of the above function:

def permwdups(s): ans = [] def helper(created, remaining): if len(created) == len(s): ans.append(''.join(created)) return for ch in remaining: if (remaining[ch]<=0): continue remaining[ch] -=1 created.append(ch) helper(created, remaining) remaining[ch] +=1 created.pop() counts = {} for i in range(len(s)): if s[i] not in counts: counts[s[i]] = 1 else: counts[s[i]]+= 1 helper([], counts) print(len(set(ans))) return ans

解决方案

其实很简单，sets在CPython中实现为hash - item 表格:

It's actually very easy, sets are implemented in CPython as hash - item table:

hash | item ----------------- - | - ----------------- - | - ...

CPython 使用开放寻址，因此不会填充表中的所有行，它会根据项目的(截断)散列确定元素的位置，并在发生冲突时使用伪随机"位置确定.我将在此答案中忽略截断哈希冲突.

CPython uses open-addressing so not all rows in the table are filled and it determines the position of an element based on the (truncated) hash of the item with a "pseudo-randomized" position determination in case of collisions. I will ignore truncated-hash-collisions in this answer.

我还将忽略散列截断的细节，只使用整数，所有(除了一些例外)将它们的散列映射到实际值:

I'll also ignore the specifics of the hash-truncation and just work with integers, which all (with some exceptions) map their hash to the actual value:

>>> hash(1) 1 >>> hash(2) 2 >>> hash(20) 20

因此，当您创建具有值 1、2 和 3 的 set 时，您将拥有(大致)下表:

So when you create a set with the values 1, 2 and 3 you'll have (roughly) the following table:

hash | item ----------------- - | - ----------------- 1 | 1 ----------------- 2 | 2 ----------------- 3 | 3 ...

该集合从表的顶部迭代到表的末尾，并忽略空行".因此，如果您在不修改的情况下迭代该集合，您将获得数字 1、2 和 3:

The set is iterated from the top of the table to the end of the table and empty "rows" are ignored. So if you would iterate over that set without modifying it you would get the numbers 1, 2 and 3:

>>> for item in {1, 2, 3}: print(item) 1 2 3

基本上，迭代器从第 0 行开始，看到该行为空，然后转到包含项目 1 的第 1 行.该项目由迭代器返回.下一次迭代它在第 2 行并返回那里的值，即 2，然后它转到第 3 行并返回存储在那里的 3.接下来的迭代迭代器在第 4 行，它是空的，所以它去到第 5 行，它也是空的，然后到第 6 行，......直到它到达表的末尾并抛出一个 StopIteration 异常，表示迭代器完成.

Basically the iterator starts in row 0 and sees that the row is empty and goes to row 1 which contains the item 1. This item is returned by the iterator. The next iteration it's in row 2 and returns the value there, which is 2, then it goes to row 3 and returns the 3 that is stored there. The following iteration the iterator is in row 4 which is empty, so it goes to row 5 which is also empty, then to row 6, .... until it reaches the end of the table and throws a StopIteration exception, which signals that the iterator finished.

但是，如果您在迭代集合时更改集合，则会记住迭代器的当前位置(行).这意味着如果您在前一行中添加一个项目，迭代器将不会返回它，如果它是在后面的行中添加的，它将被返回(至少在迭代器存在之前它没有被再次删除).

However if you would change the set while iterating over it the current position (row) where the iterator is is remembered. That means if you add an item in a previous row the iterator won't return it, if it's added in a later row it will be returned (at least if it's not removed again before the iterator is there).

假设，您总是删除当前项目并添加一个整数，即 item + 1 到集合中.像这样的:

Assume, you always remove the current item and add an integer which is item + 1 to the set. Something like this:

s = {1} for item in s: print(item) s.discard(item) s.add(item+1)

迭代前的集合是这样的:

The set before the iteration looks like this:

hash | item ----------------- - | - ----------------- 1 | 1 ----------------- - | - ...

迭代器将从第 0 行开始，发现它是空的，然后转到包含值 1 的第 1 行，然后返回并打印该值.如果箭头指示迭代器的位置，它在第一次迭代中将如下所示:

The iterator will start in row 0 find it is empty and goes to row 1 which contains the value 1 which is then returned and printed. If the arrow would indicate the position of the iterator it would look like this in the first iteration:

hash | item ----------------- - | - ----------------- 1 | 1 <---------- ----------------- - | -

然后把1去掉，加上2:

hash | item ----------------- - | - ----------------- - | - <---------- ----------------- 2 | 2

所以在下一次迭代中，迭代器找到值 2 并返回它.然后两个相加，一个3相加:

So in the next iteration the iterator finds the value 2 and returns it. Then the two is added and a 3 is added:

hash | item ----------------- - | - ----------------- - | - ----------------- - | - <---------- ----------------- 3 | 3

等等.

直到它到达 7.此时发生了一些有趣的事情:8 的截断哈希意味着 8 将被放入第 0 行，但是第 0 行已经被迭代，因此它将停止<代码>7.实际值可能取决于 Python 版本和集合的添加/删除历史记录，例如仅更改 set.add 和 set.discard 操作的顺序将给出一个不同的结果(因为调整了集合的大小，所以最多为 15).

Until it reaches 7. At that point something interesting happens: The truncated hash of 8 means that the 8 will be put in row 0, however row 0 has already been iterated over so it will stop with 7. The actual value might depend on Python version and the add/removal history of the set, for example just changing the order of the set.add and set.discard operations will give a different result (goes up to 15 because the set is resized).

出于同样的原因，如果在每次迭代中添加 item - 1，迭代器只会显示 1，因为 0 将是(因为哈希 0)到第一行:

For the same reason the iterator would only display 1 if one would add the item - 1 in each iteration, because 0 would be (because of hash 0) to the first row:

s = {1} for item in s: print(item) s.discard(item) s.add(item-1) hash | item ----------------- - | - ----------------- 1 | 1 <---------- ----------------- - | - hash | item ----------------- 0 | 0 ----------------- - | - ----------------- - | - <----------

使用简单的 GIF 进行可视化:

Visualized with a simple GIF:

请注意，这些示例非常简单，如果 set 在迭代期间调整大小，它将根据新"截断散列重新分配存储的项目，并且还会删除留下的假人当您从集合中移除一个项目时.在这种情况下，您仍然可以(大致)预测会发生什么，但它会变得更加复杂.

Note that these examples are very simplistic, if the set resizes during the iteration it will re-distribute the stored items based on the "new" truncated hash and it will also remove dummies that are left behind when you remove an item from a set. In that case you still can (roughly) predict what will happen but it will become a lot more convoluted.

另一个但非常重要的事实是 Python(自 Python 3.4 起)为每个解释器随机化字符串的哈希值.这意味着每个 Python 会话将为字符串生成不同的哈希值.因此，如果您在迭代时添加/删除字符串，行为也将是随机的.

One additional but very important fact is that Python (since Python 3.4) randomizes the hash value of strings for each interpreter. That means that each Python session will produce different hash-values for strings. So if you add/remove strings while iterating the behavior will also be random.

假设你有这个脚本:

s = {'a'} for item in s: print(item) s.discard(item) s.add(item*2)

你从命令行多次运行它，结果会有所不同.

and you run it multiple times from the command line the result will be different.

例如我的第一次跑步:

'a' 'aa'

我的第二次/第三次/第四次跑步:

My second / third / fourth run:

'a'

我的第五次跑步:

'a' 'aa'

这是因为命令行中的脚本总是启动一个新的解释器会话.如果您在同一会话中多次运行该脚本，结果不会有所不同.例如:

That's because scripts from the command line always start a new interpreter session. If you run the script multiple times in the same session the results won't differ. For example:

>>> def fun(): ... s = {'a'} ... for item in s: ... print(item) ... s.discard(item) ... s.add(item*2) >>> for i in range(10): ... fun()

产生:

a aa a aa a aa a aa a aa a aa a aa a aa a aa a aa

但它也可以给出 10 倍的 a 或 10 倍的 a, aa 和 aaaa, ...

But it could also have given 10 times a or 10 times a, aa and aaaa, ...

总结一下:

如果项目放置在尚未被迭代的位置，则将显示在迭代期间添加到集合的值.位置取决于项目的截断哈希和碰撞策略.

A value added to a set during iteration will be displayed if the item is put in a position that hasn't been iterated over. The position depends on the truncated hash of the item and the collision strategy.

哈希的截断取决于集合的大小，而该大小取决于集合的添加/删除历史记录.

The truncation of the hash depends on the size of the set and that size depends on the add/removal history of the set.

对于字符串，我们无法预测位置，因为在最近的 Python 版本中，它们的哈希值是在每个会话的基础上随机分配的.

With strings one cannot predict the position because their hash values are randomized on a per-session-basis in recent Python versions.

最重要的是:避免在迭代时修改 set/list/dict/....它几乎总是会导致问题，即使没有，也会让任何阅读它的人感到困惑！尽管在少数非常罕见的情况下，在迭代列表时将元素添加到列表中是有意义的.这将需要非常具体的评论，否则它看起来像一个错误！特别是对于 set 和 dicts，您将依赖可能随时更改的实现细节！

And most importantly: Avoid modifying the set / list / dict / ... while iterating over it. It almost always leads to problems and even if it doesn't it will confuse anyone reading it! Although there are a few, very rare cases where adding elements to a list while iterating over it makes sense. That would require very specific comments alongside it, otherwise it would look like a mistake! Especially with sets and dicts you will rely on implementation details that might change at any point!

以防你好奇，我使用 Jupyter Notebook 中的 Cython 内省(有点脆弱，可能只适用于 Python 3.6，绝对不能用于生产代码)检查了集合的内部:

Just in case you're curious I inspected the internals of the set using (somewhat fragile and probably only works on Python 3.6 and definitely not usable in production-code) Cython introspection in a Jupyter Notebook:

%load_ext Cython %%cython from cpython cimport PyObject, PyTypeObject cimport cython cdef extern from "Python.h": ctypedef Py_ssize_t Py_hash_t struct setentry: PyObject *key Py_hash_t hash ctypedef struct PySetObject: Py_ssize_t ob_refcnt PyTypeObject *ob_type Py_ssize_t fill Py_ssize_t used Py_ssize_t mask setentry *table Py_hash_t hash Py_ssize_t finger setentry smalltable[8] PyObject *weakreflist cpdef print_set_table(set inp): cdef PySetObject* innerset = <PySetObject *>inp for idx in range(innerset.mask+1): if (innerset.table[idx].key == NULL): print(idx, '<EMPTY>') else: print(idx, innerset.table[idx].hash, <object>innerset.table[idx].key)

其中打印集合内的键值表:

Which prints the key-value table inside the set:

a = {1} print_set_table(a) for idx, item in enumerate(a): print(' idx', idx) a.discard(item) a.add(item+1) print_set_table(a)

请注意，输出将包含虚拟对象(已删除集合项的剩余部分)，并且它们有时会消失(当集合变得太满或调整大小时).

Note that the output will contain dummys (left-overs from deleted set-items) and they will sometimes disappear (when the set get's too full or resized).

相关文章