如果参数是集合,为什么联合会消耗更多内存?

2022-01-17 00:00:00 python set memory-management

问题描述

我对 sets 的这种内存分配行为感到困惑:

I'm puzzled by this behaviour of memory allocation of sets:

>>> set(range(1000)).__sizeof__()
32968
>>> set(range(1000)).union(range(1000)).__sizeof__()       # expected, set doesn't change
32968
>>> set(range(1000)).union(list(range(1000))).__sizeof__() #expected, set doesn't change
32968
>>> set(range(1000)).union(set(range(1000))).__sizeof__()  # not expected
65736

为什么使用 set 作为参数 加倍 结果 set 使用的内存量?两种情况下的结果都与原始 set 相同:

Why using a set as argument doubles the amount of memory used by the resulting set? The result in both cases is identical to the original set:

>>> set(range(1000)) == set(range(1000)).union(range(1000)) == set(range(1000)).union(set(range(1000)))
True

请注意,使用普通迭代器也会发生同样的情况:

Note that the same happens using a normal iterator:

>>> set(range(1000)).union(iter(list(range(1000)))).__sizeof__()
32968

并且使用 update 方法:

>>> a.update(range(1000))
>>> a.__sizeof__()
32968
>>> a.update(set(range(1000)))
>>> a.__sizeof__()
65736

一开始我以为是因为在调用union的时候,看到另一个set的大小是1000所以决定分配足够的内存来容纳两个 set 的所有元素,但之后它只使用该内存的一部分,而在迭代器的情况下,它只是简单地迭代它并一个一个地添加元素(这不会消耗更多内存,因为所有元素都已经在 set).

At first I thought that it was because when union is called, it sees that the size of the other set is 1000 and thus decides to allocate enough memory to fit all the elements of both sets, but afterwards it uses only part of that memory, while in the iterator case it simply iterators over it and adds the elements one by one(which doesn't consume more memory since all the elements are already in the set).

但是range也是一个序列,第一个例子中的list也是.

But range is also a sequence, and so is the list in the first example.

>>> len(range(1000))
1000
>>> range(1000)[100]
100

那么为什么 rangelist 不会发生这种情况,而只有 set 会发生这种情况?这背后是否有任何设计决策或者是一个错误?

So why this doesn't happen with range and list but only with set? Is there any design decision behind this or is it a bug?

在 linux 64 位上的 python 2.7.3 和 python 3.2.3 上测试.

Tested on python 2.7.3 and python 3.2.3 on linux 64 bit.


解决方案

在 Python 2.7.3 中,set.union() 委托给一个名为 set_update_internal().后者根据其参数的 Python 类型使用几种不同的实现.这种实现的多样性解释了您进行的测试之间的行为差​​异.

In Python 2.7.3, set.union() delegates to a C function called set_update_internal(). The latter uses several different implementations depending on the Python type of its argument. This multiplicity of implementations is what explains the difference in behaviour between the tests you've conducted.

当参数是 set 时使用的实现在代码中记录了以下假设:

The implementation that's used when the argument is a set makes the following assumption documented in the code:

/* Do one big resize at the start, rather than
 * incrementally resizing as we insert new keys.  Expect
 * that there will be no (or few) overlapping keys.
 */

显然,在您的特定情况下,没有(或很少)重叠键的假设是不正确的.这就是导致最终 set 过度分配内存的原因.

Clearly, the assumption of no (or few) overlapping keys is incorrect in your particular case. This is what results in the final set overallocating memory.

虽然我不确定我是否会将此称为错误.set 的实现者选择了在我看来是合理的权衡,而您只是发现自己处于这种权衡的错误一边.

I am not sure I would call this a bug though. The implementer of set chose what to me looks like a reasonable tradeoff, and you've simply found yourself on the wrong side of that tradeoff.

权衡的好处是,在许多情况下,预分配会带来更好的性能:

The upside of the tradeoff is that in many cases the pre-allocation results in better performance:

In [20]: rhs = list(range(1000))

In [21]: %timeit set().union(rhs)
10000 loops, best of 3: 30 us per loop

In [22]: rhs = set(range(1000))

In [23]: %timeit set().union(rhs)
100000 loops, best of 3: 14 us per loop

这里,set 版本的速度是原来的两倍,大概是因为它不会重复地重新分配内存,因为它正在从 rhs 添加元素.

Here, the set version is twice as fast, presumably because it doesn't repeatedly reallocate memory as it's adding elements from rhs.

如果过度分配会破坏交易,有很多方法可以解决它,其中一些您已经发现了.

If the overallocation is a deal-breaker, there's a number of ways to work around it, some of which you've already discovered.

相关文章