Pythonfrozenset散列算法/实现

2022-01-17 00:00:00 python set hash python-internals

问题描述

我目前正在尝试了解为 Python 的内置 frozenset 数据类型定义的哈希函数背后的机制.实现显示在底部以供参考.我特别感兴趣的是选择这种散射操作的基本原理:

I'm currently trying to understand the mechanism behind the hash function defined for Python's built-in frozenset data type. The implementation is shown at the bottom for reference. What I'm interested in particular is the rationale for the choice of this scattering operation:

lambda h: (h ^ (h << 16) ^ 89869747) * 3644798167

其中 h 是每个元素的哈希值.有谁知道这些是从哪里来的?(也就是说,选择这些数字有什么特别的原因吗?)或者它们只是随意选择的?

where h is the hash of each element. Does anyone know where these came from? (That is, was there any particular reason to pick these numbers?) Or were they simply chosen arbitrarily?

这是来自官方 CPython 实现的片段,

Here is the snippet from the official CPython implementation,

static Py_hash_t
frozenset_hash(PyObject *self)
{
    PySetObject *so = (PySetObject *)self;
    Py_uhash_t h, hash = 1927868237UL;
    setentry *entry;
    Py_ssize_t pos = 0;

    if (so->hash != -1)
        return so->hash;

    hash *= (Py_uhash_t)PySet_GET_SIZE(self) + 1;
    while (set_next(so, &pos, &entry)) {
        /* Work to increase the bit dispersion for closely spaced hash
           values.  The is important because some use cases have many
           combinations of a small number of elements with nearby
           hashes so that many distinct combinations collapse to only
           a handful of distinct hash values. */
        h = entry->hash;
        hash ^= (h ^ (h << 16) ^ 89869747UL)  * 3644798167UL;
    }
    hash = hash * 69069U + 907133923UL;
    if (hash == -1)
        hash = 590923713UL;
    so->hash = hash;
    return hash;
}

和 Python 中的等效实现:

def _hash(self):
    MAX = sys.maxint
    MASK = 2 * MAX + 1
    n = len(self)
    h = 1927868237 * (n + 1)
    h &= MASK
    for x in self:
        hx = hash(x)
        h ^= (hx ^ (hx << 16) ^ 89869747)  * 3644798167
        h &= MASK
    h = h * 69069 + 907133923
    h &= MASK
    if h > MAX:
        h -= MASK + 1
    if h == -1:
        h = 590923713
    return h


解决方案

除非 Raymond Hettinger(代码作者)插话,否则我们永远无法确定 ;-) 但这些东西中的科学"通常比你少可能会期望:您采用一些一般原则和一个测试套件,几乎任意地调整常量,直到结果看起来足够好".

Unless Raymond Hettinger (the code's author) chimes in, we'll never know for sure ;-) But there's usually less "science" in these things than you might expect: you take some general principles, and a test suite, and fiddle the constants almost arbitrarily until the results look "good enough".

一些一般原则显然"在这里起作用:

Some general principles "obviously" at work here:

  1. 要获得所需的快速位分散",您需要乘以一个大整数.由于 CPython 的哈希结果在许多平台上必须适合 32 位,因此需要 32 位的整数最适合此操作.而且,确实,(3644798167).bit_length() == 32.

为避免系统地丢失低位,您需要乘以一个奇数.3644798167 是奇数.

To avoid systematically losing the low-order bit(s), you want to multiply by an odd integer. 3644798167 is odd.

更一般地说,为了避免输入哈希中的复合模式,您希望乘以一个素数.而 3644798167 是素数.

More generally, to avoid compounding patterns in the input hashes, you want to multiply by a prime. And 3644798167 is prime.

您还需要一个二进制表示没有明显重复模式的乘法器.bin(3644798167) == '0b11011001001111110011010011010111'.这很糟糕,这是一件好事;-)

And you also want a multiplier whose binary representation doesn't have obvious repeating patterns. bin(3644798167) == '0b11011001001111110011010011010111'. That's pretty messed up, which is a good thing ;-)

其他常量在我看来完全是任意的.

The other constants look utterly arbitrary to me. The

if h == -1:
    h = 590923713

part 需要另一个原因:在内部,CPython 从整数值 C 函数中获取 -1 返回值,表示需要引发异常";即,这是一个错误返回.所以你永远不会在 CPython 中看到任何对象的 -1 哈希码.返回的值而不是 -1 完全是任意的 - 它只需要每次都是 相同的 值(而不是 -1).

part is needed for another reason: internally, CPython takes a -1 return value from an integer-valued C function as meaning "an exception needs to be raised"; i.e., it's an error return. So you'll never see a hash code of -1 for any object in CPython. The value returned instead of -1 is wholly arbitrary - it just needs to be the same value (instead of -1) each time.

玩弄

我不知道 Raymond 用什么来测试这个.这是我会使用的:查看一组连续整数的所有子集的哈希统计信息.这些都是有问题的,因为 hash(i) == i 对于很多整数 i.

I don't know what Raymond used to test this. Here's what I would have used: look at hash statistics for all subsets of a set of consecutive integers. Those are problematic because hash(i) == i for a great many integers i.

>>> all(hash(i) == i for i in range(1000000))
True

简单地对哈希进行异或运算会在这样的输入上产生大量取消.

Simply xor'ing hashes together will yield massive cancellation on inputs like that.

所以这里有一个小函数来生成所有子集,另一个是对所有哈希码进行简单的异或:

So here's a little function to generate all subsets, and another to do a dirt-simple xor across all hash codes:

def hashxor(xs):
    h = 0
    for x in xs:
        h ^= hash(x)
    return h

def genpowerset(xs):
    from itertools import combinations
    for length in range(len(xs) + 1):
        for t in combinations(xs, length):
            yield t

然后是一个驱动程序,以及一个显示碰撞统计信息的小函数:

Then a driver, and a little function to display collision statistics:

def show_stats(d):
    total = sum(d.values())
    print "total", total, "unique hashes", len(d), 
          "collisions", total - len(d)

def drive(n, hasher=hashxor):
    from collections import defaultdict
    d = defaultdict(int)

    for t in genpowerset(range(n)):
        d[hasher(t)] += 1
    show_stats(d)

使用简单粗暴的哈希器是灾难性的:

Using the dirt-simple hasher is disastrous:

>> drive(20)
total 1048576 unique hashes 32 collisions 1048544

哎呀!OTOH,使用为 freezesets 设计的 _hash() 在这种情况下做得很好:

Yikes! OTOH, using the _hash() designed for frozensets does a perfect job in this case:

>>> drive(20, _hash)
total 1048576 unique hashes 1048576 collisions 0

然后,您可以使用它来查看在 _hash() 中哪些是 - 哪些不是 - 产生真正的差异.例如,如果

Then you can play with that to see what does - and doesn't - make a real difference in _hash(). For example, it still does a perfect job on these inputs if

    h = h * 69069 + 907133923

被删除.我不知道为什么那条线在那里.同样,如果内部循环中的 ^ 89869747 被删除,它继续在这些输入上做得很好——也不知道为什么会这样.并且可以从以下位置更改初始化:

is removed. And I have no idea why that line is there. Similarly, it continues to do a perfect job on these inputs if the ^ 89869747 in the inner loop is removed - don't know why that's there either. And initialization can be changed from:

    h = 1927868237 * (n + 1)

到:

    h = n

这里也没有伤害.这一切都符合我的预期:由于已经解释过的原因,内循环中的乘法常数至关重要.例如,将 1 添加到它(使用 3644798168),然后它不再是素数或奇数,并且统计信息会降级为:

without harm here too. That all jibes with what I expected: it's the multiplicative constant in the inner loop that's crucial, for reasons already explained. For example, add 1 to it (use 3644798168) and then it's no longer prime or odd, and the stats degrade to:

total 1048576 unique hashes 851968 collisions 196608

仍然相当可用,但肯定更糟.把它改成一个小的素数,比如 13,结果更糟:

Still quite usable, but definitely worse. Change it to a small prime, like 13, and it's worse:

total 1048576 unique hashes 483968 collisions 564608

使用具有明显二进制模式的乘数,例如 0b01010101010101010101010101010101,甚至更糟:

Use a multiplier with an obvious binary pattern, like 0b01010101010101010101010101010101, and worse again:

total 1048576 unique hashes 163104 collisions 885472

到处玩!这些东西很有趣:-)

Play around! These things are fun :-)

相关文章