python如何计算元组的哈希

2022-01-20 00:00:00 python tuples hash

问题描述

在 python 中,如果我有一个包含许多元素的元组,它的哈希值是根据其元素的 id 还是其元素的内容计算得出的?

In python, if I have a tuple with many elements, is its hash calculated from its elements' ids or its elements' content?

在这个例子中,

a = (1, [1,2])
hash(a)

它错误地说列表是不可散列的.所以我猜它不是由id计算的,或者可能检查元素是否可变.

It errors out saying list is unhashable. So I guess it's not computed by id, or probably there is a check on whether the element is mutable.

现在看这个例子

class A: pass
a0 = A()
ta = (1, a0)
hash(ta)  # -1122968024
a0.x = 20
hash(ta)  # -1122968024

这里发现 ta 的哈希值不会随着其元素的修改而改变,即 a0.那么可能 a0 的 id 用于哈希计算?a0 是否被认为是不可变的?python如何知道一个类型是否可变?

Here it turns out the hash of ta does not change with the modification of its element, i.e., a0. So maybe a0's id is used for the hash calculation? Is a0 somehow considered as immutable? How does python know if a type is mutable?

现在考虑这种情况

b = (1, 2)
id(b)  # 3980742764
c = (1, 2)
id(c)  # 3980732588
tb = (1, b)
tc = (1, c) 
hash(tb)  # -1383040070
hash(tc)  # -1383040070

看来bc的内容是用来做hash计算的.

It seems the content of b and c are used for the hash calculation.

我应该如何理解这些例子?

How should I understand these examples?


解决方案

如果我有一个包含许多元素的元组,它的哈希值是根据其元素的 id 还是其元素的内容计算得出的?

If I have a tuple with many elements, is its hash calculated from its elements' ids or its elements' content?

两者都没有.它是根据这些元素的哈希值计算的,而不是它们的内容".(值/属性),也不是 ID.

Neither. It is calculated on the basis of the hashes of these elements, not their "contents" (values/attributes), nor IDs.

查看python文档词汇表中的本段.

Take a look at this paragraph in python's documentation glossary.

某事物是否可散列,以及如何散列,取决于其 __hash__() 方法的实现.Python 本身不知道对象的可变性.

Whether something is hashable or not, and how it is hashed, depends on the implementation of its __hash__() method. By itself, Python has no idea about mutability of an object.

哈希在识别对象时很有用.例如,它加快了从 dict 中检索数据的速度,通过来自有限区间的单个数值(键的哈希)来识别键的任意值.

A hash is useful in identification of objects. For example, it speeds up data retrieval from a dict, identifying the arbitrary value of a key by a single numerical value from a finite interval - the key's hash.

哈希值在对象的整个生命周期内都应该保持不变.否则,一个对象可以映射到 dict 中的两个不同值,或者包含在 中>set 两次,只要它的哈希值发生变化.

A hash should remain unchanged throughout the lifetime of the object. Otherwise, one object could map to two different values in a dict, or be included into a set twice, as soon as its hash changes.

仅通过哈希值比较两个对象是不够的:归根结底,您可能仍需要执行相等性检查,因为不同对象的哈希之间可能存在冲突.这就是为什么可散列对象需要实现__eq__().

It's not enough to compare two objects by their hashes: at the end of the day, you may still need to perform equality checks, because there may be a collision between the hashes of different objects. That's why hashable objects are required to have __eq__() implemented.

这与可变性有关:如果一个可散列对象发生变异,从而改变了与散列对象的相等比较,尤其是那些具有相同散列的对象 - 它违反了合同,并可能导致相同的结果变异哈希会很奇怪.可哈希对象不应改变它们之间的比较.

This ties back to the mutability: if a hashable object mutates such that it changes equality comparisons with hashables, especially the ones with the same hash - it breaks the contract, and may result in the same weirdness a mutating hash would. Hashable objects should not mutate comparisons between themselves.

彼此相等的可散列对象应该具有相同的散列.这是一个通用的契约,它使一切变得更简单 - 很自然地假设 x== y 意味着 xy 都映射到 dict 中的相同值.

Hashable objects that are equal to each other should have the same hash. This is a general contract that makes everything else simpler - it's natural to assume x == y implies that both x and y map to the same value in a dict.

考虑您的第一个示例.tuple 根据其元素对自身进行散列,而其第二个元素 list 根本没有散列 - __hash__ 方法没有为它实现.所以 tuple.__hash__ 方法失败了.

Consider your first example. The tuple hashes itself on the basis of its elements, while its second element, the list, doesn't have a hash at all - the __hash__ method is not implemented for it. And so the tuple.__hash__ method fails.

这就是为什么内部带有 list 对象的 tuple 不可散列的原因.如您所见,因此说 tuple 哈希基于其元素的 ID 也是不正确的.

That's why a tuple with a list object inside of it is not hashable. As you can see, it is therefore also incorrect to say, that a tuple hash is based on the IDs of its elements.

注意,如果 list 在这里是可散列的,并且散列是基于其元素的,则更改它们会更改外部 tuple 的散列,从而违反合同.

Notice, that if the list was hashable here, and the hash was based on its elements, changing them would change the hash of the outer tuple, breaking the contract.

让我们看看 python 数据模型文档,以及它对该主题的看法:

Let's have a look at python data model documentation, and what it has to say on the topic:

用户定义的类默认有 __eq__()__hash__() 方法;使用它们,所有对象都比较不相等(除了它们自己)并且 x.__hash__() 返回一个适当的值,使得 x == y 暗示 x 是yhash(x) == hash(y).

User-defined classes have __eq__() and __hash__() methods by default; with them, all objects compare unequal (except with themselves) and x.__hash__() returns an appropriate value such that x == y implies both that x is y and hash(x) == hash(y).

简单来说,默认实现是比较对象identity,与对象attributes无关.这就是为什么您可以更改内部"值的原因.自定义类的对象而不更改其哈希值.

Put simply, the default implementation compares objects identity, which has nothing to do with object attributes. That's why you can change the values "inside" the object of your custom class without changing its hash.

这也是您不必为您的类定义 __hash__() 的原因 - 在这种情况下,python 会为您完成.

That's also why you don't have to define __hash__() for your classes - python does it for you in this case.

在这方面你是对的 - 自定义类的散列函数的默认(CPython 的)实现依赖于 id() 对象(而不是在它的内部"值上).这是一个实现细节,在 Python 版本之间有所不同.

In this regard you're right - the default (CPython's) implementation of the hashing function for custom classes relies on the id() of an object (and not on the values "inside" of it). It is an implementation detail, and it differs between Python versions.

在最新版本的 Python 中,hash()id() 之间的关系涉及随机化.这可以防止某些形式的 拒绝服务攻击,其中可能会创建任意哈希冲突显着降低 Web 应用程序的速度.请参阅 PEP-456.

In more recent versions of Python, the relation between hash() and id() involves randomization. This prevents some forms of denial of service attacks, where creating arbitrary hash collisions could significantly slow down web applications. See PEP-456.

虽然细节相当复杂,可能涉及一些高级数学,但元组对象的哈希函数的实现是用 C 编写的,可以看到 这里(参见static Py_hash_t tuplehash(PyTupleObject *v)).

While the details are quite complicated and probably involve some advanced math, the implementation of the hash function for tuple objects is written in C, and can be seen here (see static Py_hash_t tuplehash(PyTupleObject *v)).

计算涉及将一个常数与每个元组元素的哈希值进行异或运算.负责元素散列的行是这一行:

The calculation involves XORing a constant with the hashes of each of the tuple's elements. The line responsible for hashing of the elements is this one:

y = PyObject_Hash(*p++);


所以,回答您最初的问题:它使用 每个元素的散列 进行了一堆 XOR hokus-pocus.是否考虑这些元素的内容和属性取决于它们具体的哈希函数.


So, to answer your original question: it does a bunch of XOR hokus-pocus with the hashes of each of its elements. Whether or not the contents and attributes of these elements are considered depends on their specific hash functions.

相关文章