std::unordered_map 是如何实现的
c++ unordered_map 碰撞处理、调整大小和重新散列
这是我之前提出的一个问题,我发现我对 unordered_map 的实现方式有很多困惑.我相信很多其他人也和我一样困惑.根据我没有阅读标准就知道的信息:
This is a previous question opened by me and I have seen that I am having a lot of confusion about how unordered_map is implemented. I am sure many other people shares that confusion with me. Based on the information I have know without reading the standard:
每个 unordered_map 实现都存储一个链表到外部桶数组中的节点......不,这根本不是为大多数常见用途实现哈希映射的有效方法.不幸的是,规范中的一个小疏忽"unordered_map 几乎都需要这种行为.所需的行为是元素的迭代器在插入或删除时必须保持有效其他元素
Every unordered_map implementation stores a linked list to external nodes in the array of buckets... No, that is not at all the most efficient way to implement a hash map for most common uses. Unfortunately, a small "oversight" in the specification of unordered_map all but requires this behavior. The required behavior is that iterators to elements must stay valid when inserting or deleting other elements
我希望有人可以解释实现以及它如何符合 C++ 标准定义(在性能要求方面),如果它真的不是实现哈希映射数据结构的最有效方法,如何改进它?
I was hoping that someone might explain the implementation and how it fits with the C++ standard definition ( in terms of performance requirements ) and if it is really not the most efficient way to implement an hash map data structure how it can be improved ?
推荐答案
标准有效地要求 std::unordered_set
和 std::unordered_map
- 和他们的multi"弟兄们 - 使用 open hashing aka separate chaining,表示一个桶数组,每个桶都持有链表的头部?.这个要求很微妙:它是以下原因的结果:
The Standard effectively mandates that implementations of std::unordered_set
and std::unordered_map
- and their "multi" brethren - use open hashing aka separate chaining, which means an array of buckets, each of which holds the head of a linked list?. That requirement is subtle: it is a consequence of:
- 默认
max_load_factor()
为 1.0(这意味着表格将在任何时候调整大小size()
否则会超过bucket_count()代码> 和
- 保证表不会被重新散列,除非增长超过该负载因子.
如果没有链接,那将是不切实际的,因为与哈希表实现的其他主要类别发生冲突 - 封闭散列又名开放寻址 - 随着load_factor()
](https://en.cppreference.com/w/cpp/container/unordered_map/load_factor) 接近 1.
That would be impractical without chaining, as the collisions with the other main category of hash table implementation - closed hashing aka open addressing - become overwhelming as the load_factor()
](https://en.cppreference.com/w/cpp/container/unordered_map/load_factor) approaches 1.
参考资料:
23.2.5/15: insert
和 emplace
成员不应影响迭代器的有效性,如果 (N+n)
(N+n) <z * B
,其中N
是插入操作之前容器中的元素个数,n
是插入的元素个数,B
是容器的桶数,z
是容器的最大负载因子.
23.2.5/15: The
insert
andemplace
members shall not affect the validity of iterators if(N+n) < z * B
, whereN
is the number of elements in the container prior to the insert operation,n
is the number of elements inserted,B
is the container’s bucket count, andz
is the container’s maximum load factor.
在 23.5.4.2/1 的构造函数的效果中:max_load_factor()
返回 1.0
.
amongst the Effects of the constructor at 23.5.4.2/1: max_load_factor()
returns 1.0
.
? 为了在不越过任何空桶的情况下实现最优迭代,GCC 的实现用迭代器将桶填充到一个包含所有值的单链表中:迭代器立即指向之前的元素桶的元素,因此如果擦除桶的最后一个值,则可以重新连接那里的 next
指针.
? To allow optimal iteration without passing over any empty buckets, GCC's implementation fills the buckets with iterators into a single singly-linked list holding all the values: the iterators point to the element immediately before that bucket's elements, so the next
pointer there can be rewired if erasing the bucket's last value.
关于你引用的文字:
不,这根本不是为最常见的用途实现哈希映射的最有效方法.不幸的是,一个小小的疏忽"在 unordered_map 的规范中,几乎都需要这种行为.要求的行为是元素的迭代器在插入或删除其他元素时必须保持有效
No, that is not at all the most efficient way to implement a hash map for most common uses. Unfortunately, a small "oversight" in the specification of unordered_map all but requires this behavior. The required behavior is that iterators to elements must stay valid when inserting or deleting other elements
没有疏忽"……所做的事情是经过深思熟虑的,并且是在全神贯注的情况下完成的.确实可以达成其他妥协,但是开放散列/链接方法对于一般用途来说是一个合理的妥协,它可以相当优雅地处理来自平庸散列函数的冲突,对于小或大的键/值类型不会太浪费,并处理任意多个 insert
/erase
对,而不会像许多封闭哈希实现那样逐渐降低性能.
There is no "oversight"... what was done was very deliberate and done with full awareness. It's true that other compromises could have been struck, but the open hashing / chaining approach is a reasonable compromise for general use, that copes reasonably elegantly with collisions from mediocre hash functions, isn't too wasteful with small or large key/value types, and handles arbitrarily-many insert
/erase
pairs without gradually degrading performance the way many closed hashing implementations do.
作为意识的证据,来自 Matthew奥斯特恩的提议在这里:
As evidence of the awareness, from Matthew Austern's proposal here:
我不知道在通用框架中有任何令人满意的开放寻址实现.开放寻址带来了许多问题:
I'm not aware of any satisfactory implementation of open addressing in a generic framework. Open addressing presents a number of problems:
? 有必要区分空缺职位和占用职位.
? It's necessary to distinguish between a vacant position and an occupied one.
? 必须要么将哈希表限制为具有默认构造函数的类型,并提前构造每个数组元素,要么维护一个数组,其中一些元素是对象,其他元素是原始内存.
? It's necessary either to restrict the hash table to types with a default constructor, and to construct every array element ahead of time, or else to maintain an array some of whose elements are objects and others of which are raw memory.
? 开放寻址使冲突管理变得困难:如果您插入的元素的哈希码映射到已占用的位置,您需要一个策略来告诉您下一步在哪里尝试.这是一个已解决的问题,但最著名的解决方案很复杂.
? Open addressing makes collision management difficult: if you're inserting an element whose hash code maps to an already-occupied location, you need a policy that tells you where to try next. This is a solved problem, but the best known solutions are complicated.
? 当允许擦除元素时,碰撞管理尤其复杂.(有关讨论,请参阅 Knuth.)标准库的容器类应该允许擦除.
? Collision management is especially complicated when erasing elements is allowed. (See Knuth for a discussion.) A container class for the standard library ought to allow erasure.
? 开放寻址的冲突管理方案倾向于假定一个固定大小的数组,最多可容纳 N 个元素.当插入新元素时,标准库的容器类应该能够根据需要增长,直到可用内存的限制.
? Collision management schemes for open addressing tend to assume a fixed size array that can hold up to N elements. A container class for the standard library ought to be able to grow as necessary when new elements are inserted, up to the limit of available memory.
解决这些问题可能是一个有趣的研究项目,但是,在缺乏 C++ 上下文中的实现经验的情况下,将开放寻址容器类标准化是不合适的.
Solving these problems could be an interesting research project, but, in the absence of implementation experience in the context of C++, it would be inappropriate to standardize an open-addressing container class.
特别是对于数据小到可以直接存储在存储桶中的仅插入表、未使用的存储桶的方便标记值以及良好的散列函数,封闭散列方法可能会快大约一个数量级,并且使用量大大减少内存,但这不是通用的.
Specifically for insert-only tables with data small enough to store directly in the buckets, a convenient sentinel value for unused buckets, and a good hash function, a closed hashing approach may be roughly an order of magnitude faster and use dramatically less memory, but that's not general purpose.
哈希表设计选项及其含义的完整比较和详细说明不在 S.O. 的主题范围内.因为它太宽泛了,无法在这里正确解决.
A full comparison and elaboration of hash table design options and their implications is off topic for S.O. as it's way too broad to address properly here.
相关文章