std::unordered_map 是如何实现的

2022-01-08 00:00:00 hashmap c++ c++11 unordered-map

c++ unordered_map 碰撞处理、调整大小和重新散列

这是我之前提出的一个问题,我发现我对 unordered_map 的实现方式有很多困惑.我相信很多其他人也和我一样困惑.根据我没有阅读标准就知道的信息:

This is a previous question opened by me and I have seen that I am having a lot of confusion about how unordered_map is implemented. I am sure many other people shares that confusion with me. Based on the information I have know without reading the standard:

每个 unordered_map 实现都存储一个链表到外部桶数组中的节点......不,这根本不是为大多数常见用途实现哈希映射的有效方法.不幸的是,规范中的一个小疏忽"unordered_map 几乎都需要这种行为.所需的行为是元素的迭代器在插入或删除时必须保持有效其他元素

Every unordered_map implementation stores a linked list to external nodes in the array of buckets... No, that is not at all the most efficient way to implement a hash map for most common uses. Unfortunately, a small "oversight" in the specification of unordered_map all but requires this behavior. The required behavior is that iterators to elements must stay valid when inserting or deleting other elements

我希望有人可以解释实现以及它如何符合 C++ 标准定义(在性能要求方面),如果它真的不是实现哈希映射数据结构的最有效方法,如何改进它?

I was hoping that someone might explain the implementation and how it fits with the C++ standard definition ( in terms of performance requirements ) and if it is really not the most efficient way to implement an hash map data structure how it can be improved ?

推荐答案

标准有效地要求 std::unordered_setstd::unordered_map - 和他们的multi"弟兄们 - 使用 open hashing aka separate chaining,表示一个桶数组,每个桶都持有链表的头部?.这个要求很微妙:它是以下原因的结果:

The Standard effectively mandates that implementations of std::unordered_set and std::unordered_map - and their "multi" brethren - use open hashing aka separate chaining, which means an array of buckets, each of which holds the head of a linked list?. That requirement is subtle: it is a consequence of:

  • 默认 max_load_factor()为 1.0(这意味着表格将在任何时候调整大小 size() 否则会超过 bucket_count()
  • 保证表不会被重新散列,除非增长超过该负载因子.

如果没有链接,那将是不切实际的,因为与哈希表实现的其他主要类别发生冲突 - 封闭散列又名开放寻址 - 随着load_factor()](https://en.cppreference.com/w/cpp/container/unordered_map/load_factor) 接近 1.

That would be impractical without chaining, as the collisions with the other main category of hash table implementation - closed hashing aka open addressing - become overwhelming as the load_factor()](https://en.cppreference.com/w/cpp/container/unordered_map/load_factor) approaches 1.

参考资料:

23.2.5/15: insertemplace 成员不应影响迭代器的有效性,如果 (N+n) (N+n) <z * B,其中N是插入操作之前容器中的元素个数,n是插入的元素个数,B 是容器的桶数,z 是容器的最大负载因子.

23.2.5/15: The insert and emplace members shall not affect the validity of iterators if (N+n) < z * B, where N is the number of elements in the container prior to the insert operation, n is the number of elements inserted, B is the container’s bucket count, and z is the container’s maximum load factor.

在 23.5.4.2/1 的构造函数的效果中:max_load_factor() 返回 1.0.

amongst the Effects of the constructor at 23.5.4.2/1: max_load_factor() returns 1.0.

? 为了在不越过任何空桶的情况下实现最优迭代,GCC 的实现用迭代器将桶填充到一个包含所有值的单链表中:迭代器立即指向之前的元素桶的元素,因此如果擦除桶的最后一个值,则可以重新连接那里的 next 指针.

? To allow optimal iteration without passing over any empty buckets, GCC's implementation fills the buckets with iterators into a single singly-linked list holding all the values: the iterators point to the element immediately before that bucket's elements, so the next pointer there can be rewired if erasing the bucket's last value.

关于你引用的文字:

不,这根本不是为最常见的用途实现哈希映射的最有效方法.不幸的是,一个小小的疏忽"在 unordered_map 的规范中,几乎都需要这种行为.要求的行为是元素的迭代器在插入或删除其他元素时必须保持有效

No, that is not at all the most efficient way to implement a hash map for most common uses. Unfortunately, a small "oversight" in the specification of unordered_map all but requires this behavior. The required behavior is that iterators to elements must stay valid when inserting or deleting other elements

没有疏忽"……所做的事情是经过深思熟虑的,并且是在全神贯注的情况下完成的.确实可以达成其他妥协,但是开放散列/链接方法对于一般用途来说是一个合理的妥协,它可以相当优雅地处理来自平庸散列函数的冲突,对于小或大的键/值类型不会太浪费,并处理任意多个 insert/erase 对,而不会像许多封闭哈希实现那样逐渐降低性能.

There is no "oversight"... what was done was very deliberate and done with full awareness. It's true that other compromises could have been struck, but the open hashing / chaining approach is a reasonable compromise for general use, that copes reasonably elegantly with collisions from mediocre hash functions, isn't too wasteful with small or large key/value types, and handles arbitrarily-many insert/erase pairs without gradually degrading performance the way many closed hashing implementations do.

作为意识的证据,来自 Matthew奥斯特恩的提议在这里:

As evidence of the awareness, from Matthew Austern's proposal here:

我不知道在通用框架中有任何令人满意的开放寻址实现.开放寻址带来了许多问题:

I'm not aware of any satisfactory implementation of open addressing in a generic framework. Open addressing presents a number of problems:

? 有必要区分空缺职位和占用职位.

? It's necessary to distinguish between a vacant position and an occupied one.

? 必须要么将哈希表限制为具有默认构造函数的类型,并提前构造每个数组元素,要么维护一个数组,其中一些元素是对象,其他元素是原始内存.

? It's necessary either to restrict the hash table to types with a default constructor, and to construct every array element ahead of time, or else to maintain an array some of whose elements are objects and others of which are raw memory.

? 开放寻址使冲突管理变得困难:如果您插入的元素的哈希码映射到已占用的位置,您需要一个策略来告诉您下一步在哪里尝试.这是一个已解决的问题,但最著名的解决方案很复杂.

? Open addressing makes collision management difficult: if you're inserting an element whose hash code maps to an already-occupied location, you need a policy that tells you where to try next. This is a solved problem, but the best known solutions are complicated.

? 当允许擦除元素时,碰撞管理尤其复杂.(有关讨论,请参阅 Knuth.)标准库的容器类应该允许擦除.

? Collision management is especially complicated when erasing elements is allowed. (See Knuth for a discussion.) A container class for the standard library ought to allow erasure.

? 开放寻址的冲突管理方案倾向于假定一个固定大小的数组,最多可容纳 N 个元素.当插入新元素时,标准库的容器类应该能够根据需要增长,直到可用内存的限制.

? Collision management schemes for open addressing tend to assume a fixed size array that can hold up to N elements. A container class for the standard library ought to be able to grow as necessary when new elements are inserted, up to the limit of available memory.

解决这些问题可能是一个有趣的研究项目,但是,在缺乏 C++ 上下文中的实现经验的情况下,将开放寻址容器类标准化是不合适的.

Solving these problems could be an interesting research project, but, in the absence of implementation experience in the context of C++, it would be inappropriate to standardize an open-addressing container class.

特别是对于数据小到可以直接存储在存储桶中的仅插入表、未使用的存储桶的方便标记值以及良好的散列函数,封闭散列方法可能会快大约一个数量级,并且使用量大大减少内存,但这不是通用的.

Specifically for insert-only tables with data small enough to store directly in the buckets, a convenient sentinel value for unused buckets, and a good hash function, a closed hashing approach may be roughly an order of magnitude faster and use dramatically less memory, but that's not general purpose.

哈希表设计选项及其含义的完整比较和详细说明不在 S.O. 的主题范围内.因为它太宽泛了,无法在这里正确解决.

A full comparison and elaboration of hash table design options and their implications is off topic for S.O. as it's way too broad to address properly here.

相关文章