std::unordered_map 是如何实现的

2022-01-08 00:00:00 hashmap c++ c++11 unordered-map

c++ unordered_map 碰撞处理、调整大小和重新散列

这是我之前提出的一个问题，我发现我对 unordered_map 的实现方式有很多困惑.我相信很多其他人也和我一样困惑.根据我没有阅读标准就知道的信息:

This is a previous question opened by me and I have seen that I am having a lot of confusion about how unordered_map is implemented. I am sure many other people shares that confusion with me. Based on the information I have know without reading the standard:

每个 unordered_map 实现都存储一个链表到外部桶数组中的节点......不，这根本不是为大多数常见用途实现哈希映射的有效方法.不幸的是，规范中的一个小疏忽"unordered_map 几乎都需要这种行为.所需的行为是元素的迭代器在插入或删除时必须保持有效其他元素

Every unordered_map implementation stores a linked list to external nodes in the array of buckets... No, that is not at all the most efficient way to implement a hash map for most common uses. Unfortunately, a small "oversight" in the specification of unordered_map all but requires this behavior. The required behavior is that iterators to elements must stay valid when inserting or deleting other elements

我希望有人可以解释实现以及它如何符合 C++ 标准定义(在性能要求方面)，如果它真的不是实现哈希映射数据结构的最有效方法，如何改进它?

I was hoping that someone might explain the implementation and how it fits with the C++ standard definition ( in terms of performance requirements ) and if it is really not the most efficient way to implement an hash map data structure how it can be improved ?

推荐答案

标准有效地要求 std::unordered_set 和 std::unordered_map - 和他们的multi"弟兄们 - 使用 open hashing aka separate chaining，表示一个桶数组，每个桶都持有链表的头部?.这个要求很微妙:它是以下原因的结果:

The Standard effectively mandates that implementations of std::unordered_set and std::unordered_map - and their "multi" brethren - use open hashing aka separate chaining, which means an array of buckets, each of which holds the head of a linked list?. That requirement is subtle: it is a consequence of:

默认 max_load_factor()为 1.0(这意味着表格将在任何时候调整大小 size() 否则会超过 bucket_count() 和
保证表不会被重新散列，除非增长超过该负载因子.
如果没有链接，那将是不切实际的，因为与哈希表实现的其他主要类别发生冲突 - 封闭散列又名开放寻址 - 随着load_factor()](https://en.cppreference.com/w/cpp/container/unordered_map/load_factor) 接近 1. That would be impractical without chaining, as the collisions with the other main category of hash table implementation - closed hashing aka open addressing - become overwhelming as the load_factor()](https://en.cppreference.com/w/cpp/container/unordered_map/load_factor) approaches 1. 参考资料: 23.2.5/15: insert 和 emplace 成员不应影响迭代器的有效性，如果 (N+n) (N+n) <z * B，其中N是插入操作之前容器中的元素个数，n是插入的元素个数，B 是容器的桶数，z 是容器的最大负载因子. 23.2.5/15: The insert and emplace members shall not affect the validity of iterators if (N+n) < z * B, where N is the number of elements in the container prior to the insert operation, n is the number of elements inserted, B is the container’s bucket count, and z is the container’s maximum load factor. 在 23.5.4.2/1 的构造函数的效果中:max_load_factor() 返回 1.0. amongst the Effects of the constructor at 23.5.4.2/1: max_load_factor() returns 1.0. ? 为了在不越过任何空桶的情况下实现最优迭代，GCC 的实现用迭代器将桶填充到一个包含所有值的单链表中:迭代器立即指向之前的元素桶的元素，因此如果擦除桶的最后一个值，则可以重新连接那里的 next 指针. ? To allow optimal iteration without passing over any empty buckets, GCC's implementation fills the buckets with iterators into a single singly-linked list holding all the values: the iterators point to the element immediately before that bucket's elements, so the next pointer there can be rewired if erasing the bucket's last value. 关于你引用的文字: 不，这根本不是为最常见的用途实现哈希映射的最有效方法.不幸的是，一个小小的疏忽"在 unordered_map 的规范中，几乎都需要这种行为.要求的行为是元素的迭代器在插入或删除其他元素时必须保持有效 No, that is not at all the most efficient way to implement a hash map for most common uses. Unfortunately, a small "oversight" in the specification of unordered_map all but requires this behavior. The required behavior is that iterators to elements must stay valid when inserting or deleting other elements 没有疏忽"……所做的事情是经过深思熟虑的，并且是在全神贯注的情况下完成的.确实可以达成其他妥协，但是开放散列/链接方法对于一般用途来说是一个合理的妥协，它可以相当优雅地处理来自平庸散列函数的冲突，对于小或大的键/值类型不会太浪费，并处理任意多个 insert/erase 对，而不会像许多封闭哈希实现那样逐渐降低性能. There is no "oversight"... what was done was very deliberate and done with full awareness. It's true that other compromises could have been struck, but the open hashing / chaining approach is a reasonable compromise for general use, that copes reasonably elegantly with collisions from mediocre hash functions, isn't too wasteful with small or large key/value types, and handles arbitrarily-many insert/erase pairs without gradually degrading performance the way many closed hashing implementations do. 作为意识的证据，来自 Matthew奥斯特恩的提议在这里: As evidence of the awareness, from Matthew Austern's proposal here: 我不知道在通用框架中有任何令人满意的开放寻址实现.开放寻址带来了许多问题: I'm not aware of any satisfactory implementation of open addressing in a generic framework. Open addressing presents a number of problems: ? 有必要区分空缺职位和占用职位. ? It's necessary to distinguish between a vacant position and an occupied one. ? 必须要么将哈希表限制为具有默认构造函数的类型，并提前构造每个数组元素，要么维护一个数组，其中一些元素是对象，其他元素是原始内存. ? It's necessary either to restrict the hash table to types with a default constructor, and to construct every array element ahead of time, or else to maintain an array some of whose elements are objects and others of which are raw memory. ? 开放寻址使冲突管理变得困难:如果您插入的元素的哈希码映射到已占用的位置，您需要一个策略来告诉您下一步在哪里尝试.这是一个已解决的问题，但最著名的解决方案很复杂. ? Open addressing makes collision management difficult: if you're inserting an element whose hash code maps to an already-occupied location, you need a policy that tells you where to try next. This is a solved problem, but the best known solutions are complicated. ? 当允许擦除元素时，碰撞管理尤其复杂.(有关讨论，请参阅 Knuth.)标准库的容器类应该允许擦除. ? Collision management is especially complicated when erasing elements is allowed. (See Knuth for a discussion.) A container class for the standard library ought to allow erasure. ? 开放寻址的冲突管理方案倾向于假定一个固定大小的数组，最多可容纳 N 个元素.当插入新元素时，标准库的容器类应该能够根据需要增长，直到可用内存的限制. ? Collision management schemes for open addressing tend to assume a fixed size array that can hold up to N elements. A container class for the standard library ought to be able to grow as necessary when new elements are inserted, up to the limit of available memory. 解决这些问题可能是一个有趣的研究项目，但是，在缺乏 C++ 上下文中的实现经验的情况下，将开放寻址容器类标准化是不合适的. Solving these problems could be an interesting research project, but, in the absence of implementation experience in the context of C++, it would be inappropriate to standardize an open-addressing container class. 特别是对于数据小到可以直接存储在存储桶中的仅插入表、未使用的存储桶的方便标记值以及良好的散列函数，封闭散列方法可能会快大约一个数量级，并且使用量大大减少内存，但这不是通用的. Specifically for insert-only tables with data small enough to store directly in the buckets, a convenient sentinel value for unused buckets, and a good hash function, a closed hashing approach may be roughly an order of magnitude faster and use dramatically less memory, but that's not general purpose. 哈希表设计选项及其含义的完整比较和详细说明不在 S.O. 的主题范围内.因为它太宽泛了，无法在这里正确解决. A full comparison and elaboration of hash table design options and their implications is off topic for S.O. as it's way too broad to address properly here.


	
		相关文章