与 std::hash 意外冲突

我知道将无限数量的字符串散列到 32b int 必须产生冲突,但我希望散列函数有一些不错的分布.

I know hashing infinite number of string into 32b int must generate collision, but I expect from hashing function some nice distribution.

这两个字符串有相同的哈希值是不是很奇怪?

Isn't it weird that these 2 strings have the same hash?

size_t hash0 = std::hash<std::string>()("generated_id_0");
size_t hash1 = std::hash<std::string>()("generated_id_1");
//hash0 == hash1

我知道我可以使用 boost::hash<std::string> 或其他,但我想知道 std::hash 有什么问题.我用错了吗?我不应该以某种方式播种"它吗?

I know I can use boost::hash<std::string> or others, but I want to know what is wrong with std::hash. Am I using it wrong? Shouldn't I somehow "seed" it?

推荐答案

您对 std::hash 的使用没有任何问题.问题在于,与 Visual Studio 2010 捆绑的标准库实现提供的专门化 std::hash<std::string> 仅采用字符串字符的子集来确定哈希值(大概用于性能原因).巧合的是,具有 14 个字符的字符串的最后一个字符不属于该集合,这就是为什么两个字符串产生相同哈希值的原因.

There's nothing wrong with your usage of std::hash. The problem is that the specialization std::hash<std::string> provided by the standard library implementation bundled with Visual Studio 2010 only takes a subset of the string's characters to determine the hash value (presumably for performance reasons). Coincidentally the last character of a string with 14 characters is not part of this set, which is why both strings yield the same hash value.

据我所知,这种行为符合标准,要求仅对具有相同参数的哈希函数的多次调用必须始终返回相同的值.然而,散列冲突的可能性应该是最小的.VS2010 实现实现了必选部分,但没有考虑可选部分.

As far as I know this behaviour is in conformance with the standard, which demands only that multiple calls to the hash function with the same argument must always return the same value. However, the probability of a hash collision should be minimal. The VS2010 implementation fulfills the mandatory part, yet fails to account for the optional one.

有关详细信息,请参阅头文件 xfunctional(从我的副本中的第 869 行开始)和 C++ 标准的 §17.6.3.4(最新公开草案).

For details, see the implementation in the header file xfunctional (starting at line 869 in my copy) and §17.6.3.4 of the C++ standard (latest public draft).

如果你绝对需要一个更好的字符串哈希函数,你应该自己实现它.实际上没那么难.

If you absolutely need a better hash function for strings, you should implement it yourself. It's actually not that hard.

相关文章