gcc std::unordered_map 实现速度慢吗?如果是这样 - 为什么?
我们正在用 C++ 开发高性能的关键软件.在那里我们需要一个并发哈希映射并实现一个.所以我们写了一个基准来计算,我们的并发哈希映射比 std::unordered_map
慢多少.
We are developing a highly performance critical software in C++. There we need a concurrent hash map and implemented one. So we wrote a benchmark to figure out, how much slower our concurrent hash map is compared with std::unordered_map
.
但是,std::unordered_map
似乎非常慢......所以这是我们的微基准(对于并发映射,我们生成了一个新线程以确保锁定不会得到优化请注意,我从不插入 0,因为我还使用 google::dense_hash_map
进行基准测试,它需要一个空值):
But, std::unordered_map
seems to be incredibly slow... So this is our micro-benchmark (for the concurrent map we spawned a new thread to make sure that locking does not get optimized away and note that I never inser 0 because I also benchmark with google::dense_hash_map
, which needs a null value):
boost::random::mt19937 rng;
boost::random::uniform_int_distribution<> dist(std::numeric_limits<uint64_t>::min(), std::numeric_limits<uint64_t>::max());
std::vector<uint64_t> vec(SIZE);
for (int i = 0; i < SIZE; ++i) {
uint64_t val = 0;
while (val == 0) {
val = dist(rng);
}
vec[i] = val;
}
std::unordered_map<int, long double> map;
auto begin = std::chrono::high_resolution_clock::now();
for (int i = 0; i < SIZE; ++i) {
map[vec[i]] = 0.0;
}
auto end = std::chrono::high_resolution_clock::now();
auto elapsed = std::chrono::duration_cast<std::chrono::milliseconds>(end - begin);
std::cout << "inserts: " << elapsed.count() << std::endl;
std::random_shuffle(vec.begin(), vec.end());
begin = std::chrono::high_resolution_clock::now();
long double val;
for (int i = 0; i < SIZE; ++i) {
val = map[vec[i]];
}
end = std::chrono::high_resolution_clock::now();
elapsed = std::chrono::duration_cast<std::chrono::milliseconds>(end - begin);
std::cout << "get: " << elapsed.count() << std::endl;
(整个源代码可以在这里找到:http://pastebin.com/vPqf7eya)
( the whole source code can be found here: http://pastebin.com/vPqf7eya)
std::unordered_map
的结果是:
inserts: 35126
get : 2959
对于 google::dense_map
:
inserts: 3653
get : 816
对于我们手动支持的并发映射(它进行锁定,虽然基准测试是单线程的 - 但在一个单独的 spawn 线程中):
For our hand backed concurrent map (which does locking, although the benchmark is single threaded - but in a separate spawn thread):
inserts: 5213
get : 2594
如果我在没有 pthread 支持的情况下编译基准程序并在主线程中运行所有内容,我会得到以下手动支持并发映射的结果:
If I compile the benchmark program without pthread support and run everything in the main thread, I get the following results for our hand backed concurrent map:
inserts: 4441
get : 1180
我使用以下命令编译:
g++-4.7 -O3 -DNDEBUG -I/tmp/benchmap/sparsehash-2.0.2/src/ -std=c++11 -pthread main.cc
因此,特别是在 std::unordered_map
上插入似乎非常昂贵 - 35 秒与其他地图的 3-5 秒.此外,查找时间似乎相当长.
So especially inserts on std::unordered_map
seem to be extremely expensive - 35 seconds vs 3-5 seconds for other maps. Also the lookup time seems to be quite high.
我的问题:这是为什么?我在 stackoverflow 上读到另一个问题,有人问,为什么 std::tr1::unordered_map
比他自己的实现慢.评分最高的答案指出,std::tr1::unordered_map
需要实现更复杂的接口.但是我看不到这个论点:我们在 concurrent_map 中使用了桶方法,std::unordered_map
也使用了桶方法(google::dense_hash_map
没有,但是比 std::unordered_map
应该至少和我们手动支持的并发安全版本一样快?).除此之外,我在界面中看不到任何强制执行使哈希映射性能不佳的功能的任何内容...
My question: why is this? I read another question on stackoverflow where someone asks, why std::tr1::unordered_map
is slower than his own implementation. There the highest rated answer states, that the std::tr1::unordered_map
needs to implement a more complicated interface. But I can not see this argument: we use a bucket approach in our concurrent_map, std::unordered_map
uses a bucket-approach too (google::dense_hash_map
does not, but than std::unordered_map
should be at least as fast than our hand backed concurrency-safe version?). Apart from that I can not see anything in the interface that force a feature which makes the hash map perform badly...
所以我的问题是:std::unordered_map
是否真的很慢?如果不是:有什么问题?如果是:原因是什么.
So my question: is it true that std::unordered_map
seems to be very slow? If no: what is wrong? If yes: what is the reason for that.
我的主要问题是:为什么将一个值插入到 std::unordered_map
中如此昂贵(即使我们在开始时保留了足够的空间,它的性能也没有好多少 - 所以重新散列似乎不是问题)?
And my main question: why is inserting a value into a std::unordered_map
so terrible expensive (even if we reserve enough space at the beginning, it does not perform much better - so rehashing seems not to be the problem)?
首先:是的,所提供的基准测试并不是完美无缺的――这是因为我们对它进行了很多尝试,它只是一个黑客(例如,用于生成整数的 uint64
分布在实践中会不是一个好主意,在循环中排除 0 有点愚蠢等等......).
First of all: yes the presented benchmark is not flawless - this is because we played around a lot with it and it is just a hack (for example the uint64
distribution to generate ints would in practice not be a good idea, exclude 0 in a loop is kind of stupid etc...).
目前大多数评论都解释说,我可以通过为它预分配足够的空间来使 unordered_map 更快.在我们的应用程序中,这是不可能的:我们正在开发一个数据库管理系统,需要一个哈希映射来存储事务期间的一些数据(例如锁定信息).所以这个映射可以是从 1(用户只进行一次插入和提交)到数十亿个条目(如果发生全表扫描)的所有内容.在这里预分配足够的空间是不可能的(而且刚开始分配很多会消耗太多内存).
At the moment most comments explain, that I can make the unordered_map faster by preallocating enough space for it. In our application this is just not possible: we are developing a database management system and need a hash map to store some data during a transaction (for example locking information). So this map can be everything from 1 (user just makes one insert and commits) to billions of entries (if full table scans happen). It is just impossible to preallocate enough space here (and just allocate a lot in the beginning will consume too much memory).
此外,我很抱歉,我没有足够清楚地说明我的问题:我对快速制作 unordered_map 并不感兴趣(使用谷歌密集哈希图对我们来说很好用),我只是不明白这种巨大的性能在哪里差异来自.它不能只是预分配(即使有足够的预分配内存,密集映射也比 unordered_map 快一个数量级,我们手动支持的并发映射从大小为 64 的数组开始 - 因此比 unordered_map 小).
Furthermore, I apologize, that I did not state my question clear enough: I am not really interested in making unordered_map fast (using googles dense hash map works fine for us), I just don't really understand where this huge performance differences come from. It can not be just preallocation (even with enough preallocated memory, the dense map is an order of magnitude faster than unordered_map, our hand backed concurrent map starts with an array of size 64 - so a smaller one than unordered_map).
那么 std::unordered_map
性能不佳的原因是什么?或者换一种方式问:是否可以编写一个 std::unordered_map
接口的实现,它符合标准并且(几乎)与谷歌的密集哈希图一样快?或者标准中是否有强制实施者选择一种低效的方式来实施它?
So what is the reason for this bad performance of std::unordered_map
? Or differently asked: Could one write an implementation of the std::unordered_map
interface which is standard conform and (nearly) as fast as googles dense hash map? Or is there something in the standard that enforces the implementer to chose an inefficient way to implement it?
通过分析,我发现很多时间都用于整数 divions.std::unordered_map
使用素数作为数组大小,而其他实现使用 2 的幂.为什么 std::unordered_map
使用质数?如果散列不好,性能会更好吗?对于好的哈希值,恕我直言没有任何区别.
By profiling I see that a lot of time is used for integer divions. std::unordered_map
uses prime numbers for the array size, while the other implementations use powers of two. Why does std::unordered_map
use prime-numbers? To perform better if the hash is bad? For good hashes it does imho make no difference.
这些是std::map
的数字:
inserts: 16462
get : 16978
Sooooooo:为什么插入 std::map
比插入 std::unordered_map
更快......我的意思是 WAT?std::map
具有更差的局部性(树 vs 数组),需要进行更多分配(每次插入 vs 每次重新哈希 + 每次碰撞加 ~1),最重要的是:具有另一种算法复杂性(O(logn) vs O(1))!
Sooooooo: why are inserts into a std::map
faster than inserts into a std::unordered_map
... I mean WAT? std::map
has a worse locality (tree vs array), needs to make more allocations (per insert vs per rehash + plus ~1 for each collision) and, most important: has another algorithmic complexity (O(logn) vs O(1))!
推荐答案
找到原因:是gcc-4.7的问题!!
I found the reason: it is a Problem of gcc-4.7!!
使用 gcc-4.7
inserts: 37728
get : 2985
使用 gcc-4.6
inserts: 2531
get : 1565
所以 gcc-4.7 中的 std::unordered_map
坏了(或者我的安装,它是在 Ubuntu 上安装的 gcc-4.7.0 - 另一个安装是在 debian 上的 gcc 4.7.1测试).
So std::unordered_map
in gcc-4.7 is broken (or my installation, which is an installation of gcc-4.7.0 on Ubuntu - and another installation which is gcc 4.7.1 on debian testing).
我将提交错误报告..在那之前:不要在 gcc 4.7 中使用 std::unordered_map
!
I will submit a bug report.. until then: DO NOT use std::unordered_map
with gcc 4.7!
相关文章