boost::flat_map 及其与 map 和 unordered_map 相比的性能
编程中的常识是,由于缓存命中,内存局部性可以大大提高性能.我最近发现了 boost::flat_map
,它是一种基于矢量的地图实现.它似乎没有您典型的 map
/unordered_map
流行,所以我找不到任何性能比较.它如何比较以及它的最佳用例是什么?
谢谢!
解决方案我最近在我的公司针对不同的数据结构运行了一个基准测试,所以我觉得我需要说几句.正确地进行基准测试非常复杂.
基准测试
在网络上,我们很少找到(如果有的话)精心设计的基准测试.直到今天,我只找到了以记者的方式完成的基准测试(非常快,涵盖了数十个变量).
1)你需要考虑缓存预热
大多数运行基准测试的人都害怕计时器差异,因此他们运行他们的东西数千次并占用全部时间,他们只是小心翼翼地为每次操作采取相同的数千次,然后再考虑可比性.
>事实是,在现实世界中它没有什么意义,因为您的缓存不会是热的,并且您的操作可能只会被调用一次.因此,您需要使用 RDTSC 进行基准测试,并且只计算一次调用它们的时间.英特尔发表了一篇论文 描述 如何使用 RDTSC(使用 cpuid 指令刷新管道,并在程序开始时至少调用 3 次以使其稳定).
2) RDTSC 准确度测量
我也建议这样做:
u64 g_correctionFactor;//每次测量后要偏移的时钟数,以消除测量器本身的开销.u64 g_accuracy;静态 u64 const errormeasure = ~((u64)0);#ifdef _MSC_VER#pragma 内在(__rdtsc)内联 u64 GetRDTSC(){int a[4];__cpuid(a, 0x80000000);//刷新 OOO 指令管道返回 __rdtsc();}内联 void WarmupRDTSC(){int a[4];__cpuid(a, 0x80000000);//预热 cpuid.__cpuid(a, 0x80000000);__cpuid(a, 0x80000000);//用测量器测量测量器开销(疯狂他..)u64 minDiff = LLONG_MAX;u64 maxDiff = 0;//这将有助于计算我们的 PRECISION ERROR MARGINfor (int i = 0; i <80; ++i){u64 tick1 = GetRDTSC();u64 tick2 = GetRDTSC();minDiff = std::min(minDiff, tick2 - tick1);//进行多次拍摄,取最小的一次.maxDiff = std::max(maxDiff, tick2 - tick1);}g_correctionFactor = minDiff;printf("校正因子 %llu 时钟
", g_correctionFactor);g_accuracy = maxDiff - minDiff;printf("测量精度(以时钟为单位):%llu
", g_accuracy);}#万一
这是一个差异测量器,它将取所有测量值中的最小值,以避免不时得到 -10**18(64 位第一个负值).
注意使用内在函数而不是内联汇编.现在编译器很少支持第一个内联汇编,但更糟糕的是,编译器在内联汇编周围创建了一个完整的排序屏障,因为它不能静态分析内部,所以这是对现实世界的东西进行基准测试的问题,尤其是在调用东西时就一次.因此,内在函数适合这里,因为它不会破坏编译器对指令的自由重新排序.
3) 参数
最后一个问题是人们通常测试场景的变体太少.容器性能受以下因素影响:
- 分配器
- 包含类型的大小
- 所包含类型的复制操作、赋值操作、移动操作、构造操作的实现成本.
- 容器中的元素数量(问题的大小)
- type 有简单的 3.-operations
- 类型是POD
第 1 点很重要,因为容器会不时进行分配,如果它们使用 CRTnew"进行分配,这一点很重要.或一些用户定义的操作,如池分配或空闲列表或其他...
(对于 pt 1 感兴趣的人,的性能进行测量,我测量了在某些 std::unordered_map
用例上,Windows 7 和 Windows 8 之间的性能差距超过 3000%(在此讨论).
这让我想警告读者上述结果(它们是在 Win7 上制作的):您的里程可能会有所不同.
It is common knowledge in programming that memory locality improves performance a lot due to cache hits. I recently found out about boost::flat_map
which is a vector based implementation of a map. It doesn't seem to be nearly as popular as your typical map
/unordered_map
so I haven't been able to find any performance comparisons. How does it compare and what are the best use cases for it?
Thanks!
解决方案I have run a benchmark on different data structures very recently at my company so I feel I need to drop a word. It is very complicated to benchmark something correctly.
Benchmarking
On the web, we rarely find (if ever) a well-engineered benchmark. Until today I only found benchmarks that were done the journalist way (pretty quickly and sweeping dozens of variables under the carpet).
1) You need to consider cache warming
Most people running benchmarks are afraid of timer discrepancy, therefore they run their stuff thousands of times and take the whole time, they just are careful to take the same thousand of times for every operation, and then consider that comparable.
The truth is, in the real world it makes little sense, because your cache will not be warm, and your operation will likely be called just once. Therefore you need to benchmark using RDTSC, and time stuff calling them once only. Intel has made a paper describing how to use RDTSC (using a cpuid instruction to flush the pipeline, and calling it at least 3 times at the beginning of the program to stabilize it).
2) RDTSC accuracy measure
I also recommend doing this:
u64 g_correctionFactor; // number of clocks to offset after each measurement to remove the overhead of the measurer itself.
u64 g_accuracy;
static u64 const errormeasure = ~((u64)0);
#ifdef _MSC_VER
#pragma intrinsic(__rdtsc)
inline u64 GetRDTSC()
{
int a[4];
__cpuid(a, 0x80000000); // flush OOO instruction pipeline
return __rdtsc();
}
inline void WarmupRDTSC()
{
int a[4];
__cpuid(a, 0x80000000); // warmup cpuid.
__cpuid(a, 0x80000000);
__cpuid(a, 0x80000000);
// measure the measurer overhead with the measurer (crazy he..)
u64 minDiff = LLONG_MAX;
u64 maxDiff = 0; // this is going to help calculate our PRECISION ERROR MARGIN
for (int i = 0; i < 80; ++i)
{
u64 tick1 = GetRDTSC();
u64 tick2 = GetRDTSC();
minDiff = std::min(minDiff, tick2 - tick1); // make many takes, take the smallest that ever come.
maxDiff = std::max(maxDiff, tick2 - tick1);
}
g_correctionFactor = minDiff;
printf("Correction factor %llu clocks
", g_correctionFactor);
g_accuracy = maxDiff - minDiff;
printf("Measurement Accuracy (in clocks) : %llu
", g_accuracy);
}
#endif
This is a discrepancy measurer, and it will take the minimum of all measured values, to avoid getting a -10**18 (64 bits first negatives values) from time to time.
Notice the use of intrinsics and not inline assembly. First inline assembly is rarely supported by compilers nowadays, but much worse of all, the compiler creates a full ordering barrier around inline assembly because it cannot static analyze the inside, so this is a problem to benchmark real-world stuff, especially when calling stuff just once. So an intrinsic is suited here because it doesn't break the compiler free-re-ordering of instructions.
3) parameters
The last problem is people usually test for too few variations of the scenario. A container performance is affected by:
- Allocator
- size of the contained type
- cost of implementation of the copy operation, assignment operation, move operation, construction operation, of the contained type.
- number of elements in the container (size of the problem)
- type has trivial 3.-operations
- type is POD
Point 1 is important because containers do allocate from time to time, and it matters a lot if they allocate using the CRT "new" or some user-defined operation, like pool allocation or freelist or other...
(for people interested about pt 1, join the mystery thread on gamedev about system allocator performance impact)
Point 2 is because some containers (say A) will lose time copying stuff around, and the bigger the type the bigger the overhead. The problem is that when comparing to another container B, A may win over B for small types, and lose for larger types.
Point 3 is the same as point 2, except it multiplies the cost by some weighting factor.
Point 4 is a question of big O mixed with cache issues. Some bad-complexity containers can largely outperform low-complexity containers for a small number of types (like map
vs. vector
, because their cache locality is good, but map
fragments the memory). And then at some crossing point, they will lose, because the contained overall size starts to "leak" to main memory and cause cache misses, that plus the fact that the asymptotic complexity can start to be felt.
Point 5 is about compilers being able to elide stuff that are empty or trivial at compile time. This can optimize greatly some operations because the containers are templated, therefore each type will have its own performance profile.
Point 6 same as point 5, PODs can benefit from the fact that copy construction is just a memcpy
, and some containers can have a specific implementation for these cases, using partial template specializations, or SFINAE to select algorithms according to traits of T.
About the flat map
Apparently, the flat map is a sorted vector wrapper, like Loki AssocVector, but with some supplementary modernizations coming with C++11, exploiting move semantics to accelerate insert and delete of single elements.
This is still an ordered container. Most people usually don't need the ordering part, therefore the existence of unordered..
.
Have you considered that maybe you need a flat_unorderedmap
? which would be something like google::sparse_map
or something like that―an open address hash map.
The problem of open address hash maps is that at the time of rehash
they have to copy everything around to the new extended flat land, whereas a standard unordered map just has to recreate the hash index, while the allocated data stays where it is. The disadvantage of course is that the memory is fragmented like hell.
The criterion of a rehash in an open address hash map is when the capacity exceeds the size of the bucket vector multiplied by the load factor.
A typical load factor is 0.8
; therefore, you need to care about that, if you can pre-size your hash map before filling it, always pre-size to: intended_filling * (1/0.8) + epsilon
this will give you a guarantee of never having to spuriously rehash and recopy everything during filling.
The advantage of closed address maps (std::unordered..
) is that you don't have to care about those parameters.
But the boost::flat_map
is an ordered vector; therefore, it will always have a log(N) asymptotic complexity, which is less good than the open address hash map (amortized constant time). You should consider that as well.
Benchmark results
This is a test involving different maps (with int
key and __int64
/somestruct
as value) and std::vector
.
tested types information:
typeid=__int64 . sizeof=8 . ispod=yes
typeid=struct MediumTypePod . sizeof=184 . ispod=yes
Insertion
EDIT:
My previous results included a bug: they actually tested ordered insertion, which exhibited a very fast behavior for the flat maps.
I left those results later down this page because they are interesting.
This is the correct test:
I have checked the implementation, there is no such thing as a deferred sort implemented in the flat maps here. Each insertion sorts on the fly, therefore this benchmark exhibits the asymptotic tendencies:
map: O(N * log(N))
hashmaps: O(N)
vector and flatmaps: O(N * N)
Warning: hereafter the 2 tests for std::map
and both flat_map
s are buggy and actually test ordered insertion (vs random insertion for other containers. yes it's confusing sorry):
We can see that ordered insertion, results in back pushing, and is extremely fast. However, from the non-charted results of my benchmark, I can also say that this is not near the absolute optimality for a back-insertion. At 10k elements, perfect back-insertion optimality is obtained on a pre-reserved vector. Which gives us 3Million cycles; we observe 4.8M here for the ordered insertion into the flat_map
(therefore 160% of the optimal).
Analysis: remember this is 'random insert' for the vector, so the massive 1 billion cycles come from having to shift half (in average) the data upward (one element by one element) at each insertion.
Random search of 3 elements (clocks renormalized to 1)
in size = 100
in size = 10000
Iteration
over size 100 (only MediumPod type)
over size 10000 (only MediumPod type)
Final grain of salt
In the end, I wanted to come back on "Benchmarking §3 Pt1" (the system allocator). In a recent experiment, I am doing around the performance of an open address hash map I developed, I measured a performance gap of more than 3000% between Windows 7 and Windows 8 on some std::unordered_map
use cases (discussed here).
This makes me want to warn the reader about the above results (they were made on Win7): your mileage may vary.
相关文章