缓存行、错误共享和对齐

2021-12-30 00:00:00 multithreading parallel-processing caching c++

我编写了以下简短的 C++ 程序来重现香草萨特:

I wrote the following short C++ program to reproduce the false sharing effect as described by Herb Sutter:

比如说，我们想要执行全部的 WORKLOAD 整数运算，并且我们希望它们平均分配到多个 (PARALLEL) 线程.出于此测试的目的，每个线程将从整数数组中递增其自己的专用变量，因此该过程可能是理想的可并行化的.

Say, we want to perform a total amount of WORKLOAD integer operations and we want them to be equally distributed to a number (PARALLEL) of threads. For the purpose of this test, each thread will increment its own dedicated variable from an array of integers, so the process may be ideally parallelizable.

void thread_func(int* ptr) { for (unsigned i = 0; i < WORKLOAD / PARALLEL; ++i) { (*ptr)++; } } int main() { int arr[PARALLEL * PADDING]; thread threads[PARALLEL]; for (unsigned i = 0; i < PARALLEL; ++i) { threads[i] = thread(thread_func, &(arr[i * PADDING])); } for (auto& th : threads) { th.join(); } return 0; }

我认为这个想法很容易理解.如果你设置

I think the idea is easy to grasp. If you set

#define PADDING 16

每个线程都将在单独的缓存行上工作(假设缓存行的长度为 64 字节).所以结果将是加速的线性增加，直到 PARALLEL > # cores.另一方面，如果 PADDING 设置为低于 16 的任何值，则应该会遇到严重的争用，因为现在至少有两个线程可能在同一高速缓存行上运行，但该高速缓存行受到内置硬件互斥锁的保护.我们希望我们的加速不仅在这种情况下是次线性的，而且甚至总是 <;1、因为隐形锁护航.

every thread will work on a separate cache line (assuming the length of a cache line to be 64 bytes). So the result will be linear increase of speedup until PARALLEL > # cores. If, on the other hand, PADDING is set to any value below 16, one should encounter severe contention, for at least two threads are now likely to operate on the same cache line, which however is protected by a built-in hardware mutex. We would expect our speedup not only to be sublinear in this case, but even to be always < 1, because of the invisible lock convoy.

现在，我的第一次尝试几乎满足了这些期望，但是避免错误共享所需的最小 PADDING 值大约是 8 而不是 16.我困惑了大约半个小时，直到我得出一个明显的结论，那就是不能保证我的数组与主内存中缓存行的开头完全对齐.实际对齐可能会因许多条件而异，包括数组的大小.

Now, my first attempts nearly satisfied these expectations, yet the minimum value of PADDING needed to avoid false sharing was around 8 and not 16. I was quite puzzled for about half an hour until I came to the obvious conclusion, that there is no guarantee for my array to be aligned exactly to the beginning of a cache line inside main memory. The actual alignment may vary depending on many conditions, including the size of the array.

在这个例子中，我们当然不需要以特殊的方式对齐数组，因为我们可以将 PADDING 保留为 16，一切正常.但是人们可以想象一下，它确实会产生影响的情况，无论某个结构是否与缓存线对齐.因此，我添加了一些代码行以获取有关数组实际对齐的一些信息.

In this example, there is of course no need for us to have the array aligned in a special way, because we can just leave PADDING at 16 and everything works out fine. But one could imagine cases, where it does make a difference, whether a certain structure is aligned to a cache line or not. Hence, I added some lines of code to get some information about the actual alignment of my array.

int main() { int arr[PARALLEL * 16]; thread threads[PARALLEL]; int offset = 0; while (reinterpret_cast<int>(&arr[offset]) % 64) ++offset; for (unsigned i = 0; i < PARALLEL; ++i) { threads[i] = thread(thread_func, &(arr[i * 16 + offset])); } for (auto& th : threads) { th.join(); } return 0; }

尽管在这种情况下这个解决方案对我来说效果很好，但我不确定这是否是一个很好的方法.所以这是我的问题:

Despite this solution worked out fine for me in this case, I'm not sure if it would be a good approach in general. So here is my question:

除了我在上面的例子中所做的之外，有没有什么常见的方法可以让内存中的对象与缓存行对齐?

Is there any common way to have objects in memory aligned to cache lines other than what I did in the above example?

(使用 g++ MinGW Win32 x86 v.4.8.1 posix dwarf rev3)

(using g++ MinGW Win32 x86 v.4.8.1 posix dwarf rev3)

推荐答案

您应该能够向编译器请求所需的对齐:

You should be able to request the required alignment from the compiler:

alignas(64) int arr[PARALELL * PADDING]; // align the array to a 64 byte line

相关文章