如何以及何时与缓存行大小对齐?

2021-12-20 00:00:00 c caching c++

在 Dmitry Vyukov 用 C++ 编写的优秀有界 mpmc 队列中请参阅:http://www.1024cores.net/home/lock-free-算法/队列/bounded-mpmc-queue

In Dmitry Vyukov's excellent bounded mpmc queue written in C++ See: http://www.1024cores.net/home/lock-free-algorithms/queues/bounded-mpmc-queue

他添加了一些填充变量.我认为这是为了使其与缓存线对齐以提高性能.

He adds some padding variables. I presume this is to make it align to a cache line for performance.

我有一些问题.

  1. 为什么要这样做?
  2. 它是一种可移植的方法吗?永远工作
  3. 在什么情况下最好使用__attribute__((aligned (64))) 代替.
  4. 为什么在缓冲区指针之前填充有助于提高性能?不只是加载到缓存中的指针所以它真的只是一个指针的大小吗?

  1. Why is it done in this way?
  2. Is it a portable method that will always work
  3. In what cases would it be best to use __attribute__ ((aligned (64))) instead.
  4. why would padding before a buffer pointer help with performance? isn't just the pointer loaded into the cache so it's really only the size of a pointer?

static size_t const     cacheline_size = 64;
typedef char            cacheline_pad_t [cacheline_size];

cacheline_pad_t         pad0_;
cell_t* const           buffer_;
size_t const            buffer_mask_;
cacheline_pad_t         pad1_;
std::atomic<size_t>     enqueue_pos_;
cacheline_pad_t         pad2_;
std::atomic<size_t>     dequeue_pos_;
cacheline_pad_t         pad3_;

这个概念在 gcc 下是否适用于 c 代码?

Would this concept work under gcc for c code?

推荐答案

这样做是为了让修改不同字段的不同内核不必在它们的缓存之间反弹包含它们的缓存行.一般来说,处理器要访问内存中的某些数据,包含它的整个缓存行必须在该处理器的本地缓存中.如果它正在修改该数据,则该缓存条目通常必须是系统中任何缓存中的唯一副本(MESI/MOESI 样式缓存一致性协议中的独占模式).当不同的内核尝试修改恰好位于同一缓存行上的不同数据,从而浪费时间来回移动整行时,这称为错误共享.

It's done this way so that different cores modifying different fields won't have to bounce the cache line containing both of them between their caches. In general, for a processor to access some data in memory, the entire cache line containing it must be in that processor's local cache. If it's modifying that data, that cache entry usually must be the only copy in any cache in the system (Exclusive mode in the MESI/MOESI-style cache coherence protocols). When separate cores try to modify different data that happens to live on the same cache line, and thus waste time moving that whole line back and forth, that's known as false sharing.

在您给出的特定示例中,一个核心可以将条目入队(读取(共享)buffer_ 和写入(独占)enqueue_pos_),而另一个核心出队(共享)buffer_ 和独占的 dequeue_pos_),而没有一个内核在另一个拥有的缓存线上停滞.

In the particular example you give, one core can be enqueueing an entry (reading (shared) buffer_ and writing (exclusive) only enqueue_pos_) while another dequeues (shared buffer_ and exclusive dequeue_pos_) without either core stalling on a cache line owned by the other.

开头的填充意味着 buffer_buffer_mask_ 最终位于同一缓存行上,而不是分成两行,因此需要双倍的内存流量才能访问.

The padding at the beginning means that buffer_ and buffer_mask_ end up on the same cache line, rather than split across two lines and thus requiring double the memory traffic to access.

我不确定该技术是否完全可移植.假设每个 cacheline_pad_t 本身将与 64 字节(其大小)缓存线边界对齐,因此接下来的任何内容都将位于下一个缓存线上.据我所知,C 和 C++ 语言标准只需要整个结构的这一点,这样它们就可以很好地存在于数组中,而不会违反其任何成员的对齐要求.(见评论)

I'm unsure whether the technique is entirely portable. The assumption is that each cacheline_pad_t will itself be aligned to a 64 byte (its size) cache line boundary, and hence whatever follows it will be on the next cache line. So far as I know, the C and C++ language standards only require this of whole structures, so that they can live in arrays nicely, without violating alignment requirements of any of their members. (see comments)

attribute 方法将更加特定于编译器,但可能会将这个结构的大小减半,因为填充将仅限于将每个元素四舍五入到一个完整的缓存行.如果有很多这些,那可能会非常有益.

The attribute approach would be more compiler specific, but might cut the size of this structure in half, since the padding would be limited to rounding up each element to a full cache line. That could be quite beneficial if one had a lot of these.

同样的概念适用于 C 和 C++.

The same concept applies in C as well as C++.

相关文章