为什么具有顺序一致性的 std::atomic 存储使用 XCHG?
为什么是 std::atomic
的a> store
:
Why is std::atomic
's store
:
std::atomic<int> my_atomic;
my_atomic.store(1, std::memory_order_seq_cst);
做一个xchg
要求一致性?
doing an xchg
when a store with sequential consistency is requested?
从技术上讲,具有读/写内存屏障的普通存储不应该足够吗?相当于:
Shouldn't, technically, a normal store with a read/write memory barrier be enough? Equivalent to:
_ReadWriteBarrier(); // Or `asm volatile("" ::: "memory");` for gcc/clang
my_atomic.store(1, std::memory_order_acquire);
我明确地谈论 x86 &x86_64.商店有一个隐含的获取栅栏.
I'm explicitly talking about x86 & x86_64. Where a store has an implicit acquire fence.
推荐答案
mov
-store + mfence
和xchg
都是在x86. 带有内存的 xchg
上的隐式 lock
前缀使其成为一个完整的内存屏障,就像 x86 上的所有原子 RMW 操作一样.
mov
-store + mfence
and xchg
are both valid ways to implement a sequential-consistency store on x86. The implicit lock
prefix on an xchg
with memory makes it a full memory barrier, like all atomic RMW operations on x86.
(x86 的内存排序规则本质上使完全屏障效应成为任何原子 RMW 的唯一选择:它同时是加载和存储,并在全局顺序中粘在一起.原子性要求加载和存储不是通过将存储排队到存储缓冲区中而分开,因此必须排空,并且加载端的加载-加载排序要求它不会重新排序.)
(x86's memory-ordering rules essentially make that full-barrier effect the only option for any atomic RMW: it's both a load and a store at the same time, stuck together in the global order. Atomicity requires that the load and store aren't separated by just queuing the store into the store buffer so it has to be drained, and load-load ordering of the load side requires that it not reorder.)
简单的 mov
是不够的;它只有释放语义,没有顺序释放.(与 AArch64 的 stlr
指令不同,它确实做了一个顺序释放存储,不能用后面的 ldar
顺序获取加载重新排序.这个选择显然是由 C++ 驱动的11 使用seq_cst作为默认的内存排序.但是AArch64的正常存储要弱得多;放松不释放.)
Plain mov
is not sufficient; it only has release semantics, not sequential-release. (Unlike AArch64's stlr
instruction, which does do a sequential-release store that can't reorder with later ldar
sequential-acquire loads. This choice is obviously motivated by C++11 having seq_cst as the default memory ordering. But AArch64's normal store is much weaker; relaxed not release.)
请参阅 Jeff Preshing 关于获取/释放语义的文章,并注意常规发布存储(如 mov
或除 xchg 之外的任何非锁定 x86 内存目标指令)允许使用后续操作重新排序,包括获取加载(如 mov 或任何 x86 内存源操作数).例如如果 release-store 正在释放一个锁,那么后面的事情似乎发生在临界区中是可以的.
See Jeff Preshing's article on acquire / release semantics, and note that regular release stores (like mov
or any non-locked x86 memory-destination instruction other than xchg) allows reordering with later operations, including acquire loads (like mov or any x86 memory-source operand). e.g. If the release-store is releasing a lock, it's ok for later stuff to appear to happen inside the critical section.
mfence
和 xchg
在不同 CPU 上的性能存在差异,可能在热缓存与冷缓存以及竞争与非竞争中案件.和/或许多操作在同一线程中背靠背的吞吐量,而不是单独的一个,并允许周围的代码与原子操作重叠执行.
There are performance differences between mfence
and xchg
on different CPUs, and maybe in the hot vs. cold cache and contended vs. uncontended cases. And/or for throughput of many operations back to back in the same thread vs. for one on its own, and for allowing surrounding code to overlap execution with the atomic operation.
参见 https://shipilev.net/blog/2014/on-the-fence-with-dependencies 用于 mfence
与 lock addl $0, -8(%rsp)
与 (%rsp)
作为一个完整的障碍(当你还没有商店要做的时候).
See https://shipilev.net/blog/2014/on-the-fence-with-dependencies for actual benchmarks of mfence
vs. lock addl $0, -8(%rsp)
vs. (%rsp)
as a full barrier (when you don't already have a store to do).
在 Intel Skylake 硬件上,mfence
会阻止独立 ALU 指令的乱序执行,但 xchg
不会.(查看我的测试 asm + 结果这个SO答案的底部).英特尔的手册不需要它那么强大;只有 lfence
被记录在案.但作为一个实现细节,在 Skylake 上乱序执行周围代码的代价非常高.
On Intel Skylake hardware, mfence
blocks out-of-order execution of independent ALU instructions, but xchg
doesn't. (See my test asm + results in the bottom of this SO answer). Intel's manuals don't require it to be that strong; only lfence
is documented to do that. But as an implementation detail, it's very expensive for out-of-order execution of surrounding code on Skylake.
我没有测试过其他 CPU,这可能是 针对 SKL079 勘误的微码修复,来自 WC 内存的 SKL079 MOVNTDQA 可能会通过较早的 MFENCE 说明.勘误的存在,基本证明SKL曾经可以在MFENCE之后执行指令.如果他们通过在微代码中增强 MFENCE 来修复它,我不会感到惊讶,这是一种生硬的工具方法,显着增加了对周围代码的影响.
I haven't tested other CPUs, and this may be a result of a microcode fix for erratum SKL079, SKL079 MOVNTDQA From WC Memory May Pass Earlier MFENCE Instructions. The existence of the erratum basically proves that SKL used to be able to execute instructions after MFENCE. I wouldn't be surprised if they fixed it by making MFENCE stronger in microcode, kind of a blunt instrument approach that significantly increases the impact on surrounding code.
我只测试了缓存行在 L1d 缓存中很热的单线程情况.(不是当它在内存中很冷时,或者当它在另一个核心上处于修改状态时.) xchg
必须加载前一个值,创建一个假"依赖于内存中的旧值.但是 mfence
强制 CPU 等待,直到之前的存储提交到 L1d,这也需要缓存线到达(并处于 M 状态).所以他们在这方面可能差不多,但英特尔的 mfence
强制一切都等待,而不仅仅是加载.
I've only tested the single-threaded case where the cache line is hot in L1d cache. (Not when it's cold in memory, or when it's in Modified state on another core.) xchg
has to load the previous value, creating a "false" dependency on the old value that was in memory. But mfence
forces the CPU to wait until previous stores commit to L1d, which also requires the cache line to arrive (and be in M state). So they're probably about equal in that respect, but Intel's mfence
forces everything to wait, not just loads.
AMD 的优化手册推荐 xchg
用于原子 seq-cst 存储.我以为英特尔推荐了 mov
+ mfence
,这是旧的 gcc 使用的,但 英特尔的编译器在这里也使用了 xchg
.
AMD's optimization manual recommends xchg
for atomic seq-cst stores. I thought Intel recommended mov
+ mfence
, which older gcc uses, but Intel's compiler also uses xchg
here.
当我测试时,我在 Skylake 上为 xchg
获得了比 mov
+mfence
更好的吞吐量,在相同的单线程循环中位置反复.有关详细信息,请参阅 Agner Fog 的微架构指南和指令表,但他并没有花太多时间在锁定操作上.
When I tested, I got better throughput on Skylake for xchg
than for mov
+mfence
in a single-threaded loop on the same location repeatedly. See Agner Fog's microarch guide and instruction tables for some details, but he doesn't spend much time on locked operations.
查看 GCC/铛/ICC/MSVC上输出的Godbolt编译探险对于 C++11 seq-cst my_atomic = 4;
当 SSE2 可用时,gcc 使用 mov
+ mfence
.(使用 -m32 -mno-sse2
让 gcc 也使用 xchg
).其他 3 个编译器都更喜欢带有默认调整的 xchg
,或者 znver1
(Ryzen) 或 skylake
.
See gcc/clang/ICC/MSVC output on the Godbolt compiler explorer for a C++11 seq-cst my_atomic = 4;
gcc uses mov
+ mfence
when SSE2 is available. (use -m32 -mno-sse2
to get gcc to use xchg
too). The other 3 compilers all prefer xchg
with default tuning, or for znver1
(Ryzen) or skylake
.
Linux 内核使用 xchg
作为 __smp_store_mb()
.
The Linux kernel uses xchg
for __smp_store_mb()
.
更新:最近的 GCC(如 GCC10)改为使用 xchg
进行 seq-cst 存储,就像其他编译器一样,即使 mfence
的 SSE2 可用.
Update: recent GCC (like GCC10) changed to using xchg
for seq-cst stores like other compilers do, even when SSE2 for mfence
is available.
另一个有趣的问题是如何编译atomic_thread_fence(mo_seq_cst);
.显而易见的选项是 mfence
,但 lock 或 dword [rsp], 0
是另一个有效选项(当 MFENCE 不可用时由 gcc -m32
使用)不可用).栈底通常在 M 状态的缓存中已经很热了.缺点是如果本地存储在那里会引入延迟.(如果它只是一个返回地址,返回地址预测通常非常好,所以延迟 ret
读取它的能力不是什么大问题.)所以 lock 或 dword [rsp-4], 0
在某些情况下可能值得考虑.(gcc 确实考虑过,但是因为它让 valgrind 不高兴而恢复了它.这是在它之前众所周知,即使 mfence
可用,它也可能比 mfence
更好.)
Another interesting question is how to compile atomic_thread_fence(mo_seq_cst);
. The obvious option is mfence
, but lock or dword [rsp], 0
is another valid option (and used by gcc -m32
when MFENCE isn't available). The bottom of the stack is usually already hot in cache in M state. The downside is introducing latency if a local was stored there. (If it's just a return address, return-address prediction is usually very good so delaying ret
's ability to read it is not much of a problem.) So lock or dword [rsp-4], 0
could be worth considering in some cases. (gcc did consider it, but reverted it because it makes valgrind unhappy. This was before it was known that it might be better than mfence
even when mfence
was available.)
当前所有编译器都使用 mfence
作为独立屏障(当它可用时).这些在 C++11 代码中很少见,但需要更多研究什么才是真正的多线程代码最有效的,这些代码在无锁通信的线程内部进行实际工作.
All compilers currently use mfence
for a stand-alone barrier when it's available. Those are rare in C++11 code, but more research is needed on what's actually most efficient for real multi-threaded code that has real work going on inside the threads that are communicating locklessly.
但是多源推荐使用lock add
到堆栈作为屏障而不是mfence
,所以最近Linux内核改用它了对于 x86 上的 smp_mb()
实现,即使 SSE2 可用.
But multiple source recommend using lock add
to the stack as a barrier instead of mfence
, so the Linux kernel recently switched to using it for the smp_mb()
implementation on x86, even when SSE2 is available.
参见 https://groups.google.com/d/msg/fa.linux.kernel/hNOoIZc6I9E/pVO3hB5ABAAJ 进行了一些讨论,包括提到 HSW/BDW 的一些勘误表,关于 movntdqa
从 WC 内存中加载较早的 lock
编辑的指令.(与 Skylake 相对,它是 mfence
而不是 lock
ed 指令有问题.但与 SKL 不同的是,微码没有修复.这可能是 Linux 仍然使用的原因mfence
用于驱动程序的 mb()
,以防万一任何东西使用 NT 加载从视频 RAM 或其他东西复制回来,但不能让读取发生在更早之前商店可见.)
See https://groups.google.com/d/msg/fa.linux.kernel/hNOoIZc6I9E/pVO3hB5ABAAJ for some discussion, including a mention of some errata for HSW/BDW about movntdqa
loads from WC memory passing earlier lock
ed instructions. (Opposite of Skylake, where it was mfence
instead of lock
ed instructions that were a problem. But unlike SKL, there's no fix in microcode. This may be why Linux still uses mfence
for its mb()
for drivers, in case anything ever uses NT loads to copy back from video RAM or something but can't let the reads happen until after an earlier store is visible.)
在 Linux 中4.14、
smp_mb()
使用mb()
.如果可用,则使用 mfence,否则使用lock addl $0, 0(%esp)
.
In Linux 4.14,
smp_mb()
usesmb()
. That uses mfence is used if available, otherwiselock addl $0, 0(%esp)
.
__smp_store_mb
(存储 + 内存屏障)使用 xchg
(并且在以后的内核中不会改变).
__smp_store_mb
(store + memory barrier) uses xchg
(and that doesn't change in later kernels).
在 Linux 中4.15、smb_mb()
使用lock;addl $0,-4(%esp)
或 %rsp
,而不是使用 mb()
.(即使在 64 位内核中也不使用红色区域,因此 -4
可能有助于避免本地变量的额外延迟).
In Linux 4.15, smb_mb()
uses lock; addl $0,-4(%esp)
or %rsp
, instead of using mb()
. (The kernel doesn't use a red-zone even in 64-bit, so the -4
may help avoid extra latency for local vars).
mb()
来命令访问 MMIO 区域,但是 smp_mb()
在为单处理器系统编译时变成无操作.更改 mb()
风险更大,因为它更难测试(影响驱动程序),并且 CPU 有与 lock 与 mfence 相关的勘误表.但无论如何,mb()
使用 mfence 如果可用,否则 lock addl $0, -4(%esp)
.唯一的变化是 -4
.
mb()
is used by drivers to order access to MMIO regions, but smp_mb()
turns into a no-op when compiled for a uniprocessor system. Changing mb()
is riskier because it's harder to test (affects drivers), and CPUs have errata related to lock vs. mfence. But anyway, mb()
uses mfence if available, else lock addl $0, -4(%esp)
. The only change is the -4
.
在 Linux 中4.16,除了删除 #if defined(CONFIG_X86_PPRO_FENCE)
之外没有任何变化,该代码定义了比现代硬件实现的 x86-TSO 模型更弱排序的内存模型的内容.
In Linux 4.16, no change except removing the #if defined(CONFIG_X86_PPRO_FENCE)
which defined stuff for a more weakly-ordered memory model than the x86-TSO model that modern hardware implements.
x86 &x86_64.商店有一个隐式获取栅栏
x86 & x86_64. Where a store has an implicit acquire fence
你的意思是发布,我希望.my_atomic.store(1, std::memory_order_acquire);
不会编译,因为只写原子操作不能是获取操作.另请参阅 Jeff Preshing 关于获取/释放语义的文章.
You mean release, I hope. my_atomic.store(1, std::memory_order_acquire);
won't compile, because write-only atomic operations can't be acquire operations. See also Jeff Preshing's article on acquire/release semantics.
或者 asm volatile("" ::: "memory");
不,那只是编译器障碍;它可以防止所有编译时重新排序,但不会t 防止运行时StoreLoad重新排序,即存储被缓冲直到稍后,并且直到稍后加载之后才出现在全局顺序中.(StoreLoad 是 x86 允许的唯一一种运行时重新排序.)
No, that's a compiler barrier only; it prevents all compile-time reordering across it, but doesn't prevent runtime StoreLoad reordering, i.e. the store being buffered until later, and not appearing in the global order until after a later load. (StoreLoad is the only kind of runtime reordering x86 allows.)
无论如何,在这里表达你想要的另一种方式是:
Anyway, another way to express what you want here is:
my_atomic.store(1, std::memory_order_release); // mov
// with no operations in between, there's nothing for the release-store to be delayed past
std::atomic_thread_fence(std::memory_order_seq_cst); // mfence
使用发布栅栏不够强大(它和发布存储都可能延迟到以后的加载,这与说发布栅栏不会阻止较晚的加载提前发生是一样的).不过,发布获取栅栏可以解决问题,防止后期加载过早发生,并且本身无法与发布商店重新排序.
Using a release fence would not be strong enough (it and the release-store could both be delayed past a later load, which is the same thing as saying that release fences don't keep later loads from happening early). A release-acquire fence would do the trick, though, keeping later loads from happening early and not itself being able to reorder with the release store.
相关:Jeff Preshing's关于栅栏与释放操作不同的文章.
但是请注意,根据 C++11 规则,seq-cst 是特殊的:只有 seq-cst 操作才能保证具有所有线程都同意的单个全局/总顺序.因此,即使在 x86 上,在 C++ 抽象机上使用较弱的顺序 + 围栏来模拟它们通常可能并不完全等效.(在 x86 上,所有存储都有一个所有内核都同意的总顺序.另请参阅全局不可见加载说明:加载可以从存储缓冲区中获取数据,因此我们不能真正说加载 + 存储有总顺序.)
But note that seq-cst is special according to C++11 rules: only seq-cst operations are guaranteed to have a single global / total order which all threads agree on seeing. So emulating them with weaker order + fences might not be exactly equivalent in general on the C++ abstract machine, even if it is on x86. (On x86, all store have a single total order which all cores agree on. See also Globally Invisible load instructions: Loads can take their data from the store buffer, so we can't really say that there's a total order for loads + stores.)
相关文章