如何使用 c++11 CAS 实现 ABA 计数器?

2021-12-06 00:00:00 multithreading c++ lock-free lockless aba

我正在实现一个基于此算法的无锁队列,它使用一个计数器来解决 ABA 问题.但我不知道如何用 c++11 CAS 实现这个计数器.例如从算法:

I am implementing a lock-free queue based on this algorithm, which uses a counter to solve the ABA problem. But I don't know how to implement this counter with c++11 CAS. For example, from the algorithm:

E9:    if CAS(&tail.ptr->next, next, <node, next.count+1>)

这是一个原子操作,意思是如果tail.ptr->next等于next,让tail.ptr->nextcode> 指向 node 并同时(原子地)使 next.count+1.但是,使用C++11 CAS,我只能实现:

It is an atomic operation, meaning if tail.ptr->next is equal to next, let tail.ptr->next point to node and simultaneously (atomically) make next.count+1. However, using C++11 CAS, I can only implement:

std::atomic_compare_exchange_weak(&tail.ptr->next, next, node);

不能使 next.count+1 同时发生.

推荐答案

要通过一次原子操作同时原子地修改两个东西,您需要将它们放在相邻的内存中,例如在两个成员的结构中.然后你可以使用 std::atomic 让 gcc 发出 在 x86-64 上锁定 cmpxchg16b,例如.

To atomically modify two things at once with a single atomic operation, you need to put them in adjacent memory, e.g. in a two-member struct. Then you can use std::atomic<my_struct> to get gcc to emit lock cmpxchg16b on x86-64, for example.

为此您不需要内联汇编,为了避免它,需要一点 C++ 语法痛苦.https://gcc.gnu.org/wiki/DontUseInlineAsm.

You don't need inline asm for this, and it's worth a bit of C++ syntax pain to avoid it. https://gcc.gnu.org/wiki/DontUseInlineAsm.

不幸的是,对于当前的编译器,您需要使用 union 来获得高效的代码以仅读取一对中的一个.显而易见"对结构进行原子加载然后只使用一个成员的方法仍然会导致 lock cmpxchg16b 读取整个结构,即使我们只需要一个成员.(慢得多,并且弄脏了缓存线,因此读者与其他读者竞争).我相信指针的正常 64b 负载仍然可以正确实现 x86 上的获取内存排序语义(以及原子性),但是当前的编译器即使对于 std::memory_order_relaxed<也不会进行这种优化/code>,所以我们用联合来欺骗他们.

Unfortunately, with current compilers, you need to use a union to get efficient code for reading just one of the pair. The "obvious" way of doing an atomic load of the struct and then only using one member still leads to a lock cmpxchg16b to read the whole struct even though we only need one member. (Much slower, and dirties the cache line so readers contend with other readers). I'm confident that a normal 64b load of the pointer would still correctly implement the acquire memory-ordering semantics on x86 (as well as atomicity), but current compilers don't do that optimization even for std::memory_order_relaxed, so we trick them into it with a union.

(提交了 GCC 错误 80835 关于此.TODO:同样适用于如果这是一个有用的想法,请叮一下.)

(submitted GCC bug 80835 about this. TODO: same for clang if this is a useful idea.)

清单:

  • 确保您的编译器生成有效代码,以便在只读情况下仅加载一个成员,而不是一对 lock cmpxchg16b.例如使用联合.

确保您的编译器保证在编写不同的联合成员后访问联合的一个成员在该实现中具有明确定义的行为.联合类型双关在 C99 中是合法的(所以这应该适用于 C11 stdatomic),但它在 ISO C++11 中是 UB.但是,它在 C++ 的 GNU 方言中是合法的(受 gcc、clang 和 ICC 等支持).

Make sure your compiler guarantees that accessing one member of a union after writing a different union member has well-defined behaviour in that implementation. Union type-punning is legal in in C99 (so this should work well with C11 stdatomic), but it's UB in ISO C++11. However, it's legal in the GNU dialect of C++ (supported by gcc, clang, and ICC, among others).

确保您的对象是 16B 对齐的,或 32 位指针的 8B 对齐.更一般地说, alignas(2*sizeof(void*)) 应该可以工作.未对齐的 locked 指令在 x86 上可能非常,特别是当它们跨越缓存线边界时.如果对象未对齐,clang3.8 甚至会将其编译为库调用.

Make sure your object is 16B-aligned, or 8B-aligned for 32-bit pointers. More generally, alignas(2*sizeof(void*)) should work. Misaligned locked instructions can be very slow on x86, especially if they cross a cache-line boundary. clang3.8 even compiles it to a library call if the object isn't aligned.

使用 -mcx16 编译 x86-64 版本.cmpxchg16b 不被最早的 x86-64 CPU (AMD K8) 支持,但之后的一切都应该支持.如果没有 -mcx16,您会得到一个库函数调用(可能使用全局锁).32 位等效项 cmpxchg8b 已经足够老,现代编译器假定支持它.(并且可以使用 SSE、MMX 甚至 x87 来执行 64b 原子加载/存储,因此在读取一个成员时使用联合对于获得良好性能来说不太重要).

Compile with -mcx16 for x86-64 builds. cmpxchg16b was not supported by the earliest x86-64 CPUs (AMD K8), but should be on everything after that. Without -mcx16, you get a library function call (which probably uses a global lock). The 32-bit equivalent, cmpxchg8b, is old enough that modern compilers assume its support. (And can use SSE, MMX, or even x87 to do 64b atomic loads/stores, so using a union is somewhat less important for good performance when reading one member).

确保指针+uintptr_t 原子对象是无锁的.这对于 x32 和 32 位 ABI(8B 对象)几乎是有保证的,但对于 16B 对象则不然.例如MSVC 使用 x86-64 的锁.

Make sure that a pointer+uintptr_t atomic object is lock-free. This is pretty much guaranteed for x32 and 32-bit ABIs (8B object), but not for 16B objects. e.g. MSVC uses a lock for x86-64.

gcc7 及更高版本将调用 libatomic 而不是内联 lock cmpxchg16b,并且将从 atomic_is_lock_free ( 的原因包括它太慢了,这不是用户期望 is_lock_free 的意思),但是至少现在,libatomic 实现仍然在该指令可用的目标上使用 lock cmpxchg16b.(它甚至可以为只读原子对象出现段错误,所以它真的不理想.更重要的是,读取器与其他读取器争夺对缓存行的独占访问.这就是为什么我们要如此麻烦地避免 lockcmpxchg16b 当我们只需要一个 8 字节的一半时,这里的读取端.)

gcc7 and later will call libatomic instead of inlining lock cmpxchg16b, and will return false from atomic_is_lock_free (for reasons including that it's so slow it's not what users expect is_lock_free to mean), but at least for now the libatomic implementation still uses lock cmpxchg16b on targets where that instruction is available. (It can even segfault for read-only atomic objects, so it's really not ideal. More importantly, readers contend with other readers for exclusive access to the cache line. That's why we're going to so much trouble to avoid lock cmpxchg16b for the read side here when we only want one 8-byte half.)

这是一个带有 CAS 重试循环的代码示例,它编译为看起来正确的 asm,并且我认为对于允许联合类型双关语的实现,没有 UB 或其他不安全的 C++.它是用 C 风格编写的(非成员函数等),但如果你编写成员函数也一样.

Here's an example of code with a CAS retry-loop that compiles to asm that looks right, and I think is free of UB or other unsafe C++ for implementations that allow union type punning. It's written in C style (non-member functions, etc.) but it would be the same if you wrote member functions.

请参阅与ASM输出的来自 Godbolt 编译器浏览器上的 gcc6.3.对于 -m32,它使用 cmpxchg8b 的方式与 64 位代码使用 cmpxchg16b 的方式相同.使用 -mx32(长模式下的 32 位指针),它可以简单地使用 64 位 cmpxchg 和普通的 64 位整数加载来获取一个原子中的两个成员加载.

See the code with asm output from gcc6.3 on the Godbolt compiler explorer. With -m32, it uses cmpxchg8b the same way 64-bit code uses cmpxchg16b. With -mx32 (32-bit pointers in long mode) it can simply use a 64-bit cmpxchg, and normal 64-bit integer loads to grab both members in one atomic load.

这是可移植的 C++11(联合类型双关除外),没有任何特定于 x86 的内容.但是,它仅在可以 CAS 两个指针大小的对象的目标上高效.它编译为对 ARM/ARM64 和 MIPS64 的 __atomic_compare_exchange_16 库函数的调用,正如您在 Godbolt 上看到的那样.

This is portable C++11 (except the union type-punning), with nothing x86-specific. It's only efficient on targets that can CAS an object the size of two pointers, though. e.g. it compiles to a call to a __atomic_compare_exchange_16 library function for ARM / ARM64 and MIPS64, as you can see on Godbolt.

它不会在 MSVC 上编译,其中 atomic 大于 counted_ptr_separate,所以 static_assert 会捕获它.据推测,MSVC 在原子对象中包含一个锁成员.

It doesn't compile on MSVC, where atomic<counted_ptr> is larger than counted_ptr_separate, so the static_assert catches it. Presumably MSVC includes a lock member in the atomic object.

#include <atomic>
#include <stdint.h>

using namespace std;

struct node {
  // This alignas is essential for clang to use cmpxchg16b instead of a function call
  // Apparently just having it on the union member isn't enough.
  struct alignas(2*sizeof(node*)) counted_ptr {
    node *    ptr;
    uintptr_t count;  // use pointer-sized integers to avoid padding
  };
  
  // hack to allow reading just the pointer without lock-cmpxchg16b,
  // but still without any C++ data race
  struct counted_ptr_separate {
    atomic<node *>    ptr;
    atomic<uintptr_t> count_separate;  // var name emphasizes that accessing this way isn't atomic with ptr
  };

  static_assert(sizeof(atomic<counted_ptr>) == sizeof(counted_ptr_separate), "atomic<counted_ptr> isn't the same size as the separate version; union type-punning will be bogus");
  //static_assert(std::atomic<counted_ptr>{}.is_lock_free());
  
  union {  // anonymous union: the members are directly part of struct node
    alignas(2*sizeof(node*)) atomic<counted_ptr> next_and_count;
    counted_ptr_separate  next;
  };
  // TODO: write member functions to read next.ptr or read/write next_and_count

  int data[4];
};


// make sure read-only access is efficient.
node *follow(node *p) {   // good asm, just a mov load
  return p->next.ptr.load(memory_order_acquire);
}
node *follow_nounion(node *p) {  // really bad asm, using cmpxchg16b to load the whole thing
  return p->next_and_count.load(memory_order_acquire).ptr;
}


void update_next(node &target, node *desired)
{
  // read the old value efficiently to avoid overhead for the no-contention case
  // tearing (or stale data from a relaxed load) will just lead to a retry
  node::counted_ptr expected = {
      target.next.ptr.load(memory_order_relaxed),
      target.next.count_separate.load(memory_order_relaxed) };

  bool success;
  do {
    node::counted_ptr newval = { desired, expected.count + 1 };
    // x86-64: compiles to cmpxchg16b
    success = target.next_and_count.compare_exchange_weak(
                           expected, newval, memory_order_acq_rel);
    // updates exected on failure
  } while( !success );
}

clang 4.0 -O3 -mcx16 的 asm 输出是:

The asm output from clang 4.0 -O3 -mcx16 is:

update_next(node&, node*):
    push    rbx             # cmpxchg16b uses rbx implicitly so it has to be saved/restored
    mov     rbx, rsi
    mov     rax, qword ptr [rdi]     # load the pointer
    mov     rdx, qword ptr [rdi + 8] # load the counter
.LBB2_1:                        # =>This Inner Loop Header: Depth=1
    lea     rcx, [rdx + 1]
    lock
    cmpxchg16b      xmmword ptr [rdi]
    jne     .LBB2_1
    pop     rbx
    ret

gcc 做了一些笨重的存储/重新加载,但基本上是相同的逻辑.

gcc does some clunky store/reloads, but is basically the same logic.

follow(node*) 编译为 mov rax, [rdi]/ret,因此对指针的只读访问为由于工会黑客,它应该很便宜.

follow(node*) compiles to mov rax, [rdi] / ret, so read-only access to the pointer is as cheap as it should be, thanks to the union hack.

它确实依赖于通过一个成员编写联合并通过另一个成员读取它,以便在不使用 lock cmpxchg16b 的情况下高效读取指针.这保证在 GNU C++(和 ISO C99/C11)中有效,但在 ISO C++ 中无效.许多其他 C++ 编译器确实保证联合类型双关有效,但即使没有它它也可能仍然有效:我们总是使用 std::atomic 加载,它必须假设值是异步修改的.所以我们应该避免类似别名的问题,即在通过另一个指针(或联合成员)写入值后,寄存器中的值仍然被认为是有效的.不过,编译器认为是独立的东西的编译时重新排序可能是一个问题.

It does depend on writing a union through one member and reading it through another, to do efficient reads of just the pointer without using a lock cmpxchg16b. This is guaranteed to work in GNU C++ (and ISO C99/C11), but not in ISO C++. Many other C++ compilers do guarantee that union type-punning works, but even without that it would probably still work: We're always using std::atomic loads which have to assume that the value was modified asynchronously. So we should be immune from aliasing-like problems where values in registers are still considered live after writing the value through another pointer (or union member). Compile-time reordering of things the compiler thinks are independent could be a problem, though.

在指针 + 计数器的原子 cmpxchg 之后原子地读取指针仍然应该为您提供 x86 上的获取/释放语义,但我认为 ISO C++ 对此没有任何说明.猜测广泛的发布存储(作为 compare_exchange_weak 的一部分将与大多数架构上来自同一地址的更窄的负载同步(就像在 x86 上一样),但是 AFAIK C++ std::atomic 不保证任何类型双关.

Atomically reading just the pointer after an atomic cmpxchg of the pointer+counter should still give you acquire/release semantics on x86, but I don't think ISO C++ says anything about it. I would guess that a wide release-store (as part of the compare_exchange_weak will synchronize with a narrower load from the same address on most architectures (like it does on x86), but AFAIK C++ std::atomic doesn't guarantee anything about type-punning.

与指针 + ABA 计数器无关,但可以用于其他使用联合以允许访问更大原子对象的子集的应用程序:不要使用联合来允许原子存储仅指向指针或者只是柜台.至少如果您关心与该对的获取负载同步,则不会.即使是强序 x86 也可以 重新排序一个狭窄的商店,用更宽的负载完全包含它.一切仍然是原子的,但就内存排序而言,你进入了奇怪的领域.

Not relevant for pointer + ABA-counter, but could come up for other applications of using a union to allow access to subsets of larger atomic object: Don't use the union to allow atomic stores to just the pointer or just the counter. At least not if you care about synchronization with an acquire-load of the pair. Even strongly-ordered x86 can reorder a narrow store with a wider load that fully contains it. Everything is still atomic, but you get into weird territory as far as memory-ordering goes.

在 x86-64 上,原子 16B 加载需要 lock cmpxchg16b(这是一个完整的内存屏障,防止前面的窄存储在它之后变得全局可见).但是,如果您将其与 32 位指针(或 32 位数组索引)一起使用,则很容易遇到问题,因为可以使用常规 64b 加载来加载两半.如果您需要与其他线程同步而不仅仅是原子性,我不知道您会在其他架构上看到什么问题.

On x86-64, an atomic 16B load requires a lock cmpxchg16b (which is a full memory barrier, preventing a preceding narrow store from becoming globally visible after it). But you could easily have a problem if you were using this with 32-bit pointers (or 32-bit array indices), since both halves could be loaded with a regular 64b load. And I have no idea what problems you could see on other architectures, if you need synchronization with other threads and not just atomicity.

要了解有关 std::memory_order 获取和释放的更多信息,请参阅 Jeff Preshing 的优秀文章.

相关文章