rdtscp、rdtsc 之间的区别:内存和 cpuid/rdtsc?

2021-12-18 00:00:00 performance c assembly c++ rdtsc

假设我们正在尝试使用 tsc 进行性能监控,并且我们希望防止指令重新排序.

Assume we're trying to use the tsc for performance monitoring and we we want to prevent instruction reordering.

这些是我们的选择:

1: rdtscp 是一个序列化调用.它可以防止对 rdtscp 的调用重新排序.

1: rdtscp is a serializing call. It prevents reordering around the call to rdtscp.

__asm__ __volatile__("rdtscp; "         // serializing read of tsc
                     "shl $32,%%rdx; "  // shift higher 32 bits stored in rdx up
                     "or %%rdx,%%rax"   // and or onto rax
                     : "=a"(tsc)        // output to tsc variable
                     :
                     : "%rcx", "%rdx"); // rcx and rdx are clobbered

但是,rdtscp 仅适用于较新的 CPU.所以在这种情况下我们必须使用rdtsc.但是 rdtsc 是非序列化的,因此单独使用它不会阻止 CPU 对其重新排序.

However, rdtscp is only available on newer CPUs. So in this case we have to use rdtsc. But rdtsc is non-serializing, so using it alone will not prevent the CPU from reordering it.

所以我们可以使用这两个选项中的任何一个来防止重新排序:

So we can use either of these two options to prevent reordering:

2:这是对cpuidrdtsc 的调用.cpuid 是一个序列化调用.

2: This is a call to cpuid and then rdtsc. cpuid is a serializing call.

volatile int dont_remove __attribute__((unused)); // volatile to stop optimizing
unsigned tmp;
__cpuid(0, tmp, tmp, tmp, tmp);                   // cpuid is a serialising call
dont_remove = tmp;                                // prevent optimizing out cpuid

__asm__ __volatile__("rdtsc; "          // read of tsc
                     "shl $32,%%rdx; "  // shift higher 32 bits stored in rdx up
                     "or %%rdx,%%rax"   // and or onto rax
                     : "=a"(tsc)        // output to tsc
                     :
                     : "%rcx", "%rdx"); // rcx and rdx are clobbered

3:这是在clobber列表中使用memory调用rdtsc,防止重新排序

3: This is a call to rdtsc with memory in the clobber list, which prevents reordering

__asm__ __volatile__("rdtsc; "          // read of tsc
                     "shl $32,%%rdx; "  // shift higher 32 bits stored in rdx up
                     "or %%rdx,%%rax"   // and or onto rax
                     : "=a"(tsc)        // output to tsc
                     :
                     : "%rcx", "%rdx", "memory"); // rcx and rdx are clobbered
                                                  // memory to prevent reordering

我对第三个选项的理解如下:

My understanding for the 3rd option is as follows:

调用 __volatile__ 可防止优化器移除 asm 或将其移动到任何可能需要 asm 结果(或更改输入)的指令中.但是,它仍然可以针对不相关的操作移动它.所以 __volatile__ 是不够的.

Making the call __volatile__ prevents the optimizer from removing the asm or moving it across any instructions that could need the results (or change the inputs) of the asm. However it could still move it with respect to unrelated operations. So __volatile__ is not enough.

告诉编译器内存被破坏:: "memory")."memory" 破坏意味着 GCC 不能对整个 asm 中的内存内容保持不变做出任何假设,因此不会围绕它重新排序.

Tell the compiler memory is being clobbered: : "memory"). The "memory" clobber means that GCC cannot make any assumptions about memory contents remaining the same across the asm, and thus will not reorder around it.

所以我的问题是:

  • 1:我对 __volatile__"memory" 的理解是否正确?
  • 2:后两个调用是否做同样的事情?
  • 3:使用 "memory" 看起来比使用另一个序列化指令简单得多.为什么有人会使用第三个选项而不是第二个选项?
  • 1: Is my understanding of __volatile__ and "memory" correct?
  • 2: Do the second two calls do the same thing?
  • 3: Using "memory" looks much simpler than using another serializing instruction. Why would anyone use the 3rd option over the 2nd option?

推荐答案

正如评论中提到的,编译器屏障和处理器屏障是有区别的.asm 语句中的 volatilememory 充当编译器屏障,但处理器仍然可以自由地重新排序指令.

As mentioned in a comment, there's a difference between a compiler barrier and a processor barrier. volatile and memory in the asm statement act as a compiler barrier, but the processor is still free to reorder instructions.

处理器屏障是必须明确给出的特殊指令,例如rdtscp、cpuid、内存栅栏指令(mfence、lfence、 ...)等

Processor barrier are special instructions that must be explicitly given, e.g. rdtscp, cpuid, memory fence instructions (mfence, lfence, ...) etc.

顺便说一句,虽然在 rdtsc 之前使用 cpuid 作为障碍很常见,但从性能角度来看,它也可能非常糟糕,因为虚拟机平台经常陷入困境和模拟 cpuid 指令,以便在集群中的多台机器上强加一组通用的 CPU 功能(以确保实时迁移工作).因此最好使用内存栅栏指令之一.

As an aside, while using cpuid as a barrier before rdtsc is common, it can also be very bad from a performance perspective, since virtual machine platforms often trap and emulate the cpuid instruction in order to impose a common set of CPU features across multiple machines in a cluster (to ensure that live migration works). Thus it's better to use one of the memory fence instructions.

Linux 内核在 AMD 平台上使用 mfence;rdtsc,在 Intel 平台上使用 lfence;rdtsc.如果你不想费心去区分这些,mfence;rdtsc 对两者都有效,虽然它稍微慢一些,因为 mfence 是比 lfence.

The Linux kernel uses mfence;rdtsc on AMD platforms and lfence;rdtsc on Intel. If you don't want to bother with distinguishing between these, mfence;rdtsc works on both although it's slightly slower as mfence is a stronger barrier than lfence.

Edit 2019-11-25:从 Linux 内核 5.4 开始,lfence 用于在 Intel 和 AMD 上序列化 rdtsc.请参阅此提交x86:删除 X86_FEATURE_MFENCE_RDTSC":https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=be261ffce6f13229dad50f59c5e491f933d3167f

Edit 2019-11-25: As of Linux kernel 5.4, lfence is used to serialize rdtsc on both Intel and AMD. See this commit "x86: Remove X86_FEATURE_MFENCE_RDTSC": https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=be261ffce6f13229dad50f59c5e491f933d3167f

相关文章