VS:_BitScanReverse64 内在的意外优化行为

以下代码在调试模式下工作正常,因为 _BitScanReverse64 已定义如果没有设置位,则返回 0.引用 MSDN:(返回值是)如果设置了索引,则非零,如果未找到设置位,则为 0."

The following code works fine in debug mode, since _BitScanReverse64 is defined to return 0 if no Bit is set. Citing MSDN: (The return value is) "Nonzero if Index was set, or 0 if no set bits were found."

如果我在发布模式下编译这段代码它仍然有效,但如果我启用编译器优化,例如 O1 或 O2 索引不为零并且 assert() 失败.

If I compile this code in release mode it still works, but if I enable compiler optimizations, such as O1 or O2 the index is not zero and the assert() fails.

#include <iostream>
#include <cassert>

using namespace std;

int main()
{
  unsigned long index = 0;
  _BitScanReverse64(&index, 0x0ull);

  cout << index << endl;

  assert(index == 0);

  return 0;
}

这是预期的行为吗?我使用的是 Visual Studio Community 2015,版本 14.0.25431.01 更新 3.(我留下了 cout,以便在优化过程中不会删除变量索引).还有没有有效的解决方法,还是我不应该直接使用这个编译器内在函数?

Is this the intended behaviour ? I am using Visual Studio Community 2015, Version 14.0.25431.01 Update 3. (I left cout in, so that the variable index is not deleted during optimization). Also is there an efficient workaround or should I just not use this compiler intrinsic directly?

推荐答案

AFAICT,intrinsic 在输入为零时在 index 中留下垃圾, 弱于行为asm 指令.这就是为什么它有一个单独的布尔返回值和整数输出操作数.

AFAICT, the intrinsic leaves garbage in index when the input is zero, weaker than the behaviour of the asm instruction. This is why it has a separate boolean return value and integer output operand.

尽管 index 参数是通过引用获取的,但编译器将其视为仅输出.

Despite the index arg being taken by reference, the compiler treats it as output-only.

unsigned char _BitScanReverse64 (unsigned __int32* index, unsigned __int64 mask)
英特尔针对相同内在函数的内在函数指南文档似乎比您链接的 Microsoft 文档 更清晰,并阐明了 MS文档试图说.但是仔细阅读,它们似乎都在说同样的事情,并且描述了围绕 bsr 指令的薄包装.

unsigned char _BitScanReverse64 (unsigned __int32* index, unsigned __int64 mask)
Intel's intrinsics guide documentation for the same intrinsic seems clearer than the Microsoft docs you linked, and sheds some light on what the MS docs are trying to say. But on careful reading, they do both seem to say the same thing, and describe a thin wrapper around the bsr instruction.

英特尔记录了BSR 指令当输入为 0 时产生未定义的值",但在这种情况下设置 ZF. 但 AMD 将其记录为保持目标不变:

Intel documents the BSR instruction as producing an "undefined value" when the input is 0, but setting the ZF in that case. But AMD documents it as leaving the destination unchanged:

AMD 的 BSF 条目 AMD64 架构程序员手册第 3 卷:通用和系统说明

... 如果第二个操作数包含 0,则指令设置 ZF为 1 并且不改变目标寄存器的内容....

... If the second operand contains 0, the instruction sets ZF to 1 and does not change the contents of the destination register. ...

在当前的 Intel 硬件上,实际行为与 AMD 的文档相符:当 src 操作数为 0 时,它使目标寄存器保持不变.也许这就是为什么 MS 将其描述为仅在输入为时设置 Index非零(并且内部函数的返回值非零).

On current Intel hardware, the actual behaviour matches AMD's documentation: it leaves the destination register unmodified when the src operand is 0. Perhaps this is why MS describes it as only setting Index when the input is non-zero (and the intrinsic's return value is non-zero).

在英特尔(但可能不是 AMD)上,就这样因为甚至没有将 64 位寄存器截断为 32 位.例如mov rax,-1 ;bsf eax, ecx(ECX 为零)使 RAX=-1(64 位),而不是您从 xor eax, 0<得到的 0x00000000ffffffff/代码>.但是对于非零 ECX,bsf eax, ecx 具有将零扩展到 RAX 的通常效果,例如留下 RAX=3.

On Intel (but maybe not AMD), this goes as far as not even truncating a 64-bit register to 32-bit. e.g. mov rax,-1 ; bsf eax, ecx (with zeroed ECX) leaves RAX=-1 (64-bit), not the 0x00000000ffffffff you'd get from xor eax, 0. But with non-zero ECX, bsf eax, ecx has the usual effect of zero-extending into RAX, leaving for example RAX=3.

IDK 为什么英特尔还没有记录下来. 也许一个非常老的 x86 CPU(比如原来的 386?)以不同的方式实现它?英特尔和 AMD 经常为了不破坏现有的广泛使用的代码(例如 Windows)而超越 x86 手册中记录的内容a>,这可能就是它的开始.

IDK why Intel still hasn't documented it. Perhaps a really old x86 CPU (like original 386?) implements it differently? Intel and AMD frequently go above and beyond what's documented in the x86 manuals in order to not break existing widely-used code (e.g. Windows), which might be how this started.

在这一点上,英特尔似乎不太可能放弃这种输出依赖性,并为 input=0 留下实际垃圾或 -1 或 32,但由于缺乏文档,该选项仍然存在.

At this point it seems unlikely that Intel will ever drop that output dependency and leave actual garbage or -1 or 32 for input=0, but the lack of documentation leaves that option open.

Skylake 删除了 lzcnttzcnt 的错误依赖(后来的 uarch 删除了 popcnt 的错误依赖),同时仍然保留了依赖对于 bsr/bsf.(为什么打破 LZCNT 的输出依赖"很重要?)

Skylake dropped the false dependency for lzcnt and tzcnt (and a later uarch dropped the false dep for popcnt) while still preserving the dependency for bsr/bsf. (Why does breaking the "output dependency" of LZCNT matter?)

当然,由于 MSVC 优化了您的 index = 0 初始化,大概它只使用它想要的任何目标寄存器,不一定是保存 C 变量先前值的寄存器. 因此,即使您愿意,我也不认为您可以利用 dst 未修改的行为,即使它在 AMD 上得到保证.

Of course, since MSVC optimized away your index = 0 initialization, presumably it just uses whatever destination register it wants, not necessarily the register that held the previous value of the C variable. So even if you wanted to, I don't think you could take advantage of the dst-unmodified behaviour even though it's guaranteed on AMD.

因此,在 C++ 术语中,内在函数对 index 没有输入依赖性.但是在 asm 中,指令确实对 dst 寄存器有输入依赖性,就像 add dst, src 指令一样.如果编译器不小心,这可能会导致意外的性能问题.

So in C++ terms, the intrinsic has no input dependency on index. But in asm, the instruction does have an input dependency on the dst register, like an add dst, src instruction. This can cause unexpected performance issues if compilers aren't careful.

不幸的是,在英特尔硬件上,popcnt/lzcnt/tzcnt asm 指令也对它们的目的地有一个错误的依赖,即使结果从不依赖它.不过,编译器可以解决这个问题,因为它是已知的,因此您在使用内在函数时不必担心它(除非您的编译器已经使用了几年以上,因为它是最近才发现的).

Unfortunately on Intel hardware, the popcnt / lzcnt / tzcnt asm instructions also have a false dependency on their destination, even though the result never depends on it. Compilers work around this now that it's known, though, so you don't have to worry about it when using intrinsics (unless you have a compiler more than a couple years old, since it was only recently discovered).

您需要检查它以确保 index 有效,除非您知道输入非零.例如

You need to check it to make sure index is valid, unless you know the input was non-zero. e.g.

if(_BitScanReverse64(&idx, input)) {
    // idx is valid.
    // (MS docs say "Index was set")
} else {
    // input was zero, idx holds garbage.
    // (MS docs don't say Index was even set)
    idx = -1;     // might make sense, one lower than the result for bsr(1)
}

<小时>

如果你想避免这个额外的检查分支,你可以使用lzcnt<如果您的目标是足够新的硬件(例如 Intel Haswell 或 AMD Bulldozer IIRC),则通过不同的内在函数/code> 指令.即使输入全为零,它也有效",并且实际上计算前导零而不是返回最高设置位的索引.


If you want to avoid this extra check branch, you can use the lzcnt instruction via different intrinsics if you're targeting new enough hardware (e.g. Intel Haswell or AMD Bulldozer IIRC). It "works" even when the input is all-zero, and actually counts leading zeros instead of returning the index of the highest set bit.

相关文章