如何在 gcc 和 VS 中强制使用 cmov

我有这个简单的二分查找成员函数,其中 lastIndexnIterxi 是类成员:

I have this simple binary search member function, where lastIndex, nIter and xi are class members:

uint32 scalar(float z) const
{
    uint32 lo = 0;
    uint32 hi = lastIndex;
    uint32 n = nIter;
    while (n--) {
        int mid = (hi + lo) >> 1;
        // defining this if-else assignment as below cause VS2015
        // to generate two cmov instructions instead of a branch
        if( z < xi[mid] ) 
            hi = mid;
        if ( !(z < xi[mid]) )
            lo = mid;
    }
    return lo;
}

gcc 和 VS 2015 都用代码流分支来翻译内循环:

Both gcc and VS 2015 translate the inner loop with a code flow branch:

000000013F0AA778  movss       xmm0,dword ptr [r9+rax*4]  
000000013F0AA77E  comiss      xmm0,xmm1  
000000013F0AA781  jbe         Tester::run+28h (013F0AA788h) 
000000013F0AA783  mov         r8d,ecx  
000000013F0AA786  jmp         Tester::run+2Ah (013F0AA78Ah)  
000000013F0AA788  mov         edx,ecx  
000000013F0AA78A  mov         ecx,r8d

有没有办法在不编写内联汇编程序的情况下说服他们使用 1 个 comiss 指令和 2 个 cmov 指令?

Is there a way, without writing assembler inline, to convince them to use exactly 1 comiss instruction and 2 cmov instructions?

如果没有,有人可以建议如何为此编写 gcc 汇编程序模板吗?

If not, can anybody suggest how to write a gcc assembler template for this?

请注意,我知道存在二进制搜索算法的变体,编译器可以轻松生成无分支代码,但这不是问题.

Please note that I am aware that there are variations of the binary search algorithm where it is easy for the compiler to generate branch free code, but this is beside the question.

谢谢

推荐答案

正如评论中所说,没有简单的方法可以强制您提出要求,尽管最近 (>4.4) 版本的 gcc 似乎已经对其进行了优化,例如你说.编辑:有趣的是,gcc 6 系列似乎使用了一个分支,不同于 gcc 5 和 gcc 7 系列,使用两个cmov.

As said in the comments, there's no easy way to force what you are asking, although it seems that recent (>4.4) versions of gcc already optimize it like you said. Edit: interestingly, the gcc 6 series seems to use a branch, unlike both the gcc 5 and gcc 7 series, which use two cmov.

通常的 __builtin_expect 可能无法推动 gcc 使用 cmov,因为 cmov 在难以预测时通常很方便比较的结果,而 __builtin_expect 告诉编译器可能的结果是什么――所以你只会把它推向错误的方向.

The usual __builtin_expect probably cannot do much into pushing gcc to use cmov, given that cmov is generally convenient when it's difficult to predict the result of a comparison, while __builtin_expect tells the compiler what is the likely outcome - so you would be just pushing it in the wrong direction.

不过,如果你发现这个优化非常重要,你的编译器版本通常会出错,并且出于某种原因你无法使用 PGO 帮助它,相关的 gcc 程序集模板应该是这样的:

Still, if you find that this optimization is extremely important, your compiler version typically gets it wrong and for some reason you cannot help it with PGO, the relevant gcc assembly template should be something like:

    __asm__ (
        "comiss %[xi_mid],%[z]
"
        "cmovb %[mid],%[hi]
"
        "cmovae %[mid],%[lo]
"
        : [hi] "+r"(hi), [lo] "+r"(lo)
        : [mid] "rm"(mid), [xi_mid] "xm"(xi[mid]), [z] "x"(z)
        : "cc"
    );

使用的约束是:

  • hilo 进入写入"变量列表,使用 +r 约束作为 cmov 可以只使用寄存器作为目标操作数,我们有条件覆盖其中一个(我们不能使用 =,因为它意味着该值总是被覆盖,因此编译器可以自由地给我们一个与当前不同的目标寄存器,并在我们的 asm 块之后使用它来引用那个变量);
  • mid 在读取"列表中,rm as cmov 可以将寄存器或内存操作数作为输入值;
  • xi[mid]z 在读取"列表中;
    • z 有特殊的 x 约束,意思是任何 SSE 寄存器"(ucomiss 第一个操作数需要);
    • xi[mid]xm,因为第二个 ucomiss 操作数允许内存操作;考虑到 zxi[mid] 之间的选择,我选择最后一个作为直接从内存中提取的更好的候选者,因为 z> 已经在寄存器中(由于 System V 调用约定 - 无论如何都会在迭代之间缓存)并且 xi[mid] 仅用于此比较;
    • hi and lo are into the "write" variables list, with +r constraint as cmov can only work with registers as target operands, and we are conditionally overwriting just one of them (we cannot use =, as it implies that the value is always overwritten, so the compiler would be free to give us a different target register than the current one, and use it to refer to that variable after our asm block);
    • mid is in the "read" list, rm as cmov can take either a register or a memory operand as input value;
    • xi[mid] and z are in the "read" list;
      • z has the special x constraint that means "any SSE register" (required for ucomiss first operand);
      • xi[mid] has xm, as the second ucomiss operand allows a memory operator; given the choice between z and xi[mid], I chose the last one as a better candidate for being taken directly from memory, given that z is already in a register (due to the System V calling convention - and is going to be cached between iterations anyway) and xi[mid] is used just in this comparison;

相关文章