如何在 gcc 和 VS 中强制使用 cmov
我有这个简单的二分查找成员函数,其中 lastIndex
、nIter
和 xi
是类成员:
I have this simple binary search member function, where lastIndex
, nIter
and xi
are class members:
uint32 scalar(float z) const
{
uint32 lo = 0;
uint32 hi = lastIndex;
uint32 n = nIter;
while (n--) {
int mid = (hi + lo) >> 1;
// defining this if-else assignment as below cause VS2015
// to generate two cmov instructions instead of a branch
if( z < xi[mid] )
hi = mid;
if ( !(z < xi[mid]) )
lo = mid;
}
return lo;
}
gcc 和 VS 2015 都用代码流分支来翻译内循环:
Both gcc and VS 2015 translate the inner loop with a code flow branch:
000000013F0AA778 movss xmm0,dword ptr [r9+rax*4]
000000013F0AA77E comiss xmm0,xmm1
000000013F0AA781 jbe Tester::run+28h (013F0AA788h)
000000013F0AA783 mov r8d,ecx
000000013F0AA786 jmp Tester::run+2Ah (013F0AA78Ah)
000000013F0AA788 mov edx,ecx
000000013F0AA78A mov ecx,r8d
有没有办法在不编写内联汇编程序的情况下说服他们使用 1 个 comiss
指令和 2 个 cmov
指令?
Is there a way, without writing assembler inline, to convince them to use exactly 1 comiss
instruction and 2 cmov
instructions?
如果没有,有人可以建议如何为此编写 gcc 汇编程序模板吗?
If not, can anybody suggest how to write a gcc assembler template for this?
请注意,我知道存在二进制搜索算法的变体,编译器可以轻松生成无分支代码,但这不是问题.
Please note that I am aware that there are variations of the binary search algorithm where it is easy for the compiler to generate branch free code, but this is beside the question.
谢谢
推荐答案
正如评论中所说,没有简单的方法可以强制您提出要求,尽管最近 (>4.4) 版本的 gcc 似乎已经对其进行了优化,例如你说.编辑:有趣的是,gcc 6 系列似乎使用了一个分支,不同于 gcc 5 和 gcc 7 系列,使用两个cmov
.
As said in the comments, there's no easy way to force what you are asking, although it seems that recent (>4.4) versions of gcc already optimize it like you said. Edit: interestingly, the gcc 6 series seems to use a branch, unlike both the gcc 5 and gcc 7 series, which use two cmov
.
通常的 __builtin_expect
可能无法推动 gcc 使用 cmov
,因为 cmov
在难以预测时通常很方便比较的结果,而 __builtin_expect
告诉编译器可能的结果是什么――所以你只会把它推向错误的方向.
The usual __builtin_expect
probably cannot do much into pushing gcc to use cmov
, given that cmov
is generally convenient when it's difficult to predict the result of a comparison, while __builtin_expect
tells the compiler what is the likely outcome - so you would be just pushing it in the wrong direction.
不过,如果你发现这个优化非常重要,你的编译器版本通常会出错,并且出于某种原因你无法使用 PGO 帮助它,相关的 gcc 程序集模板应该是这样的:
Still, if you find that this optimization is extremely important, your compiler version typically gets it wrong and for some reason you cannot help it with PGO, the relevant gcc assembly template should be something like:
__asm__ (
"comiss %[xi_mid],%[z]
"
"cmovb %[mid],%[hi]
"
"cmovae %[mid],%[lo]
"
: [hi] "+r"(hi), [lo] "+r"(lo)
: [mid] "rm"(mid), [xi_mid] "xm"(xi[mid]), [z] "x"(z)
: "cc"
);
使用的约束是:
hi
和lo
进入写入"变量列表,使用+r
约束作为cmov
可以只使用寄存器作为目标操作数,我们有条件覆盖其中一个(我们不能使用=
,因为它意味着该值总是被覆盖,因此编译器可以自由地给我们一个与当前不同的目标寄存器,并在我们的asm
块之后使用它来引用那个变量);mid
在读取"列表中,rm
ascmov
可以将寄存器或内存操作数作为输入值;xi[mid]
和z
在读取"列表中;z
有特殊的x
约束,意思是任何 SSE 寄存器"(ucomiss
第一个操作数需要);xi[mid]
有xm
,因为第二个ucomiss
操作数允许内存操作;考虑到z
和xi[mid]
之间的选择,我选择最后一个作为直接从内存中提取的更好的候选者,因为z
> 已经在寄存器中(由于 System V 调用约定 - 无论如何都会在迭代之间缓存)并且xi[mid]
仅用于此比较;
hi
andlo
are into the "write" variables list, with+r
constraint ascmov
can only work with registers as target operands, and we are conditionally overwriting just one of them (we cannot use=
, as it implies that the value is always overwritten, so the compiler would be free to give us a different target register than the current one, and use it to refer to that variable after ourasm
block);mid
is in the "read" list,rm
ascmov
can take either a register or a memory operand as input value;xi[mid]
andz
are in the "read" list;z
has the specialx
constraint that means "any SSE register" (required forucomiss
first operand);xi[mid]
hasxm
, as the seconducomiss
operand allows a memory operator; given the choice betweenz
andxi[mid]
, I chose the last one as a better candidate for being taken directly from memory, given thatz
is already in a register (due to the System V calling convention - and is going to be cached between iterations anyway) andxi[mid]
is used just in this comparison;
相关文章