GCC 5.4.0 的昂贵跳跃
我有一个看起来像这样的函数(只显示重要的部分):
I had a function which looked like this (showing only the important part):
double CompareShifted(const std::vector<uint16_t>& l, const std::vector<uint16_t> &curr, int shift, int shiftY) {
...
for(std::size_t i=std::max(0,-shift);i<max;i++) {
if ((curr[i] < 479) && (l[i + shift] < 479)) {
nontopOverlap++;
}
...
}
...
}
这样写,这个函数在我的机器上用了大约 34 毫秒.将条件更改为 bool 乘法后(使代码看起来像这样):
Written like this, the function took ~34ms on my machine. After changing the condition to bool multiplication (making the code look like this):
double CompareShifted(const std::vector<uint16_t>& l, const std::vector<uint16_t> &curr, int shift, int shiftY) {
...
for(std::size_t i=std::max(0,-shift);i<max;i++) {
if ((curr[i] < 479) * (l[i + shift] < 479)) {
nontopOverlap++;
}
...
}
...
}
执行时间减少到~19ms.
the execution time decreased to ~19ms.
使用的编译器是带有 -O3
的 GCC 5.4.0,并且在检查了 之后生成使用godbolt.org 的asm 代码 我发现第一个例子产生一个跳转,而第二个没有.我决定尝试 GCC 6.2.0,它在使用第一个示例时也会生成一条跳转指令,但 GCC 7 似乎不再生成.
The compiler used was GCC 5.4.0 with -O3
and after checking the generated asm code using godbolt.org I found out that the first example generates a jump, while the second one does not. I decided to try GCC 6.2.0 which also generates a jump instruction when using the first example, but GCC 7 seems to not generate one anymore.
找到这种加速代码的方法是相当可怕的,需要花费相当长的时间.为什么编译器会这样?它是有意为之的吗?它是程序员应该注意的吗?还有没有类似的东西?
Finding out this way to speed up the code was rather gruesome and took quite some time. Why does the compiler behave this way? Is it intended and is it something the programmers should look out for? Are there any more things similar to this?
推荐答案
逻辑与运算符 (&&
) 使用短路求值,这意味着第二次测试仅在第一个比较评估为真.这通常正是您需要的语义.例如,考虑以下代码:
The logical AND operator (&&
) uses short-circuit evaluation, which means that the second test is only done if the first comparison evaluates to true. This is often exactly the semantics that you require. For example, consider the following code:
if ((p != nullptr) && (p->first > 0))
在取消引用之前,您必须确保该指针不为空.如果这不是短路评估,您将有未定义的行为,因为您将取消引用空指针.
You must ensure that the pointer is non-null before you dereference it. If this wasn't a short-circuit evaluation, you'd have undefined behavior because you'd be dereferencing a null pointer.
在条件评估是一个昂贵的过程的情况下,短路评估也可能产生性能增益.例如:
It is also possible that short circuit evaluation yields a performance gain in cases where the evaluation of the conditions is an expensive process. For example:
if ((DoLengthyCheck1(p) && (DoLengthyCheck2(p))
如果 DoLengthyCheck1
失败,则调用 DoLengthyCheck2
没有意义.
If DoLengthyCheck1
fails, there is no point in calling DoLengthyCheck2
.
然而,在生成的二进制文件中,短路操作通常会导致两个分支,因为这是编译器保留这些语义的最简单方法.(这就是为什么,另一方面,短路评估有时会抑制优化潜力.)您可以通过查看为您的 生成的目标代码的相关部分来了解这一点GCC 5.4 的 if
语句:
However, in the resulting binary, a short-circuit operation often results in two branches, since this is the easiest way for the compiler to preserve these semantics. (Which is why, on the other side of the coin, short-circuit evaluation can sometimes inhibit optimization potential.) You can see this by looking at the relevant portion of object code generated for your if
statement by GCC 5.4:
movzx r13d, WORD PTR [rbp+rcx*2]
movzx eax, WORD PTR [rbx+rcx*2]
cmp r13w, 478 ; (curr[i] < 479)
ja .L5
cmp ax, 478 ; (l[i + shift] < 479)
ja .L5
add r8d, 1 ; nontopOverlap++
您在这里看到了两个比较(cmp
指令),每个后面跟着一个单独的条件跳转/分支(ja
,或者如果上面的跳转).
You see here the two comparisons (cmp
instructions) here, each followed by a separate conditional jump/branch (ja
, or jump if above).
一般的经验法则是分支很慢,因此应避免在紧密循环中.这在几乎所有 x86 处理器上都是如此,从不起眼的 8088(其缓慢的获取时间和极小的预取队列 [与指令缓存相比],加上完全缺乏分支预测,意味着所采取的分支需要转储缓存) 到现代实现(其长管道使错误预测的分支同样昂贵).请注意我在那里滑倒的小警告.自 Pentium Pro 以来的现代处理器具有先进的分支预测引擎,旨在最大限度地降低分支成本.如果可以正确预测分支的方向,则成本最小.大多数情况下,这很有效,但是如果您遇到分支预测器不在您身边的病理情况,你的代码可能会变得非常慢.这大概是你在这里的地方,因为你说你的数组是未排序的.
It is a general rule of thumb that branches are slow and are therefore to be avoided in tight loops. This has been true on virtually all x86 processors, from the humble 8088 (whose slow fetch times and extremely small prefetch queue [comparable to an instruction cache], combined with utter lack of branch prediction, meant that taken branches required the cache to be dumped) to modern implementations (whose long pipelines make mispredicted branches similarly expensive). Note the little caveat that I slipped in there. Modern processors since the Pentium Pro have advanced branch prediction engines that are designed to minimize the cost of branches. If the direction of the branch can be properly predicted, the cost is minimal. Most of the time, this works well, but if you get into pathological cases where the branch predictor is not on your side, your code can get extremely slow. This is presumably where you are here, since you say that your array is unsorted.
您说基准测试证实用 *
替换 &&
会使代码明显更快.当我们比较目标代码的相关部分时,原因很明显:
You say that benchmarks confirmed that replacing the &&
with a *
makes the code noticeably faster. The reason for this is evident when we compare the relevant portion of the object code:
movzx r13d, WORD PTR [rbp+rcx*2]
movzx eax, WORD PTR [rbx+rcx*2]
xor r15d, r15d ; (curr[i] < 479)
cmp r13w, 478
setbe r15b
xor r14d, r14d ; (l[i + shift] < 479)
cmp ax, 478
setbe r14b
imul r14d, r15d ; meld results of the two comparisons
cmp r14d, 1 ; nontopOverlap++
sbb r8d, -1
这可能会更快,这有点违反直觉,因为这里有更多指令,但这就是优化有时的工作方式.您会看到这里进行了相同的比较 (cmp
),但是现在,每个比较都以 xor
开头,然后是 setbe
.XOR 只是清除寄存器的标准技巧.setbe
是一条 x86 指令,它根据标志的值设置位,通常用于实现无分支代码.这里,setbe
是 ja
的反函数.如果比较低于或等于,它将其目标寄存器设置为 1(因为寄存器被预置零,否则它将为 0),而 ja
如果比较高于则分支.一旦在 r15b
和 r14b
寄存器中获得这两个值,就可以使用 imul
将它们相乘.乘法传统上是一个相对较慢的操作,但在现代处理器上它是非常快的,而且这将特别快,因为它只是乘以两个字节大小的值.
It is a bit counter-intuitive that this could be faster, since there are more instructions here, but that is how optimization works sometimes. You see the same comparisons (cmp
) being done here, but now, each is preceded by an xor
and followed by a setbe
. The XOR is just a standard trick for clearing a register. The setbe
is an x86 instruction that sets a bit based on the value of a flag, and is often used to implement branchless code. Here, setbe
is the inverse of ja
. It sets its destination register to 1 if the comparison was below-or-equal (since the register was pre-zeroed, it will be 0 otherwise), whereas ja
branched if the comparison was above. Once these two values have been obtained in the r15b
and r14b
registers, they are multiplied together using imul
. Multiplication was traditionally a relatively slow operation, but it is darn fast on modern processors, and this will be especially fast, because it's only multiplying two byte-sized values.
您可以很容易地用按位 AND 运算符 (&
) 替换乘法,它不进行短路评估.这使代码更加清晰,并且是编译器通常识别的模式.但是,当您使用代码执行此操作并使用 GCC 5.4 进行编译时,它会继续发出第一个分支:
You could just as easily have replaced the multiplication with the bitwise AND operator (&
), which does not do short-circuit evaluation. This makes the code much clearer, and is a pattern that compilers generally recognize. But when you do this with your code and compile it with GCC 5.4, it continues to emit the first branch:
movzx r13d, WORD PTR [rbp+rcx*2]
movzx eax, WORD PTR [rbx+rcx*2]
cmp r13w, 478 ; (curr[i] < 479)
ja .L4
cmp ax, 478 ; (l[i + shift] < 479)
setbe r14b
cmp r14d, 1 ; nontopOverlap++
sbb r8d, -1
没有技术原因它必须以这种方式发出代码,但出于某种原因,它的内部启发式方法告诉它这种方式更快.如果分支预测器站在你这边,它可能会更快,但如果分支预测失败的次数多于成功的次数,它可能会更慢.
There is no technical reason it had to emit the code this way, but for some reason, its internal heuristics are telling it that this is faster. It would probably be faster if the branch predictor was on your side, but it will likely be slower if branch prediction fails more often than it succeeds.
新一代编译器(以及其他编译器,如 Clang)知道这个规则,有时会使用它来生成与您通过手动优化寻求的相同代码.我经常看到 Clang 将 &&
表达式转换为如果我使用 &
会发出的相同代码.以下是 GCC 6.2 的相关输出以及使用普通 &&
运算符的代码:
Newer generations of the compiler (and other compilers, like Clang) know this rule, and will sometimes use it to generate the same code that you would have sought by hand-optimizing. I regularly see Clang translate &&
expressions to the same code that would have been emitted if I'd have used &
. The following is the relevant output from GCC 6.2 with your code using the normal &&
operator:
movzx r13d, WORD PTR [rbp+rcx*2]
movzx eax, WORD PTR [rbx+rcx*2]
cmp r13d, 478 ; (curr[i] < 479)
jg .L7
xor r14d, r14d ; (l[i + shift] < 479)
cmp eax, 478
setle r14b
add esi, r14d ; nontopOverlap++
注意这个有多聪明!它使用有符号条件(jg
和 setle
)而不是无符号条件(ja
和 setbe
),但是这不重要.你可以看到它仍然像旧版本一样对第一个条件进行比较和分支,并使用相同的setCC
指令为第二个条件生成无分支代码,但它已经得到了很多在如何进行增量方面更有效.它没有进行第二次冗余比较来设置 sbb
操作的标志,而是使用 r14d
将是 1 或 0 的知识无条件地将此值添加到nontopOverlap
.如果 r14d
为 0,则加法为空操作;否则,它会加 1,就像它应该做的那样.
Note how clever this is! It is using signed conditions (jg
and setle
) as opposed to unsigned conditions (ja
and setbe
), but this isn't important. You can see that it still does the compare-and-branch for the first condition like the older version, and uses the same setCC
instruction to generate branchless code for the second condition, but it has gotten a lot more efficient in how it does the increment. Instead of doing a second, redundant comparison to set the flags for a sbb
operation, it uses the knowledge that r14d
will be either 1 or 0 to simply unconditionally add this value to nontopOverlap
. If r14d
is 0, then the addition is a no-op; otherwise, it adds 1, exactly like it is supposed to do.
GCC 6.2 实际上会生成更高效的代码:
GCC 6.2 actually produces more efficient code when you use the short-circuiting &&
operator than the bitwise &
operator:
movzx r13d, WORD PTR [rbp+rcx*2]
movzx eax, WORD PTR [rbx+rcx*2]
cmp r13d, 478 ; (curr[i] < 479)
jg .L6
cmp eax, 478 ; (l[i + shift] < 479)
setle r14b
cmp r14b, 1 ; nontopOverlap++
sbb esi, -1
分支和条件集仍然存在,但现在它恢复到增加nontopOverlap
的不太聪明的方式.这是一个重要的教训,说明为什么在尝试超越编译器时应该小心!
The branch and the conditional set are still there, but now it reverts back to the less-clever way of incrementing nontopOverlap
. This is an important lesson in why you should be careful when trying to out-clever your compiler!
但是如果您可以证明分支代码实际上更慢的基准测试,那么尝试超越您的编译器可能是值得的.您只需仔细检查反汇编,并准备好在升级到更高版本的编译器时重新评估您的决定.例如,您的代码可以改写为:
But if you can prove with benchmarks that the branching code is actually slower, then it may pay to try and out-clever your compiler. You just have to do so with careful inspection of the disassembly―and be prepared to re-evaluate your decisions when you upgrade to a later version of the compiler. For example, the code you have could be rewritten as:
nontopOverlap += ((curr[i] < 479) & (l[i + shift] < 479));
这里根本没有 if
语句,绝大多数编译器永远不会考虑为此发出分支代码.海湾合作委员会也不例外;所有版本都会生成类似于以下内容的内容:
There is no if
statement here at all, and the vast majority of compilers will never think about emitting branching code for this. GCC is no exception; all versions generate something akin to the following:
movzx r14d, WORD PTR [rbp+rcx*2]
movzx eax, WORD PTR [rbx+rcx*2]
cmp r14d, 478 ; (curr[i] < 479)
setle r15b
xor r13d, r13d ; (l[i + shift] < 479)
cmp eax, 478
setle r13b
and r13d, r15d ; meld results of the two comparisons
add esi, r13d ; nontopOverlap++
如果您一直在关注前面的示例,那么您应该非常熟悉.两个比较都是以无分支的方式完成的,中间结果被和
ed在一起,然后这个结果(0或1)被add
ed到nontopOverlap
.如果您想要无分支代码,这实际上可以确保您得到它.
If you've been following along with the previous examples, this should look very familiar to you. Both comparisons are done in a branchless way, the intermediate results are and
ed together, and then this result (which will be either 0 or 1) is add
ed to nontopOverlap
. If you want branchless code, this will virtually ensure that you get it.
GCC 7 变得更加智能.它现在为上述技巧生成与原始代码几乎相同的代码(除了一些轻微的指令重新排列).所以,对于您的问题的答案,为什么编译器会这样?",可能是因为它们并不完美!他们尝试使用启发式方法来生成尽可能最佳的代码,但他们并不总是做出最佳决策.但至少他们可以随着时间的推移变得更聪明!
GCC 7 has gotten even smarter. It now generates virtually identical code (excepting some slight rearrangement of instructions) for the above trick as the original code. So, the answer to your question, "Why does the compiler behave this way?", is probably because they're not perfect! They try to use heuristics to generate the most optimal code possible, but they don't always make the best decisions. But at least they can get smarter over time!
看待这种情况的一种方式是分支代码具有更好的最佳情况性能.如果分支预测成功,跳过不必要的操作会导致运行时间略快.然而,无分支代码具有更好的最坏情况性能.如果分支预测失败,根据需要执行一些额外的指令以避免分支肯定比错误预测的分支快.即使是最聪明、最聪明的编译器也很难做出这个选择.
One way of looking at this situation is that the branching code has the better best-case performance. If branch prediction is successful, skipping unnecessary operations will result in a slightly faster running time. However, branchless code has the better worst-case performance. If branch prediction fails, executing a few additional instructions as necessary to avoid a branch will definitely be faster than a mispredicted branch. Even the smartest and most clever of compilers will have a hard time making this choice.
对于您是否需要注意这是否是程序员需要注意的问题,答案几乎肯定是否定的,除非在您试图通过微优化加速的某些热循环中.然后,您坐下来进行拆卸并找到调整它的方法.而且,正如我之前所说,当您更新到更新版本的编译器时,请准备好重新考虑这些决定,因为它可能会对您棘手的代码做一些愚蠢的事情,或者它可能已经改变了它的优化启发式方法,以至于您可以返回使用您的原始代码.彻底评论!
And for your question of whether this is something programmers need to watch out for, the answer is almost certainly no, except in certain hot loops that you are trying to speed up via micro-optimizations. Then, you sit down with the disassembly and find ways to tweak it. And, as I said before, be prepared to revisit those decisions when you update to a newer version of the compiler, because it may either do something stupid with your tricky code, or it may have changed its optimization heuristics enough that you can go back to using your original code. Comment thoroughly!
相关文章