gcc 优化标志 -O3 使代码比 -O2 慢
我找到了这个话题 为什么处理排序数组比处理未排序数组更快?.并尝试运行此代码.我发现奇怪的行为.如果我使用 -O3
优化标志编译此代码,则运行需要 2.98605 秒
.如果我用 -O2
编译它需要 1.98093 sec
.我尝试在同一台机器上的同一环境中多次运行此代码(5 或 6 次),我关闭了所有其他软件(chrome、Skype 等).
gcc --versiongcc (Ubuntu 4.9.2-0ubuntu1~14.04) 4.9.2版权所有 (C) 2014 Free Software Foundation, Inc.这是免费软件;请参阅复制条件的来源.没有保修单;甚至不是为了特定目的的适销性或适合性.
所以请你向我解释为什么会发生这种情况?我阅读了 gcc
手册,我看到 -O3
包括 -O2
.谢谢你的帮助.
附注添加代码
#include <算法>#include <ctime>#include int main(){//生成数据const unsigned arraySize = 32768;整数数据[数组大小];for (unsigned c = 0; c = 128)总和 += 数据 [c];}}double elapsedTime = static_cast(clock() - start)/CLOCKS_PER_SEC;std::cout <<elapsedTime<
解决方案 gcc -O3
使用 cmov 用于条件,因此它延长了循环携带的依赖链以包含一个 cmov
(根据英特尔 Sandybridge CPU,这是 2 个 uops 和 2 个延迟周期到 Agner Fog 的指令表.另见 x86 标记维基).这是其中一种cmov
很烂的情况.>
如果数据是中度不可预测的,cmov
可能会赢,所以这是编译器做出的一个相当明智的选择.(但是,编译器有时可能会过多地使用无分支代码.)
我 把你的代码放在Godbolt 编译器资源管理器 以查看 asm(具有很好的突出显示和过滤掉不相关的行.不过,您仍然需要向下滚动所有排序代码才能到达 main().
.L82: # gcc -O3 的内部循环movsx rcx, DWORD PTR [rdx] # 符号扩展加载数据[c]mov rsi, rcx添加 rcx, rbx # rcx = sum+data[c]cmp esi, 127cmovg rbx, rcx # sum = data[c]>127 ?rcx : 总和添加 rdx, 4 # 指针增量cmp r12, rdxjne .L82
gcc 可以通过使用 LEA 而不是 ADD 来保存 MOV.
循环瓶颈在 ADD->CMOV(3 个周期)的延迟上,因为循环的一次迭代使用 CMO 写入 rbx,而下一次迭代使用 ADD 读取 rbx.
该循环仅包含 8 个融合域 uops,因此它可以每 2 个周期发出一个.执行端口压力也不是sum
dep 链的延迟那么严重的瓶颈,但很接近(Sandybridge 只有 3 个 ALU 端口,与 Haswell 的 4 个不同).
顺便说一句,把它写成 sum += (data[c] >= 128 ? data[c] : 0);
把 cmov
从循环携带的 dep 链可能很有用.指令还是很多,但是每次迭代中的cmov
都是独立的.这个 c在 gcc6.3 -O2
及更早版本 中按预期编译,但 gcc7 在关键路径(https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82666).(与 if()
的编写方式相比,它还使用较早的 gcc 版本自动矢量化.)
即使使用原始源,Clang 也将 cmov 移出了关键路径.
<小时>gcc -O2
使用分支(对于 gcc5.x 及更早版本),由于您的数据已排序,因此可以很好地预测.由于现代 CPU 使用分支预测来处理控制依赖关系,因此循环携带的依赖关系链更短:只是一个 add
(1 个周期延迟).
每次迭代中的比较和分支是独立的,这要归功于分支预测 + 推测执行,这让执行在确定分支方向之前继续执行.
.L83: # gcc -O2 的内部循环movsx rcx, DWORD PTR [rdx] # 加载从 int32 到 int64 的符号扩展cmp ecx, 127jle .L82 # 条件跳转到下一条指令添加 rbp, rcx # sum+=data[c].L82:添加 rdx, 4cmp rbx, rdx.L83
有两个循环携带的依赖链:sum
和循环计数器.sum
为 0 或 1 个周期长,循环计数器始终为 1 个周期长.但是,循环在 Sandybridge 上是 5 个融合域 uops,因此无论如何它都无法以每次迭代 1c 的速度执行,因此延迟不是瓶颈.
它可能大约每 2 个周期运行一次迭代(分支指令吞吐量的瓶颈),而 -O3 循环每 3 个周期运行一次.下一个瓶颈将是 ALU uop 吞吐量:4 个 ALU uop(在未采用的情况下)但只有 3 个 ALU 端口.(ADD 可以在任何端口上运行).
此管道分析预测与您的 -O3 约 3 秒和 -O2 约 2 秒的时间非常吻合.
<小时>Haswell/Skylake 可以每 1.25 个周期运行一个未采用的情况,因为它可以在与采用的分支相同的周期内执行未采用的分支,并且有 4 个 ALU 端口.(或略低于 一个 5 uop 的循环在每个周期 4 uop 时并不是很重要).
(刚刚测试:Skylake @ 3.9GHz 1.45s 运行整个程序的branchy 版本,或1.68s 的branchless 版本.所以差异要小得多.)
<小时>g++6.3.1 使用 cmov
即使在 -O2
,但 g++5.4 仍然表现得像 4.9.2.
对于 g++6.3.1 和 g++5.4,使用 -fprofile-generate
/-fprofile-use
即使在 也会产生分支版本-O3
(使用 -fno-tree-vectorize
).
来自较新 gcc 的循环的 CMOV 版本使用 add ecx,-128
/cmovge rbx,rdx
而不是 CMP/CMOV.这有点奇怪,但可能不会减慢速度.ADD 写入输出寄存器和标志,因此对物理寄存器的数量造成更大压力.但只要不是瓶颈,应该差不多.
较新的 gcc 使用 -O3 自动矢量化循环,即使仅使用 SSE2,这也是一个显着的加速.(例如,我的 i7-6700k Skylake 运行矢量化版本在 0.74 秒内,大约是标量的两倍.或 -O3 -march=native
在 0.35 秒内,使用 AVX2 256b 向量).
矢量化版本的指令看起来很多,但也不算太糟糕,而且大部分都不是循环携带的 dep 链的一部分.它只需要在接近尾声时解压为 64 位元素.但是,它执行 pcmpgtd
两次,因为它没有意识到当条件已经将所有负整数归零时,它可以只进行零扩展而不是符号扩展.
I find this topic Why is it faster to process a sorted array than an unsorted array? . And try to run this code. And I find strange behavior. If I compile this code with -O3
optimization flag it takes 2.98605 sec
to run. If I compile with -O2
it takes 1.98093 sec
. I try to run this code several times(5 or 6) on the same machine in the same environment, I close all other software(chrome, skype etc).
gcc --version
gcc (Ubuntu 4.9.2-0ubuntu1~14.04) 4.9.2
Copyright (C) 2014 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
So please can you explain to me why this happens? I read gcc
manual and I see that -O3
includes -O2
. Thank you for help.
P.S. add code
#include <algorithm>
#include <ctime>
#include <iostream>
int main()
{
// Generate data
const unsigned arraySize = 32768;
int data[arraySize];
for (unsigned c = 0; c < arraySize; ++c)
data[c] = std::rand() % 256;
// !!! With this, the next loop runs faster
std::sort(data, data + arraySize);
// Test
clock_t start = clock();
long long sum = 0;
for (unsigned i = 0; i < 100000; ++i)
{
// Primary loop
for (unsigned c = 0; c < arraySize; ++c)
{
if (data[c] >= 128)
sum += data[c];
}
}
double elapsedTime = static_cast<double>(clock() - start) / CLOCKS_PER_SEC;
std::cout << elapsedTime << std::endl;
std::cout << "sum = " << sum << std::endl;
}
解决方案
gcc -O3
uses a cmov for the conditional, so it lengthens the loop-carried dependency chain to include a cmov
(which is 2 uops and 2 cycles of latency on your Intel Sandybridge CPU, according to Agner Fog's instruction tables. See also the x86 tag wiki). This is one of the cases where cmov
sucks.
If the data was even moderately unpredictable, cmov
would probably be a win, so this is a fairly sensible choice for a compiler to make. (However, compilers may sometimes use branchless code too much.)
I put your code on the Godbolt compiler explorer to see the asm (with nice highlighting and filtering out irrelevant lines. You still have to scroll down past all the sort code to get to main(), though).
.L82: # the inner loop from gcc -O3
movsx rcx, DWORD PTR [rdx] # sign-extending load of data[c]
mov rsi, rcx
add rcx, rbx # rcx = sum+data[c]
cmp esi, 127
cmovg rbx, rcx # sum = data[c]>127 ? rcx : sum
add rdx, 4 # pointer-increment
cmp r12, rdx
jne .L82
gcc could have saved the MOV by using LEA instead of ADD.
The loop bottlenecks on the latency of ADD->CMOV (3 cycles), since one iteration of the loop writes rbx with CMO, and the next iteration reads rbx with ADD.
The loop only contains 8 fused-domain uops, so it can issue at one per 2 cycles. Execution-port pressure is also not as bad a bottleneck as the latency of the sum
dep chain, but it's close (Sandybridge only has 3 ALU ports, unlike Haswell's 4).
BTW, writing it as sum += (data[c] >= 128 ? data[c] : 0);
to take the cmov
out of the loop-carried dep chain is potentially useful. Still lots of instructions, but the cmov
in each iteration is independent. This compiles as expected in gcc6.3 -O2
and earlier, but gcc7 de-optimizes into a cmov
on the critical path (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82666). (It also auto-vectorizes with earlier gcc versions than the if()
way of writing it.)
Clang takes the cmov off the critical path even with the original source.
gcc -O2
uses a branch (for gcc5.x and older), which predicts well because your data is sorted. Since modern CPUs use branch-prediction to handle control dependencies, the loop-carried dependency chain is shorter: just an add
(1 cycle latency).
The compare-and-branch in every iteration is independent, thanks to branch-prediction + speculative execution, which lets execution continue before the branch direction is known for sure.
.L83: # The inner loop from gcc -O2
movsx rcx, DWORD PTR [rdx] # load with sign-extension from int32 to int64
cmp ecx, 127
jle .L82 # conditional-jump over the next instruction
add rbp, rcx # sum+=data[c]
.L82:
add rdx, 4
cmp rbx, rdx
jne .L83
There are two loop-carried dependency chains: sum
and the loop-counter. sum
is 0 or 1 cycle long, and the loop-counter is always 1 cycle long. However, the loop is 5 fused-domain uops on Sandybridge, so it can't execute at 1c per iteration anyway, so latency isn't a bottleneck.
It probably runs at about one iteration per 2 cycles (bottlenecked on branch instruction throughput), vs. one per 3 cycles for the -O3 loop. The next bottleneck would be ALU uop throughput: 4 ALU uops (in the not-taken case) but only 3 ALU ports. (ADD can run on any port).
This pipeline-analysis prediction matches pretty much exactly with your timings of ~3 sec for -O3 vs. ~2 sec for -O2.
Haswell/Skylake could run the not-taken case at one per 1.25 cycles, since it can execute a not-taken branch in the same cycle as a taken branch and has 4 ALU ports. (Or slightly less since a 5 uop loop doesn't quite issue at 4 uops every cycle).
(Just tested: Skylake @ 3.9GHz runs the branchy version of the whole program in 1.45s, or the branchless version in 1.68s. So the difference is much smaller there.)
g++6.3.1 uses cmov
even at -O2
, but g++5.4 still behaves like 4.9.2.
With both g++6.3.1 and g++5.4, using -fprofile-generate
/ -fprofile-use
produces the branchy version even at -O3
(with -fno-tree-vectorize
).
The CMOV version of the loop from newer gcc uses add ecx,-128
/ cmovge rbx,rdx
instead of CMP/CMOV. That's kinda weird, but probably doesn't slow it down. ADD writes an output register as well as flags, so creates more pressure on the number of physical registers. But as long as that's not a bottleneck, it should be about equal.
Newer gcc auto-vectorizes the loop with -O3, which is a significant speedup even with just SSE2. (e.g. my i7-6700k Skylake runs the vectorized version
in 0.74s, so about twice as fast as scalar. Or -O3 -march=native
in 0.35s, using AVX2 256b vectors).
The vectorized version looks like a lot of instructions, but it's not too bad, and most of them aren't part of a loop-carried dep chain. It only has to unpack to 64-bit elements near the end. It does pcmpgtd
twice, though, because it doesn't realize it could just zero-extend instead of sign-extend when the condition has already zeroed all negative integers.
相关文章