如果我针对大小而不是速度进行优化,为什么 GCC 生成的代码速度会快 15-20%?

我在 2009 年第一次注意到 GCC(至少在我的项目和我的机器上)如果我针对大小(-Os) 而不是速度(-O2-O3),我一直想知道为什么.

I first noticed in 2009 that GCC (at least on my projects and on my machines) have the tendency to generate noticeably faster code if I optimize for size (-Os) instead of speed (-O2 or -O3), and I have been wondering ever since why.

我已经设法创建(相当愚蠢的)代码来显示这种令人惊讶的行为,并且代码足够小,可以在此处发布.

I have managed to create (rather silly) code that shows this surprising behavior and is sufficiently small to be posted here.

const int LOOP_BOUND = 200000000;

__attribute__((noinline))
static int add(const int& x, const int& y) {
    return x + y;
}

__attribute__((noinline))
static int work(int xval, int yval) {
    int sum(0);
    for (int i=0; i<LOOP_BOUND; ++i) {
        int x(xval+sum);
        int y(yval+sum);
        int z = add(x, y);
        sum += z;
    }
    return sum;
}

int main(int , char* argv[]) {
    int result = work(*argv[1], *argv[2]);
    return result;
}

如果我用 -Os 编译它,执行这个程序需要 0.38 秒,如果用 -O2 编译需要 0.44 秒代码>-O3.这些时间是一致获得的,几乎没有噪音(gcc 4.7.2、x86_64 GNU/Linux、英特尔酷睿 i5-3320M).

If I compile it with -Os, it takes 0.38 s to execute this program, and 0.44 s if it is compiled with -O2 or -O3. These times are obtained consistently and with practically no noise (gcc 4.7.2, x86_64 GNU/Linux, Intel Core i5-3320M).

(更新:我已将所有汇编代码移至 GitHub:他们制作了由于 fno-align-* 标志具有相同的效果,因此帖子臃肿并且显然为问题增加的价值很小.)

(Update: I have moved all assembly code to GitHub: They made the post bloated and apparently add very little value to the questions as the fno-align-* flags have the same effect.)

这是使用 -Os-O2.

不幸的是,我对程序集的理解非常有限,所以我不知道我接下来所做的是否正确:我获取了 -O2 的程序集并将其所有差异合并到 -O2 的程序集中code>-Os 除了 .p2align 行,结果 此处.此代码仍在 0.38 秒内运行,唯一的区别是 .p2align 东西.

Unfortunately, my understanding of assembly is very limited, so I have no idea whether what I did next was correct: I grabbed the assembly for -O2 and merged all its differences into the assembly for -Os except the .p2align lines, result here. This code still runs in 0.38s and the only difference is the .p2align stuff.

如果我猜对了,这些是用于堆栈对齐的填充.根据 为什么 GCC pad 与 NOP 一起工作? 这样做是希望代码运行得更快,但显然这就我而言,优化适得其反.

If I guess correctly, these are paddings for stack alignment. According to Why does GCC pad functions with NOPs? it is done in the hope that the code will run faster, but apparently this optimization backfired in my case.

在这种情况下,填充是罪魁祸首吗?为什么以及如何?

它产生的噪音几乎使时序微优化变得不可能.

The noise it makes pretty much makes timing micro-optimizations impossible.

当我对 C 或 C++ 源代码进行微优化(与堆栈对齐无关)时,如何确保这种偶然的幸运/不幸对齐不会干扰?

更新:

根据 Pascal Cuoq 的回答,我对对齐进行了一些修改.通过将 -O2 -fno-align-functions -fno-align-loops 传递给 gcc,所有 .p2align 从程序集中消失,生成的可执行文件在 0.38 秒内运行.根据 gcc 文档:

Following Pascal Cuoq's answer I tinkered a little bit with the alignments. By passing -O2 -fno-align-functions -fno-align-loops to gcc, all .p2align are gone from the assembly and the generated executable runs in 0.38s. According to the gcc documentation:

-Os 启用所有 -O2 优化 [但] -Os 禁用以下优化标志:

-Os enables all -O2 optimizations [but] -Os disables the following optimization flags:

  -falign-functions  -falign-jumps  -falign-loops
  -falign-labels  -freorder-blocks  -freorder-blocks-and-partition
  -fprefetch-loop-arrays

所以,这似乎是一个(错误)对齐问题.

我仍然对 Marat Dukhan 的回答中建议的 -march=native 持怀疑态度.我不相信它不仅仅是在干扰这个(错误)对齐问题;它对我的机器完全没有影响.(尽管如此,我赞成他的回答.)

I am still skeptical about -march=native as suggested in Marat Dukhan's answer. I am not convinced that it isn't just interfering with this (mis)alignment issue; it has absolutely no effect on my machine. (Nevertheless, I upvoted his answer.)

更新 2:

我们可以把图片中的-Os去掉.下面的时间是用

We can take -Os out of the picture. The following times are obtained by compiling with

  • -O2 -fno-omit-frame-pointer 0.37s

-O2 -fno-align-functions -fno-align-loops 0.37s

-S -O2 然后在 work() 0.37s

-S -O2 then manually moving the assembly of add() after work() 0.37s

-O2 0.44s

在我看来,add() 与调用站点的距离很重要.我已经尝试过 perf,但是 perf statperf report 的输出对我来说意义不大.但是,我只能从中得到一个一致的结果:

It looks like to me the distance of add() from the call site matters a lot. I have tried perf, but the output of perf stat and perf report makes very little sense to me. However, I could only get one consistent result out of it:

-O2:

 602,312,864 stalled-cycles-frontend   #    0.00% frontend cycles idle
       3,318 cache-misses
 0.432703993 seconds time elapsed
 [...]
 81.23%  a.out  a.out              [.] work(int, int)
 18.50%  a.out  a.out              [.] add(int const&, int const&) [clone .isra.0]
 [...]
       |   __attribute__((noinline))
       |   static int add(const int& x, const int& y) {
       |       return x + y;
100.00 |     lea    (%rdi,%rsi,1),%eax
       |   }
       |   ? retq
[...]
       |            int z = add(x, y);
  1.93 |    ? callq  add(int const&, int const&) [clone .isra.0]
       |            sum += z;
 79.79 |      add    %eax,%ebx

对于fno-align-*:

 604,072,552 stalled-cycles-frontend   #    0.00% frontend cycles idle
       9,508 cache-misses
 0.375681928 seconds time elapsed
 [...]
 82.58%  a.out  a.out              [.] work(int, int)
 16.83%  a.out  a.out              [.] add(int const&, int const&) [clone .isra.0]
 [...]
       |   __attribute__((noinline))
       |   static int add(const int& x, const int& y) {
       |       return x + y;
 51.59 |     lea    (%rdi,%rsi,1),%eax
       |   }
[...]
       |    __attribute__((noinline))
       |    static int work(int xval, int yval) {
       |        int sum(0);
       |        for (int i=0; i<LOOP_BOUND; ++i) {
       |            int x(xval+sum);
  8.20 |      lea    0x0(%r13,%rbx,1),%edi
       |            int y(yval+sum);
       |            int z = add(x, y);
 35.34 |    ? callq  add(int const&, int const&) [clone .isra.0]
       |            sum += z;
 39.48 |      add    %eax,%ebx
       |    }

对于-fno-omit-frame-pointer:

 404,625,639 stalled-cycles-frontend   #    0.00% frontend cycles idle
      10,514 cache-misses
 0.375445137 seconds time elapsed
 [...]
 75.35%  a.out  a.out              [.] add(int const&, int const&) [clone .isra.0]                                                                                     |
 24.46%  a.out  a.out              [.] work(int, int)
 [...]
       |   __attribute__((noinline))
       |   static int add(const int& x, const int& y) {
 18.67 |     push   %rbp
       |       return x + y;
 18.49 |     lea    (%rdi,%rsi,1),%eax
       |   const int LOOP_BOUND = 200000000;
       |
       |   __attribute__((noinline))
       |   static int add(const int& x, const int& y) {
       |     mov    %rsp,%rbp
       |       return x + y;
       |   }
 12.71 |     pop    %rbp
       |   ? retq
 [...]
       |            int z = add(x, y);
       |    ? callq  add(int const&, int const&) [clone .isra.0]
       |            sum += z;
 29.83 |      add    %eax,%ebx

在缓慢的情况下,我们似乎在对 add() 的调用上拖延.

It looks like we are stalling on the call to add() in the slow case.

我已经检查了perf -e 可以在我的机器上吐出的一切;不仅仅是上面给出的统计数据.

I have examined everything that perf -e can spit out on my machine; not just the stats that are given above.

对于同一个可执行文件,stalled-cycles-frontend 与执行时间呈线性相关;我没有注意到任何其他相关性如此明显的东西.(比较不同的可执行文件的 stalled-cycles-frontend 对我来说没有意义.)

For the same executable, the stalled-cycles-frontend shows linear correlation with the execution time; I did not notice anything else that would correlate so clearly. (Comparing stalled-cycles-frontend for different executables doesn't make sense to me.)

我在第一条评论中包含了缓存未命中.我检查了可以通过 perf 在我的机器上测量的所有缓存未命中,而不仅仅是上面给出的那些.缓存未命中非常嘈杂,并且与执行时间几乎没有相关性.

I included the cache misses as it came up as the first comment. I examined all the cache misses that can be measured on my machine by perf, not just the ones given above. The cache misses are very very noisy and show little to no correlation with the execution times.

推荐答案

我的同事帮我找到了一个合理的答案.他注意到 256 字节边界的重要性.他没有在这里注册,并鼓励我自己发布答案(并带走所有的名气).

My colleague helped me find a plausible answer to my question. He noticed the importance of the 256 byte boundary. He is not registered here and encouraged me to post the answer myself (and take all the fame).

简答:

在这种情况下,填充是罪魁祸首吗?为什么以及如何?

Is it the padding that is the culprit in this case? Why and how?

这一切都归结为对齐.对齐会对性能产生重大影响,这就是我们首先使用 -falign-* 标志的原因.

It all boils down to alignment. Alignments can have a significant impact on the performance, that is why we have the -falign-* flags in the first place.

我已向 gcc 开发人员提交了一份(伪造的?)错误报告.事实证明,默认行为是我们默认将循环对齐到 8 字节,但如果我们不需要填充超过 10 个字节,则尝试将其对齐到 16 字节." 显然,这个默认值在这种特殊情况下和我的机器上,这不是最佳选择.带有 -O3 的 Clang 3.4 (trunk) 进行了适当的对齐,生成的代码没有显示这种奇怪的行为.

I have submitted a (bogus?) bug report to the gcc developers. It turns out that the default behavior is "we align loops to 8 byte by default but try to align it to 16 byte if we don't need to fill in over 10 bytes." Apparently, this default is not the best choice in this particular case and on my machine. Clang 3.4 (trunk) with -O3 does the appropriate alignment and the generated code does not show this weird behavior.

当然,如果进行了不适当的对齐,事情就会变得更糟.不必要的/错误的对齐只会无缘无故地消耗字节,并可能增加缓存未命中等.

Of course, if an inappropriate alignment is done, it makes things worse. An unnecessary / bad alignment just eats up bytes for no reason and potentially increases cache misses, etc.

它产生的噪音几乎使时间微优化不可能.

The noise it makes pretty much makes timing micro-optimizations impossible.

我怎样才能确保这种偶然的幸运/不幸对齐当我进行微优化时不干扰(与堆栈无关对齐)在 C 或 C++ 源代码上?

How can I make sure that such accidental lucky / unlucky alignments are not interfering when I do micro-optimizations (unrelated to stack alignment) on C or C++ source codes?

简单地告诉 gcc 做正确的对齐:

g++ -O2 -falign-functions=16 -falign-loops=16

长答案:

如果出现以下情况,代码会运行得更慢:

The code will run slower if:

  • 一个XX字节边界在中间切割add()(XX依赖于机器).

  • an XX byte boundary cuts add() in the middle (XX being machine dependent).

如果对 add() 的调用必须跳过 XX 字节边界并且目标未对齐.

if the call to add() has to jump over an XX byte boundary and the target is not aligned.

如果 add() 没有对齐.

如果循环未对齐.

前 2 个在 Marat Dukhan 善意发布的代码和结果中清晰可见.在这种情况下,gcc-4.8.1 -Os(在 0.994 秒内执行):

The first 2 are beautifully visible on the codes and results that Marat Dukhan kindly posted. In this case, gcc-4.8.1 -Os (executes in 0.994 secs):

00000000004004fd <_ZL3addRKiS0_.isra.0>:
  4004fd:       8d 04 37                lea    eax,[rdi+rsi*1]
  400500:       c3   

一个 256 字节的边界在中间切割 add() 并且 add() 和循环都没有对齐.惊喜,惊喜,这是最慢的情况!

a 256 byte boundary cuts add() right in the middle and neither add() nor the loop is aligned. Surprise, surprise, this is the slowest case!

如果 gcc-4.7.3 -Os(在 0.822 秒内执行),256 字节边界只会切入冷段(但既不是循环,也不是 add() 被剪切):

In case gcc-4.7.3 -Os (executes in 0.822 secs), the 256 byte boundary only cuts into a cold section (but neither the loop, nor add() is cut):

00000000004004fa <_ZL3addRKiS0_.isra.0>:
  4004fa:       8d 04 37                lea    eax,[rdi+rsi*1]
  4004fd:       c3                      ret

[...]

  40051a:       e8 db ff ff ff          call   4004fa <_ZL3addRKiS0_.isra.0>

没有对齐,对 add() 的调用必须跳过 256 字节边界.这段代码是第二慢的.

Nothing is aligned, and the call to add() has to jump over the 256 byte boundary. This code is the second slowest.

如果 gcc-4.6.4 -Os(在 0.709 秒内执行),虽然没有对齐,但对 add() 的调用不必跳转超过 256 字节边界,目标正好在 32 字节之外:

In case gcc-4.6.4 -Os (executes in 0.709 secs), although nothing is aligned, the call to add() doesn't have to jump over the 256 byte boundary and the target is exactly 32 byte away:

  4004f2:       e8 db ff ff ff          call   4004d2 <_ZL3addRKiS0_.isra.0>
  4004f7:       01 c3                   add    ebx,eax
  4004f9:       ff cd                   dec    ebp
  4004fb:       75 ec                   jne    4004e9 <_ZL4workii+0x13>

这是所有三个中最快的.为什么 256 字节的边界在他的机器上很特别,我会让他自己去弄清楚.我没有这样的处理器.

This is the fastest of all three. Why the 256 byte boundary is speacial on his machine, I will leave it up to him to figure it out. I don't have such a processor.

现在,在我的机器上我没有得到这个 256 字节的边界效果.只有函数和循环对齐在我的机器上起作用.如果我通过 g++ -O2 -falign-functions=16 -falign-loops=16 那么一切都会恢复正常:我总是得到最快的情况并且时间对 不敏感-fno-omit-frame-pointer 标志了.我可以传递 g++ -O2 -falign-functions=32 -falign-loops=32 或任何 16 的倍数,代码对此也不敏感.

Now, on my machine I don't get this 256 byte boundary effect. Only the function and the loop alignment kicks in on my machine. If I pass g++ -O2 -falign-functions=16 -falign-loops=16 then everything is back to normal: I always get the fastest case and the time isn't sensitive to the -fno-omit-frame-pointer flag anymore. I can pass g++ -O2 -falign-functions=32 -falign-loops=32 or any multiples of 16, the code is not sensitive to that either.

我在 2009 年第一次注意到 gcc(至少在我的项目和我的机器)有生成明显更快的代码的趋势,如果我优化大小(-Os)而不是速度(-O2 或 -O3),我已经一直想知道为什么.

I first noticed in 2009 that gcc (at least on my projects and on my machines) have the tendency to generate noticeably faster code if I optimize for size (-Os) instead of speed (-O2 or -O3) and I have been wondering ever since why.

一个可能的解释是我有热点对对齐很敏感,就像这个例子中的那个.通过混淆标志(传递 -Os 而不是 -O2),这些热点偶然以幸运的方式对齐,代码变得更快.这与优化大小无关:热点对齐得更好完全是偶然的.从现在开始,我将检查对齐对我的项目的影响.

A likely explanation is that I had hotspots which were sensitive to the alignment, just like the one in this example. By messing with the flags (passing -Os instead of -O2), those hotspots were aligned in a lucky way by accident and the code became faster. It had nothing to do with optimizing for size: These were by sheer accident that the hotspots got aligned better. From now on, I will check the effects of alignment on my projects.

哦,还有一件事.这样的热点是如何产生的,就像示例中所示的那样?像 add() 这样的小函数的内联怎么会失败?

Oh, and one more thing. How can such hotspots arise, like the one shown in the example? How can the inlining of such a tiny function like add() fail?

考虑一下:

// add.cpp
int add(const int& x, const int& y) {
    return x + y;
}

并在一个单独的文件中:

and in a separate file:

// main.cpp
int add(const int& x, const int& y);

const int LOOP_BOUND = 200000000;

__attribute__((noinline))
static int work(int xval, int yval) {
    int sum(0);
    for (int i=0; i<LOOP_BOUND; ++i) {
        int x(xval+sum);
        int y(yval+sum);
        int z = add(x, y);
        sum += z;
    }
    return sum;
}

int main(int , char* argv[]) {
    int result = work(*argv[1], *argv[2]);
    return result;
}

并编译为:g++ -O2 add.cpp main.cpp.

     gcc 不会内联 add()

      gcc won't inline add()!

仅此而已,很容易在无意中创建像 OP 中的热点那样的热点.当然部分是我的错:gcc 是一个优秀的编译器. 如果将上述编译为:g++ -O2 -flto add.cpp main.cpp,即 如果我执行链接时间优化,代码运行时间为 0.19s!

That's all, it's that easy to unintendedly create hotspots like the one in the OP. Of course it is partly my fault: gcc is an excellent compiler. If compile the above as: g++ -O2 -flto add.cpp main.cpp, that is, if I perform link time optimization, the code runs in 0.19s!

(内联在 OP 中被人为禁用,因此,OP 中的代码慢了 2 倍).

(Inlining is artificially disabled in the OP, hence, the code in the OP was 2x slower).

相关文章