在 g++ 上进行聚合初始化的 std::array 会生成大量代码

在 g++ 4.9.2 和 5.3.1 上,此代码需要几秒钟才能编译并生成 52,776 字节的可执行文件:

On g++ 4.9.2 and 5.3.1, this code takes several seconds to compile and produces a 52,776 byte executable:

#include <array>
#include <iostream>

int main()
{
    constexpr std::size_t size = 4096;

    struct S
    {
        float f;
        S() : f(0.0f) {}
    };

    std::array<S, size> a = {};  // <-- note aggregate initialization

    for (auto& e : a)
        std::cerr << e.f;

    return 0;
}

增加 size 似乎会线性增加编译时间和可执行文件大小.我无法使用 clang 3.5 或 Visual C++ 2015 重现此行为.使用 -Os 没有区别.

Increasing size seems to increase compilation time and executable size linearly. I cannot reproduce this behaviour with either clang 3.5 or Visual C++ 2015. Using -Os makes no difference.

$ time g++ -O2 -std=c++11 test.cpp
real    0m4.178s
user    0m4.060s
sys     0m0.068s

检查汇编代码发现a的初始化被展开,生成4096 movl指令:

Inspecting the assembly code reveals that the initialization of a is unrolled, generating 4096 movl instructions:

main:
.LFB1313:
    .cfi_startproc
    pushq   %rbx
    .cfi_def_cfa_offset 16
    .cfi_offset 3, -16
    subq    $16384, %rsp
    .cfi_def_cfa_offset 16400
    movl    $0x00000000, (%rsp)
    movl    $0x00000000, 4(%rsp)
    movq    %rsp, %rbx
    movl    $0x00000000, 8(%rsp)
    movl    $0x00000000, 12(%rsp)
    movl    $0x00000000, 16(%rsp)
       [...skipping 4000 lines...]
    movl    $0x00000000, 16376(%rsp)
    movl    $0x00000000, 16380(%rsp)

这仅在 T 具有非平凡构造函数并且使用 {} 初始化数组时发生.如果我执行以下任何操作,g++ 会生成一个简单的循环:

This only happens when T has a non-trivial constructor and the array is initialized using {}. If I do any of the following, g++ generates a simple loop:

  1. 删除S::S();
  2. 移除S::S()并在类内初始化S::f
  3. 移除聚合初始化(= {});
  4. 不使用-O2编译.
  1. Remove S::S();
  2. Remove S::S() and initialize S::f in-class;
  3. Remove the aggregate initialization (= {});
  4. Compile without -O2.

我完全赞成将循环展开作为一种优化,但我认为这不是一个很好的优化.在我将此报告为错误之前,有人可以确认这是否是预期的行为吗?

I'm all for loop unrolling as an optimization, but I don't think this is a very good one. Before I report this as a bug, can someone confirm whether this is the expected behaviour?

推荐答案

好像有相关的错误报告,错误 59659 - 大型零初始化 std::array 编译时间过长.对于 4.9.0,它被认为是固定的",所以我认为这个测试用例要么是回归,要么是补丁未覆盖的边缘用例.值得一提的是,两个错误报告的测试用例1,2 在 GCC 4.9.0 和 5.3 上都对我表现出症状.1

There appears to be a related bug report, Bug 59659 - large zero-initialized std::array compile time excessive. It was considered "fixed" for 4.9.0, so I consider this testcase either a regression or an edgecase not covered by the patch. For what it's worth, two of the bug report's test cases1, 2 exhibit symptoms for me on both GCC 4.9.0 as well as 5.3.1

还有两个相关的错误报告:

There are two more related bug reports:

错误 68203 - 使用 -std=c++ 对嵌套数组的结构进行无限编译时间11

安德鲁平斯基 2015-11-04 07:56:57 UTC

Andrew Pinski 2015-11-04 07:56:57 UTC

这很可能是产生大量默认值的内存占用构造函数,而不是对它们进行循环.

This is most likely a memory hog which is generating lots of default constructors rather than a loop over them.

那个声称是这个的复制品:

That one claims to be a duplicate of this one:

错误 56671 - Gcc 在大型 C++11 位集上使用大量内存和处理器能力

Jonathan Wakely 2016-01-26 15:12:27 UTC

Jonathan Wakely 2016-01-26 15:12:27 UTC

为此 constexpr 构造函数生成数组初始化是问题:

Generating the array initialization for this constexpr constructor is the problem:

  constexpr _Base_bitset(unsigned long long __val) noexcept
  : _M_w{ _WordT(__val)
   } { }

确实,如果我们将其更改为 S a[4096] {}; 我们不会遇到问题.

Indeed if we change it to S a[4096] {}; we don't get the problem.

使用 perf 我们可以看到 GCC 大部分时间都花在了哪里.第一:

Using perf we can see where GCC is spending most of its time. First:

性能记录 g++ -std=c++11 -O2 test.cpp

然后性能报告:

  10.33%  cc1plus   cc1plus                 [.] get_ref_base_and_extent
   6.36%  cc1plus   cc1plus                 [.] memrefs_conflict_p
   6.25%  cc1plus   cc1plus                 [.] vn_reference_lookup_2
   6.16%  cc1plus   cc1plus                 [.] exp_equiv_p
   5.99%  cc1plus   cc1plus                 [.] walk_non_aliased_vuses
   5.02%  cc1plus   cc1plus                 [.] find_base_term
   4.98%  cc1plus   cc1plus                 [.] invalidate
   4.73%  cc1plus   cc1plus                 [.] write_dependence_p
   4.68%  cc1plus   cc1plus                 [.] estimate_calls_size_and_time
   4.11%  cc1plus   cc1plus                 [.] ix86_find_base_term
   3.41%  cc1plus   cc1plus                 [.] rtx_equal_p
   2.87%  cc1plus   cc1plus                 [.] cse_insn
   2.77%  cc1plus   cc1plus                 [.] record_store
   2.66%  cc1plus   cc1plus                 [.] vn_reference_eq
   2.48%  cc1plus   cc1plus                 [.] operand_equal_p
   1.21%  cc1plus   cc1plus                 [.] integer_zerop
   1.00%  cc1plus   cc1plus                 [.] base_alias_check

这对除了 GCC 开发人员之外的任何人都没有多大意义,但看看是什么占用了这么多编译时间仍然很有趣.

This won't mean much to anyone but GCC developers but it's still interesting to see what's taking up so much compilation time.

Clang 3.7.0 在这方面做得比 GCC 好得多.在 -O2 上,编译时间不到一秒,生成的可执行文件小得多(8960 字节),而这个程序集:

Clang 3.7.0 does a much better job at this than GCC. At -O2 it takes less than a second to compile, produces a much smaller executable (8960 bytes) and this assembly:

0000000000400810 <main>:
  400810:   53                      push   rbx
  400811:   48 81 ec 00 40 00 00    sub    rsp,0x4000
  400818:   48 8d 3c 24             lea    rdi,[rsp]
  40081c:   31 db                   xor    ebx,ebx
  40081e:   31 f6                   xor    esi,esi
  400820:   ba 00 40 00 00          mov    edx,0x4000
  400825:   e8 56 fe ff ff          call   400680 <memset@plt>
  40082a:   66 0f 1f 44 00 00       nop    WORD PTR [rax+rax*1+0x0]
  400830:   f3 0f 10 04 1c          movss  xmm0,DWORD PTR [rsp+rbx*1]
  400835:   f3 0f 5a c0             cvtss2sd xmm0,xmm0
  400839:   bf 60 10 60 00          mov    edi,0x601060
  40083e:   e8 9d fe ff ff          call   4006e0 <_ZNSo9_M_insertIdEERSoT_@plt>
  400843:   48 83 c3 04             add    rbx,0x4
  400847:   48 81 fb 00 40 00 00    cmp    rbx,0x4000
  40084e:   75 e0                   jne    400830 <main+0x20>
  400850:   31 c0                   xor    eax,eax
  400852:   48 81 c4 00 40 00 00    add    rsp,0x4000
  400859:   5b                      pop    rbx
  40085a:   c3                      ret    
  40085b:   0f 1f 44 00 00          nop    DWORD PTR [rax+rax*1+0x0]

另一方面,使用 GCC 5.3.1,没有优化,它编译速度非常快,但仍然生成 95328 大小的可执行文件.使用 -O2 编译将可执行文件大小减少到 53912,但编译时间需要 4 秒.我肯定会向他们的 bugzilla 报告这个.

On the other hand with GCC 5.3.1, with no optimizations, it compiles very quickly but still produces a 95328 sized executable. Compiling with -O2 reduces the executable size to 53912 but compilation time takes 4 seconds. I would definitely report this to their bugzilla.

相关文章