在 g++ 上进行聚合初始化的 std::array 生成大量代码

在 g++ 4.9.2 和 5.3.1 上,这段代码需要几秒钟的时间来编译并生成一个 52,776 字节的可执行文件:

On g++ 4.9.2 and 5.3.1, this code takes several seconds to compile and produces a 52,776 byte executable:

#include <array>
#include <iostream>

int main()
{
    constexpr std::size_t size = 4096;

    struct S
    {
        float f;
        S() : f(0.0f) {}
    };

    std::array<S, size> a = {};  // <-- note aggregate initialization

    for (auto& e : a)
        std::cerr << e.f;

    return 0;
}

增加 size 似乎会线性增加编译时间和可执行文件的大小.我无法使用 clang 3.5 或 Visual C++ 2015 重现此行为.使用 -Os 没有区别.

Increasing size seems to increase compilation time and executable size linearly. I cannot reproduce this behaviour with either clang 3.5 or Visual C++ 2015. Using -Os makes no difference.

$ time g++ -O2 -std=c++11 test.cpp
real    0m4.178s
user    0m4.060s
sys     0m0.068s

检查汇编代码发现a的初始化被展开,生成4096 movl指令:

Inspecting the assembly code reveals that the initialization of a is unrolled, generating 4096 movl instructions:

main:
.LFB1313:
    .cfi_startproc
    pushq   %rbx
    .cfi_def_cfa_offset 16
    .cfi_offset 3, -16
    subq    $16384, %rsp
    .cfi_def_cfa_offset 16400
    movl    $0x00000000, (%rsp)
    movl    $0x00000000, 4(%rsp)
    movq    %rsp, %rbx
    movl    $0x00000000, 8(%rsp)
    movl    $0x00000000, 12(%rsp)
    movl    $0x00000000, 16(%rsp)
       [...skipping 4000 lines...]
    movl    $0x00000000, 16376(%rsp)
    movl    $0x00000000, 16380(%rsp)

这仅在 T 具有非平凡构造函数并且使用 {} 初始化数组时才会发生.如果我执行以下任何操作,g++ 会生成一个简单的循环:

This only happens when T has a non-trivial constructor and the array is initialized using {}. If I do any of the following, g++ generates a simple loop:

  1. 删除S::S();
  2. 移除S::S()并在类中初始化S::f
  3. 移除聚合初始化(= {});
  4. 不使用-O2编译.
  1. Remove S::S();
  2. Remove S::S() and initialize S::f in-class;
  3. Remove the aggregate initialization (= {});
  4. Compile without -O2.

我完全将循环展开作为一种优化,但我认为这不是一个很好的优化.在我将此报告为错误之前,有人可以确认这是否是预期的行为吗?

I'm all for loop unrolling as an optimization, but I don't think this is a very good one. Before I report this as a bug, can someone confirm whether this is the expected behaviour?

推荐答案

似乎有一个相关的错误报告,错误 59659 - 大的零初始化 std::array 编译时间过长.它被认为是 4.9.0 的固定",所以我认为这个测试用例要么是回归,要么是补丁未涵盖的边缘情况.值得一提的是,错误报告的两个测试用例1,2 在 GCC 4.9.0 和 5.3 上都对我表现出症状.1

There appears to be a related bug report, Bug 59659 - large zero-initialized std::array compile time excessive. It was considered "fixed" for 4.9.0, so I consider this testcase either a regression or an edgecase not covered by the patch. For what it's worth, two of the bug report's test cases1, 2 exhibit symptoms for me on both GCC 4.9.0 as well as 5.3.1

还有两个相关的错误报告:

There are two more related bug reports:

Bug 68203 - 使用 -std=c++ 的嵌套数组对结构的无限编译时间11

安德鲁平斯基 2015-11-04 07:56:57 UTC

Andrew Pinski 2015-11-04 07:56:57 UTC

这很可能是一个内存占用,它产生了大量的默认构造函数而不是对它们的循环.

This is most likely a memory hog which is generating lots of default constructors rather than a loop over them.

那个声称是这个的复制品:

That one claims to be a duplicate of this one:

错误 56671 - Gcc 使用大量内存和处理器能力以及大型 C++11 位集

乔纳森・韦克利 2016-01-26 15:12:27 UTC

Jonathan Wakely 2016-01-26 15:12:27 UTC

为这个 constexpr 构造函数生成数组初始化是问题:

Generating the array initialization for this constexpr constructor is the problem:

  constexpr _Base_bitset(unsigned long long __val) noexcept
  : _M_w{ _WordT(__val)
   } { }

确实,如果我们将其更改为 S a[4096] {}; 我们不会遇到问题.

Indeed if we change it to S a[4096] {}; we don't get the problem.

使用 perf 我们可以看到 GCC 大部分时间都花在了什么地方.第一:

Using perf we can see where GCC is spending most of its time. First:

perf record g++ -std=c++11 -O2 test.cpp

然后性能报告:

  10.33%  cc1plus   cc1plus                 [.] get_ref_base_and_extent
   6.36%  cc1plus   cc1plus                 [.] memrefs_conflict_p
   6.25%  cc1plus   cc1plus                 [.] vn_reference_lookup_2
   6.16%  cc1plus   cc1plus                 [.] exp_equiv_p
   5.99%  cc1plus   cc1plus                 [.] walk_non_aliased_vuses
   5.02%  cc1plus   cc1plus                 [.] find_base_term
   4.98%  cc1plus   cc1plus                 [.] invalidate
   4.73%  cc1plus   cc1plus                 [.] write_dependence_p
   4.68%  cc1plus   cc1plus                 [.] estimate_calls_size_and_time
   4.11%  cc1plus   cc1plus                 [.] ix86_find_base_term
   3.41%  cc1plus   cc1plus                 [.] rtx_equal_p
   2.87%  cc1plus   cc1plus                 [.] cse_insn
   2.77%  cc1plus   cc1plus                 [.] record_store
   2.66%  cc1plus   cc1plus                 [.] vn_reference_eq
   2.48%  cc1plus   cc1plus                 [.] operand_equal_p
   1.21%  cc1plus   cc1plus                 [.] integer_zerop
   1.00%  cc1plus   cc1plus                 [.] base_alias_check

这对 GCC 开发人员以外的任何人都没有多大意义,但看到什么占用了如此多的编译时间仍然很有趣.

This won't mean much to anyone but GCC developers but it's still interesting to see what's taking up so much compilation time.

Clang 3.7.0 在这方面比 GCC 做得更好.在 -O2 编译时间不到一秒钟,生成一个小得多的可执行文件(8960 字节)和这个程序集:

Clang 3.7.0 does a much better job at this than GCC. At -O2 it takes less than a second to compile, produces a much smaller executable (8960 bytes) and this assembly:

0000000000400810 <main>:
  400810:   53                      push   rbx
  400811:   48 81 ec 00 40 00 00    sub    rsp,0x4000
  400818:   48 8d 3c 24             lea    rdi,[rsp]
  40081c:   31 db                   xor    ebx,ebx
  40081e:   31 f6                   xor    esi,esi
  400820:   ba 00 40 00 00          mov    edx,0x4000
  400825:   e8 56 fe ff ff          call   400680 <memset@plt>
  40082a:   66 0f 1f 44 00 00       nop    WORD PTR [rax+rax*1+0x0]
  400830:   f3 0f 10 04 1c          movss  xmm0,DWORD PTR [rsp+rbx*1]
  400835:   f3 0f 5a c0             cvtss2sd xmm0,xmm0
  400839:   bf 60 10 60 00          mov    edi,0x601060
  40083e:   e8 9d fe ff ff          call   4006e0 <_ZNSo9_M_insertIdEERSoT_@plt>
  400843:   48 83 c3 04             add    rbx,0x4
  400847:   48 81 fb 00 40 00 00    cmp    rbx,0x4000
  40084e:   75 e0                   jne    400830 <main+0x20>
  400850:   31 c0                   xor    eax,eax
  400852:   48 81 c4 00 40 00 00    add    rsp,0x4000
  400859:   5b                      pop    rbx
  40085a:   c3                      ret    
  40085b:   0f 1f 44 00 00          nop    DWORD PTR [rax+rax*1+0x0]

另一方面,使用 GCC 5.3.1,在没有优化的情况下,编译速度非常快,但仍会生成 95328 大小的可执行文件.使用 -O2 编译将可执行文件大小减少到 53912,但编译时间需要 4 秒.我绝对会将这个报告给他们的 bugzilla.

On the other hand with GCC 5.3.1, with no optimizations, it compiles very quickly but still produces a 95328 sized executable. Compiling with -O2 reduces the executable size to 53912 but compilation time takes 4 seconds. I would definitely report this to their bugzilla.

相关文章