在 g++ 上进行聚合初始化的 std::array 会生成大量代码

2022-01-23 00:00:00 optimization g++ c++ loop-unrolling stdarray

在 g++ 4.9.2 和 5.3.1 上，此代码需要几秒钟才能编译并生成 52,776 字节的可执行文件:

On g++ 4.9.2 and 5.3.1, this code takes several seconds to compile and produces a 52,776 byte executable:

#include <array> #include <iostream> int main() { constexpr std::size_t size = 4096; struct S { float f; S() : f(0.0f) {} }; std::array<S, size> a = {}; // <-- note aggregate initialization for (auto& e : a) std::cerr << e.f; return 0; }

增加 size 似乎会线性增加编译时间和可执行文件大小.我无法使用 clang 3.5 或 Visual C++ 2015 重现此行为.使用 -Os 没有区别.

Increasing size seems to increase compilation time and executable size linearly. I cannot reproduce this behaviour with either clang 3.5 or Visual C++ 2015. Using -Os makes no difference.

$ time g++ -O2 -std=c++11 test.cpp real 0m4.178s user 0m4.060s sys 0m0.068s

检查汇编代码发现a的初始化被展开，生成4096 movl指令:

Inspecting the assembly code reveals that the initialization of a is unrolled, generating 4096 movl instructions:

main: .LFB1313: .cfi_startproc pushq %rbx .cfi_def_cfa_offset 16 .cfi_offset 3, -16 subq $16384, %rsp .cfi_def_cfa_offset 16400 movl $0x00000000, (%rsp) movl $0x00000000, 4(%rsp) movq %rsp, %rbx movl $0x00000000, 8(%rsp) movl $0x00000000, 12(%rsp) movl $0x00000000, 16(%rsp) [...skipping 4000 lines...] movl $0x00000000, 16376(%rsp) movl $0x00000000, 16380(%rsp)

这仅在 T 具有非平凡构造函数并且使用 {} 初始化数组时发生.如果我执行以下任何操作，g++ 会生成一个简单的循环:

This only happens when T has a non-trivial constructor and the array is initialized using {}. If I do any of the following, g++ generates a simple loop:

删除S::S();
移除S::S()并在类内初始化S::f；
移除聚合初始化(= {})；
不使用-O2编译.

Remove S::S();

Remove S::S() and initialize S::f in-class;

Remove the aggregate initialization (= {});

Compile without -O2.

我完全赞成将循环展开作为一种优化，但我认为这不是一个很好的优化.在我将此报告为错误之前，有人可以确认这是否是预期的行为吗?

I'm all for loop unrolling as an optimization, but I don't think this is a very good one. Before I report this as a bug, can someone confirm whether this is the expected behaviour?

推荐答案

好像有相关的错误报告，错误 59659 - 大型零初始化 std::array 编译时间过长.对于 4.9.0，它被认为是固定的"，所以我认为这个测试用例要么是回归，要么是补丁未覆盖的边缘用例.值得一提的是，两个错误报告的测试用例^1，2 在 GCC 4.9.0 和 5.3 上都对我表现出症状.1

There appears to be a related bug report, Bug 59659 - large zero-initialized std::array compile time excessive. It was considered "fixed" for 4.9.0, so I consider this testcase either a regression or an edgecase not covered by the patch. For what it's worth, two of the bug report's test cases^{1, 2} exhibit symptoms for me on both GCC 4.9.0 as well as 5.3.1

还有两个相关的错误报告:

There are two more related bug reports:

错误 68203 - 使用 -std=c++ 对嵌套数组的结构进行无限编译时间11

安德鲁平斯基 2015-11-04 07:56:57 UTC

Andrew Pinski 2015-11-04 07:56:57 UTC

这很可能是产生大量默认值的内存占用构造函数，而不是对它们进行循环.

This is most likely a memory hog which is generating lots of default constructors rather than a loop over them.

那个声称是这个的复制品:

That one claims to be a duplicate of this one:

错误 56671 - Gcc 在大型 C++11 位集上使用大量内存和处理器能力

Jonathan Wakely 2016-01-26 15:12:27 UTC

Jonathan Wakely 2016-01-26 15:12:27 UTC

为此 constexpr 构造函数生成数组初始化是问题:

Generating the array initialization for this constexpr constructor is the problem:

constexpr _Base_bitset(unsigned long long __val) noexcept : _M_w{ _WordT(__val) } { }

确实，如果我们将其更改为 S a[4096] {}; 我们不会遇到问题.

Indeed if we change it to S a[4096] {}; we don't get the problem.

使用 perf 我们可以看到 GCC 大部分时间都花在了哪里.第一:

Using perf we can see where GCC is spending most of its time. First:

性能记录 g++ -std=c++11 -O2 test.cpp

然后性能报告:

10.33% cc1plus cc1plus [.] get_ref_base_and_extent 6.36% cc1plus cc1plus [.] memrefs_conflict_p 6.25% cc1plus cc1plus [.] vn_reference_lookup_2 6.16% cc1plus cc1plus [.] exp_equiv_p 5.99% cc1plus cc1plus [.] walk_non_aliased_vuses 5.02% cc1plus cc1plus [.] find_base_term 4.98% cc1plus cc1plus [.] invalidate 4.73% cc1plus cc1plus [.] write_dependence_p 4.68% cc1plus cc1plus [.] estimate_calls_size_and_time 4.11% cc1plus cc1plus [.] ix86_find_base_term 3.41% cc1plus cc1plus [.] rtx_equal_p 2.87% cc1plus cc1plus [.] cse_insn 2.77% cc1plus cc1plus [.] record_store 2.66% cc1plus cc1plus [.] vn_reference_eq 2.48% cc1plus cc1plus [.] operand_equal_p 1.21% cc1plus cc1plus [.] integer_zerop 1.00% cc1plus cc1plus [.] base_alias_check

这对除了 GCC 开发人员之外的任何人都没有多大意义，但看看是什么占用了这么多编译时间仍然很有趣.

This won't mean much to anyone but GCC developers but it's still interesting to see what's taking up so much compilation time.

Clang 3.7.0 在这方面做得比 GCC 好得多.在 -O2 上，编译时间不到一秒，生成的可执行文件小得多(8960 字节)，而这个程序集:

Clang 3.7.0 does a much better job at this than GCC. At -O2 it takes less than a second to compile, produces a much smaller executable (8960 bytes) and this assembly:

0000000000400810 <main>: 400810: 53 push rbx 400811: 48 81 ec 00 40 00 00 sub rsp,0x4000 400818: 48 8d 3c 24 lea rdi,[rsp] 40081c: 31 db xor ebx,ebx 40081e: 31 f6 xor esi,esi 400820: ba 00 40 00 00 mov edx,0x4000 400825: e8 56 fe ff ff call 400680 <memset@plt> 40082a: 66 0f 1f 44 00 00 nop WORD PTR [rax+rax*1+0x0] 400830: f3 0f 10 04 1c movss xmm0,DWORD PTR [rsp+rbx*1] 400835: f3 0f 5a c0 cvtss2sd xmm0,xmm0 400839: bf 60 10 60 00 mov edi,0x601060 40083e: e8 9d fe ff ff call 4006e0 <_ZNSo9_M_insertIdEERSoT_@plt> 400843: 48 83 c3 04 add rbx,0x4 400847: 48 81 fb 00 40 00 00 cmp rbx,0x4000 40084e: 75 e0 jne 400830 <main+0x20> 400850: 31 c0 xor eax,eax 400852: 48 81 c4 00 40 00 00 add rsp,0x4000 400859: 5b pop rbx 40085a: c3 ret 40085b: 0f 1f 44 00 00 nop DWORD PTR [rax+rax*1+0x0]

另一方面，使用 GCC 5.3.1，没有优化，它编译速度非常快，但仍然生成 95328 大小的可执行文件.使用 -O2 编译将可执行文件大小减少到 53912，但编译时间需要 4 秒.我肯定会向他们的 bugzilla 报告这个.

On the other hand with GCC 5.3.1, with no optimizations, it compiles very quickly but still produces a 95328 sized executable. Compiling with -O2 reduces the executable size to 53912 but compilation time takes 4 seconds. I would definitely report this to their bugzilla.

相关文章