在 g++ 上进行聚合初始化的 std::array 生成大量代码

2021-12-20 00:00:00 optimization g++ c++ loop-unrolling stdarray

在 g++ 4.9.2 和 5.3.1 上，这段代码需要几秒钟的时间来编译并生成一个 52,776 字节的可执行文件:

On g++ 4.9.2 and 5.3.1, this code takes several seconds to compile and produces a 52,776 byte executable:

#include <array> #include <iostream> int main() { constexpr std::size_t size = 4096; struct S { float f; S() : f(0.0f) {} }; std::array<S, size> a = {}; // <-- note aggregate initialization for (auto& e : a) std::cerr << e.f; return 0; }

增加 size 似乎会线性增加编译时间和可执行文件的大小.我无法使用 clang 3.5 或 Visual C++ 2015 重现此行为.使用 -Os 没有区别.

Increasing size seems to increase compilation time and executable size linearly. I cannot reproduce this behaviour with either clang 3.5 or Visual C++ 2015. Using -Os makes no difference.

$ time g++ -O2 -std=c++11 test.cpp real 0m4.178s user 0m4.060s sys 0m0.068s

检查汇编代码发现a的初始化被展开，生成4096 movl指令:

Inspecting the assembly code reveals that the initialization of a is unrolled, generating 4096 movl instructions:

main: .LFB1313: .cfi_startproc pushq %rbx .cfi_def_cfa_offset 16 .cfi_offset 3, -16 subq $16384, %rsp .cfi_def_cfa_offset 16400 movl $0x00000000, (%rsp) movl $0x00000000, 4(%rsp) movq %rsp, %rbx movl $0x00000000, 8(%rsp) movl $0x00000000, 12(%rsp) movl $0x00000000, 16(%rsp) [...skipping 4000 lines...] movl $0x00000000, 16376(%rsp) movl $0x00000000, 16380(%rsp)

这仅在 T 具有非平凡构造函数并且使用 {} 初始化数组时才会发生.如果我执行以下任何操作，g++ 会生成一个简单的循环:

This only happens when T has a non-trivial constructor and the array is initialized using {}. If I do any of the following, g++ generates a simple loop:

删除S::S();
移除S::S()并在类中初始化S::f；
移除聚合初始化(= {});
不使用-O2编译.

Remove S::S();

Remove S::S() and initialize S::f in-class;

Remove the aggregate initialization (= {});

Compile without -O2.

我完全将循环展开作为一种优化，但我认为这不是一个很好的优化.在我将此报告为错误之前，有人可以确认这是否是预期的行为吗?

I'm all for loop unrolling as an optimization, but I don't think this is a very good one. Before I report this as a bug, can someone confirm whether this is the expected behaviour?

推荐答案

似乎有一个相关的错误报告，错误 59659 - 大的零初始化 std::array 编译时间过长.它被认为是 4.9.0 的固定"，所以我认为这个测试用例要么是回归，要么是补丁未涵盖的边缘情况.值得一提的是，错误报告的两个测试用例^1，2 在 GCC 4.9.0 和 5.3 上都对我表现出症状.1

There appears to be a related bug report, Bug 59659 - large zero-initialized std::array compile time excessive. It was considered "fixed" for 4.9.0, so I consider this testcase either a regression or an edgecase not covered by the patch. For what it's worth, two of the bug report's test cases^{1, 2} exhibit symptoms for me on both GCC 4.9.0 as well as 5.3.1

还有两个相关的错误报告:

There are two more related bug reports:

Bug 68203 - 使用 -std=c++ 的嵌套数组对结构的无限编译时间11

安德鲁平斯基 2015-11-04 07:56:57 UTC

Andrew Pinski 2015-11-04 07:56:57 UTC

这很可能是一个内存占用，它产生了大量的默认构造函数而不是对它们的循环.

This is most likely a memory hog which is generating lots of default constructors rather than a loop over them.

那个声称是这个的复制品:

That one claims to be a duplicate of this one:

错误 56671 - Gcc 使用大量内存和处理器能力以及大型 C++11 位集

乔纳森・韦克利 2016-01-26 15:12:27 UTC

Jonathan Wakely 2016-01-26 15:12:27 UTC

为这个 constexpr 构造函数生成数组初始化是问题:

Generating the array initialization for this constexpr constructor is the problem:

constexpr _Base_bitset(unsigned long long __val) noexcept : _M_w{ _WordT(__val) } { }

确实，如果我们将其更改为 S a[4096] {}; 我们不会遇到问题.

Indeed if we change it to S a[4096] {}; we don't get the problem.

使用 perf 我们可以看到 GCC 大部分时间都花在了什么地方.第一:

Using perf we can see where GCC is spending most of its time. First:

perf record g++ -std=c++11 -O2 test.cpp

然后性能报告:

10.33% cc1plus cc1plus [.] get_ref_base_and_extent 6.36% cc1plus cc1plus [.] memrefs_conflict_p 6.25% cc1plus cc1plus [.] vn_reference_lookup_2 6.16% cc1plus cc1plus [.] exp_equiv_p 5.99% cc1plus cc1plus [.] walk_non_aliased_vuses 5.02% cc1plus cc1plus [.] find_base_term 4.98% cc1plus cc1plus [.] invalidate 4.73% cc1plus cc1plus [.] write_dependence_p 4.68% cc1plus cc1plus [.] estimate_calls_size_and_time 4.11% cc1plus cc1plus [.] ix86_find_base_term 3.41% cc1plus cc1plus [.] rtx_equal_p 2.87% cc1plus cc1plus [.] cse_insn 2.77% cc1plus cc1plus [.] record_store 2.66% cc1plus cc1plus [.] vn_reference_eq 2.48% cc1plus cc1plus [.] operand_equal_p 1.21% cc1plus cc1plus [.] integer_zerop 1.00% cc1plus cc1plus [.] base_alias_check

这对 GCC 开发人员以外的任何人都没有多大意义，但看到什么占用了如此多的编译时间仍然很有趣.

This won't mean much to anyone but GCC developers but it's still interesting to see what's taking up so much compilation time.

Clang 3.7.0 在这方面比 GCC 做得更好.在 -O2 编译时间不到一秒钟，生成一个小得多的可执行文件(8960 字节)和这个程序集:

Clang 3.7.0 does a much better job at this than GCC. At -O2 it takes less than a second to compile, produces a much smaller executable (8960 bytes) and this assembly:

0000000000400810 <main>: 400810: 53 push rbx 400811: 48 81 ec 00 40 00 00 sub rsp,0x4000 400818: 48 8d 3c 24 lea rdi,[rsp] 40081c: 31 db xor ebx,ebx 40081e: 31 f6 xor esi,esi 400820: ba 00 40 00 00 mov edx,0x4000 400825: e8 56 fe ff ff call 400680 <memset@plt> 40082a: 66 0f 1f 44 00 00 nop WORD PTR [rax+rax*1+0x0] 400830: f3 0f 10 04 1c movss xmm0,DWORD PTR [rsp+rbx*1] 400835: f3 0f 5a c0 cvtss2sd xmm0,xmm0 400839: bf 60 10 60 00 mov edi,0x601060 40083e: e8 9d fe ff ff call 4006e0 <_ZNSo9_M_insertIdEERSoT_@plt> 400843: 48 83 c3 04 add rbx,0x4 400847: 48 81 fb 00 40 00 00 cmp rbx,0x4000 40084e: 75 e0 jne 400830 <main+0x20> 400850: 31 c0 xor eax,eax 400852: 48 81 c4 00 40 00 00 add rsp,0x4000 400859: 5b pop rbx 40085a: c3 ret 40085b: 0f 1f 44 00 00 nop DWORD PTR [rax+rax*1+0x0]

另一方面，使用 GCC 5.3.1，在没有优化的情况下，编译速度非常快，但仍会生成 95328 大小的可执行文件.使用 -O2 编译将可执行文件大小减少到 53912，但编译时间需要 4 秒.我绝对会将这个报告给他们的 bugzilla.

On the other hand with GCC 5.3.1, with no optimizations, it compiles very quickly but still produces a 95328 sized executable. Compiling with -O2 reduces the executable size to 53912 but compilation time takes 4 seconds. I would definitely report this to their bugzilla.

相关文章