为什么这个 C++ 包装类没有被内联?

2022-01-04 00:00:00 performance compilation assembly c++ c++11

EDIT - 我的构建系统出了点问题.我仍在弄清楚到底是什么，但是 gcc 产生了奇怪的结果(即使它是一个 .cpp 文件)，但是一旦我使用了 g++ 然后它按预期工作.

EDIT - something's up with my build system. I'm still figuring out exactly what, but gcc was producing weird results (even though it's a .cpp file), but once I used g++ then it worked as expected.

对于我遇到的问题，这是一个非常简化的测试用例，其中使用数字包装类(我认为会被内联)使我的程序慢了 10 倍.

This is a very reduced test-case for something I've been having trouble with, where using a numerical wrapper class (which I thought would be inlined away) made my program 10x slower.

这与优化级别无关(尝试使用 -O0 和 -O3).

This is independent of optimisation level (tried with -O0 and -O3).

我是否在包装类中遗漏了一些细节?

Am I missing some detail in my wrapper class?

我有以下程序，我在其中定义了一个包含 double 并提供 + 运算符的类:

I have the following program, in which I define a class which wraps a double and provides the + operator:

#include <cstdio> #include <cstdlib> #define INLINE __attribute__((always_inline)) inline struct alignas(8) WrappedDouble { double value; INLINE friend const WrappedDouble operator+(const WrappedDouble& left, const WrappedDouble& right) { return {left.value + right.value}; }; }; #define doubleType WrappedDouble // either "double" or "WrappedDouble" int main() { int N = 100000000; doubleType* arr = (doubleType*)malloc(sizeof(doubleType)*N); for (int i = 1; i < N; i++) { arr[i] = arr[i - 1] + arr[i]; } free(arr); printf("done "); return 0; }

我认为这会编译为相同的东西 - 它进行相同的计算，并且所有内容都是内联的.

I thought that this would compile to the same thing - it's doing the same calculations, and everything is inlined.

然而，事实并非如此 - 无论优化级别如何，它都会产生更大更慢的结果.

However, it's not - it produces a larger and slower result, regardless of optimisation level.

(这个特殊的结果并没有显着慢，但我的实际用例包括更多的算术.)

(This particular result is not significantly slower, but my actual use-case includes more arithmetic.)

EDIT - 我知道这不是在构建我的数组元素.我认为这可能会产生更少的 ASM，所以我可以更好地理解它，但如果它有问题，我可以更改它.

EDIT - I am aware that this isn't constructing my array elements. I thought this might produce less ASM so I could understand it better, but I can change it if it's a problem.

EDIT - 我也知道我应该使用 new[]/delete[].不幸的是 gcc 拒绝编译它，即使它在一个 .cpp 文件中.这是我的构建系统被搞砸的症状，这可能是我的实际问题.

EDIT - I am also aware that I should be using new[]/delete[]. Unfortunately gcc refused to compile that, even though it was in a .cpp file. This was a symptom of my build system being screwed up, which is probably my actual problem.

EDIT - 如果我使用 g++ 而不是 gcc，它会产生相同的输出.

EDIT - If I use g++ instead of gcc, it produces identical output.

EDIT - 我发布了错误版本的 ASM(-O0 而不是 -O3)，所以本节没有帮助.

EDIT - I posted the wrong version of the ASM (-O0 instead of -O3), so this section isn't helpful.

我在 64 位系统上的 Mac 上使用 XCode 的 gcc.结果是一样的，除了 for 循环的主体.

I'm using XCode's gcc on my Mac, on a 64-bit system. The result is the same, aside from the body of the for-loop.

如果 doubleType 是 double，它为循环体产生的结果如下:

Here's what it produces for the body of the loop if doubleType is double:

movq -16(%rbp), %rax movl -20(%rbp), %ecx subl $1, %ecx movslq %ecx, %rdx movsd (%rax,%rdx,8), %xmm0 ## xmm0 = mem[0],zero movq -16(%rbp), %rax movslq -20(%rbp), %rdx addsd (%rax,%rdx,8), %xmm0 movq -16(%rbp), %rax movslq -20(%rbp), %rdx movsd %xmm0, (%rax,%rdx,8)

WrappedDouble 版本要长得多:

movq -40(%rbp), %rax movl -44(%rbp), %ecx subl $1, %ecx movslq %ecx, %rdx shlq $3, %rdx addq %rdx, %rax movq -40(%rbp), %rdx movslq -44(%rbp), %rsi shlq $3, %rsi addq %rsi, %rdx movq %rax, -16(%rbp) movq %rdx, -24(%rbp) movq -16(%rbp), %rax movsd (%rax), %xmm0 ## xmm0 = mem[0],zero movq -24(%rbp), %rax addsd (%rax), %xmm0 movsd %xmm0, -8(%rbp) movsd -8(%rbp), %xmm0 ## xmm0 = mem[0],zero movsd %xmm0, -56(%rbp) movq -40(%rbp), %rax movslq -44(%rbp), %rdx movq -56(%rbp), %rsi movq %rsi, (%rax,%rdx,8)

推荐答案

当您使用 启用优化时，两个版本都会使用 g++ 和 clang++ 生成相同的汇编代码>-O3.

Both versions result in identical assembly code with g++ and clang++ when you turn on optimizations with -O3.

相关文章