gcc 的 asm volatile 是否等同于递归的 gfortran 默认设置?
我只是在 C++
和 Fortran
中使用递归函数,我意识到 Fortran
中的简单递归函数几乎是与其等效的 C++
函数一样快.现在,在进入这个之前,我知道这里有类似的问题,特别是:
I was just playing around with recursive functions in C++
and Fortran
and I realised that a simple recursive function in Fortran
is almost twice as fast as its equivalent C++
function. Now, before getting into this, I know there are similar questions here, specifically:
- 为什么要添加程序集注释导致生成的代码发生如此巨大的变化?
- asm的工作volatile (" : : : 记忆")
- 等效于 gfortran 中的 asm volatile
- Why does adding assembly comments cause such radical change in generated code?
- Working of asm volatile ("" : : : "memory")
- Equivalent to asm volatile in gfortran
然而,我有点更具体和困惑,因为 Fortran 编译器似乎正在做你可以在 gcc
中使用 asm volatile
实现的功能.为了给您一些上下文,让我们考虑以下递归 Fibonacci number
实现:
However, I am a bit more specific and puzzled as the Fortran compiler seems to be doing what you can achieve with asm volatile
in gcc
. To give you some context let's consider the following recursive Fibonacci number
implementation:
Fortran 代码:
module test
implicit none
private
public fib
contains
! Fibonacci function
integer recursive function fib(n) result(r)
integer, intent(in) :: n
if (n < 2) then
r = n
else
r = fib(n-1) + fib(n-2)
end if
end function ! end of Fibonacci function
end module
program fibonacci
use test, only: fib
implicit none
integer :: r,i
integer :: n = 1e09
real(8) :: start, finish, cum_time
cum_time=0
do i= 1,n
call cpu_time(start)
r = fib(20)
call cpu_time(finish)
cum_time = cum_time + (finish - start)
if (cum_time >0.5) exit
enddo
print*,i,'runs, average elapsed time is', cum_time/i/1e-06, 'us'
end program
编译:
gfortran -O3 -march=native
C++ 代码:
#include <iostream>
#include <chrono>
using namespace std;
// Fib function
int fib(const int n)
{
int r;
if (n < 2)
r = n;
else
r = fib(n-1) + fib(n-2);
return r;
} // end of fib
template<typename T, typename ... Args>
double timeit(T (*func)(Args...), Args...args)
{
double counter = 1.0;
double mean_time = 0.0;
for (auto iter=0; iter<1e09; ++iter){
std::chrono::time_point<std::chrono::system_clock> start, end;
start = std::chrono::system_clock::now();
func(args...);
end = std::chrono::system_clock::now();
std::chrono::duration<double> elapsed_seconds = end-start;
mean_time += elapsed_seconds.count();
counter++;
if (mean_time > 0.5){
mean_time /= counter;
std::cout << static_cast<long int>(counter)
<< " runs, average elapsed time is "
<< mean_time/1.0e-06 << " xC2xB5s" << std::endl;
break;
}
}
return mean_time;
}
int main(){
timeit(fib,20);
return 0;
}
编译:
g++ -O3 -march=native
时间:
Fortran: 24991 runs, average elapsed time is 20.087 us
C++ : 12355 runs, average elapsed time is 40.471 μs
所以 gfortran
的速度是 gcc
的两倍.查看汇编代码,我得到了
So gfortran
is twice as fast there compared to gcc
. Looking at the assembly code, I get
程序集(Fortran):
.L28:
cmpl $1, %r13d
jle .L29
leal -8(%rbx), %eax
movl %ecx, 12(%rsp)
movl %eax, 48(%rsp)
leaq 48(%rsp), %rdi
leal -9(%rbx), %eax
movl %eax, 16(%rsp)
call __bench_MOD_fib
leaq 16(%rsp), %rdi
movl %eax, %r13d
call __bench_MOD_fib
movl 12(%rsp), %ecx
addl %eax, %r13d
程序集 (C++):
.L28:
movl 72(%rsp), %edx
cmpl $1, %edx
movl %edx, %eax
jle .L33
subl $3, %eax
movl $0, 52(%rsp)
movl %eax, %esi
movl %eax, 96(%rsp)
movl 92(%rsp), %eax
shrl %eax
movl %eax, 128(%rsp)
addl %eax, %eax
subl %eax, %esi
movl %edx, %eax
subl $1, %eax
movl %esi, 124(%rsp)
movl %eax, 76(%rsp)
两个汇编代码都由一遍又一遍重复的几乎相似的块/标签组成.如您所见,Fortran 程序集对 fib
函数进行了两次调用,而在 C++ 程序集中,gcc
可能已经展开了所有可能需要更多堆栈 push 的递归调用/pop
和尾部跳跃.
Both assembly codes are made up of almost similar blocks/labels repeated over and over again. As you can see the Fortran assembly makes two calls to fib
function whereas in the C++ assembly, gcc
has probably unrolled all the recursive calls which probably requires more stack push/pop
and tail jumps.
现在,如果我像这样在 C++ 代码中添加一个内联汇编注释
Now if I just put one inline assembly comment in the C++ code like so
修改后的 C++ 代码:
// Fib function
int fib(const int n)
{
int r;
if (n < 2)
r = n;
else
r = fib(n-1) + fib(n-2);
asm("");
return r;
} // end of fib
生成的汇编代码,修改为
The generated assembly code, changes to
程序集(C++ 修改):
.L7:
cmpl $1, %edx
jle .L17
leal -4(%rbx), %r13d
leal -5(%rbx), %edx
cmpl $1, %r13d
jle .L19
leal -5(%rbx), %r14d
cmpl $1, %r14d
jle .L55
leal -6(%rbx), %r13d
movl %r13d, %edi
call _Z3fibi
leal -7(%rbx), %edi
movl %eax, %r15d
call _Z3fibi
movl %r13d, %edi
addl %eax, %r15d
您现在可以看到对 fib
函数的两次调用.计时他们给了我
You can now see two calls to fib
function. Timing them gives me
时间:
Fortran: 24991 runs, average elapsed time is 20.087 us
C++ : 25757 runs, average elapsed time is 19.412 μs
我知道没有输出的 asm
和 asm volatile
的效果是抑制激进的编译器优化,但在这种情况下,gcc
认为太聪明了,但最终生成了效率较低的代码.
I know the effect of asm
with no output and asm volatile
is to suppress aggressive compiler optimisations but in this case, gcc
thought it was too smart but ended up generating a less efficient code in the first place.
所以问题是:
- 为什么
gcc
看不到这个优化",而gfortan
明明可以? - 内联流水线必须在 return 语句之前.放在别处,它不会有任何效果.为什么?
- 这种行为编译器是特定的吗?例如,您可以使用 clang/MSVC 模仿相同的行为吗?
- 在
C
或C++
中是否有更安全的方法可以使递归更快(不依赖内联汇编或迭代式编码)?也许可变参数模板?
- Why can
gcc
not see this "optimisation", whengfortan
clearly can? - The inline assembly line has to be before the return statement. Put it elsewhere and it will have no effect. Why?
- Is this behaviour compiler specific? For instance can you mimick the same behaviour with clang/MSVC?
- Are there safer ways to make recursions faster in
C
orC++
(without relying on inline assembly or iteration-style coding)? Variadic templates perhaps?
更新:
- 上面显示的结果都是
gcc 4.8.4
.我也试过用gcc 4.9.2
和gcc 5.2
编译它,我得到了相同的结果. - 如果我将输入参数声明为 volatile 即
(volatile int n)
而不是而不是将
,虽然这会导致我的机器上的运行时间稍微变慢.asm
(const int n) - 正如 Michael Karcher 所提到的,我们可以改为传递
-fno-optimize-sibling-calls
标志来解决这个问题.由于此标志在-O2
级别及更高级别激活,因此即使使用-O1
进行编译也可以解决此问题. - 我用
clang 3.5.1
和-O3 -march=native
运行了相同的示例,虽然情况不一样,clang
似乎也可以使用asm
生成更快的代码.
- The result shown above are all with
gcc 4.8.4
. I have also tried compiling it withgcc 4.9.2
andgcc 5.2
and I get identical results. - The problem can also be replicated (fixed?) if instead of putting
asm
I declare the input argument as volatile i.e.(volatile int n)
instead of(const int n)
, although this will result in a tad bit slower run-time on my machine. - As Michael Karcher has mentioned, we can instead pass the
-fno-optimize-sibling-calls
flag to fix this problem. Since this flag is activated at-O2
level and beyond, even compiling with-O1
fixes this problem. - I have run the same example with
clang 3.5.1
with-O3 -march=native
and though the situation is not identical,clang
also appears to generate faster code withasm
.
Clang 计时:
clang++ w/o asm : 8846 runs, average elapsed time is 56.4555 μs
clang++ with asm : 10427 runs, average elapsed time is 47.8991 μs
推荐答案
请参阅本答案末尾附近的粗体字,了解如何获得由 gcc 生成的快速程序.阅读四个问题的答案.
See the bold print near the end of this answer on how to get a fast program generated by gcc. Read the answer for replies to the four questions.
您的第一个问题假设 gfortran
能够看到 gcc
未能看到的优化可能性.事实上,情况正好相反.gcc
识别出一些它认为是优化可能性的东西,而 gfortran
错过了它.唉,gcc
是错误的,它应用的优化结果是您的系统损失了 100% 的速度(与我的系统相当).
Your first question assumes that gfortran
is able to see an optimization possibility that gcc
failed to see. In fact, the opposite is true. gcc
identified something that it considered to be an optimization possibility, while gfortran
missed it. Alas, gcc
was wrong and the optimization it applied turns out to be a 100% speed loss on your system (comparable on mine).
解决您的第二个问题:asm
语句阻止了使 gcc
看到错误优化可能性的内部转换.如果没有 asm
语句,您的代码(有效)转换为:
To address your second question: The asm
statement prevented an internal transformation that made gcc
see the false optimization possiblity. Without the asm
statement, your code got (validly) transformed to:
int fib(const int n)
{
if (n < 2)
return n;
else
return fib(n-1) + fib(n-2);
}
包含递归调用的 return 语句会触发使您的代码悲观的兄弟调用优化".包含 asm 语句可防止在其间移动返回指令.
The return statement containing the recursive call triggers the "sibling calls optimization" that pessimizes your code. Including the asm statement prevents moving the return instruction across it.
目前,我手头只有 gcc,所以我无法尝试其他编译器的行为来通过证据回答您的第三个问题,但这似乎绝对依赖于编译器.您遇到了 gcc 的一个怪癖(或错误,无论您如何称呼它),它在尝试优化它时会生成错误的代码.不同编译器的优化器差异很大,因此很可能其他一些编译器不会像 gcc
那样错误地优化您的代码.另一方面,用于优化的代码转换是一个经过充分研究的主题,大多数编译器都在实现类似的优化方法,因此另一个编译器可能会与 gcc
陷入相同的陷阱.
Currently, I only have gcc at hand, so I can't try the behaviour of other compilers to answer your third question by evidence, but this seems definitely compiler dependent. You ran into a quirk (or bug, however you call it) of gcc that it generates bad code while trying to optimize it. The optimizers of different compilers are highly different, so it is quite likely that some other compiler doesn't mis-optimize your code like gcc
does. On the other hand, code transformations for optimization is a well-researched topic, and most compilers are implementing similar approaches to optimization, so it is possible that another compiler steps into the same trap as gcc
.
回答你最后一个问题:这不是 C/C++ 与 Fortan 的问题,而是 gcc
的问题,它搞砸了这个示例程序(可能还有类似的生产程序).所以没有办法在C++
中使递归更快,但是有一种方法可以加快这个例子在gcc
,通过禁用有问题的优化:-fno-optimize-sibling-calls
,这会导致(在我的系统上,在单个测试运行中)比只需插入 asm
语句.
To address you last question: It is not a problem about C/C++ versus Fortan, but a problem about gcc
that messes up this example program (and likely similar production programs). So there is no way to make recursion faster in C++
, but there is a way to speed up this example in gcc
, by disabling the problematic optimization: -fno-optimize-sibling-calls
, which results (on my system, in a single test run) in even faster code than just inserting the asm
statement.
相关文章