gcc 的 asm volatile 是否等同于递归的 gfortran 默认设置?

2022-01-14 00:00:00 gcc fortran optimization assembly c++

我只是在 C++Fortran 中使用递归函数,我意识到 Fortran 中的简单递归函数几乎是与其等效的 C++ 函数一样快.现在,在进入这个之前,我知道这里有类似的问题,特别是:

I was just playing around with recursive functions in C++ and Fortran and I realised that a simple recursive function in Fortran is almost twice as fast as its equivalent C++ function. Now, before getting into this, I know there are similar questions here, specifically:

  1. 为什么要添加程序集注释导致生成的代码发生如此巨大的变化?
  2. asm的工作volatile (" : : : 记忆")
  3. 等效于 gfortran 中的 asm volatile
  1. Why does adding assembly comments cause such radical change in generated code?
  2. Working of asm volatile ("" : : : "memory")
  3. Equivalent to asm volatile in gfortran

然而,我有点更具体和困惑,因为 Fortran 编译器似乎正在做你可以在 gcc 中使用 asm volatile 实现的功能.为了给您一些上下文,让我们考虑以下递归 Fibonacci number 实现:

However, I am a bit more specific and puzzled as the Fortran compiler seems to be doing what you can achieve with asm volatile in gcc. To give you some context let's consider the following recursive Fibonacci number implementation:

Fortran 代码:

module test
implicit none
private
public fib

contains

! Fibonacci function
integer recursive function fib(n) result(r)
    integer, intent(in) :: n
    if (n < 2) then
        r = n
    else
        r = fib(n-1) + fib(n-2)
    end if
end function  ! end of Fibonacci function
end module

program fibonacci
use test, only: fib
implicit none
integer :: r,i 
integer :: n = 1e09
real(8) :: start, finish, cum_time

cum_time=0
do i= 1,n 
    call cpu_time(start)
    r = fib(20)
    call cpu_time(finish) 
    cum_time = cum_time + (finish - start)
    if (cum_time >0.5) exit
enddo  

print*,i,'runs, average elapsed time is', cum_time/i/1e-06, 'us' 
end program

编译:

gfortran -O3 -march=native

C++ 代码:

#include <iostream>
#include <chrono>
using namespace std;

// Fib function
int fib(const int n)
{
    int r;
    if (n < 2)
        r = n;
    else
        r = fib(n-1) + fib(n-2);
    return r;
} // end of fib

template<typename T, typename ... Args>
double timeit(T (*func)(Args...), Args...args)
{
    double counter = 1.0;
    double mean_time = 0.0;
    for (auto iter=0; iter<1e09; ++iter){
        std::chrono::time_point<std::chrono::system_clock> start, end;
        start = std::chrono::system_clock::now();

        func(args...);

        end = std::chrono::system_clock::now();
        std::chrono::duration<double> elapsed_seconds = end-start;

        mean_time += elapsed_seconds.count();
        counter++;

        if (mean_time > 0.5){
            mean_time /= counter;
            std::cout << static_cast<long int>(counter)
            << " runs, average elapsed time is "
            << mean_time/1.0e-06 << " xC2xB5s" << std::endl; 
            break;
        }
    }
    return mean_time;
}

int main(){
    timeit(fib,20);
    return 0;
}

编译:

g++ -O3 -march=native

时间:

Fortran: 24991 runs, average elapsed time is 20.087 us
C++    : 12355 runs, average elapsed time is 40.471 μs

所以 gfortran 的速度是 gcc 的两倍.查看汇编代码,我得到了

So gfortran is twice as fast there compared to gcc. Looking at the assembly code, I get

程序集(Fortran):

.L28:
    cmpl    $1, %r13d
    jle .L29
    leal    -8(%rbx), %eax
    movl    %ecx, 12(%rsp)
    movl    %eax, 48(%rsp)
    leaq    48(%rsp), %rdi
    leal    -9(%rbx), %eax
    movl    %eax, 16(%rsp)
    call    __bench_MOD_fib
    leaq    16(%rsp), %rdi
    movl    %eax, %r13d
    call    __bench_MOD_fib
    movl    12(%rsp), %ecx
    addl    %eax, %r13d

程序集 (C++):

.L28:
    movl    72(%rsp), %edx
    cmpl    $1, %edx
    movl    %edx, %eax
    jle .L33
    subl    $3, %eax
    movl    $0, 52(%rsp)
    movl    %eax, %esi
    movl    %eax, 96(%rsp)
    movl    92(%rsp), %eax
    shrl    %eax
    movl    %eax, 128(%rsp)
    addl    %eax, %eax
    subl    %eax, %esi
    movl    %edx, %eax
    subl    $1, %eax
    movl    %esi, 124(%rsp)
    movl    %eax, 76(%rsp)

两个汇编代码都由一遍又一遍重复的几乎相似的块/标签组成.如您所见,Fortran 程序集对 fib 函数进行了两次调用,而在 C++ 程序集中,gcc 可能已经展开了所有可能需要更多堆栈 push 的递归调用/pop 和尾部跳跃.

Both assembly codes are made up of almost similar blocks/labels repeated over and over again. As you can see the Fortran assembly makes two calls to fib function whereas in the C++ assembly, gcc has probably unrolled all the recursive calls which probably requires more stack push/pop and tail jumps.

现在,如果我像这样在 C++ 代码中添加一个内联汇编注释

Now if I just put one inline assembly comment in the C++ code like so

修改后的 C++ 代码:

// Fib function
int fib(const int n)
{
    int r;
    if (n < 2)
        r = n;
    else
        r = fib(n-1) + fib(n-2);
    asm("");
    return r;
} // end of fib

生成的汇编代码,修改为

The generated assembly code, changes to

程序集(C++ 修改):

.L7:
    cmpl    $1, %edx
    jle .L17
    leal    -4(%rbx), %r13d
    leal    -5(%rbx), %edx
    cmpl    $1, %r13d
    jle .L19
    leal    -5(%rbx), %r14d
    cmpl    $1, %r14d
    jle .L55
    leal    -6(%rbx), %r13d
    movl    %r13d, %edi
    call    _Z3fibi
    leal    -7(%rbx), %edi
    movl    %eax, %r15d
    call    _Z3fibi
    movl    %r13d, %edi
    addl    %eax, %r15d

您现在可以看到对 fib 函数的两次调用.计时他们给了我

You can now see two calls to fib function. Timing them gives me

时间:

Fortran: 24991 runs, average elapsed time is 20.087 us
C++    : 25757 runs, average elapsed time is 19.412 μs

我知道没有输出的 asmasm volatile 的效果是抑制激进的编译器优化,但在这种情况下,gcc 认为太聪明了,但最终生成了效率较低的代码.

I know the effect of asm with no output and asm volatile is to suppress aggressive compiler optimisations but in this case, gcc thought it was too smart but ended up generating a less efficient code in the first place.

所以问题是:

  • 为什么 gcc 看不到这个优化",而 gfortan 明明可以?
  • 内联流水线必须在 return 语句之前.放在别处,它不会有任何效果.为什么?
  • 这种行为编译器是特定的吗?例如,您可以使用 clang/MSVC 模仿相同的行为吗?
  • CC++ 中是否有更安全的方法可以使递归更快(不依赖内联汇编或迭代式编码)?也许可变参数模板?
  • Why can gcc not see this "optimisation", when gfortan clearly can?
  • The inline assembly line has to be before the return statement. Put it elsewhere and it will have no effect. Why?
  • Is this behaviour compiler specific? For instance can you mimick the same behaviour with clang/MSVC?
  • Are there safer ways to make recursions faster in C or C++ (without relying on inline assembly or iteration-style coding)? Variadic templates perhaps?

更新:

  • 上面显示的结果都是gcc 4.8.4.我也试过用 gcc 4.9.2gcc 5.2 编译它,我得到了相同的结果.
  • 如果我将输入参数声明为 volatile 即 (volatile int n) 而不是 而不是将 asm(const int n),虽然这会导致我的机器上的运行时间稍微变慢.
  • 正如 Michael Karcher 所提到的,我们可以改为传递 -fno-optimize-sibling-calls 标志来解决这个问题.由于此标志在 -O2 级别及更高级别激活,因此即使使用 -O1 进行编译也可以解决此问题.
  • 我用 clang 3.5.1-O3 -march=native 运行了相同的示例,虽然情况不一样,clang 似乎也可以使用 asm 生成更快的代码.
  • The result shown above are all with gcc 4.8.4. I have also tried compiling it with gcc 4.9.2 and gcc 5.2 and I get identical results.
  • The problem can also be replicated (fixed?) if instead of putting asm I declare the input argument as volatile i.e. (volatile int n) instead of (const int n), although this will result in a tad bit slower run-time on my machine.
  • As Michael Karcher has mentioned, we can instead pass the -fno-optimize-sibling-calls flag to fix this problem. Since this flag is activated at -O2 level and beyond, even compiling with -O1 fixes this problem.
  • I have run the same example with clang 3.5.1 with -O3 -march=native and though the situation is not identical, clang also appears to generate faster code with asm.

Clang 计时:

clang++ w/o asm    :  8846 runs, average elapsed time is 56.4555 μs
clang++ with asm   : 10427 runs, average elapsed time is 47.8991 μs 

推荐答案

请参阅本答案末尾附近的粗体字,了解如何获得由 gcc 生成的快速程序.阅读四个问题的答案.

See the bold print near the end of this answer on how to get a fast program generated by gcc. Read the answer for replies to the four questions.

您的第一个问题假设 gfortran 能够看到 gcc 未能看到的优化可能性.事实上,情况正好相反.gcc 识别出一些它认为是优化可能性的东西,而 gfortran 错过了它.唉,gcc 是错误的,它应用的优化结果是您的系统损失了 100% 的速度(与我的系统相当).

Your first question assumes that gfortran is able to see an optimization possibility that gcc failed to see. In fact, the opposite is true. gcc identified something that it considered to be an optimization possibility, while gfortran missed it. Alas, gcc was wrong and the optimization it applied turns out to be a 100% speed loss on your system (comparable on mine).

解决您的第二个问题:asm 语句阻止了使 gcc 看到错误优化可能性的内部转换.如果没有 asm 语句,您的代码(有效)转换为:

To address your second question: The asm statement prevented an internal transformation that made gcc see the false optimization possiblity. Without the asm statement, your code got (validly) transformed to:

int fib(const int n)
{
    if (n < 2)
        return n;
    else
        return fib(n-1) + fib(n-2);
}

包含递归调用的 return 语句会触发使您的代码悲观的兄弟调用优化".包含 asm 语句可防止在其间移动返回指令.

The return statement containing the recursive call triggers the "sibling calls optimization" that pessimizes your code. Including the asm statement prevents moving the return instruction across it.

目前,我手头只有 gcc,所以我无法尝试其他编译器的行为来通过证据回答您的第三个问题,但这似乎绝对依赖于编译器.您遇到了 gcc 的一个怪癖(或错误,无论您如何称呼它),它在尝试优化它时会生成错误的代码.不同编译器的优化器差异很大,因此很可能其他一些编译器不会像 gcc 那样错误地优化您的代码.另一方面,用于优化的代码转换是一个经过充分研究的主题,大多数编译器都在实现类似的优化方法,因此另一个编译器可能会与 gcc 陷入相同的陷阱.

Currently, I only have gcc at hand, so I can't try the behaviour of other compilers to answer your third question by evidence, but this seems definitely compiler dependent. You ran into a quirk (or bug, however you call it) of gcc that it generates bad code while trying to optimize it. The optimizers of different compilers are highly different, so it is quite likely that some other compiler doesn't mis-optimize your code like gcc does. On the other hand, code transformations for optimization is a well-researched topic, and most compilers are implementing similar approaches to optimization, so it is possible that another compiler steps into the same trap as gcc.

回答你最后一个问题:这不是 C/C++ 与 Fortan 的问题,而是 gcc 的问题,它搞砸了这个示例程序(可能还有类似的生产程序).所以没有办法在C++中使递归更快,但是有一种方法可以加快这个例子在gcc,通过禁用有问题的优化:-fno-optimize-sibling-calls,这会导致(在我的系统上,在单个测试运行中)比只需插入 asm 语句.

To address you last question: It is not a problem about C/C++ versus Fortan, but a problem about gcc that messes up this example program (and likely similar production programs). So there is no way to make recursion faster in C++, but there is a way to speed up this example in gcc, by disabling the problematic optimization: -fno-optimize-sibling-calls, which results (on my system, in a single test run) in even faster code than just inserting the asm statement.

相关文章