高度优化的矩阵乘法代码在 MSVC 和 GCC 之间的性能差异

2021-12-18 00:00:00 gcc assembly visual-c++ c++ x86

对于 Ivy Bridge 系统,我发现在 MSVC(在 Windows 上)和 GCC(在 Linux 上)中编译的代码之间的性能存在很大差异.该代码执行密集矩阵乘法.我使用 GCC 获得了 70% 的峰值失败率,而使用 MSVC 只有 50%.我想我可能已经隔离了它们如何转换以下三个内在函数的差异.

I'm seeing a big difference in performance between code compiled in MSVC (on Windows) and GCC (on Linux) for an Ivy Bridge system. The code does dense matrix multiplication. I'm getting 70% of the peak flops with GCC and only 50% with MSVC. I think I may have isolated the difference to how they both convert the following three intrinsics.

__m256 breg0 = _mm256_loadu_ps(&b[8*i])
_mm256_add_ps(_mm256_mul_ps(arge0,breg0), tmp0)

海湾合作委员会这样做

vmovups ymm9, YMMWORD PTR [rax-256]
vmulps  ymm9, ymm0, ymm9
vaddps  ymm8, ymm8, ymm9

MSVC 做到这一点

vmulps   ymm1, ymm2, YMMWORD PTR [rax-256]
vaddps   ymm3, ymm1, ymm3

有人可以向我解释一下这两种解决方案是否以及为什么会产生如此大的性能差异?

尽管 MSVC 少使用了一条指令,但它会将负载与多条指令联系起来,也许这使它更加依赖(也许负载不能乱序完成)?我的意思是 Ivy Bridge 可以在一个时钟周期内完成一个 AVX 加载、一个 AVX 多任务和一个 AVX 添加,但这要求每个操作都是独立的.

Despite MSVC using one less instruction it ties the load to the mult and maybe that makes it more dependent (maybe the load can't be done out of order)? I mean Ivy Bridge can do one AVX load, one AVX mult, and one AVX add in one clock cycle but this requires each operation to be independent.

也许问题出在其他地方?您可以在下面看到最内层循环的 GCC 和 MSVC 的完整汇编代码.您可以在此处查看循环的 C++ 代码 循环展开以使用 Ivy Bridge 和 Haswell 实现最大吞吐量

Maybe the problem lies elsewhere? You can see the full assembly code for GCC and MSVC for the innermost loop below. You can see the C++ code for the loop here Loop unrolling to achieve maximum throughput with Ivy Bridge and Haswell

g++ -S -masm=intel matrix.cpp -O3 -mavx -fopenmp

g++ -S -masm=intel matrix.cpp -O3 -mavx -fopenmp

.L4:
    vbroadcastss    ymm0, DWORD PTR [rcx+rdx*4]
    add rdx, 1
    add rax, 256
    vmovups ymm9, YMMWORD PTR [rax-256]
    vmulps  ymm9, ymm0, ymm9
    vaddps  ymm8, ymm8, ymm9
    vmovups ymm9, YMMWORD PTR [rax-224]
    vmulps  ymm9, ymm0, ymm9
    vaddps  ymm7, ymm7, ymm9
    vmovups ymm9, YMMWORD PTR [rax-192]
    vmulps  ymm9, ymm0, ymm9
    vaddps  ymm6, ymm6, ymm9
    vmovups ymm9, YMMWORD PTR [rax-160]
    vmulps  ymm9, ymm0, ymm9
    vaddps  ymm5, ymm5, ymm9
    vmovups ymm9, YMMWORD PTR [rax-128]
    vmulps  ymm9, ymm0, ymm9
    vaddps  ymm4, ymm4, ymm9
    vmovups ymm9, YMMWORD PTR [rax-96]
    vmulps  ymm9, ymm0, ymm9
    vaddps  ymm3, ymm3, ymm9
    vmovups ymm9, YMMWORD PTR [rax-64]
    vmulps  ymm9, ymm0, ymm9
    vaddps  ymm2, ymm2, ymm9
    vmovups ymm9, YMMWORD PTR [rax-32]
    cmp esi, edx
    vmulps  ymm0, ymm0, ymm9
    vaddps  ymm1, ymm1, ymm0
    jg  .L4

MSVC/FAc/O2/openmp/arch:AVX ...

MSVC /FAc /O2 /openmp /arch:AVX ...

vbroadcastss ymm2, DWORD PTR [r10]    
lea  rax, QWORD PTR [rax+256]
lea  r10, QWORD PTR [r10+4] 
vmulps   ymm1, ymm2, YMMWORD PTR [rax-320]
vaddps   ymm3, ymm1, ymm3    
vmulps   ymm1, ymm2, YMMWORD PTR [rax-288]
vaddps   ymm4, ymm1, ymm4    
vmulps   ymm1, ymm2, YMMWORD PTR [rax-256]
vaddps   ymm5, ymm1, ymm5    
vmulps   ymm1, ymm2, YMMWORD PTR [rax-224]
vaddps   ymm6, ymm1, ymm6    
vmulps   ymm1, ymm2, YMMWORD PTR [rax-192]
vaddps   ymm7, ymm1, ymm7    
vmulps   ymm1, ymm2, YMMWORD PTR [rax-160]
vaddps   ymm8, ymm1, ymm8    
vmulps   ymm1, ymm2, YMMWORD PTR [rax-128]
vaddps   ymm9, ymm1, ymm9    
vmulps   ymm1, ymm2, YMMWORD PTR [rax-96]
vaddps   ymm10, ymm1, ymm10    
dec  rdx
jne  SHORT $LL3@AddDot4x4_

我通过将总浮点运算计算为 2.0*n^3 来对代码进行基准测试,其中 n 是方阵的宽度并除以 omp_get_wtime().我重复循环几次.在下面的输出中,我重复了 100 次.

I benchmark the code by claculating the total floating point operations as 2.0*n^3 where n is the width of the square matrix and dividing by the time measured with omp_get_wtime(). I repeat the loop several times. In the output below I repeated it 100 times.

MSVC2012 在 Intel Xeon E5 1620 (Ivy Bridge) turbo 上的所有内核输出为 3.7 GHz

Output from MSVC2012 on an Intel Xeon E5 1620 (Ivy Bridge) turbo for all cores is 3.7 GHz

maximum GFLOPS = 236.8 = (8-wide SIMD) * (1 AVX mult + 1 AVX add) * (4 cores) * 3.7 GHz

n   64,     0.02 ms, GFLOPs   0.001, GFLOPs/s   23.88, error 0.000e+000, efficiency/core   40.34%, efficiency  10.08%, mem 0.05 MB
n  128,     0.05 ms, GFLOPs   0.004, GFLOPs/s   84.54, error 0.000e+000, efficiency/core  142.81%, efficiency  35.70%, mem 0.19 MB
n  192,     0.17 ms, GFLOPs   0.014, GFLOPs/s   85.45, error 0.000e+000, efficiency/core  144.34%, efficiency  36.09%, mem 0.42 MB
n  256,     0.29 ms, GFLOPs   0.034, GFLOPs/s  114.48, error 0.000e+000, efficiency/core  193.37%, efficiency  48.34%, mem 0.75 MB
n  320,     0.59 ms, GFLOPs   0.066, GFLOPs/s  110.50, error 0.000e+000, efficiency/core  186.66%, efficiency  46.67%, mem 1.17 MB
n  384,     1.39 ms, GFLOPs   0.113, GFLOPs/s   81.39, error 0.000e+000, efficiency/core  137.48%, efficiency  34.37%, mem 1.69 MB
n  448,     3.27 ms, GFLOPs   0.180, GFLOPs/s   55.01, error 0.000e+000, efficiency/core   92.92%, efficiency  23.23%, mem 2.30 MB
n  512,     3.60 ms, GFLOPs   0.268, GFLOPs/s   74.63, error 0.000e+000, efficiency/core  126.07%, efficiency  31.52%, mem 3.00 MB
n  576,     3.93 ms, GFLOPs   0.382, GFLOPs/s   97.24, error 0.000e+000, efficiency/core  164.26%, efficiency  41.07%, mem 3.80 MB
n  640,     5.21 ms, GFLOPs   0.524, GFLOPs/s  100.60, error 0.000e+000, efficiency/core  169.93%, efficiency  42.48%, mem 4.69 MB
n  704,     6.73 ms, GFLOPs   0.698, GFLOPs/s  103.63, error 0.000e+000, efficiency/core  175.04%, efficiency  43.76%, mem 5.67 MB
n  768,     8.55 ms, GFLOPs   0.906, GFLOPs/s  105.95, error 0.000e+000, efficiency/core  178.98%, efficiency  44.74%, mem 6.75 MB
n  832,    10.89 ms, GFLOPs   1.152, GFLOPs/s  105.76, error 0.000e+000, efficiency/core  178.65%, efficiency  44.66%, mem 7.92 MB
n  896,    13.26 ms, GFLOPs   1.439, GFLOPs/s  108.48, error 0.000e+000, efficiency/core  183.25%, efficiency  45.81%, mem 9.19 MB
n  960,    16.36 ms, GFLOPs   1.769, GFLOPs/s  108.16, error 0.000e+000, efficiency/core  182.70%, efficiency  45.67%, mem 10.55 MB
n 1024,    17.74 ms, GFLOPs   2.147, GFLOPs/s  121.05, error 0.000e+000, efficiency/core  204.47%, efficiency  51.12%, mem 12.00 MB

推荐答案

既然我们已经讨论了对齐问题,我猜是这样的:http://en.wikipedia.org/wiki/Out-of-order_execution

Since we've covered the alignment issue, I would guess it's this: http://en.wikipedia.org/wiki/Out-of-order_execution

由于 g++ 发出独立加载指令,您的处理器可以重新排序指令以预取下一个需要的数据,同时进行加法和乘法.MSVC 在 mul 处抛出一个指针会使 load 和 mul 绑定到同一条指令,因此更改指令的执行顺序没有任何帮助.

Since g++ issues a standalone load instruction, your processor can reorder the instructions to be pre-fetching the next data that will be needed while also adding and multiplying. MSVC throwing a pointer at mul makes the load and mul tied to the same instruction, so changing the execution order of the instructions doesn't help anything.

英特尔的服务器与所有文档今天不那么生气了,所以这里有更多关于为什么乱序执行是(部分)答案的研究.

Intel's server(s) with all the docs are less angry today, so here's more research on why out of order execution is (part of) the answer.

首先,您的评论似乎完全正确,即乘法指令的 MSVC 版本可以解码为可以由 CPU 的乱序引擎优化的单独 μ-op.这里有趣的部分是现代微码排序器是可编程的,因此实际行为取决于硬件和固件.生成的程序集的差异似乎来自 GCC 和 MSVC,每个都试图解决不同的潜在瓶颈.GCC 版本试图给乱序引擎留出余地(正如我们已经介绍过的).但是,MSVC 版本最终利用了一种称为微操作融合"的功能.这是因为 μ-op 退休限制.管道的末端每个滴答只能退出 3 μ-ops.在特定情况下,微操作融合需要两个微操作,必须在两个不同的执行单元(即内存读取和算术)上完成,并在大多数情况下将它们绑定到单个微操作管道.融合的 μ-op 仅在执行单元分配之前被拆分为两个真正的 μ-op.执行后,操作员再次融合,让他们作为一个退休.

First of all, it looks like your comment is completely right about it being possible for the MSVC version of the multiplication instruction to decode to separate μ-ops that can be optimized by a CPU's out of order engine. The fun part here is that modern microcode sequencers are programmable, so the actual behavior is both hardware and firmware dependent. The differences in the generated assembly seems to be from GCC and MSVC each trying to fight different potential bottlenecks. The GCC version tries to give leeway to the out of order engine (as we've already covered). However, the MSVC version ends up taking advantage of a feature called "micro-op fusion". This is because of the μ-op retirement limitations. The end of the pipeline can only retire 3 μ-ops per tick. Micro-op fusion, in specific cases, takes two μ-ops that must be done on two different execution units (i.e. memory read and arithmetic) and ties them to a single μ-op for most of the pipeline. The fused μ-op is only split into the two real μ-ops right before execution unit assignment. After the execution, the ops are fused again, allowing them to be retired as one.

乱序引擎只能看到融合的 μ-op,因此它无法将负载 op 从乘法中拉开.这会导致管道在等待下一个操作数完成其总线传输过程中挂起.

The out of order engine only sees the fused μ-op, so it can't pull the load op away from the multiplication. This causes the pipeline to hang while waiting for the next operand to finish its bus ride.

所有链接!!!:http://download-software.intel.com/sites/default/files/managed/71/2e/319433-017.pdf

http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf

http://www.agner.org/optimize/microarchitecture.pdf

http://www.agner.org/optimize/optimizing_assembly.pdf

http://www.agner.org/optimize/instruction_tables.ods(注意:Excel 抱怨此电子表格部分损坏或以其他方式粗略,因此打开风险自负.不过,它似乎不是恶意的,并且根据我的其他研究,Agner Fog 很棒.在我选择之后- 在 Excel 恢复步骤中,我发现它充满了大量数据)

http://www.agner.org/optimize/instruction_tables.ods (NOTE: Excel complains that this spreadsheet is partially corrupted or otherwise sketchy, so open at your own risk. It doesn't seem to be malicious, though, and according to the rest of my research, Agner Fog is awesome. After I opted-in to the Excel recovery step, I found it full of tons of great data)

http://cs.nyu.edu/courses/fall13/CSCI-GA.3033-008/Microprocessor-Report-Sandy-Bridge-Spans-Generations-243901.pdf

http://www.syncfusion.com/Content/downloads/ebook/Assembly_Language_Succinctly.pdf

后期哇,这里的讨论有一些有趣的更新.我想我误会了微操作融合实际影响了多少管道.也许从循环条件检查的差异中获得的性能增益比我预期的要多,其中未融合的指令允许 GCC 将比较和跳转与最后一个向量加载和算术步骤交错?

MUCH LATER Wow, there has been some interesting update to the discussion here. I guess I was mistaken about how much of the pipeline is actually affected by micro op fusion. Maybe there is more perf gain than I expected from the the differences in the loop condition check, where the unfused instructions allow GCC to interleave the compare and jump with the last vector load and arithmetic steps?

vmovups ymm9, YMMWORD PTR [rax-32]
cmp esi, edx
vmulps  ymm0, ymm0, ymm9
vaddps  ymm1, ymm1, ymm0
jg  .L4

相关文章