使用 AVX CPU 指令:没有“/arch:AVX"的性能不佳

2021-12-14 00:00:00 performance c++ sse visual-studio-2010 avx

我的 C++ 代码使用 SSE,现在我想改进它以在可用时支持 AVX.所以我检测 AVX 何时可用并调用一个使用 AVX 命令的函数.我使用 Win7 SP1 + VS2010 SP1 和一个带 AVX 的 CPU.

My C++ code uses SSE and now I want to improve it to support AVX when it is available. So I detect when AVX is available and call a function that uses AVX commands. I use Win7 SP1 + VS2010 SP1 and a CPU with AVX.

要使用 AVX,必须包含这个:

To use AVX, it is necessary to include this:

#include "immintrin.h"

然后你可以使用内在的 AVX 函数,比如 _mm256_mul_ps_mm256_add_ps 等.问题是,默认情况下,VS2010 生成的代码运行速度非常慢并显示警告:

and then you can use intrinsics AVX functions like _mm256_mul_ps, _mm256_add_ps etc. The problem is that by default, VS2010 produces code that works very slowly and shows the warning:

警告 C4752:发现英特尔(R) 高级矢量扩展;考虑使用/arch:AVX

warning C4752: found Intel(R) Advanced Vector Extensions; consider using /arch:AVX

似乎 VS2010 实际上不使用 AVX 指令,而是模拟它们.我在编译器选项中添加了 /arch:AVX 并得到了不错的结果.但是这个选项告诉编译器在可能的情况下在任何地方使用 AVX 命令.所以我的代码可能会在不支持 AVX 的 CPU 上崩溃!

It seems VS2010 actually does not use AVX instructions, but instead, emulates them. I added /arch:AVX to the compiler options and got good results. But this option tells the compiler to use AVX commands everywhere when possible. So my code may crash on CPU that does not support AVX!

所以问题是如何让 VS2010 编译器生成 AVX 代码,但只有当我直接指定 AVX 内在函数时.对于 SSE,它可以工作,我只使用 SSE 内在函数并生成 SSE 代码,而无需任何编译器选项,例如 /arch:SSE.但是对于 AVX,由于某种原因它不起作用.

So the question is how to make VS2010 compiler to produce AVX code but only when I specify AVX intrinsics directly. For SSE it works, I just use SSE intrinsics functions and it produce SSE code without any compiler options like /arch:SSE. But for AVX it does not work for some reason.

推荐答案

2021 更新:现代版本的 MSVC 不需要手动使用 _mm256_zeroupper() 即使在没有 的情况下编译 AVX 内部函数/arch:AVX.VS2010 做到了.

2021 update: Modern versions of MSVC don't need manual use of _mm256_zeroupper() even when compiling AVX intrinsics without /arch:AVX. VS2010 did.

您所看到的行为是昂贵的状态切换的结果.

The behavior that you are seeing is the result of expensive state-switching.

请参阅 Agner Fog 手册的第 102 页:

See page 102 of Agner Fog's manual:

http://www.agner.org/optimize/microarchitecture.pdf

每次您在 SSE 和 AVX 指令之间不正确地来回切换时,您都将付出极高的 (~70) 周期损失.

Every time you improperly switch back and forth between SSE and AVX instructions, you will pay an extremely high (~70) cycle penalty.

当你在没有 /arch:AVX 的情况下编译时,VS2010 将生成 SSE 指令,但在你有 AVX 内在函数的任何地方仍然会使用 AVX.因此,您将获得同时具有 SSE 和 AVX 指令的代码 - 这将具有那些状态切换惩罚.(VS2010 知道这一点,所以它会发出您看到的警告.)

When you compile without /arch:AVX, VS2010 will generate SSE instructions, but will still use AVX wherever you have AVX intrinsics. Therefore, you'll get code that has both SSE and AVX instructions - which will have those state-switching penalties. (VS2010 knows this, so it emits that warning you're seeing.)

因此,您应该全部使用 SSE,或全部使用 AVX.指定 /arch:AVX 告诉编译器使用所有 AVX.

Therefore, you should use either all SSE, or all AVX. Specifying /arch:AVX tells the compiler to use all AVX.

听起来您正在尝试创建多个代码路径:一个用于 SSE,另一个用于 AVX.为此,我建议您将 SSE 和 AVX 代码分成两个不同的编译单元.(一个用 /arch:AVX 编译,一个不用)然后将它们链接在一起,让调度程序根据运行的硬件进行选择.

It sounds like you're trying to make multiple code paths: one for SSE, and one for AVX. For this, I suggest you separate your SSE and AVX code into two different compilation units. (one compiled with /arch:AVX and one without) Then link them together and make a dispatcher to choose based on the what hardware it's running on.

如果您需要混合 SSE 和 AVX,请务必使用 _mm256_zeroupper()_mm256_zeroall()> 适当地避免状态转换惩罚.

If you need to mix SSE and AVX, be sure to use _mm256_zeroupper() or _mm256_zeroall() appropriately to avoid the state-switching penalties.

相关文章