_mm256_load_ps 在调试模式下导致 google/benchmark 出现分段错误
- 以下代码可以在发布模式和调试模式下运行.
#include <immintrin.h>
constexpr int n_batch = 10240;
constexpr int n = n_batch * 8;
#pragma pack(32)
float a[n];
float b[n];
float c[n];
#pragma pack()
int main() {
for(int i = 0; i < n; ++i)
c[i] = a[i] * b[i];
for(int i = 0; i < n; i += 4) {
__m128 av = _mm_load_ps(a + i);
__m128 bv = _mm_load_ps(b + i);
__m128 cv = _mm_mul_ps(av, bv);
_mm_store_ps(c + i, cv);
}
for(int i = 0; i < n; i += 8) {
__m256 av = _mm256_load_ps(a + i);
__m256 bv = _mm256_load_ps(b + i);
__m256 cv = _mm256_mul_ps(av, bv);
_mm256_store_ps(c + i, cv);
}
}
<小时>
- 以下代码只能在发布模式下运行,在调试模式下会出现分段错误.
#include <immintrin.h>
#include "benchmark/benchmark.h"
constexpr int n_batch = 10240;
constexpr int n = n_batch * 8;
#pragma pack(32)
float a[n];
float b[n];
float c[n];
#pragma pack()
static void BM_Scalar(benchmark::State &state) {
for(auto _: state)
for(int i = 0; i < n; ++i)
c[i] = a[i] * b[i];
}
BENCHMARK(BM_Scalar);
static void BM_Packet_4(benchmark::State &state) {
for(auto _: state) {
for(int i = 0; i < n; i += 4) {
__m128 av = _mm_load_ps(a + i);
__m128 bv = _mm_load_ps(b + i);
__m128 cv = _mm_mul_ps(av, bv);
_mm_store_ps(c + i, cv);
}
}
}
BENCHMARK(BM_Packet_4);
static void BM_Packet_8(benchmark::State &state) {
for(auto _: state) {
for(int i = 0; i < n; i += 8) {
__m256 av = _mm256_load_ps(a + i); // Signal: SIGSEGV (signal SIGSEGV: invalid address (fault address: 0x0))
__m256 bv = _mm256_load_ps(b + i);
__m256 cv = _mm256_mul_ps(av, bv);
_mm256_store_ps(c + i, cv);
}
}
}
BENCHMARK(BM_Packet_8);
BENCHMARK_MAIN();
推荐答案
您的数组未按 32 对齐.您可以使用调试器进行检查.
Your arrays aren't aligned by 32. You could check this with a debugger.
#pragma pack(32)
仅对齐 struct/union/class 成员,由 MS 记录.C++ 数组是一种不同类型的对象,根本不受 MSVC 编译指示的影响.(不过,我认为您实际上使用的是 GCC 或 clang 的版本,因为 MSVC 通常使用 vmovups
而不是 vmovaps
)
#pragma pack(32)
only aligns struct/union/class members, as documented by MS. C++ arrays are a different kind of object and aren't affected at all by that MSVC pragma. (I think you're actually using GCC's or clang's version of it, though, because MSVC generally uses vmovups
not vmovaps
)
对于静态或自动存储(非动态分配)中的数组,在 C++11 及更高版本中对齐数组的最简单方法是 alignas(32)
.这是完全可移植的,不像 GNU C __attribute__((aligned(32)))
或任何 MSVC 的等价物.
For arrays in static or automatic storage (not dynamically allocated), the easiest way to align arrays in C++11 and later is alignas(32)
. That's fully portable, unlike GNU C __attribute__((aligned(32)))
or whatever MSVC's equivalent is.
alignas(32) float a[n];
alignas(32) float b[n];
alignas(32) float c[n];
AVX:数据对齐:存储崩溃、存储、加载, loadu 没有 解释为什么存在差异取决于优化级别:优化的代码会将一个负载折叠到 vmulps
的内存源操作数中,这(与 SSE 不同)不需要对齐.(大概第一个数组恰好对齐了.)
AVX: data alignment: store crash, storeu, load, loadu doesn't explains why there's a difference depending on optimization level: optimized code will fold one load into a memory source operand for vmulps
which (unlike SSE) doesn't require alignment. (Presumably the first array happens to be sufficiently aligned.)
未优化的代码将单独执行 _mm256_load_ps
与 vmovaps
对齐所需的负载.
Un-optimized code will do the _mm256_load_ps
separately with a vmovaps
alignment-required load.
(_mm256_loadu_ps
将始终避免使用需要对齐的加载,因此如果您不能保证数据对齐,请使用它.)
(_mm256_loadu_ps
will always avoid using alignment-required loads, so use that if you can't guarantee your data is aligned.)
相关文章