浮点向量的 SSE 缩减

2022-01-09 00:00:00 simd sum c++ reduction sse

如何使用 sse 内在函数获取浮点向量的总和元素(减少)?

How can I get sum elements (reduction) of float vector using sse intrinsics?

简单的序列号:

void(float *input, float &result, unsigned int NumElems) { result = 0; for(auto i=0; i<NumElems; ++i) result += input[i]; }

推荐答案

通常在循环中生成 4 个部分和，然后在循环后对 4 个元素进行水平求和，例如

Typically you generate 4 partial sums in your loop and then just sum horizontally across the 4 elements after the loop, e.g.

#include <cassert> #include <cstdint> #include <emmintrin.h> float vsum(const float *a, int n) { float sum; __m128 vsum = _mm_set1_ps(0.0f); assert((n & 3) == 0); assert(((uintptr_t)a & 15) == 0); for (int i = 0; i < n; i += 4) { __m128 v = _mm_load_ps(&a[i]); vsum = _mm_add_ps(vsum, v); } vsum = _mm_hadd_ps(vsum, vsum); vsum = _mm_hadd_ps(vsum, vsum); _mm_store_ss(&sum, vsum); return sum; }

注意:对于上面的例子，a必须是16字节对齐，n必须是4的倍数.如果a对齐不能保证然后使用 _mm_loadu_ps 代替 _mm_load_ps.如果 n 不能保证是 4 的倍数，则在函数末尾添加一个标量循环以累积任何剩余元素.

Note: for the above example a must be 16 byte aligned and n must be a multiple of 4. If the alignment of a can not be guaranteed then use _mm_loadu_ps instead of _mm_load_ps. If n is not guaranteed to be a multiple of 4 then add a scalar loop at the end of the function to accumulate any remaining elements.

相关文章