加快浮动转换的速度?

2022-01-13 00:00:00 type-conversion c++ sse x86

我在 C++ 中的浮点类型转换很短，这对我的代码造成了瓶颈.

I have a short to float cast in C++ that is bottlenecking my code.

代码从一个硬件设备缓冲区转换而来，该缓冲区本身是短路的，这表示来自花哨的光子计数器的输入.

The code translates from a hardware device buffer which is natively shorts, this represents the input from a fancy photon counter.

float factor= 1.0f/value; for (int i = 0; i < W*H; i++)//25% of time is spent doing this { int value = source[i];//ushort -> int destination[i] = value*factor;//int*float->float }

一些细节

取值范围为 0 到 2^16-1，代表高灵敏度相机的像素值

Value should go from 0 to 2^16-1, it represents the pixel values of a highly sensitive camera

我在一台配备 i7 处理器(i7 960，即 SSE 4.2 和 4.1)的多核 x86 机器上.

I'm on a multicore x86 machine with an i7 processor (i7 960 which is SSE 4.2 and 4.1).

源与 8 位边界对齐(硬件设备的要求)

Source is aligned to an 8 bit boundary (a requirement of the hardware device)

W*H 总是能被 8 整除，大多数时候 W 和 H 都能被 8 整除

W*H is always divisible by 8, most of the time W and H are divisible by 8

这让我很难过，我能做些什么吗?

This makes me sad, is there anything I can do about it?

我正在使用 Visual Studios 2012...

I am using Visual Studios 2012...

推荐答案

这是一个基本的 SSE4.1 实现:

Here's a basic SSE4.1 implementation:

__m128 factor = _mm_set1_ps(1.0f / value); for (int i = 0; i < W*H; i += 8) { // Load 8 16-bit ushorts. // vi = {a,b,c,d,e,f,g,h} __m128i vi = _mm_load_si128((const __m128i*)(source + i)); // Convert to 32-bit integers // vi0 = {a,0,b,0,c,0,d,0} // vi1 = {e,0,f,0,g,0,h,0} __m128i vi0 = _mm_cvtepu16_epi32(vi); __m128i vi1 = _mm_cvtepu16_epi32(_mm_unpackhi_epi64(vi,vi)); // Convert to float __m128 vf0 = _mm_cvtepi32_ps(vi0); __m128 vf1 = _mm_cvtepi32_ps(vi1); // Multiply vf0 = _mm_mul_ps(vf0,factor); vf1 = _mm_mul_ps(vf1,factor); // Store _mm_store_ps(destination + i + 0,vf0); _mm_store_ps(destination + i + 4,vf1); }

假设:

source 和 destination 都对齐到 16 个字节.
W*H 是 8 的倍数.

source and destination are both aligned to 16 bytes.

W*H is a multiple of 8.

进一步展开此循环可能会做得更好.(见下文)

It's possible to do better by further unrolling this loop. (see below)

这里的思路如下:

将 8 个短路加载到单个 SSE 寄存器中.
将收银机分成两部分:一个是底部 4 条短裤，另一个是顶部 4 条短裤.
将两个寄存器零扩展为 32 位整数.
将它们都转换为 floats.
乘以系数.
将它们存储到destination.

<小时>

我已经有一段时间没有做这种优化了，所以我继续展开循环.

It's been a while since I've done this type of optimization, so I went ahead and unrolled the loops.

Core i7 920 @ 3.5 GHz
Visual Studio 2012 - 发布 x64:

Original Loop : 4.374 seconds Vectorize no unroll: 1.665 Vectorize unroll 2 : 1.416

进一步展开导致收益递减.

Further unrolling resulted in diminishing returns.

这是测试代码:

#include <smmintrin.h> #include <time.h> #include <iostream> #include <malloc.h> using namespace std; void default_loop(float *destination,const short* source,float value,int size){ float factor = 1.0f / value; for (int i = 0; i < size; i++) { int value = source[i]; destination[i] = value*factor; } } void vectorize8_unroll1(float *destination,const short* source,float value,int size){ __m128 factor = _mm_set1_ps(1.0f / value); for (int i = 0; i < size; i += 8) { // Load 8 16-bit ushorts. __m128i vi = _mm_load_si128((const __m128i*)(source + i)); // Convert to 32-bit integers __m128i vi0 = _mm_cvtepu16_epi32(vi); __m128i vi1 = _mm_cvtepu16_epi32(_mm_unpackhi_epi64(vi,vi)); // Convert to float __m128 vf0 = _mm_cvtepi32_ps(vi0); __m128 vf1 = _mm_cvtepi32_ps(vi1); // Multiply vf0 = _mm_mul_ps(vf0,factor); vf1 = _mm_mul_ps(vf1,factor); // Store _mm_store_ps(destination + i + 0,vf0); _mm_store_ps(destination + i + 4,vf1); } } void vectorize8_unroll2(float *destination,const short* source,float value,int size){ __m128 factor = _mm_set1_ps(1.0f / value); for (int i = 0; i < size; i += 16) { __m128i a0 = _mm_load_si128((const __m128i*)(source + i + 0)); __m128i a1 = _mm_load_si128((const __m128i*)(source + i + 8)); // Split into two registers __m128i b0 = _mm_unpackhi_epi64(a0,a0); __m128i b1 = _mm_unpackhi_epi64(a1,a1); // Convert to 32-bit integers a0 = _mm_cvtepu16_epi32(a0); b0 = _mm_cvtepu16_epi32(b0); a1 = _mm_cvtepu16_epi32(a1); b1 = _mm_cvtepu16_epi32(b1); // Convert to float __m128 c0 = _mm_cvtepi32_ps(a0); __m128 d0 = _mm_cvtepi32_ps(b0); __m128 c1 = _mm_cvtepi32_ps(a1); __m128 d1 = _mm_cvtepi32_ps(b1); // Multiply c0 = _mm_mul_ps(c0,factor); d0 = _mm_mul_ps(d0,factor); c1 = _mm_mul_ps(c1,factor); d1 = _mm_mul_ps(d1,factor); // Store _mm_store_ps(destination + i + 0,c0); _mm_store_ps(destination + i + 4,d0); _mm_store_ps(destination + i + 8,c1); _mm_store_ps(destination + i + 12,d1); } } void print_sum(const float *destination,int size){ float sum = 0; for (int i = 0; i < size; i++){ sum += destination[i]; } cout << sum << endl; } int main(){ int size = 8000; short *source = (short*)_mm_malloc(size * sizeof(short), 16); float *destination = (float*)_mm_malloc(size * sizeof(float), 16); for (int i = 0; i < size; i++){ source[i] = i; } float value = 1.1; int iterations = 1000000; clock_t start; // Default Loop start = clock(); for (int it = 0; it < iterations; it++){ default_loop(destination,source,value,size); } cout << (double)(clock() - start) / CLOCKS_PER_SEC << endl; print_sum(destination,size); // Vectorize 8, no unroll start = clock(); for (int it = 0; it < iterations; it++){ vectorize8_unroll1(destination,source,value,size); } cout << (double)(clock() - start) / CLOCKS_PER_SEC << endl; print_sum(destination,size); // Vectorize 8, unroll 2 start = clock(); for (int it = 0; it < iterations; it++){ vectorize8_unroll2(destination,source,value,size); } cout << (double)(clock() - start) / CLOCKS_PER_SEC << endl; print_sum(destination,size); _mm_free(source); _mm_free(destination); system("pause"); }

相关文章