如何使用 SSE/AVX 高效地执行 double/int64 转换?

2022-01-09 00:00:00 floating-point simd c++ sse avx

SSE2 有在单精度浮点数和 32 位整数之间转换向量的指令.

SSE2 has instructions for converting vectors between single-precision floats and 32-bit integers.

  • _mm_cvtps_epi32()
  • _mm_cvtepi32_ps()

但是没有双精度和 64 位整数的等价物.换句话说,它们不见了:

But there are no equivalents for double-precision and 64-bit integers. In other words, they are missing:

  • _mm_cvtpd_epi64()
  • _mm_cvtepi64_pd()

AVX 好像也没有.

模拟这些内在函数的最有效方法是什么?

What is the most efficient way to simulate these intrinsics?

推荐答案

在 AVX512 之前没有单一指令,它添加了与 64 位整数(有符号或无符号)的转换.(还支持与 32 位无符号的转换).查看像 _mm512_cvtpd_epi64 这样的内在函数和更窄的 AVX512VL 版本,如 _mm256_cvtpd_epi64.

There's no single instruction until AVX512, which added conversion to/from 64-bit integers, signed or unsigned. (Also support for conversion to/from 32-bit unsigned). See intrinsics like _mm512_cvtpd_epi64 and the narrower AVX512VL versions, like _mm256_cvtpd_epi64.

如果您只有 AVX2 或更少,则需要以下技巧来进行打包转换.(对于标量,x86-64 具有来自 SSE2 的标量 int64_t <-> double 或 float,但标量 uint64_t <-> FP 需要技巧,直到 AVX512 添加无符号转换.标量 32 位无符号可以通过零扩展到 64 来完成-位签名.)

If you only have AVX2 or less, you'll need tricks like below for packed-conversion. (For scalar, x86-64 has scalar int64_t <-> double or float from SSE2, but scalar uint64_t <-> FP requires tricks until AVX512 adds unsigned conversions. Scalar 32-bit unsigned can be done by zero-extending to 64-bit signed.)

如果你愿意偷工减料,double <->int64 转换只需两条指令即可完成:

If you're willing to cut corners, double <-> int64 conversions can be done in only two instructions:

  • 如果您不关心无穷大或 NaN.
  • 对于双<->int64_t,你只关心 [-2^51, 2^51] 范围内的值.
  • 对于双<->uint64_t,你只关心 [0, 2^52) 范围内的值.
  • If you don't care about infinity or NaN.
  • For double <-> int64_t, you only care about values in the range [-2^51, 2^51].
  • For double <-> uint64_t, you only care about values in the range [0, 2^52).

double -> uint64_t

//  Only works for inputs in the range: [0, 2^52)
__m128i double_to_uint64(__m128d x){
    x = _mm_add_pd(x, _mm_set1_pd(0x0010000000000000));
    return _mm_xor_si128(
        _mm_castpd_si128(x),
        _mm_castpd_si128(_mm_set1_pd(0x0010000000000000))
    );
}

double -> int64_t

//  Only works for inputs in the range: [-2^51, 2^51]
__m128i double_to_int64(__m128d x){
    x = _mm_add_pd(x, _mm_set1_pd(0x0018000000000000));
    return _mm_sub_epi64(
        _mm_castpd_si128(x),
        _mm_castpd_si128(_mm_set1_pd(0x0018000000000000))
    );
}

uint64_t -> 双

//  Only works for inputs in the range: [0, 2^52)
__m128d uint64_to_double(__m128i x){
    x = _mm_or_si128(x, _mm_castpd_si128(_mm_set1_pd(0x0010000000000000)));
    return _mm_sub_pd(_mm_castsi128_pd(x), _mm_set1_pd(0x0010000000000000));
}

int64_t -> 双倍

//  Only works for inputs in the range: [-2^51, 2^51]
__m128d int64_to_double(__m128i x){
    x = _mm_add_epi64(x, _mm_castpd_si128(_mm_set1_pd(0x0018000000000000)));
    return _mm_sub_pd(_mm_castsi128_pd(x), _mm_set1_pd(0x0018000000000000));
}

<小时>

舍入行为:

  • 对于 double ->uint64_t 转换,根据当前的舍入模式,舍入工作正常.(通常是四舍五入)
  • 对于 double ->int64_t 转换,除截断外的所有模式,舍入将遵循当前舍入模式.如果当前的舍入模式是截断(向零舍入),它实际上会向负无穷舍入.
  • For the double -> uint64_t conversion, rounding works correctly following the current rounding mode. (which is usually round-to-even)
  • For the double -> int64_t conversion, rounding will follow the current rounding mode for all modes except truncation. If the current rounding mode is truncation (round towards zero), it will actually round towards negative infinity.

它是如何工作的?

尽管这个技巧只有 2 条指令,但并不完全不言自明.

Despite this trick being only 2 instructions, it's not entirely self-explanatory.

关键是要认识到,对于双精度浮点,[2^52, 2^53) 范围内的值的二进制位"刚好低于尾数.换句话说,如果将指数位和符号位清零,尾数就变成了整数表示.

The key is to recognize that for double-precision floating-point, values in the range [2^52, 2^53) have the "binary place" just below the lowest bit of the mantissa. In other words, if you zero out the exponent and sign bits, the mantissa becomes precisely the integer representation.

double 转换 x ->uint64_t,添加幻数M,它是2^52的浮点值.这会将 x 置于 [2^52, 2^53) 的标准化"范围内,并方便地舍入小数部分.

To convert x from double -> uint64_t, you add the magic number M which is the floating-point value of 2^52. This puts x into the "normalized" range of [2^52, 2^53) and conveniently rounds away the fractional part bits.

现在剩下的就是删除高 12 位.这很容易通过掩盖它来完成.最快的方法是识别那些高 12 位与 M 的相同.因此,我们可以简单地通过 M 进行减法或异或,而不是引入额外的掩码常数.XOR 具有更高的吞吐量.

Now all that's left is to remove the upper 12 bits. This is easily done by masking it out. The fastest way is to recognize that those upper 12 bits are identical to those of M. So rather than introducing an additional mask constant, we can simply subtract or XOR by M. XOR has more throughput.

uint64_t 转换 ->double 只是这个过程的逆过程.您将 M 的指数位加回.然后通过在浮点数中减去 M 来取消归一化数字.

Converting from uint64_t -> double is simply the reverse of this process. You add back the exponent bits of M. Then un-normalize the number by subtracting M in floating-point.

有符号整数转换稍微复杂一些,因为您需要处理 2 的补码符号扩展.我会把这些留给读者作为练习.

The signed integer conversions are slightly trickier since you need to deal with the 2's complement sign-extension. I'll leave those as an exercise for the reader.

相关: 解释了将双精度数舍入为 32 位整数的快速方法

全范围 int64 -> 双精度:

多年后,我终于需要这个了.

After many years, I finally had a need for this.

  • uint64_t 的 5 条指令->双
  • int64_t -> 的 6 条指令双

uint64_t -> 双

__m128d uint64_to_double_full(__m128i x){
    __m128i xH = _mm_srli_epi64(x, 32);
    xH = _mm_or_si128(xH, _mm_castpd_si128(_mm_set1_pd(19342813113834066795298816.)));          //  2^84
    __m128i xL = _mm_blend_epi16(x, _mm_castpd_si128(_mm_set1_pd(0x0010000000000000)), 0xcc);   //  2^52
    __m128d f = _mm_sub_pd(_mm_castsi128_pd(xH), _mm_set1_pd(19342813118337666422669312.));     //  2^84 + 2^52
    return _mm_add_pd(f, _mm_castsi128_pd(xL));
}

int64_t -> 双倍

__m128d int64_to_double_full(__m128i x){
    __m128i xH = _mm_srai_epi32(x, 16);
    xH = _mm_blend_epi16(xH, _mm_setzero_si128(), 0x33);
    xH = _mm_add_epi64(xH, _mm_castpd_si128(_mm_set1_pd(442721857769029238784.)));              //  3*2^67
    __m128i xL = _mm_blend_epi16(x, _mm_castpd_si128(_mm_set1_pd(0x0010000000000000)), 0x88);   //  2^52
    __m128d f = _mm_sub_pd(_mm_castsi128_pd(xH), _mm_set1_pd(442726361368656609280.));          //  3*2^67 + 2^52
    return _mm_add_pd(f, _mm_castsi128_pd(xL));
}

这些适用于整个 64 位范围,并正确舍入为当前舍入行为.

These work for the entire 64-bit range and are correctly rounded to the current rounding behavior.

这些是下面类似的 wim 的答案 - 但有更多的滥用优化.因此,解读这些内容也将留给读者作为练习.

These are similar wim's answer below - but with more abusive optimizations. As such, deciphering these will also be left as an exercise to the reader.

相关文章