使用 const 非整数指数优化 pow()?

2021-12-20 00:00:00 math optimization c++ avx exponent

我的代码中有热点，我在执行 pow() 的地方占用了我大约 10-20% 的执行时间.

I have hot spots in my code where I'm doing pow() taking up around 10-20% of my execution time.

我对 pow(x,y) 的输入非常具体，所以我想知道是否有办法滚动两个 pow() 近似值(每个近似值一个指数)具有更高的性能:

My input to pow(x,y) is very specific, so I'm wondering if there's a way to roll two pow() approximations (one for each exponent) with higher performance:

我有两个常数指数:2.4 和 1/2.4.
当指数为 2.4 时，x 将在 (0.090473935, 1.0] 范围内.
当指数为 1/2.4 时，x 将在 (0.0031308, 1.0] 范围内.
我正在使用 SSE/AVX float 向量.如果可以利用平台特性，就直接使用！

I have two constant exponents: 2.4 and 1/2.4.

When the exponent is 2.4, x will be in the range (0.090473935, 1.0].

When the exponent is 1/2.4, x will be in the range (0.0031308, 1.0].

I'm using SSE/AVX float vectors. If platform specifics can be taken advantage of, right on!

大约 0.01% 的最大错误率是理想的，尽管我也对全精度(对于 float)算法感兴趣.

A maximum error rate around 0.01% is ideal, though I'm interested in full precision (for float) algorithms as well.

我已经在使用快速 pow() 近似，但它没有考虑这些约束.有没有可能做得更好?

I'm already using a fast pow() approximation, but it doesn't take these constraints into account. Is it possible to do better?

推荐答案

在 IEEE 754 黑客的脉络中，这里是另一种更快、更神奇"的解决方案.它在大约十几个时钟周期内实现了 0.08% 的误差容限(对于 p=2.4 的情况，在 Intel Merom CPU 上).

In the IEEE 754 hacking vein, here is another solution which is faster and less "magical." It achieves an error margin of .08% in about a dozen clock cycles (for the case of p=2.4, on an Intel Merom CPU).

浮点数最初是作为对数的近似值发明的，因此您可以使用整数值作为 log2 的近似值.通过将从整数转换指令应用于浮点值以获得另一个浮点值，这在某种程度上是可移植的.

Floating point numbers were originally invented as an approximation to logarithms, so you can use the integer value as an approximation of log2. This is somewhat-portably achievable by applying the convert-from-integer instruction to a floating-point value, to obtain another floating-point value.

要完成pow 计算，您可以乘以一个常数因子，然后使用转换为整数指令将对数转换回.在SSE上，相关指令为cvtdq2ps和cvtps2dq.

To complete the pow computation, you can multiply by a constant factor and convert the logarithm back with the convert-to-integer instruction. On SSE, the relevant instructions are cvtdq2ps and cvtps2dq.

不过，事情并没有那么简单.IEEE 754 中的指数字段是有符号的，偏置值为 127 表示指数为零.在乘以对数之前必须消除这种偏差，并在取幂之前重新添加.此外，通过减法进行的偏差调整不适用于零.幸运的是，这两种调整都可以通过事先乘以一个常数因子来实现.

It's not quite so simple, though. The exponent field in IEEE 754 is signed, with a bias value of 127 representing an exponent of zero. This bias must be removed before you multiply the logarithm, and re-added before you exponentiate. Furthermore, bias adjustment by subtraction won't work on zero. Fortunately, both adjustments can be achieved by multiplying by a constant factor beforehand.

x^p = exp2( p * log2( x ) ) = exp2( p * ( log2( x ) + 127 - 127 ) - 127 + 127 ) = cvtps2dq( p * ( log2( x ) + 127 - 127 - 127 / p ) ) = cvtps2dq( p * ( log2( x ) + 127 - log2( exp2( 127 - 127 / p ) ) ) = cvtps2dq( p * ( log2( x * exp2( 127 / p - 127 ) ) + 127 ) ) = cvtps2dq( p * ( cvtdq2ps( x * exp2( 127 / p - 127 ) ) ) )

exp2( 127/p - 127 ) 是常数因子.这个函数相当专业:它不适用于小分数指数，因为常数因子随着指数的倒数呈指数增长并且会溢出.它不适用于负指数.大指数导致高误差，因为尾数位通过乘法与指数位混合.

exp2( 127 / p - 127 ) is the constant factor. This function is rather specialized: it won't work with small fractional exponents, because the constant factor grows exponentially with the inverse of the exponent and will overflow. It won't work with negative exponents. Large exponents lead to high error, because the mantissa bits are mingled with the exponent bits by the multiplication.

但是，它只有 4 条快速指令.预乘，从整数"(到对数)转换，乘幂，转换到整数"(从对数).这种 SSE 实现的转换速度非常快.我们也可以在第一次乘法中加入一个额外的常数系数.

But, it's just 4 fast instructions long. Pre-multiply, convert from "integer" (to logarithm), power-multiply, convert to "integer" (from logarithm). Conversions are very fast on this implementation of SSE. We can also squeeze an extra constant coefficient into the first multiplication.

template< unsigned expnum, unsigned expden, unsigned coeffnum, unsigned coeffden > __m128 fastpow( __m128 arg ) { __m128 ret = arg; // std::printf( "arg = %,vg ", ret ); // Apply a constant pre-correction factor. ret = _mm_mul_ps( ret, _mm_set1_ps( exp2( 127. * expden / expnum - 127. ) * pow( 1. * coeffnum / coeffden, 1. * expden / expnum ) ) ); // std::printf( "scaled = %,vg ", ret ); // Reinterpret arg as integer to obtain logarithm. asm ( "cvtdq2ps %1, %0" : "=x" (ret) : "x" (ret) ); // std::printf( "log = %,vg ", ret ); // Multiply logarithm by power. ret = _mm_mul_ps( ret, _mm_set1_ps( 1. * expnum / expden ) ); // std::printf( "powered = %,vg ", ret ); // Convert back to "integer" to exponentiate. asm ( "cvtps2dq %1, %0" : "=x" (ret) : "x" (ret) ); // std::printf( "result = %,vg ", ret ); return ret; }

指数 = 2.4 的一些试验表明，这始终高估了约 5%.(例程总是保证高估.)您可以简单地乘以 0.95，但多一些指令将使我们获得大约 4 个十进制数字的准确度，这对于图形来说应该足够了.

A few trials with exponent = 2.4 show this consistently overestimates by about 5%. (The routine is always guaranteed to overestimate.) You could simply multiply by 0.95, but a few more instructions will get us about 4 decimal digits of accuracy, which should be enough for graphics.

关键是将高估与低估相匹配，并取平均值.

The key is to match the overestimate with an underestimate, and take the average.

计算 x^0.8:4 条指令，误差 ~ +3%.
计算 x^-0.4:一个 rsqrtps.(这已经足够准确了，但确实牺牲了处理零的能力.)
计算 x^0.4:一个 mulps.
计算 x^-0.2:一个 rsqrtps.
计算 x^2:一个 mulps.
计算 x^3:一个 mulps.
x^2.4 = x^2 * x^0.4:一个 mulps.这是高估了.
x^2.4 = x^3 * x^-0.4 * x^-0.2:两个 mulps.这是低估了.
平均以上:一个addps，一个mulps.

Compute x^0.8: four instructions, error ~ +3%.

Compute x^-0.4: one rsqrtps. (This is quite accurate enough, but does sacrifice the ability to work with zero.)

Compute x^0.4: one mulps.

Compute x^-0.2: one rsqrtps.

Compute x^2: one mulps.

Compute x^3: one mulps.

x^2.4 = x^2 * x^0.4: one mulps. This is the overestimate.

x^2.4 = x^3 * x^-0.4 * x^-0.2: two mulps. This is the underestimate.

Average the above: one addps, one mulps.

指令计数:14 个，包括两个延迟 = 5 的转换和两个吞吐量 = 4 的倒数平方根估计.

Instruction tally: fourteen, including two conversions with latency = 5 and two reciprocal square root estimates with throughput = 4.

为了正确取平均值，我们希望根据预期误差对估计值进行加权.低估将误差提高到 0.6 对 0.4 的幂，因此我们预计它是错误的 1.5 倍.加权不添加任何说明；它可以在前置因子中完成.调用系数a:a^0.5 = 1.5 a^-0.75，a = 1.38316186.

To properly take the average, we want to weight the estimates by their expected errors. The underestimate raises the error to a power of 0.6 vs 0.4, so we expect it to be 1.5x as erroneous. Weighting doesn't add any instructions; it can be done in the pre-factor. Calling the coefficient a: a^0.5 = 1.5 a^-0.75, and a = 1.38316186.

最终误差约为 0.015%，比初始 fastpow 结果好 2 个数量级.对于带有 volatile 源和目标变量的繁忙循环，运行时间大约为十几个周期……尽管它与迭代重叠，但实际使用中也会看到指令级并行性.考虑到 SIMD，这是每 3 个周期一个标量结果的吞吐量！

The final error is about .015%, or 2 orders of magnitude better than the initial fastpow result. The runtime is about a dozen cycles for a busy loop with volatile source and destination variables… although it's overlapping the iterations, real-world usage will also see instruction-level parallelism. Considering SIMD, that's a throughput of one scalar result per 3 cycles!

int main() { __m128 const x0 = _mm_set_ps( 0.01, 1, 5, 1234.567 ); std::printf( "Input: %,vg ", x0 ); // Approx 5% accuracy from one call. Always an overestimate. __m128 x1 = fastpow< 24, 10, 1, 1 >( x0 ); std::printf( "Direct x^2.4: %,vg ", x1 ); // Lower exponents provide lower initial error, but too low causes overflow. __m128 xf = fastpow< 8, 10, int( 1.38316186 * 1e9 ), int( 1e9 ) >( x0 ); std::printf( "1.38 x^0.8: %,vg ", xf ); // Imprecise 4-cycle sqrt is still far better than fastpow, good enough. __m128 xfm4 = _mm_rsqrt_ps( xf ); __m128 xf4 = _mm_mul_ps( xf, xfm4 ); // Precisely calculate x^2 and x^3 __m128 x2 = _mm_mul_ps( x0, x0 ); __m128 x3 = _mm_mul_ps( x2, x0 ); // Overestimate of x^2 * x^0.4 x2 = _mm_mul_ps( x2, xf4 ); // Get x^-0.2 from x^0.4. Combine with x^-0.4 into x^-0.6 and x^2.4. __m128 xfm2 = _mm_rsqrt_ps( xf4 ); x3 = _mm_mul_ps( x3, xfm4 ); x3 = _mm_mul_ps( x3, xfm2 ); std::printf( "x^2 * x^0.4: %,vg ", x2 ); std::printf( "x^3 / x^0.6: %,vg ", x3 ); x2 = _mm_mul_ps( _mm_add_ps( x2, x3 ), _mm_set1_ps( 1/ 1.960131704207789 ) ); // Final accuracy about 0.015%, 200x better than x^0.8 calculation. std::printf( "average = %,vg ", x2 ); }

好吧……抱歉我没能早点发布这个.并将其扩展到 x^1/2.4 作为练习 ;v) .

Well… sorry I wasn't able to post this sooner. And extending it to x^1/2.4 is left as an exercise ;v) .

我实现了一个小测试工具和两个 x^(⁵?₁₂) 对应于上面的案例.

I implemented a little test harness and two x^(⁵?₁₂) cases corresponding to the above.

#include <cstdio> #include <xmmintrin.h> #include <cmath> #include <cfloat> #include <algorithm> using namespace std; template< unsigned expnum, unsigned expden, unsigned coeffnum, unsigned coeffden > __m128 fastpow( __m128 arg ) { __m128 ret = arg; // std::printf( "arg = %,vg ", ret ); // Apply a constant pre-correction factor. ret = _mm_mul_ps( ret, _mm_set1_ps( exp2( 127. * expden / expnum - 127. ) * pow( 1. * coeffnum / coeffden, 1. * expden / expnum ) ) ); // std::printf( "scaled = %,vg ", ret ); // Reinterpret arg as integer to obtain logarithm. asm ( "cvtdq2ps %1, %0" : "=x" (ret) : "x" (ret) ); // std::printf( "log = %,vg ", ret ); // Multiply logarithm by power. ret = _mm_mul_ps( ret, _mm_set1_ps( 1. * expnum / expden ) ); // std::printf( "powered = %,vg ", ret ); // Convert back to "integer" to exponentiate. asm ( "cvtps2dq %1, %0" : "=x" (ret) : "x" (ret) ); // std::printf( "result = %,vg ", ret ); return ret; } __m128 pow125_4( __m128 arg ) { // Lower exponents provide lower initial error, but too low causes overflow. __m128 xf = fastpow< 4, 5, int( 1.38316186 * 1e9 ), int( 1e9 ) >( arg ); // Imprecise 4-cycle sqrt is still far better than fastpow, good enough. __m128 xfm4 = _mm_rsqrt_ps( xf ); __m128 xf4 = _mm_mul_ps( xf, xfm4 ); // Precisely calculate x^2 and x^3 __m128 x2 = _mm_mul_ps( arg, arg ); __m128 x3 = _mm_mul_ps( x2, arg ); // Overestimate of x^2 * x^0.4 x2 = _mm_mul_ps( x2, xf4 ); // Get x^-0.2 from x^0.4, and square it for x^-0.4. Combine into x^-0.6. __m128 xfm2 = _mm_rsqrt_ps( xf4 ); x3 = _mm_mul_ps( x3, xfm4 ); x3 = _mm_mul_ps( x3, xfm2 ); return _mm_mul_ps( _mm_add_ps( x2, x3 ), _mm_set1_ps( 1/ 1.960131704207789 * 0.9999 ) ); } __m128 pow512_2( __m128 arg ) { // 5/12 is too small, so compute the sqrt of 10/12 instead. __m128 x = fastpow< 5, 6, int( 0.992245 * 1e9 ), int( 1e9 ) >( arg ); return _mm_mul_ps( _mm_rsqrt_ps( x ), x ); } __m128 pow512_4( __m128 arg ) { // 5/12 is too small, so compute the 4th root of 20/12 instead. // 20/12 = 5/3 = 1 + 2/3 = 2 - 1/3. 2/3 is a suitable argument for fastpow. // weighting coefficient: a^-1/2 = 2 a; a = 2^-2/3 __m128 xf = fastpow< 2, 3, int( 0.629960524947437 * 1e9 ), int( 1e9 ) >( arg ); __m128 xover = _mm_mul_ps( arg, xf ); __m128 xfm1 = _mm_rsqrt_ps( xf ); __m128 x2 = _mm_mul_ps( arg, arg ); __m128 xunder = _mm_mul_ps( x2, xfm1 ); // sqrt2 * over + 2 * sqrt2 * under __m128 xavg = _mm_mul_ps( _mm_set1_ps( 1/( 3 * 0.629960524947437 ) * 0.999852 ), _mm_add_ps( xover, xunder ) ); xavg = _mm_mul_ps( xavg, _mm_rsqrt_ps( xavg ) ); xavg = _mm_mul_ps( xavg, _mm_rsqrt_ps( xavg ) ); return xavg; } __m128 mm_succ_ps( __m128 arg ) { return (__m128) _mm_add_epi32( (__m128i) arg, _mm_set1_epi32( 4 ) ); } void test_pow( double p, __m128 (*f)( __m128 ) ) { __m128 arg; for ( arg = _mm_set1_ps( FLT_MIN / FLT_EPSILON ); ! isfinite( _mm_cvtss_f32( f( arg ) ) ); arg = mm_succ_ps( arg ) ) ; for ( ; _mm_cvtss_f32( f( arg ) ) == 0; arg = mm_succ_ps( arg ) ) ; std::printf( "Domain from %g ", _mm_cvtss_f32( arg ) ); int n; int const bucket_size = 1 << 25; do { float max_error = 0; double total_error = 0, cum_error = 0; for ( n = 0; n != bucket_size; ++ n ) { float result = _mm_cvtss_f32( f( arg ) ); if ( ! isfinite( result ) ) break; float actual = ::powf( _mm_cvtss_f32( arg ), p ); float error = ( result - actual ) / actual; cum_error += error; error = std::abs( error ); max_error = std::max( max_error, error ); total_error += error; arg = mm_succ_ps( arg ); } std::printf( "error max = %8g " "avg = %8g " "|avg| = %8g " "to %8g ", max_error, cum_error / n, total_error / n, _mm_cvtss_f32( arg ) ); } while ( n == bucket_size ); } int main() { std::printf( "4 insn x^12/5: " ); test_pow( 12./5, & fastpow< 12, 5, 1059, 1000 > ); std::printf( "14 insn x^12/5: " ); test_pow( 12./5, & pow125_4 ); std::printf( "6 insn x^5/12: " ); test_pow( 5./12, & pow512_2 ); std::printf( "14 insn x^5/12: " ); test_pow( 5./12, & pow512_4 ); }

输出:

4 insn x^12/5: Domain from 1.36909e-23 error max = inf avg = inf |avg| = inf to 8.97249e-19 error max = 2267.14 avg = 139.175 |avg| = 139.193 to 5.88021e-14 error max = 0.123606 avg = -0.000102963 |avg| = 0.0371122 to 3.85365e-09 error max = 0.123607 avg = -0.000108978 |avg| = 0.0368548 to 0.000252553 error max = 0.12361 avg = 7.28909e-05 |avg| = 0.037507 to 16.5513 error max = 0.123612 avg = -0.000258619 |avg| = 0.0365618 to 1.08471e+06 error max = 0.123611 avg = 8.70966e-05 |avg| = 0.0374369 to 7.10874e+10 error max = 0.12361 avg = -0.000103047 |avg| = 0.0371122 to 4.65878e+15 error max = 0.123609 avg = nan |avg| = nan to 1.16469e+16 14 insn x^12/5: Domain from 1.42795e-19 error max = inf avg = nan |avg| = nan to 9.35823e-15 error max = 0.000936462 avg = 2.0202e-05 |avg| = 0.000133764 to 6.13301e-10 error max = 0.000792752 avg = 1.45717e-05 |avg| = 0.000129936 to 4.01933e-05 error max = 0.000791785 avg = 7.0132e-06 |avg| = 0.000129923 to 2.63411 error max = 0.000787589 avg = 1.20745e-05 |avg| = 0.000129347 to 172629 error max = 0.000786553 avg = 1.62351e-05 |avg| = 0.000132397 to 1.13134e+10 error max = 0.000785586 avg = 8.25205e-06 |avg| = 0.00013037 to 6.98147e+12 6 insn x^5/12: Domain from 9.86076e-32 error max = 0.0284339 avg = 0.000441158 |avg| = 0.00967327 to 6.46235e-27 error max = 0.0284342 avg = -5.79938e-06 |avg| = 0.00897913 to 4.23516e-22 error max = 0.0284341 avg = -0.000140706 |avg| = 0.00897084 to 2.77556e-17 error max = 0.028434 avg = 0.000440504 |avg| = 0.00967325 to 1.81899e-12 error max = 0.0284339 avg = -6.11153e-06 |avg| = 0.00897915 to 1.19209e-07 error max = 0.0284298 avg = -0.000140597 |avg| = 0.00897084 to 0.0078125 error max = 0.0284371 avg = 0.000439748 |avg| = 0.00967319 to 512 error max = 0.028437 avg = -7.74294e-06 |avg| = 0.00897924 to 3.35544e+07 error max = 0.0284369 avg = -0.000142036 |avg| = 0.00897089 to 2.19902e+12 error max = 0.0284368 avg = 0.000439183 |avg| = 0.0096732 to 1.44115e+17 error max = 0.0284367 avg = -7.41244e-06 |avg| = 0.00897923 to 9.44473e+21 error max = 0.0284366 avg = -0.000141706 |avg| = 0.00897088 to 6.1897e+26 error max = 0.485129 avg = -0.0401671 |avg| = 0.048422 to 4.05648e+31 error max = 0.994932 avg = -0.891494 |avg| = 0.891494 to 2.65846e+36 error max = 0.999329 avg = nan |avg| = nan to -0 14 insn x^5/12: Domain from 2.64698e-23 error max = 0.13556 avg = 0.00125936 |avg| = 0.00354677 to 1.73472e-18 error max = 0.000564988 avg = 2.51458e-06 |avg| = 0.000113709 to 1.13687e-13 error max = 0.000565065 avg = -1.49258e-06 |avg| = 0.000112553 to 7.45058e-09 error max = 0.000565143 avg = 1.5293e-06 |avg| = 0.000112864 to 0.000488281 error max = 0.000565298 avg = 2.76457e-06 |avg| = 0.000113713 to 32 error max = 0.000565453 avg = -1.61276e-06 |avg| = 0.000112561 to 2.09715e+06 error max = 0.000565531 avg = 1.42628e-06 |avg| = 0.000112866 to 1.37439e+11 error max = 0.000565686 avg = 2.71505e-06 |avg| = 0.000113715 to 9.0072e+15 error max = 0.000565763 avg = -1.56586e-06 |avg| = 0.000112415 to 1.84467e+19

我怀疑更准确的 5/12 的准确性受到 rsqrt 操作的限制.

I suspect accuracy of the more accurate 5/12 is being limited by the rsqrt operation.

相关文章