AVX2中log2(__m256d)的高效实现

2021-12-06 00:00:00 algorithm floating-point logarithm c++ avx2

SVML 的 __m256d _mm256_log2_pd (__m256d a) 在 Intel 之外的其他编译器上不可用,他们说它的性能在 AMD 处理器上有缺陷.AVX 中提到了互联网上的一些实现g++-4.8 中缺少日志内在函数 (_mm256_log_ps)? 和 SIMD 数学SSE 和 AVX 的库,但是它们似乎比 AVX2 更 SSE.还有 Agner Fog 的矢量库 ,但它是一个大型图书馆,拥有更多的东西,只是矢量 log2,所以从它的实现来看,很难找出向量 log2 操作的基本部分.

那么有人能解释一下如何有效地为 4 个 double 数字的向量实现 log2() 操作吗?IE.就像 __m256d _mm256_log2_pd (__m256d a) 所做的那样,但可用于其他编译器,并且对 AMD 和 Intel 处理器都相当有效.

在我当前的特定情况下,数字是 0 到 1 之间的概率,对数用于熵计算:P[i] 的所有 i 的求和的否定*日志(P[i]).P[i] 的浮点指数范围很大,因此数字可能接近 0.我不确定准确性,因此会考虑任何以 30 位尾数开头的解决方案,尤其是可调整的解决方案是首选.

这是我到目前为止的实现,基于来自 https:/的更高效系列"/en.wikipedia.org/wiki/Logarithm#Power_series .如何改进?(性能和准确性都需要改进)

命名空间{const __m256i gDoubleExpMask = _mm256_set1_epi64x(0x7ffULL << 52);const __m256i gDoubleExp0 = _mm256_set1_epi64x(1023ULL << 52);const __m256i gTo32bitExp = _mm256_set_epi32(0, 0, 0, 0, 6, 4, 2, 0);const __m128i gExpNormalizer = _mm_set1_epi32(1023);//TODO:这里是一些 128 位变量还是两个 64 位变量?const __m256d gCommMul = _mm256_set1_pd(2.0/0.693147180559945309417);//2.0/ln(2)const __m256d gCoeff1 = _mm256_set1_pd(1.0/3);const __m256d gCoeff2 = _mm256_set1_pd(1.0/5);const __m256d gCoeff3 = _mm256_set1_pd(1.0/7);const __m256d gCoeff4 = _mm256_set1_pd(1.0/9);const __m256d gVect1 = _mm256_set1_pd(1.0);}__m256d __vectorcall Log2(__m256d x) {const __m256i exps64 = _mm256_srli_epi64(_mm256_and_si256(gDoubleExpMask, _mm256_castpd_si256(x)), 52);const __m256i exps32_avx = _mm256_permutevar8x32_epi32(exps64, gTo32bitExp);const __m128i exps32_sse = _mm256_castsi256_si128(exps32_avx);const __m128i normExps = _mm_sub_epi32(exps32_sse, gExpNormalizer);const __m256d expsPD = _mm256_cvtepi32_pd(normExps);const __m256d y = _mm256_or_pd(_mm256_castsi256_pd(gDoubleExp0),_mm256_andnot_pd(_mm256_castsi256_pd(gDoubleExpMask), x));//计算 t=(y-1)/(y+1) 和 t**2const __m256d tNum = _mm256_sub_pd(y, gVect1);const __m256d tDen = _mm256_add_pd(y, gVect1);const __m256d t = _mm256_div_pd(tNum, tDen);const __m256d t2 = _mm256_mul_pd(t, t);//t**2const __m256d t3 = _mm256_mul_pd(t, t2);//t**3const __m256d term01 = _mm256_fmadd_pd(gCoeff1, t3, t);const __m256d t5 = _mm256_mul_pd(t3, t2);//t**5const __m256d term012 = _mm256_fmadd_pd(gCoeff2,t5,terms01);const __m256d t7 = _mm256_mul_pd(t5, t2);//t**7const __m256d terms0123 = _mm256_fmadd_pd(gCoeff3,t7,terms012);const __m256d t9 = _mm256_mul_pd(t7, t2);//t**9const __m256d 条款01234 = _mm256_fmadd_pd(gCoeff4,t9,条款0123);const __m256d log2_y = _mm256_mul_pd(terms01234, gCommMul);const __m256d log2_x = _mm256_add_pd(log2_y, expsPD);返回 log2_x;}

到目前为止,我的实现提供了每秒 405 268 490 次操作,并且似乎精确到第 8 位.性能通过以下函数来衡量:

#include #include <cmath>#include #include //... Log2() 在这里实现const int64_t cnLogs = 100 * 1000 * 1000;void BenchmarkLog2Vect() {__m256d 总和 = _mm256_setzero_pd();自动启动 = std::chrono::high_resolution_clock::now();for (int64_t i = 1; i <= cnLogs; i += 4) {const __m256d x = _mm256_set_pd(double(i+3), double(i+2), double(i+1), double(i));const __m256d 日志 = Log2(x);总和 = _mm256_add_pd(总和,日志);}自动消逝 = std::chrono::high_resolution_clock::now() - 开始;double nSec = 1e-6 * std::chrono::duration_cast(elapsed).count();双总和 = sums.m256d_f64[0] + sums.m256d_f64[1] + sums.m256d_f64[2] + sums.m256d_f64[3];printf("Vect Log2: %.3lf Ops/sec 计算出 %.3lf
", cnLogs/nSec, sum);}

对比

解决方案

最后是我最好的结果,它在 Ryzen 1800X @3.6GHz 上每秒给出大约 8 亿个对数(每个 4 个对数的 2 亿个向量)单线程,直到尾数的最后几位都是准确的.剧透:看看到底如何将性能提高到每秒 8.7 亿对数.

特殊情况:负数、负无穷大和带有负符号位的 NaN 被视为非常接近 0(导致一些垃圾大的负对数"值).正无穷大和带有正符号位的 NaN 产生大约 1024 的对数.如果您不喜欢特殊情况的处理方式,一种选择是添加代码来检查它们并执行适合您的操作更好的.这会使计算速度变慢.

命名空间{//限制为 19,因为我们只处理双精度的高 32 位,并且超出//那里有 20 位尾数,1 位用于四舍五入.constexpr uint8_t cnLog2TblBits = 10;//1024 个数字乘以 8 个字节 = 8KB.constexpr uint16_t cZeroExp = 1023;const __m256i gDoubleNotExp = _mm256_set1_epi64x(~(0x7ffULL << 52));const __m256d gDoubleExp0 = _mm256_castsi256_pd(_mm256_set1_epi64x(1023ULL << 52));const __m256i cAvxExp2YMask = _mm256_set1_epi64x(~((1ULL << (52-cnLog2TblBits)) - 1) );const __m256d cPlusBit = _mm256_castsi256_pd(_mm256_set1_epi64x(1ULL<<(52 - cnLog2TblBits - 1)));const __m256d gCommMul1 = _mm256_set1_pd(2.0/0.693147180559945309417);//2.0/ln(2)const __m256i gHigh32Permute = _mm256_set_epi32(0, 0, 0, 0, 7, 5, 3, 1);const __m128i cSseMantTblMask = _mm_set1_epi32((1 << cnLog2TblBits) - 1);const __m128i gExpNorm0 = _mm_set1_epi32(1023);//加上|cnLog2TblBits|第一个最高尾数位双 gPlusLog2Table[1 <<cnLog2TblBits];}//匿名命名空间void InitLog2Table() {for(uint32_t i=0; i<(1<<cnLog2TblBits); i++) {const uint64_t iZp = (uint64_t(cZeroExp) << 52)|(uint64_t(i) << (52 - cnLog2TblBits)) |(1ULL << (52 - cnLog2TblBits - 1));const double zp = *reinterpret_cast(&iZp);const double l2zp = std::log2(zp);gPlusLog2Table[i] = l2zp;}}__m256d __vectorcall Log2TblPlus(__m256d x) {const __m256d zClearExp = _mm256_and_pd(_mm256_castsi256_pd(gDoubleNotExp), x);const __m256d z = _mm256_or_pd(zClearExp, gDoubleExp0);const __m128i high32 = _mm256_castsi256_si128(_mm256_permutevar8x32_epi32(_mm256_castpd_si256(x), gHigh32Permute));//这就要求x非负,因为符号位之前没有清零//计算指数.const __m128i exps32 = _mm_srai_epi32(high32, 20);const __m128i normExps = _mm_sub_epi32(exps32, gExpNorm0);//计算 y 大约等于 log2(z)const __m128i 索引 = _mm_and_si128(cSseMantTblMask,_mm_srai_epi32(high32, 20 - cnLog2TblBits));const __m256d y = _mm256_i32gather_pd(gPlusLog2Table,索引,/*每项的字节数*/8);//将 A 计算为 z/exp2(y)const __m256d exp2_Y = _mm256_or_pd(cPlusBit, _mm256_and_pd(z, _mm256_castsi256_pd(cAvxExp2YMask)));//计算 t=(A-1)/(A+1).分子和分母都将被 exp2_Y 除const __m256d tNum = _mm256_sub_pd(z, exp2_Y);const __m256d tDen = _mm256_add_pd(z, exp2_Y);//从 https://en.wikipedia.org/wiki/Logarithm#Power_series 的更高效系列"计算第一个多项式项const __m256d t = _mm256_div_pd(tNum, tDen);const __m256d log2_z = _mm256_fmadd_pd(t, gCommMul1, y);//对数的前导整数部分const __m256d 前导 = _mm256_cvtepi32_pd(normExps);const __m256d log2_x = _mm256_add_pd(log2_z,领先);返回 log2_x;}

它结合了查找表方法和一阶多项式,主要在维基百科上进行了描述(链接在代码注释中).我可以在这里分配 8KB 的 L1 缓存(这是每个逻辑核心可用的 16KB L1 缓存的一半),因为对数计算对我来说确实是瓶颈,并且没有更多的东西需要 L1 缓存.

但是,如果您需要更多一级缓存来满足其他需求,您可以通过将 cnLog2TblBits 减少到例如,减少对数算法使用的缓存量5 以降低对数计算精度为代价.

或者为了保持较高的准确度,您可以通过添加以下内容来增加多项式项的数量:

命名空间{//...const __m256d gCoeff1 = _mm256_set1_pd(1.0/3);const __m256d gCoeff2 = _mm256_set1_pd(1.0/5);const __m256d gCoeff3 = _mm256_set1_pd(1.0/7);const __m256d gCoeff4 = _mm256_set1_pd(1.0/9);const __m256d gCoeff5 = _mm256_set1_pd(1.0/11);}

然后在 const __m256d t = _mm256_div_pd(tNum, tDen); 行之后改变 Log2TblPlus() 的尾部:

 const __m256d t2 = _mm256_mul_pd(t, t);//t**2const __m256d t3 = _mm256_mul_pd(t, t2);//t**3const __m256d term01 = _mm256_fmadd_pd(gCoeff1, t3, t);const __m256d t5 = _mm256_mul_pd(t3, t2);//t**5const __m256d term012 = _mm256_fmadd_pd(gCoeff2,t5,terms01);const __m256d t7 = _mm256_mul_pd(t5, t2);//t**7const __m256d terms0123 = _mm256_fmadd_pd(gCoeff3,t7,terms012);const __m256d t9 = _mm256_mul_pd(t7, t2);//t**9const __m256d 条款01234 = _mm256_fmadd_pd(gCoeff4,t9,条款0123);const __m256d t11 = _mm256_mul_pd(t9, t2);//t**11const __m256d 条款012345 = _mm256_fmadd_pd(gCoeff5,t11,条款01234);const __m256d log2_z = _mm256_fmadd_pd(terms012345, gCommMul1, y);

然后注释//对数的前导整数部分,其余部分不变.

通常你不需要那么多项,即使是几位表,我只是提供系数和计算以供参考.很可能如果cnLog2TblBits==5,除了terms012,你不需要任何东西.但是我还没有做过这样的测量,你需要根据自己的需要试验一下.

显然,您计算的多项式项越少,计算速度就越快.

<小时>

编辑:这个问题在什么情况下 AVX2 收集指令会比单独加载数据更快? 表明如果

,您可能会获得性能改进

const __m256d y = _mm256_i32gather_pd(gPlusLog2Table, 索引,/*每项的字节数*/8);

被替换为

const __m256d y = _mm256_set_pd(gPlusLog2Table[indexes.m128i_u32[3]],gPlusLog2Table[indexes.m128i_u32[2]],gPlusLog2Table[indexes.m128i_u32[1]],gPlusLog2Table[indexes.m128i_u32[0]]);

对于我的实现,它节省了大约 1.5 个周期,将计算 4 个对数的总周期数从 18 减少到 16.5,因此性能提高到每秒 8.7 亿对数.我将保留当前的实现,因为一旦 CPU 开始正确地执行 gather 操作(像 GPU 一样进行合并),它会更加惯用并且应该会更快.

EDIT2:在 Ryzen CPU 上(但不是在 Intel 上) 你可以通过替换获得更多的加速(大约 0.5 个周期)

const __m128i high32 = _mm256_castsi256_si128(_mm256_permutevar8x32_epi32(_mm256_castpd_si256(x), gHigh32Permute));

 const __m128 hiLane = _mm_castpd_ps(_mm256_extractf128_pd(x, 1));const __m128 loLane = _mm_castpd_ps(_mm256_castpd256_pd128(x));const __m128i high32 = _mm_castps_si128(_mm_shuffle_ps(loLane, hiLane,_MM_SHUFFLE(3, 1, 3, 1)));

SVML's __m256d _mm256_log2_pd (__m256d a) is not available on other compilers than Intel, and they say its performance is handicapped on AMD processors. There are some implementations on the internet referred in AVX log intrinsics (_mm256_log_ps) missing in g++-4.8? and SIMD math libraries for SSE and AVX , however they seem to be more SSE than AVX2. There's also Agner Fog's vector library , however it's a large library having much more stuff that just vector log2, so from the implementation in it it's hard to figure out the essential parts for just the vector log2 operation.

So can someone just explain how to implement log2() operation for a vector of 4 double numbers efficiently? I.e. like what __m256d _mm256_log2_pd (__m256d a) does, but available for other compilers and reasonably efficient for both AMD and Intel processors.

EDIT: In my current specific case, the numbers are probabilities between 0 and 1, and logarithm is used for entropy computation: the negation of sum over all i of P[i]*log(P[i]). The range of floating-point exponents for P[i] is large, so the numbers can be close to 0. I'm not sure about accuracy, so would consider any solution starting with 30 bits of mantissa, especially a tuneable solution is preferred.

EDIT2: here is my implementation so far, based on "More efficient series" from https://en.wikipedia.org/wiki/Logarithm#Power_series . How can it be improved? (both performance and accuracy improvements are desired)

namespace {
  const __m256i gDoubleExpMask = _mm256_set1_epi64x(0x7ffULL << 52);
  const __m256i gDoubleExp0 = _mm256_set1_epi64x(1023ULL << 52);
  const __m256i gTo32bitExp = _mm256_set_epi32(0, 0, 0, 0, 6, 4, 2, 0);
  const __m128i gExpNormalizer = _mm_set1_epi32(1023);
  //TODO: some 128-bit variable or two 64-bit variables here?
  const __m256d gCommMul = _mm256_set1_pd(2.0 / 0.693147180559945309417); // 2.0/ln(2)
  const __m256d gCoeff1 = _mm256_set1_pd(1.0 / 3);
  const __m256d gCoeff2 = _mm256_set1_pd(1.0 / 5);
  const __m256d gCoeff3 = _mm256_set1_pd(1.0 / 7);
  const __m256d gCoeff4 = _mm256_set1_pd(1.0 / 9);
  const __m256d gVect1 = _mm256_set1_pd(1.0);
}

__m256d __vectorcall Log2(__m256d x) {
  const __m256i exps64 = _mm256_srli_epi64(_mm256_and_si256(gDoubleExpMask, _mm256_castpd_si256(x)), 52);
  const __m256i exps32_avx = _mm256_permutevar8x32_epi32(exps64, gTo32bitExp);
  const __m128i exps32_sse = _mm256_castsi256_si128(exps32_avx);
  const __m128i normExps = _mm_sub_epi32(exps32_sse, gExpNormalizer);
  const __m256d expsPD = _mm256_cvtepi32_pd(normExps);
  const __m256d y = _mm256_or_pd(_mm256_castsi256_pd(gDoubleExp0),
    _mm256_andnot_pd(_mm256_castsi256_pd(gDoubleExpMask), x));

  // Calculate t=(y-1)/(y+1) and t**2
  const __m256d tNum = _mm256_sub_pd(y, gVect1);
  const __m256d tDen = _mm256_add_pd(y, gVect1);
  const __m256d t = _mm256_div_pd(tNum, tDen);
  const __m256d t2 = _mm256_mul_pd(t, t); // t**2

  const __m256d t3 = _mm256_mul_pd(t, t2); // t**3
  const __m256d terms01 = _mm256_fmadd_pd(gCoeff1, t3, t);
  const __m256d t5 = _mm256_mul_pd(t3, t2); // t**5
  const __m256d terms012 = _mm256_fmadd_pd(gCoeff2, t5, terms01);
  const __m256d t7 = _mm256_mul_pd(t5, t2); // t**7
  const __m256d terms0123 = _mm256_fmadd_pd(gCoeff3, t7, terms012);
  const __m256d t9 = _mm256_mul_pd(t7, t2); // t**9
  const __m256d terms01234 = _mm256_fmadd_pd(gCoeff4, t9, terms0123);

  const __m256d log2_y = _mm256_mul_pd(terms01234, gCommMul);
  const __m256d log2_x = _mm256_add_pd(log2_y, expsPD);

  return log2_x;
}

So far my implementation gives 405 268 490 operations per second, and it seems precise till the 8th digit. The performance is measured with the following function:

#include <chrono>
#include <cmath>
#include <cstdio>
#include <immintrin.h>

// ... Log2() implementation here

const int64_t cnLogs = 100 * 1000 * 1000;

void BenchmarkLog2Vect() {
  __m256d sums = _mm256_setzero_pd();
  auto start = std::chrono::high_resolution_clock::now();
  for (int64_t i = 1; i <= cnLogs; i += 4) {
    const __m256d x = _mm256_set_pd(double(i+3), double(i+2), double(i+1), double(i));
    const __m256d logs = Log2(x);
    sums = _mm256_add_pd(sums, logs);
  }
  auto elapsed = std::chrono::high_resolution_clock::now() - start;
  double nSec = 1e-6 * std::chrono::duration_cast<std::chrono::microseconds>(elapsed).count();
  double sum = sums.m256d_f64[0] + sums.m256d_f64[1] + sums.m256d_f64[2] + sums.m256d_f64[3];
  printf("Vect Log2: %.3lf Ops/sec calculated %.3lf
", cnLogs / nSec, sum);
}

Comparing to the results of Logarithm in C++ and assembly , the current vector implementation is 4 times faster than std::log2() and 2.5 times faster than std::log().

Specifically, the following approximation formula is used:

解决方案

Finally here is my best result which on Ryzen 1800X @3.6GHz gives about 0.8 billion of logarithms per second (200 million vectors of 4 logarithms in each) in a single thread, and is accurate till a few last bits in the mantissa. Spoiler: see in the end how to increase performance to 0.87 billion logarithms per second.

Special cases: Negative numbers, negative infinity and NaNs with negative sign bit are treated as if they are very close to 0 (result in some garbage large negative "logarithm" values). Positive infinity and NaNs with positive sign bit result in a logarithm around 1024. If you don't like how special cases are treated, one option is to add code that checks for them and does what suits you better. This will make the computation slower.

namespace {
  // The limit is 19 because we process only high 32 bits of doubles, and out of
  //   20 bits of mantissa there, 1 bit is used for rounding.
  constexpr uint8_t cnLog2TblBits = 10; // 1024 numbers times 8 bytes = 8KB.
  constexpr uint16_t cZeroExp = 1023;
  const __m256i gDoubleNotExp = _mm256_set1_epi64x(~(0x7ffULL << 52));
  const __m256d gDoubleExp0 = _mm256_castsi256_pd(_mm256_set1_epi64x(1023ULL << 52));
  const __m256i cAvxExp2YMask = _mm256_set1_epi64x(
    ~((1ULL << (52-cnLog2TblBits)) - 1) );
  const __m256d cPlusBit = _mm256_castsi256_pd(_mm256_set1_epi64x(
    1ULL << (52 - cnLog2TblBits - 1)));
  const __m256d gCommMul1 = _mm256_set1_pd(2.0 / 0.693147180559945309417); // 2.0/ln(2)
  const __m256i gHigh32Permute = _mm256_set_epi32(0, 0, 0, 0, 7, 5, 3, 1);
  const __m128i cSseMantTblMask = _mm_set1_epi32((1 << cnLog2TblBits) - 1);
  const __m128i gExpNorm0 = _mm_set1_epi32(1023);
  // plus |cnLog2TblBits|th highest mantissa bit
  double gPlusLog2Table[1 << cnLog2TblBits];
} // anonymous namespace

void InitLog2Table() {
  for(uint32_t i=0; i<(1<<cnLog2TblBits); i++) {
    const uint64_t iZp = (uint64_t(cZeroExp) << 52)
      | (uint64_t(i) << (52 - cnLog2TblBits)) | (1ULL << (52 - cnLog2TblBits - 1));
    const double zp = *reinterpret_cast<const double*>(&iZp);
    const double l2zp = std::log2(zp);
    gPlusLog2Table[i] = l2zp;
  }
}

__m256d __vectorcall Log2TblPlus(__m256d x) {
  const __m256d zClearExp = _mm256_and_pd(_mm256_castsi256_pd(gDoubleNotExp), x);
  const __m256d z = _mm256_or_pd(zClearExp, gDoubleExp0);

  const __m128i high32 = _mm256_castsi256_si128(_mm256_permutevar8x32_epi32(
    _mm256_castpd_si256(x), gHigh32Permute));
  // This requires that x is non-negative, because the sign bit is not cleared before
  //   computing the exponent.
  const __m128i exps32 = _mm_srai_epi32(high32, 20);
  const __m128i normExps = _mm_sub_epi32(exps32, gExpNorm0);

  // Compute y as approximately equal to log2(z)
  const __m128i indexes = _mm_and_si128(cSseMantTblMask,
    _mm_srai_epi32(high32, 20 - cnLog2TblBits));
  const __m256d y = _mm256_i32gather_pd(gPlusLog2Table, indexes,
    /*number of bytes per item*/ 8);
  // Compute A as z/exp2(y)
  const __m256d exp2_Y = _mm256_or_pd(
    cPlusBit, _mm256_and_pd(z, _mm256_castsi256_pd(cAvxExp2YMask)));

  // Calculate t=(A-1)/(A+1). Both numerator and denominator would be divided by exp2_Y
  const __m256d tNum = _mm256_sub_pd(z, exp2_Y);
  const __m256d tDen = _mm256_add_pd(z, exp2_Y);

  // Compute the first polynomial term from "More efficient series" of https://en.wikipedia.org/wiki/Logarithm#Power_series
  const __m256d t = _mm256_div_pd(tNum, tDen);

  const __m256d log2_z = _mm256_fmadd_pd(t, gCommMul1, y);

  // Leading integer part for the logarithm
  const __m256d leading = _mm256_cvtepi32_pd(normExps);

  const __m256d log2_x = _mm256_add_pd(log2_z, leading);
  return log2_x;
}

It uses a combination of lookup table approach and a 1st degree polynomial, mostly described on Wikipedia (the link is in the code comments). I can afford to allocate 8KB of L1 cache here (which is a half of 16KB L1 cache available per logical core), because logarithm computation is really the bottleneck for me and there is not much more anything that needs L1 cache.

However, if you need more L1 cache for the other needs, you can decrease the amount of cache used by logarithm algorithm by reducing cnLog2TblBits to e.g. 5 at expense of decreasing the accuracy of logarithm computation.

Or to keep the accuracy high, you can increase the number of polynomial terms by adding:

namespace {
  // ...
  const __m256d gCoeff1 = _mm256_set1_pd(1.0 / 3);
  const __m256d gCoeff2 = _mm256_set1_pd(1.0 / 5);
  const __m256d gCoeff3 = _mm256_set1_pd(1.0 / 7);
  const __m256d gCoeff4 = _mm256_set1_pd(1.0 / 9);
  const __m256d gCoeff5 = _mm256_set1_pd(1.0 / 11);
}

And then changing the tail of Log2TblPlus() after line const __m256d t = _mm256_div_pd(tNum, tDen);:

  const __m256d t2 = _mm256_mul_pd(t, t); // t**2

  const __m256d t3 = _mm256_mul_pd(t, t2); // t**3
  const __m256d terms01 = _mm256_fmadd_pd(gCoeff1, t3, t);
  const __m256d t5 = _mm256_mul_pd(t3, t2); // t**5
  const __m256d terms012 = _mm256_fmadd_pd(gCoeff2, t5, terms01);
  const __m256d t7 = _mm256_mul_pd(t5, t2); // t**7
  const __m256d terms0123 = _mm256_fmadd_pd(gCoeff3, t7, terms012);
  const __m256d t9 = _mm256_mul_pd(t7, t2); // t**9
  const __m256d terms01234 = _mm256_fmadd_pd(gCoeff4, t9, terms0123);
  const __m256d t11 = _mm256_mul_pd(t9, t2); // t**11
  const __m256d terms012345 = _mm256_fmadd_pd(gCoeff5, t11, terms01234);

  const __m256d log2_z = _mm256_fmadd_pd(terms012345, gCommMul1, y);

Then comment // Leading integer part for the logarithm and the rest unchanged follow.

Normally you don't need that many terms, even for a few-bit table, I just provided the coefficients and computations for reference. It's likely that if cnLog2TblBits==5, you won't need anything beyond terms012. But I haven't done such measurements, you need to experiment what suits your needs.

The less polynomial terms you compute, obviously, the faster the computations are.


EDIT: this question In what situation would the AVX2 gather instructions be faster than individually loading the data? suggests that you may get a performance improvement if

const __m256d y = _mm256_i32gather_pd(gPlusLog2Table, indexes,
  /*number of bytes per item*/ 8);

is replaced by

const __m256d y = _mm256_set_pd(gPlusLog2Table[indexes.m128i_u32[3]],
  gPlusLog2Table[indexes.m128i_u32[2]],
  gPlusLog2Table[indexes.m128i_u32[1]],
  gPlusLog2Table[indexes.m128i_u32[0]]);

For my implementation it saves about 1.5 cycle, reducing the total cycle count to compute 4 logarithms from 18 to 16.5, thus the performance rises to 0.87 billion logarithms per second. I'm leaving the current implementation as is because it's more idiomatic and shoud be faster once the CPUs start doing gather operations right (with coalescing like GPUs do).

EDIT2: on Ryzen CPU (but not on Intel) you can get a little more speedup (about 0.5 cycle) by replacing

const __m128i high32 = _mm256_castsi256_si128(_mm256_permutevar8x32_epi32(
  _mm256_castpd_si256(x), gHigh32Permute));

with

  const __m128 hiLane = _mm_castpd_ps(_mm256_extractf128_pd(x, 1));
  const __m128 loLane = _mm_castpd_ps(_mm256_castpd256_pd128(x));
  const __m128i high32 = _mm_castps_si128(_mm_shuffle_ps(loLane, hiLane,
    _MM_SHUFFLE(3, 1, 3, 1)));

相关文章