如何加速这个 LUT 查找的直方图?

2021-12-20 00:00:00 simd histogram optimization c++ x86

首先,我有一个数组int a[1000][1000].所有这些整数都在 0 到 32767 之间,它们是已知常量:它们在程序运行期间永远不会改变.

First, I have an array int a[1000][1000]. All these integers are between 0 and 32767 ,and they are known constants: they never change during a run of the program.

其次,我有一个数组 b[32768],其中包含 0 到 32 之间的整数.我使用这个数组将 a 中的所有数组映射到 32 个 bin:

Second, I have an array b[32768], which contains integers between 0 and 32. I use this array to map all arrays in a to 32 bins:

int bins[32]{};
for (auto e : a[i])//mapping a[i] to 32 bins.
    bins[b[e]]++;

每次,数组 b 都会用一个新数组初始化,我需要将 数组 a 中的所有 1000 个数组(每个包含 1000 个整数)散列到 1000 个数组,每个数组包含 32 个整数代表有多少整数落入每个 bin 中.

Each time, array b will be initialized with a new array, and I need to hash all those 1000 arrays in array a (each contains 1000 ints) to 1000 arrays each contains 32 ints represents for how many ints fall into its each bin .

int new_array[32768] = {some new mapping};
copy(begin(new_array), end(new_array), begin(b));//reload array b;

int bins[1000][32]{};//output array to store results .
for (int i = 0; i < 1000;i++)
    for (auto e : a[i])//hashing a[i] to 32 bins.
        bins[i][b[e]]++;

我可以在 0.00237 秒内映射 1000*1000 个值.有没有其他方法可以加速我的代码?(比如SIMD?)这段代码是我程序的瓶颈.

I can map 1000*1000 values in 0.00237 seconds. Is there any other way that I can speed up my code? (Like SIMD?) This piece of code is the bottleneck of my program.

推荐答案

这本质上是一个直方图问题.您使用 32k 条目的查找表将 16 位值映射到 5 位值,但之后它只是对 LUT 结果进行直方图绘制.如++counts[ b[a[j]] ];,其中countsbins[i].有关直方图的更多信息,请参见下文.

This is essentially a histogram problem. You're mapping values 16-bit values to 5-bit values with a 32k-entry lookup table, but after that it's just histogramming the LUT results. Like ++counts[ b[a[j]] ];, where counts is bins[i]. See below for more about histograms.

首先,您可以使用尽可能小的数据类型来增加 LUT(和原始数据)的密度.在 x86 上,将 8 位或 16 位数据加载到寄存器中的零或符号扩展加载与常规 32 位 int 加载几乎完全相同(假设两者都命中缓存), 8 位或 16 位存储也和 32 位存储一样便宜.

First of all, you can use the smallest possible data-types to increase the density of your LUT (and of the original data). On x86, a zero or sign-extending load of 8-bit or 16-bit data into a register is almost exactly the same cost as a regular 32-bit int load (assuming both hit in cache), and an 8-bit or 16-bit store is also just as cheap as a 32-bit store.

由于您的数据大小超过 L1 d-cache 大小(所有最近的 Intel 设计为 32kiB),并且您以分散的模式访问它,您可以从缩小缓存占用空间中获益很多.(有关 x86 性能的更多信息,请参阅 x86 标签wiki,尤其是 Agner Fog 的东西).

Since your data size exceeds L1 d-cache size (32kiB for all recent Intel designs), and you access it in a scattered pattern, you have a lot to gain from shrinking your cache footprint. (For more x86 perf info, see the x86 tag wiki, especially Agner Fog's stuff).

由于 a 在每个平面中的条目少于 65536 个,因此您的 bin 计数永远不会溢出 16 位计数器,因此 bins 可以是 uint16_t以及.

Since a has less than 65536 entries in each plane, your bin counts will never overflow a 16-bit counter, so bins can be uint16_t as well.

你的 copy() 毫无意义.为什么要复制到 b[32768] 而不是让内部循环使用指向当前 LUT 的指针?您以只读方式使用它.如果您无法更改生成不同 LUT 的代码以生成 int8_t<,您仍然想要复制的唯一原因是从 int 复制到 uin8_t/code> 或 uint8_t 放在首位.

Your copy() makes no sense. Why are you copying into b[32768] instead of having your inner loop use a pointer to the current LUT? You use it read-only. The only reason you'd still want to copy is to copy from int to uin8_t if you can't change the code that produces different LUTs to produce int8_t or uint8_t in the first place.

这个版本利用了这些想法和一些直方图技巧,并编译成看起来不错的 asm (Godbolt 编译器 explorer: gcc6.2 -O3 -march=haswell (AVX2)):

This version takes advantage of those ideas and a few histogram tricks, and compiles to asm that looks good (Godbolt compiler explorer: gcc6.2 -O3 -march=haswell (AVX2)):

// untested
//#include <algorithm>
#include <stdint.h>

const int PLANES = 1000;
void use_bins(uint16_t bins[PLANES][32]);   // pass the result to an extern function so it doesn't optimize away

                       // 65536 or higher triggers the static_assert
alignas(64) static uint16_t a[PLANES][1000];  // static/global, I guess?

void lut_and_histogram(uint8_t __restrict__ lut[32768])
{

    alignas(16) uint16_t bins[PLANES][32];  // don't zero the whole thing up front: that would evict more data from cache than necessary
                                            // Better would be zeroing the relevant plane of each bin right before using.
                                            // you pay the rep stosq startup overhead more times, though.

    for (int i = 0; i < PLANES;i++) {
        alignas(16) uint16_t tmpbins[4][32] = {0};
        constexpr int a_elems = sizeof(a[0])/sizeof(uint16_t);
        static_assert(a_elems > 1, "someone changed a[] into a* and forgot to update this code");
        static_assert(a_elems <= UINT16_MAX, "bins could overflow");
        const uint16_t *ai = a[i];
        for (int j = 0 ; j<a_elems ; j+=4) { //hashing a[i] to 32 bins.
            // Unrolling to separate bin arrays reduces serial dependencies
            // to avoid bottlenecks when the same bin is used repeatedly.
            // This has to be balanced against using too much L1 cache for the bins.

            // TODO: load a vector of data from ai[j] and unpack it with pextrw.
            // even just loading a uint64_t and unpacking it to 4 uint16_t would help.
            tmpbins[0][ lut[ai[j+0]] ]++;
            tmpbins[1][ lut[ai[j+1]] ]++;
            tmpbins[2][ lut[ai[j+2]] ]++;
            tmpbins[3][ lut[ai[j+3]] ]++;
            static_assert(a_elems % 4 == 0, "unroll factor doesn't divide a element count");
        }

        // TODO: do multiple a[i] in parallel instead of slicing up a single run.
        for (int k = 0 ; k<32 ; k++) {
            // gcc does auto-vectorize this with a short fully-unrolled VMOVDQA / VPADDW x3
            bins[i][k] = tmpbins[0][k] + tmpbins[1][k] +
                         tmpbins[2][k] + tmpbins[3][k];
        }
    }
  
    // do something with bins.  An extern function stops it from optimizing away.
    use_bins(bins);
}

内循环 asm 如下所示:

The inner-loop asm looks like this:

.L2:
    movzx   ecx, WORD PTR [rdx]
        add     rdx, 8                    # pointer increment over ai[]
    movzx   ecx, BYTE PTR [rsi+rcx]
    add     WORD PTR [rbp-64272+rcx*2], 1     # memory-destination increment of a histogram element
    movzx   ecx, WORD PTR [rdx-6]
    movzx   ecx, BYTE PTR [rsi+rcx]
    add     WORD PTR [rbp-64208+rcx*2], 1
    ... repeated twice more

使用来自 rbp 的 32 位偏移量(而不是来自 rsp 的 8 位偏移量,或使用另一个寄存器:/),代码密度并不好.尽管如此,平均指令长度不会太长,以至于它可能会成为任何现代 CPU 上指令解码的瓶颈.

With those 32-bit offsets from rbp (instead of 8-bit offsets from rsp, or using another register :/) the code density isn't wonderful. Still, the average instruction length isn't so long that it's likely to bottleneck on instruction decode on any modern CPU.

因为无论如何你都需要做多个直方图,所以只需并行执行 4 到 8 个直方图,而不是为单个直方图切片.展开因子甚至不必是 2 的幂.

Since you need to do multiple histograms anyway, just do 4 to 8 of them in parallel instead of slicing the bins for a single histogram. The unroll factor doesn't even have to be a power of 2.

这消除了 bins[i][k] = sum(tmpbins[0..3][k])k 结束时循环的需要.

That eliminates the need for the bins[i][k] = sum(tmpbins[0..3][k]) loop over k at the end.

在使用前将 bins[i..i+unroll_factor][0..31] 归零,而不是在循环外将整个事物归零.这样一来,当您开始时,L1 缓存中的所有 bin 都会很热,并且这项工作可能会与内循环的负载更重的工作重叠.

Zero bins[i..i+unroll_factor][0..31] right before use, instead of zeroing the whole thing outside the loop. That way all the bins will be hot in L1 cache when you start, and this work can overlap with the more load-heavy work of the inner loop.

硬件预取器可以跟踪多个顺序流,因此不必担心从 a 加载时会出现更多缓存未命中.(为此也使用矢量加载,并在加载后将它们切碎).

Hardware prefetchers can keep track of multiple sequential streams, so don't worry about having a lot more cache misses in loading from a. (Also use vector loads for this, and slice them up after loading).

其他有关直方图的有用答案的问题:

Other questions with useful answers about histograms:

  • 在 SIMD 中矢量化直方图的方法? 建议使用 multi-bin-数组和最后的技巧.
  • 优化 SIMD 直方图计算 x86 asm 加载 a 值并使用 pextrb 提取到整数寄存器.(在您的代码中,您将使用 pextrw/_mm_extract_epi16).随着所有加载/存储 uops 的发生,进行矢量加载并使用 ALU 操作来解包是有意义的.凭借良好的 L1 命中率,内存 uop 吞吐量可能成为瓶颈,而不是内存/缓存延迟.
  • 如何使用 Neon 内在函数优化直方图统计数据?一些相同的想法:bins 数组的多个副本.它还具有特定于 ARM 的建议,用于在 SIMD 向量中进行地址计算(ARM 可以在单个指令中从向量中获取两个标量),并以相反的方式布置多仓阵列.
  • Methods to vectorise histogram in SIMD? suggests the multiple-bin-arrays and sum at the end trick.
  • Optimizing SIMD histogram calculation x86 asm loading a vector of a values and extracting to integer registers with pextrb. (In your code, you'd use pextrw / _mm_extract_epi16). With all the load/store uops happening, doing a vector load and using ALU ops to unpack makes sense. With good L1 hit rates, memory uop throughput may be the bottleneck, not memory / cache latency.
  • How to optimize histogram statistics with neon intrinsics? some of the same ideas: multiple copies of the bins array. It also has an ARM-specific suggestion for doing address calculations in a SIMD vector (ARM can get two scalars from a vector in a single instruction), and laying out the multiple-bins array the opposite way.

如果您打算在 Intel Skylake 上运行它,您甚至可以使用 AVX2 收集指令进行 LUT 查找.(在 Broadwell 上,这可能是一个盈亏平衡点,而在 Haswell 上它会失败;他们支持 vpgatherdd (_mm_i32gather_epi32),但没有那么高效的实现.希望 Skylake 在元素之间存在重叠时避免多次访问相同的缓存行).

If you're going to run this on Intel Skylake, you could maybe even do the LUT lookups with AVX2 gather instructions. (On Broadwell, it's probably a break-even, and on Haswell it would lose; they support vpgatherdd (_mm_i32gather_epi32), but don't have as efficient an implementation. Hopefully Skylake avoids hitting the same cache line multiple times when there is overlap between elements).

是的,您仍然可以从 uint16_t 数组(比例因子 = 2)中进行收集,即使最小的收集粒度是 32 位元素.这意味着您在每个 32 位向量元素的高半部分得到垃圾而不是 0,但这无关紧要.缓存行拆分并不理想,因为我们可能会遇到缓存吞吐量的瓶颈.

And yes, you can still gather from an array of uint16_t (with scale factor = 2), even though the smallest gather granularity is 32-bit elements. It means you get garbage in the high half of each 32-bit vector element instead of 0, but that shouldn't matter. Cache-line splits aren't ideal, since we're probably bottlenecked on cache throughput.

收集元素的前半部分中的垃圾并不重要,因为无论如何您都使用 pextrw 提取了有用的 16 位.(并使用标量代码进行过程的直方图部分).

Garbage in the high half of gathered elements doesn't matter because you're extracting only the useful 16 bits anyway with pextrw. (And doing the histogram part of the process with scalar code).

您可以潜在地使用另一个聚集从直方图箱加载,只要每个元素来自直方图箱的单独切片/平面.否则,如果两个元素来自同一个 bin,当您手动将增加的向量散布回直方图(使用标量存储)时,它只会增加 1.这种针对分散存储的冲突检测就是 AVX512CD 存在的原因.AVX512 确实有分散指令,也有收集指令(已经在 AVX2 中添加).

You could potentially use another gather to load from the histogram bins, as long as each element comes from a separate slice/plane of histogram bins. Otherwise, if two elements come from the same bin, it would only be incremented by one when you manually scatter the incremented vector back into the histogram (with scalar stores). This kind of conflict detection for scatter stores is why AVX512CD exists. AVX512 does have scatter instructions, as well as gather (already added in AVX2).

AVX512

参见 nofollowKirill Yukhin 2014 年幻灯片的第 50 页 示例循环,重试直到没有冲突;但它没有显示 get_conflict_free_subset() 如何根据 __m512i _mm512_conflict_epi32 (__m512i a) (vpconflictd) (返回位图在它与之冲突的所有前面元素的每个元素中).正如@Mysticial 指出的那样,如果冲突检测指令只是产生一个掩码寄存器结果,而不是另一个向量,那么简单的实现就没有那么简单.

See page 50 of Kirill Yukhin's slides from 2014 for an example loop that retries until there are no conflicts; but it doesn't show how get_conflict_free_subset() is implemented in terms of __m512i _mm512_conflict_epi32 (__m512i a) (vpconflictd) (which returns a bitmap in each element of all the preceding elements it conflicts with). As @Mysticial points out, a simple implementation is less simple than it would be if the conflict-detect instruction simply produced a mask-register result, instead of another vector.

我搜索了但没有找到英特尔发布的关于使用 AVX512CD 的教程/指南,但大概他们认为使用 _mm512_lzcnt_epi32 (vplzcntd) 对 vpconflictd 在某些情况下很有用,因为它也是 AVX512CD 的一部分.

I searched for but didn't find an Intel-published tutorial/guide on using AVX512CD, but presumably they think using _mm512_lzcnt_epi32 (vplzcntd) on the result of vpconflictd is useful for some cases, because it's also part of AVX512CD.

也许你应该"做一些比跳过所有有任何冲突的元素更聪明的事情?也许是为了检测标量回退会更好的情况,例如所有 16 个双字元素都具有相同的索引?vpbroadcastmw2d 将掩码寄存器广播到结果的所有 32 位元素,这样您就可以将掩码寄存器值与 vpconflictd 中每个元素中的位图对齐.(并且已经有来自 AVX512F 的元素之间的比较、按位和其他操作).

Maybe you're "supposed" to do something more clever than just skipping all elements that have any conflicts? Maybe to detect a case where a scalar fallback would be better, e.g. all 16 dword elements have the same index? vpbroadcastmw2d broadcasts a mask register to all 32-bit elements of the result, so that lets you line up a mask-register value with the bitmaps in each element from vpconflictd. (And there are already compare, bitwise, and other operations between elements from AVX512F).

Kirill 的幻灯片列表 VPTESTNM{D,Q}(来自 AVX512F)以及冲突检测指令.它从 DEST[j] = (SRC1[i+31:i] BITWISE AND SRC2[i+31:i] == 0)?1:0.即AND元素在一起,如果它们不相交,则将该元素的掩码结果设置为1.

Kirill's slides list VPTESTNM{D,Q} (from AVX512F) along with the conflict-detection instructions. It generates a mask from DEST[j] = (SRC1[i+31:i] BITWISE AND SRC2[i+31:i] == 0)? 1 : 0. i.e. AND elements together, and set the mask result for that element to 1 if they don't intersect.

可能也相关:http://colfaxresearch.com/knl-avx512/ 说 为了实际说明,我们构建并优化了用于粒子合并粒子的微内核",以及一些 AVX2 代码(我认为).但这背后是我没有做过的免费注册.根据图表,我认为他们正在将实际的散布部分作为标量进行处理,经过一些矢量化的东西来生成他们想要散布的数据.第一个链接说第二个链接是用于之前的指令集".

Possibly also relevant: http://colfaxresearch.com/knl-avx512/ says "For a practical illustration, we construct and optimize a micro-kernel for particle binning particles", with some code for AVX2 (I think). But it's behind a free registration which I haven't done. Based on the diagram, I think they're doing the actual scatter part as scalar, after some vectorized stuff to produce data they want to scatter. The first link says the 2nd link is "for previous instruction sets".

当存储桶的数量与数组的大小相比较小时,复制计数数组并展开以最小化重复元素的存储转发延迟瓶颈变得可行.但是对于聚集/分散策略,如果我们为每个向量元素使用不同的数组,它也避免了冲突的可能性,解决了正确性问题.

When the number of buckets is small compared to the size of the array, it becomes viable to replicate the count arrays and unroll to minimize store-forwarding latency bottlenecks with repeated elements. But for a gather/scatter strategy, it also avoids the possibility of conflicts, solving the correctness problem, if we use a different array for each vector element.

当一个收集/分散指令只需要一个数组基数时,我们怎么做?使所有计数数组连续,并用一个额外的 SIMD 添加指令偏移每个 index 向量,完全取代冲突检测和分支.

How can we do that when a gather / scatter instruction only takes one array base? Make all the count arrays contiguous, and offset each index vector with one extra SIMD add instruction, fully replacing conflict detection and branching.

如果桶的数量不是 16 的倍数,您可能想要四舍五入数组几何,以便每个计数子集从对齐的偏移量开始.或者不,如果缓存局部性比避免最终减少的未对齐负载更重要.

If the number of buckets isn't a multiple of 16, you might want to round up the array geometry so each subset of counts starts at an aligned offset. Or not, if cache locality is more important than avoiding unaligned loads in the reduction at the end.

    const size_t nb = 32;  // number of buckets
    const int VEC_WIDTH = 16;  // sizeof(__m512i) / sizeof(uint32_t)
    alignas(__m512i) uint32_t counts[nb * VEC_WIDTH] = {0};

// then in your histo loop

    __m512i idx = ...;  // in this case from LUT lookups
    idx = _mm512_add_epi32(idx, _mm512_setr_epi32(
       0*nb, 1*nb,  2*nb,  3*nb,   4*nb,  5*nb,  6*nb,  7*nb,
       8*nb, 9*nb, 10*nb, 11*nb,  12*nb, 13*nb, 14*nb, 15*nb));
       // note these are C array indexes, not byte offsets
    __m512i vc = _mm512_i32gather_epi32(idx, counts, sizeof(counts[0]));
    vc = _mm512_add_epi32(vc, _mm512_set1_epi32(1));
    _mm512_i32scatter_epi32(counts, idx, vc, sizeof(counts[0]));

https://godbolt.org/z/8Kesx7sEK 表明上述代码实际上可以编译.(在循环内,向量常数设置可能会被提升,但不会在每次收集或分散之前将掩码寄存器设置为全一,或准备归零合并目标.)

https://godbolt.org/z/8Kesx7sEK shows that the above code actually compiles. (Inside a loop, the vector-constant setup could get hoisted, but not setting mask registers to all-one before each gather or scatter, or preparing a zeroed merge destination.)

然后在主直方图循环之后,减少到一个计数数组:

Then after the main histogram loop, reduce down to one count array:

// Optionally with size_t nb as an arg
// also optionally use restrict if you never reduce in-place, into the bottom of the input.
void reduce_counts(int *output, const int *counts)
{
    for (int i = 0 ; i < nb - (VEC_WIDTH-1) ; i+=VEC_WIDTH) {
        __m512i v = _mm512_load_si512(&counts[i]);  // aligned load, full cache line
        // optional: unroll this and accumulate two vectors in parallel for better spatial locality and more ILP
        for (int offset = nb; offset < nb*VEC_WIDTH ; offset+=nb) {
            __m512i tmp = _mm512_loadu_si512(&counts[i + offset]);
            v = _mm512_add_epi32(v, tmp);
        }
        _mm512_storeu_si512(&output[i], v);
    }

   // if  nb isn't a multiple of the vector width, do some cleanup here
   // possibly using a masked store to write into a final odd-sized destination
}

显然这对于??太多的桶是不好的;你最终不得不将更多的内存归零,并在最后循环很多.使用 256 位而不是 512 位收集会有所帮助,您只需要一半的数组,但是收集/分散指令的效率会随着向量的增加而提高.例如在 Cascade Lake 上,256 位每 5 个周期一个 vpgatherdd,512 位每 9.25 个周期一个.(两者都是前端的 4 uop)

Obviously this is bad with too many buckets; you end up having to zero way more memory, and loop over a lot of it at the end. Using 256-bit instead of 512-bit gathers helps, you only need half as many arrays, but efficiency of gather/scatter instructions improves with wider vectors. e.g. one vpgatherdd per 5 cycles for 256-bit on Cascade Lake, one per 9.25 for 512-bit. (And both are 4 uops for the front-end)

或者在 Ice Lake 上,每 7 个周期一个 vpscatterdd ymm,每 11 个周期一个 zmm.(与 2x ymm 的 14 相比).https://uops.info/

Or on Ice Lake, one vpscatterdd ymm per 7 cycles, one zmm per 11 cycles. (vs. 14 for 2x ymm). https://uops.info/

在您的 bins[1000][32] 案例中,您实际上可以使用 bins[i+0..15] 的后面元素作为额外的计数数组,如果您先归零,至少在前 1000-15 次外循环迭代中.这将避免触及额外的内存:下一个外部循环的归零将从前一个 counts[32] 开始,有效地.

In your bins[1000][32] case, you could actually use the later elements of bins[i+0..15] as extra count arrays, if you zero first, at least for the first 1000-15 outer loop iterations. That would avoid touching extra memory: zeroing for the next outer loop would start at the previous counts[32], effectively.

(这对于 C 2D 和 1D 数组来说会有点快速和松散,但是所有超过 [32] C 数组类型末尾的实际访问都将通过 memset(即 unsigned char*)或通过 _mm* 内在函数,也允许为任何东西设置别名)

(This would be playing a bit fast and loose with C 2D vs. 1D arrays, but all the actual accesses past the end of the [32] C array type would be via memset (i.e. unsigned char*) or via _mm* intrinsics which are also allowed to alias anything)

相关:

  • 微小的直方图(如 4 个桶)可以使用 count[0] += (arr[i] == 0) 等等,您可以使用 SIMD 打包比较进行矢量化 - 大型数组或列表的 4-bucket 直方图的微优化 当桶数小于或等于SIMD 向量中的元素.
  • Tiny histograms (like 4 buckets) can use count[0] += (arr[i] == 0) and so on, which you can vectorize with SIMD packed compares - Micro Optimization of a 4-bucket histogram of a large array or list This is interesting when the number of buckets is less than or equal to the number of elements in a SIMD vector.

相关文章