为什么 std::fill(0) 比 std::fill(1) 慢?

我在一个系统上观察到 std::fill 在大型 std::vector<int> 上设置常量值时明显且始终较慢 >0 与常量值 1 或动态值相比:

5.8 GiB/s 对比 7.5 GiB/s

然而,对于较小的数据大小,结果是不同的,其中 fill(0) 更快:

如果有多个线程,在 4 GiB 数据大小时,fill(1) 显示出更高的斜率,但达到的峰值比 fill(0) (51 GiB/s 对比 90 GiB/s):

这就引出了第二个问题,为什么fill(1)的峰值带宽要低得多.

对此的测试系统是一个双插槽 Intel Xeon CPU E5-2680 v3,频率设置为 2.5 GHz(通过 /sys/cpufreq)和 8x16 GiB DDR4-2133.我使用 GCC 6.1.0 (-O3) 和 Intel 编译器 17.0.1 (-fast) 进行了测试,两者都得到了相同的结果.GOMP_CPU_AFFINITY=0,12,1,13,2,14,3,15,4,16,5,17,6,18,7,19,8,20,9,21,10,22,11,23 已设置.Strem/add/24 个线程在系统上获得 85 GiB/s.

我能够在不同的 Haswell 双插槽服务器系统上重现这种效果,但不能在任何其他架构上重现.例如,在 Sandy Bridge EP 上,内存性能是相同的,而在缓存中 fill(0) 的速度要快得多.

这里是重现的代码:

#include <算法>#include #include #include #include <向量>使用值 = int;使用向量 = std::vector;constexpr size_t write_size = 8ll * 1024 * 1024 * 1024;constexpr size_t max_data_size = 4ll * 1024 * 1024 * 1024;void __attribute__((noinline)) fill0(vector& v) {std::fill(v.begin(), v.end(), 0);}void __attribute__((noinline)) fill1(vector& v) {std::fill(v.begin(), v.end(), 1);}无效工作台(size_t data_size,int nthreads){#pragma omp 并行 num_threads(nthreads){向量 v(data_size/(sizeof(value) * nthreads));自动重复 = write_size/data_size;#pragma omp 屏障自动 t0 = omp_get_wtime();for (auto r = 0; r <重复; r++)填充0(v);#pragma omp 屏障自动 t1 = omp_get_wtime();for (auto r = 0; r <重复; r++)填充1(v);#pragma omp 屏障自动 t2 = omp_get_wtime();#pragma omp 主std::cout <<数据大小<<", " <<nthreads<<", " <<write_size/(t1 - t0) <<"、"<<write_size/(t2 - t1) <<"
";}}int main(int argc, const char* argv[]) {std::cout <<大小,n 个线程,填充 0,填充 1
";for (size_t bytes = 1024; bytes <= max_data_size; bytes *= 2) {长凳(字节,1);}for (size_t bytes = 1024; bytes <= max_data_size; bytes *= 2) {长凳(字节,omp_get_max_threads());}for (int nthreads = 1; nthreads <= omp_get_max_threads(); nthreads++) {长凳(最大数据大小,nthreads);}}

使用 g++ fillbench.cpp -O3 -o fillbench_gcc -fopenmp 编译呈现的结果.

解决方案

来自你的问题 + 来自你的答案的编译器生成的 asm:

  • fill(0) 是一个 ERMSB rep stosb 将在优化的微编码循环中使用 256b 存储.(如果缓冲区对齐,效果最佳,可能至少为 32B 或 64B).
  • fill(1) 是一个简单的 128 位 movaps 向量存储循环.无论宽度如何,每个内核时钟周期只能执行一个存储,最高可达 256b AVX.所以128b存储只能填满Haswell的L1D缓存写入带宽的一半.这就是 fill(0) 对于高达 ~32kiB 的缓冲区的速度大约是其 2 倍的原因.使用 -march=haswell-march=native 编译以修复该问题.

    Haswell 只能勉强跟上循环开销,但它仍然可以在每个时钟运行 1 个存储,即使它根本没有展开.但是每个时钟有 4 个融合域 uops,在乱序窗口中有很多填充物占据了空间.一些展开可能会让 TLB 未命中在存储发生的位置之前开始解决,因为存储地址 uop 的吞吐量比存储数据的吞吐量要大.展开可能有助于弥补 ERMSB 和适合 L1D 的缓冲区的向量循环之间的其余差异.(对该问题的评论说 -march=native 仅对 L1 有帮助 fill(1).)

请注意,rep movsd(可用于为int 元素实现fill(1))可能与<代码>代表 stosb 在 Haswell 上.虽然只有官方文档只保证 ERMSB 给出快速的 rep stosb(而不是 rep stosd),支持 ERMSB 的实际 CPU 使用类似高效的微码来rep stosd.对 IvyBridge 有一些疑问,可能只有 b 快.请参阅@BeeOnRope 出色的ERMSB 答案,了解有关此内容的更新.

gcc 有一些用于字符串操作的 x86 调整选项(像 -mstringop-strategy=alg 和 -mmemset-strategy=strategy),但 IDK 如果其中任何一个都会让它实际发出 代表 movsd 用于 fill(1).可能不是,因为我假设代码以循环开始,而不是 memset.

<小时><块引用>

如果有多个线程,在 4 GiB 数据大小时,fill(1) 显示出更高的斜率,但达到的峰值比 fill(0) 低得多(51 GiB/s 对 90 GiB/s):

普通 movaps 存储到冷缓存行会触发 阅读所有权 (RFO).当 movaps 写入前 16 个字节时,大量实际 DRAM 带宽用于从内存读取缓存行.ERMSB 存储对其存储使用无 RFO 协议,因此内存控制器仅进行写入.(除了杂项读取,比如页表,即使在 L3 缓存中也有任何页面遍历未命中,并且可能在中断处理程序中或其他任何加载未命中).

@BeeOnRope 在评论中解释常规 RFO 存储与 ERMSB 使用的 RFO 避免协议之间的差异对于服务器 CPU 上的某些缓冲区大小范围存在不利影响,其中非核心/L3 缓存中存在高延迟.另请参阅链接的 ERMSB 答案,了解有关 RFO 与非 RFO 的更多信息,以及多核 Intel CPU 中非核心(L3/内存)的高延迟是单核带宽的一个问题.

<小时>

movntps (_mm_stream_ps()) 存储是弱排序的,因此它们可以绕过缓存并直接进入整个缓存的内存-line 一次而无需将缓存行读入 L1D.movntps 避免了 RFO,就像 rep stos 所做的那样.(rep stos 存储可以相互重新排序,但不能超出指令的边界.)

您的 movntps 结果在您更新后的答案中令人惊讶.
对于具有大缓冲区的单个线程,您的结果是 movnt >> 常规 RFO > ERMSB.因此,这两种非 RFO 方法位于普通旧商店的相反两侧真的很奇怪,而且 ERMSB 远非最佳.我目前没有对此的解释.(欢迎编辑并提供解释 + 良好的证据).

正如我们预期的那样,movnt 允许多个线程实现高聚合存储带宽,如 ERMSB.movnt 总是直接进入行填充缓冲区,然后进入内存,因此适合缓存的缓冲区大小要慢得多.每个时钟一个 128b 矢量足以轻松地将单个内核的无 RFO 带宽饱和到 DRAM.可能 vmovntps ymm (256b) 在存储 CPU-bound AVX 256b 向量化计算的结果时(即仅当它省去解包到128b的麻烦).

movnti 带宽很低,因为存储在 4B 块中的瓶颈在于每个时钟 1 个存储 uop 将数据添加到行填充缓冲区,而不是将这些行满缓冲区发送到 DRAM(直到您有足够的线程使内存带宽饱和).

<小时>

@osgx 发布了评论中的一些有趣链接::>

  • Agner Fog 的 asm 优化指南、指令表和微架构指南:http://agner.org/optimize/
  • 英特尔优化指南:http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf.

  • NUMA 监听:http://frankdenneman.nl/2016/07/11/numa-deep-dive-part-3-cache-coherency/

  • https://software.intel.com/en-我们/文章/intelr-memory-latency-checker
  • 缓存一致性协议和记忆英特尔 Haswell-EP 架构的性能

另见 x86 标签中的其他内容维基.

I have observed on a system that std::fill on a large std::vector<int> was significantly and consistently slower when setting a constant value 0 compared to a constant value 1 or a dynamic value:

5.8 GiB/s vs 7.5 GiB/s

However, the results are different for smaller data sizes, where fill(0) is faster:

With more than one thread, at 4 GiB data size, fill(1) shows a higher slope, but reaches a much lower peak than fill(0) (51 GiB/s vs 90 GiB/s):

This raises the secondary question, why the peak bandwidth of fill(1) is so much lower.

The test system for this was a dual socket Intel Xeon CPU E5-2680 v3 set at 2.5 GHz (via /sys/cpufreq) with 8x16 GiB DDR4-2133. I tested with GCC 6.1.0 (-O3) and Intel compiler 17.0.1 (-fast), both get identical results. GOMP_CPU_AFFINITY=0,12,1,13,2,14,3,15,4,16,5,17,6,18,7,19,8,20,9,21,10,22,11,23 was set. Strem/add/24 threads gets 85 GiB/s on the system.

I was able to reproduce this effect on a different Haswell dual socket server system, but not any other architecture. For example on Sandy Bridge EP, memory performance is identical, while in cache fill(0) is much faster.

Here is the code to reproduce:

#include <algorithm>
#include <cstdlib>
#include <iostream>
#include <omp.h>
#include <vector>

using value = int;
using vector = std::vector<value>;

constexpr size_t write_size = 8ll * 1024 * 1024 * 1024;
constexpr size_t max_data_size = 4ll * 1024 * 1024 * 1024;

void __attribute__((noinline)) fill0(vector& v) {
    std::fill(v.begin(), v.end(), 0);
}

void __attribute__((noinline)) fill1(vector& v) {
    std::fill(v.begin(), v.end(), 1);
}

void bench(size_t data_size, int nthreads) {
#pragma omp parallel num_threads(nthreads)
    {
        vector v(data_size / (sizeof(value) * nthreads));
        auto repeat = write_size / data_size;
#pragma omp barrier
        auto t0 = omp_get_wtime();
        for (auto r = 0; r < repeat; r++)
            fill0(v);
#pragma omp barrier
        auto t1 = omp_get_wtime();
        for (auto r = 0; r < repeat; r++)
            fill1(v);
#pragma omp barrier
        auto t2 = omp_get_wtime();
#pragma omp master
        std::cout << data_size << ", " << nthreads << ", " << write_size / (t1 - t0) << ", "
                  << write_size / (t2 - t1) << "
";
    }
}

int main(int argc, const char* argv[]) {
    std::cout << "size,nthreads,fill0,fill1
";
    for (size_t bytes = 1024; bytes <= max_data_size; bytes *= 2) {
        bench(bytes, 1);
    }
    for (size_t bytes = 1024; bytes <= max_data_size; bytes *= 2) {
        bench(bytes, omp_get_max_threads());
    }
    for (int nthreads = 1; nthreads <= omp_get_max_threads(); nthreads++) {
        bench(max_data_size, nthreads);
    }
}

Presented results compiled with g++ fillbench.cpp -O3 -o fillbench_gcc -fopenmp.

解决方案

From your question + the compiler-generated asm from your answer:

  • fill(0) is an ERMSB rep stosb which will use 256b stores in an optimized microcoded loop. (Works best if the buffer is aligned, probably to at least 32B or maybe 64B).
  • fill(1) is a simple 128-bit movaps vector store loop. Only one store can execute per core clock cycle regardless of width, up to 256b AVX. So 128b stores can only fill half of Haswell's L1D cache write bandwidth. This is why fill(0) is about 2x as fast for buffers up to ~32kiB. Compile with -march=haswell or -march=native to fix that.

    Haswell can just barely keep up with the loop overhead, but it can still run 1 store per clock even though it's not unrolled at all. But with 4 fused-domain uops per clock, that's a lot of filler taking up space in the out-of-order window. Some unrolling would maybe let TLB misses start resolving farther ahead of where stores are happening, since there is more throughput for store-address uops than for store-data. Unrolling might help make up the rest of the difference between ERMSB and this vector loop for buffers that fit in L1D. (A comment on the question says that -march=native only helped fill(1) for L1.)

Note that rep movsd (which could be used to implement fill(1) for int elements) will probably perform the same as rep stosb on Haswell. Although only the official documentation only guarantees that ERMSB gives fast rep stosb (but not rep stosd), actual CPUs that support ERMSB use similarly efficient microcode for rep stosd. There is some doubt about IvyBridge, where maybe only b is fast. See the @BeeOnRope's excellent ERMSB answer for updates on this.

gcc has some x86 tuning options for string ops (like -mstringop-strategy=alg and -mmemset-strategy=strategy), but IDK if any of them will get it to actually emit rep movsd for fill(1). Probably not, since I assume the code starts out as a loop, rather than a memset.


With more than one thread, at 4 GiB data size, fill(1) shows a higher slope, but reaches a much lower peak than fill(0) (51 GiB/s vs 90 GiB/s):

A normal movaps store to a cold cache line triggers a Read For Ownership (RFO). A lot of real DRAM bandwidth is spent on reading cache lines from memory when movaps writes the first 16 bytes. ERMSB stores use a no-RFO protocol for its stores, so the memory controllers are only writing. (Except for miscellaneous reads, like page tables if any page-walks miss even in L3 cache, and maybe some load misses in interrupt handlers or whatever).

@BeeOnRope explains in comments that the difference between regular RFO stores and the RFO-avoiding protocol used by ERMSB has downsides for some ranges of buffer sizes on server CPUs where there's high latency in the uncore/L3 cache. See also the linked ERMSB answer for more about RFO vs non-RFO, and the high latency of the uncore (L3/memory) in many-core Intel CPUs being a problem for single-core bandwidth.


movntps (_mm_stream_ps()) stores are weakly-ordered, so they can bypass the cache and go straight to memory a whole cache-line at a time without ever reading the cache line into L1D. movntps avoids RFOs, like rep stos does. (rep stos stores can reorder with each other, but not outside the boundaries of the instruction.)

Your movntps results in your updated answer are surprising.
For a single thread with large buffers, your results are movnt >> regular RFO > ERMSB. So that's really weird that the two non-RFO methods are on opposite sides of the plain old stores, and that ERMSB is so far from optimal. I don't currently have an explanation for that. (edits welcome with an explanation + good evidence).

As we expected, movnt allows multiple threads to achieve high aggregate store bandwidth, like ERMSB. movnt always goes straight into line-fill buffers and then memory, so it is much slower for buffer sizes that fit in cache. One 128b vector per clock is enough to easily saturate a single core's no-RFO bandwidth to DRAM. Probably vmovntps ymm (256b) is only a measurable advantage over vmovntps xmm (128b) when storing the results of a CPU-bound AVX 256b-vectorized computation (i.e. only when it saves the trouble of unpacking to 128b).

movnti bandwidth is low because storing in 4B chunks bottlenecks on 1 store uop per clock adding data to the line fill buffers, not on sending those line-full buffers to DRAM (until you have enough threads to saturate memory bandwidth).


@osgx posted some interesting links in comments:

  • Agner Fog's asm optimization guide, instruction tables, and microarch guide: http://agner.org/optimize/
  • Intel optimization guide: http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf.

  • NUMA snooping: http://frankdenneman.nl/2016/07/11/numa-deep-dive-part-3-cache-coherency/

  • https://software.intel.com/en-us/articles/intelr-memory-latency-checker
  • Cache Coherence Protocol and Memory Performance of the Intel Haswell-EP Architecture

See also other stuff in the x86 tag wiki.

相关文章