__builtin_prefetch,读取多少?

我正在尝试通过以下方式优化一些 C++(RK4)使用

I'm trying to optimize some C++ (RK4) by using

__builtin_prefetch

我不知道如何预取整个结构.

I can't figure out how to prefetch a whole structure.

我不明白读取了多少 const void *addr.我想加载 fromto 的下一个值.

I don't understand how much of the const void *addr is read. I want to have the next values of from and to loaded.

for (int i = from; i < to; i++)
{
    double kv = myLinks[i].kv;
    particle* from = con[i].Pfrom;
    particle* to = con[i].Pto;
    //Prefetch values at con[i++].Pfrom & con[i].Pto;
    double pos = to->px- from->px;
    double delta = from->r + to->r - pos;
    double k1 = axcel(kv, delta, from->mass) * dt; //axcel is an inlined function
    double k2 = axcel(kv, delta + 0.5 * k1, from->mass) * dt;
    double k3 = axcel(kv, delta + 0.5 * k2, from->mass) * dt;
    double k4 = axcel(kv, delta + k3, from->mass) * dt;
    #define likely(x)       __builtin_expect((x),1)
    if (likely(!from->bc))
    {
            from->x += (( k1 + 2 * k2 + 2 * k3 + k4) / 6);
    }
}

链接:http://www.ibm.com/developerworks/linux/library/l-gcc-hacks/

推荐答案

我认为它只是发出一个 FETCH 机器指令,它基本上获取一个行缓存,其大小取决于处理器.

I think it just emit one FETCH machine instruction, which basically fetches a line cache, whose size is processor specific.

例如,您可以使用 __builtin_prefetch (con[i+3].Pfrom).根据我的(小)经验,在这样的循环中,最好提前预取几个元素.

And you could use __builtin_prefetch (con[i+3].Pfrom) for instance. By my (small) experience, in such a loop, it is better to prefetch several elements in advance.

不要太频繁地使用__builtin_prefetch(即不要将它们中的很多放在一个循环中).如果需要,测量性能增益,并使用 GCC 优化(至少 -O2).如果你很幸运,手动 __builtin_prefetch 可以将循环的性能提高 10% 或 20%(但它也可能会伤害它).

Don't use __builtin_prefetch too often (i.e. don't put a lot of them inside a loop). Measure the performance gain if you need them, and use GCC optimization (at least -O2). If you are very lucky, manual __builtin_prefetch could increase the performance of your loop by 10 or 20% (but it could also hurt it).

如果这样的循环对您很重要,您可以考虑在具有 OpenCL 或 CUDA 的 GPU 上运行它(但这需要使用 OpenCL 或 CUDA 语言重新编码一些例程,并针对您的特定硬件调整它们).

If such a loop is crucial to you, you might consider running it on GPUs with OpenCL or CUDA (but that requires recoding some routines in OpenCL or CUDA language, and tuning them to your particular hardware).

还使用最新的 GCC 编译器(最新版本是 4.6.2),因为它在这些方面取得了很大进展.

Use also a recent GCC compiler (the latest release is 4.6.2) because it is making a lot of progress on these areas.

(于 2018 年 1 月添加:)

硬件(处理器)和编译器都在缓存方面取得了很大进展,因此现在(2018 年)使用 __builtin_prefetch 似乎不太有用.一定要进行基准测试.

Both hardware (processors) and compilers have made a lot of progress regarding caches, so it seems that using __builtin_prefetch is less useful today (in 2018). Be sure to benchmarck.

相关文章