数组结构和结构数组 - 性能差异

2021-12-20 00:00:00 gcc performance c caching c++

我有一堂这样的课:

//Array of Structures
class Unit
{
  public:
    float v;
    float u;
    //And similarly many other variables of float type, upto 10-12 of them.
    void update()
    {
       v+=u;
       v=v*i*t;
       //And many other equations
    }
};

我创建了一个 Unit 类型的对象数组.并调用更新.

I create an array of objects of Unit type. And call update on them.

int NUM_UNITS = 10000;
void ProcessUpdate()
{
  Unit *units = new Unit[NUM_UNITS];
  for(int i = 0; i < NUM_UNITS; i++)
  {
    units[i].update();
  }
}

为了加快速度,并可能自动向量化循环,我将 AoS 转换为数组结构.

In order to speed up things, and possibly autovectorize the loop, I converted AoS to structure of arrays.

//Structure of Arrays:
class Unit
{
  public:
  Unit(int NUM_UNITS)
  {
    v = new float[NUM_UNITS];
  }
  float *v;
  float *u;
  //Mnay other variables
  void update()
  {
    for(int i = 0; i < NUM_UNITS; i++)
    {
      v[i]+=u[i];
      //Many other equations
    }
  }
};

当循环无法自动向量化时,数组结构的性能非常差.对于50个单元,SoA的更新比AoS略快.但是从100个单元开始,SoA比AoS慢.在 300 个单位时,SoA 几乎差了两倍.以 100K 为单位,SoA 比 AoS 慢 4 倍.虽然缓存可能是 SoA 的一个问题,但我没想到性能差异会如此之大.对 cachegrind 的分析显示两种方法的未命中数相似.Unit 对象的大小为 48 字节.L1 缓存为 256K,L2 为 1MB,L3 为 8MB.我在这里错过了什么?这真的是缓存问题吗?

When the loop fails to autovectorize, i am getting a very bad performance for structure of arrays. For 50 units, SoA's update is slightly faster than AoS.But then from 100 units onwards, SoA is slower than AoS. At 300 units, SoA is almost twice as worse. At 100K units, SoA is 4x slower than AoS. While cache might be an issue for SoA, i didnt expect the performance difference to be this high. Profiling on cachegrind shows similar number of misses for both approach. Size of a Unit object is 48 bytes. L1 cache is 256K, L2 is 1MB and L3 is 8MB. What am i missing here? Is this really a cache issue?

我正在使用 gcc 4.5.2.编译器选项是 -o3 -msse4 -ftree-vectorize.

I am using gcc 4.5.2. Compiler options are -o3 -msse4 -ftree-vectorize.

我在 SoA 中做了另一个实验.我没有动态分配数组,而是在编译时分配了v"和u".当有 100K 单元时,这提供了比具有动态分配阵列的 SoA 快 10 倍的性能.这里发生了什么事?为什么静态分配的内存和动态分配的内存会有这么大的性能差异?

I did another experiment in SoA. Instead of dynamically allocating the arrays, i allocated "v" and "u" in compile time. When there are 100K units, this gives a performance which is 10x faster than the SoA with dynamically allocated arrays. Whats happening here? Why is there such a performance difference between static and dynamically allocated memory?

推荐答案

在这种情况下,数组结构对缓存不友好.

Structure of arrays is not cache friendly in this case.

您同时使用 uv,但是如果它们有 2 个不同的数组,它们将不会同时加载到一个缓存行中,缓存未命中将花费巨大性能损失.

You use both u and v together, but in case of 2 different arrays for them they will not be loaded simultaneously into one cache line and cache misses will cost huge performance penalty.

_mm_prefetch 可用于使 AoS 表示更快.

相关文章