为什么转置 512x512 的矩阵比转置 513x513 的矩阵慢得多?

2021-12-08 00:00:00 performance optimization c++

在对不同大小的方阵进行一些实验后，出现了一个模式.总是，转置 2^n 大小的矩阵比转置 2^n+1 大小的矩阵慢.对于 n 的小值，差别不大.

After conducting some experiments on square matrices of different sizes, a pattern came up. Invariably, transposing a matrix of size 2^n is slower than transposing one of size 2^n+1. For small values of n, the difference is not major.

但是在 512 的值上会出现很大的差异.(至少对我而言)

Big differences occur however over a value of 512. (at least for me)

免责声明:我知道由于元素的双重交换，该函数实际上并没有转置矩阵，但这没有区别.

遵循代码:

#define SAMPLES 1000 #define MATSIZE 512 #include <time.h> #include <iostream> int mat[MATSIZE][MATSIZE]; void transpose() { for ( int i = 0 ; i < MATSIZE ; i++ ) for ( int j = 0 ; j < MATSIZE ; j++ ) { int aux = mat[i][j]; mat[i][j] = mat[j][i]; mat[j][i] = aux; } } int main() { //initialize matrix for ( int i = 0 ; i < MATSIZE ; i++ ) for ( int j = 0 ; j < MATSIZE ; j++ ) mat[i][j] = i+j; int t = clock(); for ( int i = 0 ; i < SAMPLES ; i++ ) transpose(); int elapsed = clock() - t; std::cout << "Average for a matrix of " << MATSIZE << ": " << elapsed / SAMPLES; }

改变 MATSIZE 可以让我们改变大小(废话！).我在ideone上发布了两个版本:

Changing MATSIZE lets us alter the size (duh!). I posted two versions on ideone:

大小 512 - 平均 2.46 毫秒 - http://ideone.com/1PV7m
大小 513 - 平均 0.75 毫秒 - http://ideone.com/NShpo

size 512 - average 2.46 ms - http://ideone.com/1PV7m

size 513 - average 0.75 ms - http://ideone.com/NShpo

在我的环境中(MSVS 2010，完全优化)，区别是相似的:

In my environment (MSVS 2010, full optimizations), the difference is similar :

大小 512 - 平均 2.19 毫秒
大小 513 - 平均 0.57 毫秒

size 512 - average 2.19 ms

size 513 - average 0.57 ms

为什么会这样?

推荐答案

解释来自在 C++ 中优化软件，它减少了数据访问和存储在缓存中的方式.

The explanation comes from Agner Fog in Optimizing software in C++ and it reduces to how data is accessed and stored in the cache.

有关术语和详细信息，请参阅有关缓存的维基条目，我将缩小范围放在这里.

For terms and detailed info, see the wiki entry on caching, I'm gonna narrow it down here.

缓存按集合和行组织.一次只使用一个集合，其中包含的任何行都可以使用.一行可以镜像的内存乘以行数就是缓存大小.

A cache is organized in sets and lines. At a time, only one set is used, out of which any of the lines it contains can be used. The memory a line can mirror times the number of lines gives us the cache size.

对于一个特定的内存地址，我们可以用公式计算出哪个集合应该镜像它:

For a particular memory address, we can calculate which set should mirror it with the formula:

set = ( address / lineSize ) % numberOfsets

这种公式在理想情况下给出了跨集合的均匀分布，因为每个内存地址都有可能被读取(我说理想情况).

This sort of formula ideally gives a uniform distribution across the sets, because each memory address is as likely to be read (I said ideally).

很明显，可能会发生重叠.如果缓存未命中，则会在缓存中读取内存并替换旧值.请记住，每组都有许多行，其中最近最少使用的行会被新读取的内存覆盖.

It's clear that overlaps can occur. In case of a cache miss, the memory is read in the cache and the old value is replaced. Remember each set has a number of lines, out of which the least recently used one is overwritten with the newly read memory.

我会尽量效仿 Agner 的例子:

I'll try to somewhat follow the example from Agner:

假设每组有 4 行，每行占 64 个字节.我们首先尝试读取地址 0x2710，它位于 28 集合中.然后我们还尝试读取地址0x2F00、0x3700、0x3F00 和0x4700.所有这些都属于同一个集合.在读取 0x4700 之前，集合中的所有行都会被占用.读取该内存会驱逐集合中的现有行，该行最初保存 0x2710.问题在于我们读取的地址(在本例中)0x800 分开.这是关键的步伐(同样，对于这个例子).

Assume each set has 4 lines, each holding 64 bytes. We first attempt to read the address 0x2710, which goes in set 28. And then we also attempt to read addresses 0x2F00, 0x3700, 0x3F00 and 0x4700. All of these belong to the same set. Before reading 0x4700, all lines in the set would have been occupied. Reading that memory evicts an existing line in the set, the line that initially was holding 0x2710. The problem lies in the fact that we read addresses that are (for this example) 0x800 apart. This is the critical stride (again, for this example).

临界步幅也可以计算:

criticalStride = numberOfSets * lineSize

间隔变量 criticalStride 或多个分开的变量争夺相同的缓存行.

Variables spaced criticalStride or a multiple apart contend for the same cache lines.

这是理论部分.接下来是解释(也是阿格纳，我正在密切关注以避免出错):

This is the theory part. Next, the explanation (also Agner, I'm following it closely to avoid making mistakes):

假设一个 64x64 的矩阵(记住，效果因缓存而异)具有 8kb 缓存，每组 4 行 * 行大小为 64 字节.每行可以包含矩阵中的 8 个元素(64 位 int).

Assume a matrix of 64x64 (remember, the effects vary according to the cache) with an 8kb cache, 4 lines per set * line size of 64 bytes. Each line can hold 8 of the elements in the matrix (64-bit int).

临界步长为 2048 字节，对应于矩阵的 4 行(在内存中是连续的).

The critical stride would be 2048 bytes, which correspond to 4 rows of the matrix (which is continuous in memory).

假设我们正在处理第 28 行.我们正在尝试获取该行的元素并将它们与第 28 列的元素交换.该行的前 8 个元素构成一个缓存行，但它们会去进入第 28 列中的 8 个不同缓存行.请记住，关键步幅是相隔 4 行(一列中的 4 个连续元素).

Assume we're processing row 28. We're attempting to take the elements of this row and swap them with the elements from column 28. The first 8 elements of the row make up a cache line, but they'll go into 8 different cache lines in column 28. Remember, critical stride is 4 rows apart (4 consecutive elements in a column).

当在列中到达元素 16 时(每组 4 个缓存行和相隔 4 行 = 故障)，ex-0 元素将从缓存中逐出.当我们到达列的末尾时，所有先前的缓存行都将丢失，需要在访问下一个元素时重新加载(整行都被覆盖).

When element 16 is reached in the column (4 cache lines per set & 4 rows apart = trouble) the ex-0 element will be evicted from the cache. When we reach the end of the column, all previous cache lines would have been lost and needed reloading on access to the next element (the whole line is overwritten).

尺寸不是关键步幅的倍数会破坏这种完美的灾难场景，因为我们不再处理垂直方向上相距关键步幅的元素，因此缓存重新加载的次数大大减少.

Having a size that is not a multiple of the critical stride messes up this perfect scenario for disaster, as we're no longer dealing with elements that are critical stride apart on the vertical, so the number of cache reloads is severely reduced.

另一个免责声明 - 我刚刚理解了解释并希望我能理解它，但我可能弄错了.无论如何，我正在等待 Mysticial 的回复(或确认).:)

Another disclaimer - I just got my head around the explanation and hope I nailed it, but I might be mistaken. Anyway, I'm waiting for a response (or confirmation) from Mysticial. :)

相关文章