我应该如何以及何时将倾斜指针与 cuda API 一起使用?

2022-01-10 00:00:00 cuda c++

我对如何使用 cudaMalloc() 和 cudaMemcpy() 分配和复制线性内存非常了解.但是，当我想使用 CUDA 函数来分配和复制 2D 或 3D 矩阵时，我经常被各种参数弄糊涂，尤其是在处理 2D/3D 数组时总是存在的倾斜指针.该文档很好地提供了一些关于如何使用它们的示例，但它假设我熟悉填充和间距的概念，而我并不熟悉.

I have quite a good understanding about how to allocate and copy linear memory with cudaMalloc() and cudaMemcpy(). However, when I want to use the CUDA functions to allocate and copy 2D or 3D matrices, I am often befuddled by the various arguments, especially concerning pitched pointers which are always present when dealing with 2D/3D arrays. The documentation is good for providing a couple examples on how to use them but it assumes that I am familiar with the notion of padding and pitch, which I am not.

我通常最终会调整我在文档或网络上其他地方找到的各种示例，但随后的盲目调试非常痛苦，所以我的问题是:

I usually end up tweaking the various examples I find in the documentation or somewhere else on the web, but the blind debugging that follows is quite painful, so my question is:

什么是音高?我该如何使用它?如何在 CUDA 中分配和复制 2D 和 3D 数组?

What is a pitch? How do I use it? How do I allocate and copy 2D and 3D arrays in CUDA?

推荐答案

这里解释一下CUDA中的pitched pointer和padding.

Here is an explanation about pitched pointer and padding in CUDA.

首先，让我们从非线性内存存在的原因说起.当使用 cudaMalloc 分配内存时，结果就像使用 malloc 分配一样，我们有一个指定大小的连续内存块，我们可以在其中放入任何我们想要的东西.如果我们想分配一个 10000 浮点数的向量，我们只需这样做:

First, lets start with the reason for the existence of non linear memory. When allocating memory with cudaMalloc, the result is like an allocation with malloc, we have a contiguous memory chunk of the size specified and we can put anything we want in it. If we want to allocate a vector of 10000 float, we simply do:

float* myVector; cudaMalloc(&myVector, 10000*sizeof(float));

然后通过经典索引访问 myVector 的第 i 个元素:

and then access ith element of myVector by classic indexing:

float element = myVector[i];

如果我们想访问下一个元素，我们只需:

and if we want to access the next element, we just do:

float next_element = myvector[i+1];

它工作得很好，因为访问第一个元素旁边的元素(出于我不知道并且我现在不希望这样做的原因)很便宜.

It works very fine because accessing an element right next to the first one is (for reasons I am not aware of and I don't wish to be for now) cheap.

当我们将内存用作二维数组时，情况会有所不同.假设我们的 10000 浮点向量实际上是一个 100x100 数组.我们可以使用相同的 cudaMalloc 函数来分配它，如果我们想读取第 i 行，我们这样做:

Things become a little bit different when we use our memory as a 2D array. Lets say our 10000 float vector is in fact a 100x100 array. We can allocate it by using the same cudaMalloc function, and if we want to read the i-th row, we do:

float* myArray; cudaMalloc(&myArray, 10000*sizeof(float)); int row[100]; // number of columns for (int j=0; j<100; ++j) row[j] = myArray[i*100+j];

字对齐
所以我们必须从 myArray+100*i 读取内存到 myArray+101*i-1.它将执行的内存访问操作数取决于该行占用的内存字数.内存字中的字节数取决于实现.为了尽量减少读取单行时的内存访问次数，我们必须确保从单词的开头开始行，因此我们必须为每一行填充内存，直到新行的开头.

Word alignment

So we have to read memory from myArray+100*i to myArray+101*i-1. The number of memory access operation it will take depends on the number of memory words this row takes. The number of bytes in a memory word depends on the implementation. To minimize the number of memory accesses when reading a single row, we must assure that we start the row on the start of a word, hence we must pad the memory for every row until the start of a new one.

填充数组的另一个原因是 CUDA 中的存储库机制，涉及共享内存访问.当数组在共享内存中时，它被分成几个内存库.两个 CUDA 线程可以同时访问它，前提是它们不访问属于同一内存库的内存.由于我们通常希望并行处理每一行，因此我们可以通过将每一行填充到新银行的开头来确保可以模拟访问它.

Another reason for padding arrays is the bank mechanism in CUDA, concerning shared memory access. When the array is in the shared memory, it is split into several memory banks. Two CUDA threads can access it simultaneously, provided they don't access memory belonging to the same memory bank. Since we usually want to treat each row in parallel, we can ensure that we can access it simulateously by padding each row to the start of a new bank.

现在，我们将使用 cudaMallocPitch，而不是使用 cudaMalloc 分配 2D 数组:

Now, instead of allocating the 2D array with cudaMalloc, we will use cudaMallocPitch:

size_t pitch; float* myArray; cudaMallocPitch(&myArray, &pitch, 100*sizeof(float), 100); // width in bytes by height

请注意，这里的音高是函数的返回值:cudaMallocPitch 检查它在您的系统上应该是什么并返回适当的值.cudaMallocPitch 的作用如下:

Note that the pitch here is the return value of the function: cudaMallocPitch checks what it should be on your system and returns the appropriate value. What cudaMallocPitch does is the following:

分配第一行.
检查分配的字节数是否使其正确对齐.例如，它是 128 的倍数.
如果没有，则分配更多字节以达到 128 的下一个倍数.间距是为单行分配的字节数，包括额外字节(填充字节).
重复每一行.

最后，我们通常分配了比需要更多的内存，因为现在每一行都是音高的大小，而不是 w*sizeof(float) 的大小.

At the end, we have typically allocated more memory than necessary because each row is now the size of pitch, and not the size of w*sizeof(float).

但是现在，当我们想要访问列中的一个元素时，我们必须这样做:

But now, when we want to access an element in a column, we must do:

float* row_start = (float*)((char*)myArray + row * pitch); float column_element = row_start[column];

两个连续列之间的字节偏移量不能再从我们的数组大小中推断出来，这就是为什么我们要保持 cudaMallocPitch 返回的音高.而且由于音高是填充大小的倍数(通常是字大小和库大小中的最大值)，所以效果很好.耶.

The offset in bytes between two successive columns can no more be deduced from the size of our array, that is why we want to keep the pitch returned by cudaMallocPitch. And since pitch is a multiple of the padding size (typically, the biggest of word size and bank size), it works great. Yay.

现在我们知道如何创建和访问由 cudaMallocPitch 创建的数组中的单个元素，我们可能希望将其整个部分复制到其他内存或从其他内存中复制，无论是否为线性.

Now that we know how to create and access a single element in an array created by cudaMallocPitch, we might want to copy whole part of it to and from other memory, linear or not.

假设我们想将我们的数组复制到使用 malloc 分配在主机上的 100x100 数组中:

Lets say we want to copy our array in a 100x100 array allocated on our host with malloc:

float* host_memory = (float*)malloc(100*100*sizeof(float));

如果我们使用 cudaMemcpy，我们将复制所有用 cudaMallocPitch 分配的内存，包括每行之间的填充字节.为了避免填充内存，我们必须做的是逐行复制.我们可以手动完成:

If we use cudaMemcpy, we will copy all the memory allocated with cudaMallocPitch, including the padded bytes between each rows. What we must do to avoid padding memory is copying each row one by one. We can do it manually:

for (size_t i=0; i<100; ++i) { cudaMemcpy(host_memory[i*100], myArray[pitch*i], 100*sizeof(float), cudaMemcpyDeviceToHost); }

或者我们可以告诉 CUDA API，我们只想要我们分配的内存中的有用内存，以便 its 方便，所以如果它可以自动处理自己的混乱，那就太好了确实，谢谢.这里进入 cudaMemcpy2D:

Or we can tell the CUDA API that we want only the useful memory from the memory we allocated with padding bytes for its convenience so if it could deal with its own mess automatically it would be very nice indeed, thank you. And here enters cudaMemcpy2D:

cudaMemcpy2D(host_memory, 100*sizeof(float)/*no pitch on host*/, myArray, pitch/*CUDA pitch*/, 100*sizeof(float)/*width in bytes*/, 100/*heigth*/, cudaMemcpyDeviceToHost);

现在复制将自动完成.它将复制宽度(此处:100xsizeof(float))、高度时间(此处:100)指定的字节数，每次跳转到下一行时跳过 pitch 个字节.请注意，我们仍然必须为目标内存提供音高，因为它也可以被填充.这里不是，所以间距等于非填充数组的间距:它是一行的大小.另请注意，memcpy 函数中的宽度参数以字节表示，但高度参数以元素数量表示.那是因为复制的方式，有点像我在上面写的手动复制:宽度是每个副本沿一行的大小(在内存中连续的元素)，高度是这个操作必须的次数得以实现.(作为物理学家，这些单位的不一致让我非常恼火.)

Now the copy will be done automatically. It will copy the number of bytes specified in width (here: 100xsizeof(float)), height time (here: 100), skipping pitch bytes every time it jumps to a next row. Note that we must still provide the pitch for the destination memory because it could be padded, too. Here it is not, so the pitch is equal to the pitch of a non-padded array: it is the size of a row. Note also that the width parameter in the memcpy function is expressed in bytes, but the height parameter is expressed in number of elements. That is because of the way the copy is done, someway like I wrote the manual copy above: the width is the size of each copy along a row (elements that are contiguous in memory) and the height is the number of times this operation must be accomplished. (These inconsistencies in units, as a physicist, annoys me very much.)

3D 数组实际上与 2D 数组没有什么不同，没有包含额外的填充.3D 数组只是填充行的 2D classical 数组.这就是为什么在分配 3D 数组时，您只会得到一个音高，即沿行的连续点之间的字节数差异.如果您想访问沿深度维度的连续点，您可以安全地将间距乘以列数，得到 slicePitch.

3D arrays are no different that 2D arrays actually, there is no additional padding included. A 3D array is just a 2D classical array of padded rows. That is why when allocating a 3D array, you only get one pitch that is the difference in bytes count between to successive points along a row. If you want to access to successive points along the depth dimension, you can safely multiply the pitch by the number of columns, which gives you the slicePitch.

用于访问 3D 内存的 CUDA API 与用于访问 2D 内存的 API 略有不同，但思路相同:

The CUDA API for accessing 3D memory is slightly different than the one for 2D memory, but the idea is the same :

使用 cudaMalloc3D 时，您会收到一个音高值，您必须小心保留该值，以便后续访问内存.
复制 3D 内存块时，您不能使用 cudaMemcpy，除非您复制的是单行.您必须使用 CUDA 实用程序提供的任何其他类型的复制实用程序，以考虑音高.
当您将数据复制到/从线性内存中时，您必须为指针提供一个间距，即使它无关紧要:这个间距是行的大小，以字节表示.
大小参数以字节表示行大小，以元素数表示列和深度维度.

相关文章