CUDA如何获取网格、块、线程大小和并行化非方阵计算
我是 CUDA 的新手,需要帮助理解一些事情.我需要帮助并行化这两个 for 循环.具体如何设置dimBlock 和dimGrid 使其运行得更快.我知道这看起来像 sdk 中的向量添加示例,但该示例仅适用于方阵,当我尝试修改 128 x 1024 矩阵的代码时,它无法正常工作.
I am new to CUDA and need help understanding some things. I need help parallelizing these two for loops. Specifically how to setup the dimBlock and dimGrid to make this run faster. I know this looks like the vector add example in the sdk but that example is only for square matrices and when I try to modify that code for my 128 x 1024 matrix it doesn't work properly.
__global__ void mAdd(float* A, float* B, float* C)
{
for(int i = 0; i < 128; i++)
{
for(int j = 0; j < 1024; j++)
{
C[i * 1024 + j] = A[i * 1024 + j] + B[i * 1024 + j];
}
}
}
这段代码是一个更大循环的一部分,是代码中最简单的部分,所以我决定尝试并行化 thia 并同时学习 CUDA.我已经阅读了指南,但仍然不明白如何获得正确的编号.网格/块/线程去并有效地使用它们.
This code is part of a larger loop and is the simplest portion of the code, so I decided to try to paralleize thia and learn CUDA at same time. I have read the guides but still do not understand how to get the proper no. of grids/block/threads going and use them effectively.
推荐答案
正如您所写,该内核是完全串行的.为执行它而启动的每个线程都将执行相同的工作.
As you have written it, that kernel is completely serial. Every thread launched to execute it is going to performing the same work.
CUDA(以及 OpenCL 和其他类似的单程序、多数据"类型的编程模型)背后的主要思想是您进行数据并行"操作 - 因此,必须执行许多相同的、很大程度上独立的操作次 - 并编写一个执行该操作的内核.然后启动大量(半)自治线程以跨输入数据集执行该操作.
The main idea behind CUDA (and OpenCL and other similar "single program, multiple data" type programming models) is that you take a "data parallel" operation - so one where the same, largely independent, operation must be performed many times - and write a kernel which performs that operation. A large number of (semi)autonomous threads are then launched to perform that operation across the input data set.
在您的数组加法示例中,数据并行操作是
In your array addition example, the data parallel operation is
C[k] = A[k] + B[k];
对于0到128 * 1024之间的所有k.每个加法操作是完全独立的,没有顺序要求,因此可以由不同的线程执行.要在 CUDA 中表达这一点,可以这样编写内核:
for all k between 0 and 128 * 1024. Each addition operation is completely independent and has no ordering requirements, and therefore can be performed by a different thread. To express this in CUDA, one might write the kernel like this:
__global__ void mAdd(float* A, float* B, float* C, int n)
{
int k = threadIdx.x + blockIdx.x * blockDim.x;
if (k < n)
C[k] = A[k] + B[k];
}
[免责声明:代码在浏览器中编写,未经测试,使用风险自负]
[disclaimer: code written in browser, not tested, use at own risk]
在这里,串行代码的内循环和外循环被每个操作的一个 CUDA 线程替换,并且我在代码中添加了限制检查,以便在启动的线程多于所需操作的情况下,不会出现缓冲区溢出发生.如果内核是这样启动的:
Here, the inner and outer loop from the serial code are replaced by one CUDA thread per operation, and I have added a limit check in the code so that in cases where more threads are launched than required operations, no buffer overflow can occur. If the kernel is then launched like this:
const int n = 128 * 1024;
int blocksize = 512; // value usually chosen by tuning and hardware constraints
int nblocks = n / blocksize; // value determine by block size and total work
madd<<<nblocks,blocksize>>>mAdd(A,B,C,n);
然后 256 个块,每个块包含 512 个线程,将被启动到 GPU 硬件上以并行执行阵列加法操作.请注意,如果输入数据大小不能表示为块大小的整数倍,则需要向上取整块数以覆盖完整的输入数据集.
Then 256 blocks, each containing 512 threads will be launched onto the GPU hardware to perform the array addition operation in parallel. Note that if the input data size was not expressible as a nice round multiple of the block size, the number of blocks would need to be rounded up to cover the full input data set.
以上所有内容都是对 CUDA 范式的一个非常简单的操作的极大简化概述,但也许它提供了足够的洞察力让您继续自己.CUDA 现在相当成熟,网上有很多好的免费教育材料,您可能可以用来进一步阐明我在这个答案中掩盖的编程模型的许多方面.
All of the above is a hugely simplified overview of the CUDA paradigm for a very trivial operation, but perhaps it gives enough insight for you to continue yourself. CUDA is rather mature these days and there is a lot of good, free educational material floating around the web you can probably use to further illuminate many of the aspects of the programming model I have glossed over in this answer.
相关文章