CUDA __global__ 函数中的 printf
我目前正在 GPU 上编写矩阵乘法并想调试我的代码,但由于我不能在设备函数中使用 printf,我可以做些什么来查看该函数内部发生了什么.这是我当前的功能:
I am currently writing a matrix multiplication on a GPU and would like to debug my code, but since I can not use printf inside a device function, is there something else I can do to see what is going on inside that function. This my current function:
__global__ void MatrixMulKernel(Matrix Ad, Matrix Bd, Matrix Xd){
int tx = threadIdx.x;
int ty = threadIdx.y;
int bx = blockIdx.x;
int by = blockIdx.y;
float sum = 0;
for( int k = 0; k < Ad.width ; ++k){
float Melement = Ad.elements[ty * Ad.width + k];
float Nelement = Bd.elements[k * Bd.width + tx];
sum += Melement * Nelement;
}
Xd.elements[ty * Xd.width + tx] = sum;
}
我很想知道 Ad 和 Bd 是不是我想的那样,看看是否真的调用了那个函数.
I would love to know if Ad and Bd is what I think it is, and see if that function is actually being called.
推荐答案
编辑
为避免误导人们,正如 M. Tibbits 指出的那样,printf 可用于任何计算能力为 2.0 及更高版本的 GPU.
To avoid misleading people, as M. Tibbits points out printf is available in any GPU of compute capability 2.0 and higher.
编辑结束
你有选择:
- 使用 GPU 调试器,即 Linux 上的 cuda-gdb 或 Windows 上的 Nexus
- 使用注册开发者可以使用的 cuprintf(注册这里)
- 手动复制您想查看的数据,然后在内核完成后将该缓冲区转储到主机上(记得同步)
关于您的代码片段:
- 考虑通过指针传递
Matrix
结构体(即cudaMemcpy
它们到设备,然后传入设备指针),现在你不会有问题,但如果函数签名变得非常大,那么您可能会达到 256 字节的限制 - 您对 Ad 的读取效率低下,每次读取 Melement 时都会有一个 32 字节的内存事务 - 考虑使用共享内存作为暂存区域(参见 SDK 中的 transposeNew 示例)
- Consider passing the
Matrix
structs in via pointer (i.e.cudaMemcpy
them to the device, then pass in the device pointer), right now you will have no problem but if the function signature gets very large then you may hit the 256 byte limit - You have inefficient reads from Ad, you will have a 32-byte transaction to the memory for each read into Melement - consider using shared memory as a staging area (c.f. the transposeNew sample in the SDK)
相关文章