具有函数指针和可变参数模板的 CUDA 内核
我正在尝试设计一个 cuda 框架,该框架将接受用户函数并通过设备函数指针将它们转发到内核.CUDA 可以与可变参数模板 (-stc=c++11) 一起使用,到目前为止一切正常.
但是,当内核调用设备函数指针时,我遇到了问题.显然内核运行没有问题,但 GPU 使用率为 0%.如果我只是用实际函数替换回调指针,那么 GPU 使用率为 99%.这里的代码非常简单,大循环范围只是为了让事情变得可衡量.我用以下方法测量了 gpu 状态:
nvidia-smi --query-gpu=utilization.gpu,utilization.mory,memory.used --format=csv -lms 100 -f out.txt
IIRC,用户函数需要与内核在同一个文件单元中(可能是#included)才能使 nvcc 成功.func_d 就在源代码中,它编译和运行良好,除了不使用函数指针(这是本设计的重点).
我的问题是:为什么带有回调设备函数指针的内核不起作用?
请注意,当我 printf noth 回调和 func_d 地址时,它们是相同的,如此示例输出中所示:
Args 的大小 = 1回调()地址 = 4024b0func_d() 地址 = 4024b0
另一件奇怪的事是,如果取消注释 kernel()
中的 callback()
调用,那么 GPU 使用率会回到 0%,即使使用 func_d()
调用仍在其中... func_d 版本大约需要 4 秒才能运行,而回调版本不需要任何时间(嗯,~0.1 秒).
系统规格和编译命令在下面代码的头部.
谢谢!
//编译://nvcc -g -G -O0 -std=c++11 -arch=sm_20 -x cu sample.cpp////Nvidia Quadro 6000(计算能力 2.0)//CUDA 6.5 (V6.5.12),//Arch Linux,Nvidia 驱动程序 343.22-4,gcc 4.9.1//2014 年 11 月#include <stdio.h>__设备__无效 func_d(双 * 卷){*卷 += 5.4321f;}//CUDA核函数模板<typename...类型>__global__ void kernel( void (*callback)(Types*...) ){双 val0 = 1.2345f;////不使用 gpu (0% gpu 利用率)//for (int i = 0; i <1000000; i++) {//回调( &val0 );//}//使用 gpu(99% gpu 利用率)for (int i = 0; i <10000000; i++) {func_d( &val0 );}}//主机函数模板<typename...类型>void host_func( void (*callback)(Types*...) ){//获取用户内核参数的数量.constexpr int I = sizeof...(Types);printf("参数大小 = %d
",I);printf("callback() 地址 = %x
",callback);printf("func_d() 地址 = %x
",func_d);dim3 nblocks = 100;int nthread = 100;kernel<Types...><<<nblocks,nthread>>>(回调);}__主持人__int main(int argc, char** argv){host_func(func_d);}
解决方案 <块引用>我的问题是:为什么带有回调设备函数指针的内核不起作用?
可能有几个问题需要解决.但最简单的答案是因为在主机代码中获取设备实体的地址是非法的.对于设备变量和设备函数都是如此.现在,您可以获取这些实体的地址.但是地址是垃圾.它在主机或设备上均不可用.如果您仍然尝试使用它们,您将在设备上获得未定义的行为,这通常会使您的内核停止运行.
主机地址可以在主机代码中观察到.设备地址可以在设备代码中观察到.任何其他行为都需要 API 干预.
您似乎正在使用
nvidia-smi
利用率查询来衡量事情是否正常运行.我建议做 而是正确的 cuda 错误检查,并且您可能希望使用cuda-memcheck
运行您的代码.那为什么
func_d
的地址和callback
的地址匹配呢?"因为您在主机代码中使用 both 地址,并且这两个地址都是垃圾.为了让自己相信这一点,请在内核的最后添加一行类似这样的内容:if ((!threadIdx.x)&&(!blockIdx.x)) printf("in-kernel func_d() address = %x ",func_d);
您会看到它打印出的内容与主机上打印的内容不同.
设备利用率如何?"一旦设备遇到错误,内核就会终止,利用率会变为零.希望这将为您解释这个声明:另一个奇怪的事情是,如果取消注释 kernel() 中的 callback() 调用,那么 GPU 使用率会回到 0%,即使 func_d() 调用仍然在那里......"
我该如何解决这个问题?"我不知道解决这个问题的好方法.如果您在编译时已知的 CUDA 函数数量有限,并且您希望用户能够从中进行选择,那么适当的事情可能就是创建一个适当的索引,并使用它来选择函数.如果你真的想,你可以运行一个初步/设置内核,它将获取你关心的函数的地址,然后你可以将这些地址传回主机代码,并在后续的内核调用中使用它们作为参数,这应该允许你的机制工作.但我看不出它如何防止通过一组在编译时已知的预定义函数进行索引.如果您前进的方向是希望用户能够在运行时提供用户定义的函数,我认为您会发现这目前很难做到 使用 CUDA 运行时 API(我怀疑这可能会在未来发生变化.)我提供了一个相当扭曲的机制来尝试执行此操作 这里(阅读整个问题和答案;talonmies 的答案也很丰富).另一方面,如果您愿意使用 CUDA 驱动程序 API,那么它应该是可能的,尽管有些复杂,因为这正是 PyCUDA 以一种非常优雅的方式完成的,例如.
以后,请缩进您的代码.
这是一个完整的示例,展示了上面的一些想法.特别是,我以一种相当粗略的方式展示了 func_d
地址可以在设备代码中获取,然后传递回主机,然后用作未来的内核参数以成功选择/调用设备功能.
$ cat t595.cu//编译://nvcc -g -G -O0 -std=c++11 -arch=sm_20 -x cu sample.cpp////Nvidia Quadro 6000(计算能力 2.0)//CUDA 6.5 (V6.5.12),//Arch Linux,Nvidia 驱动程序 343.22-4,gcc 4.9.1//2014 年 11 月#include <stdio.h>__设备__无效 func_d(双 * 卷){if ((!threadIdx.x) && (!blockIdx.x)) printf("value = %f
", *vol);*卷 += 5.4321f;}模板<typename...类型>__global__ void setup_kernel(void (**my_callback)(Types*...)){*my_callback = func_d;}//CUDA核函数模板<typename...类型>__global__ void kernel( void (*callback)(Types*...) ){双 val0 = 1.2345f;////不使用 gpu (0% gpu 利用率)//for (int i = 0; i <1000000; i++) {回调(&val0);//}val0 = 0.0f;//使用 gpu(99% gpu 利用率)//for (int i = 0; i < 10000000; i++) {func_d( &val0 );//}if ((!threadIdx.x)&&(!blockIdx.x)) printf("in-kernel func_d() address = %x
",func_d);}//主机函数模板<typename...类型>void host_func( void (*callback)(Types*...) ){//获取用户内核参数的数量.constexpr int I = sizeof...(Types);printf("参数大小 = %d
",I);printf("callback() 地址 = %x
",callback);printf("func_d() 地址 = %x
",func_d);dim3 nblocks = 100;int nthread = 100;unsigned long long *d_callback, h_callback;cudaMalloc(&d_callback, sizeof(unsigned long long));setup_kernel<<<1,1>>>((void (**)(Types*...))d_callback);cudaMemcpy(&h_callback, d_callback, sizeof(unsigned long long), cudaMemcpyDeviceToHost);kernel<Types...><<<nblocks,nthread>>>((void (*)(Types*...))h_callback );cudaDeviceSynchronize();}__主持人__int main(int argc, char** argv){host_func(func_d);}$ nvcc -std=c++11 -arch=sm_20 -o t595 t595.cu$ cuda-memcheck ./t595========= CUDA-MEMCHECKArgs 的大小 = 1回调()地址= 4025ddfunc_d() 地址 = 4025dd值 = 1.234500值 = 0.000000内核内 func_d() 地址 = 4========= 错误摘要:0 个错误$
I am trying to design a cuda framework which would accept user functions and forward them to the kernel, through device function pointers. CUDA can work with variadic templates (-stc=c++11) and so far so good.
However, I hit a problem when the kernel calls the device function pointer. Apparently the kernel runs with no problem, but the GPU usage is 0%. If I simply replace the callback pointer with the actual function then GPU usage is 99%. The code here is very simple and the large loop range is simply to make things measurable. I measured the gpu status with:
nvidia-smi --query-gpu=utilization.gpu,utilization.mory,memory.used --format=csv -lms 100 -f out.txt
IIRC, the user function needs to be in the same file unit as the kernel (#included perhaps) in order to nvcc succeed. The func_d is right there in the source and it compiles and runs fine, well besides not working with the function pointer (which is the whole point in this design).
My question is: Why the kernel with the callback device function pointer is not working?
Note that, when I printf noth the callback and func_d addresses, they are the same, as in this sample output:
size of Args = 1
callback() address = 4024b0
func_d() address = 4024b0
Another weird thing is, if one uncomments the callback()
call in kernel()
then GPU usage is back to 0%, even with the func_d()
call still in there... The func_d version takes about 4 seconds to run, whereas the callback version takes nothing (well, ~0.1sec).
System specs and compilation command are in the head of the code below.
Thanks!
// compiled with:
// nvcc -g -G -O0 -std=c++11 -arch=sm_20 -x cu sample.cpp
//
// Nvidia Quadro 6000 (compute capability 2.0)
// CUDA 6.5 (V6.5.12),
// Arch Linux, Nvidia driver 343.22-4, gcc 4.9.1
// Nov, 2014
#include <stdio.h>
__device__
void func_d(double* vol)
{
*vol += 5.4321f;
}
// CUDA kernel function
template <typename... Types>
__global__ void kernel( void (*callback)(Types*...) )
{
double val0 = 1.2345f;
// // does not use gpu (0% gpu utilization)
// for ( int i = 0; i < 1000000; i++ ) {
// callback( &val0 );
// }
// uses gpu (99% gpu utilization)
for ( int i = 0; i < 10000000; i++ ) {
func_d( &val0 );
}
}
// host function
template <typename... Types>
void host_func( void (*callback)(Types*...) )
{
// get user kernel number of arguments.
constexpr int I = sizeof...(Types);
printf("size of Args = %d
",I);
printf("callback() address = %x
",callback);
printf("func_d() address = %x
",func_d);
dim3 nblocks = 100;
int nthread = 100;
kernel<Types...><<<nblocks,nthread>>>( callback );
}
__host__
int main(int argc, char** argv)
{
host_func(func_d);
}
解决方案
My question is: Why the kernel with the callback device function pointer is not working?
There are probably several issues to address. But the simplest answer is because it is illegal to take the address of device entities in host code. This is true for device variables as well as device functions. Now, you can take the address of those entities. But the address is garbage. It is not usable either on the host or on the device. If you attempt to use them anyway, you'll get undefined behavior on the device, which will usually bring your kernel to a halt.
Host addresses may be observed in host code. Device addresses may be observed in device code. Any other behavior requires API intervention.
You appear to be using the
nvidia-smi
utilization query as a measure of whether or not things are running correctly. I would suggest doing proper cuda error checking instead, and also you may wish to run your code withcuda-memcheck
."Why then does the address of
func_d
match the address ofcallback
?" Because you are taking both addresses in host code, and both addresses are garbage. To convince yourself of this, add a line something like this at the very end of your kernel:if ((!threadIdx.x)&&(!blockIdx.x)) printf("in-kernel func_d() address = %x ",func_d);
and you will see that it prints out something different from what is being printed on the host.
"What about the device utilization?" As soon as the device encounters an error, the kernel terminates, and utilization goes to zero. Hopefully this will explain this statement for you: "Another weird thing is, if one uncomments the callback() call in kernel() then GPU usage is back to 0%, even with the func_d() call still in there... "
"How can I fix this?" I don't know of a great way to fix this. If you have a limited number of CUDA functions known at compile-time, that you want the user to be able to select from, then the appropriate thing is probably to just create an appropriate index, and use that to select the function. If you really want to, you can run a preliminary/setup kernel, which will take the address of functions you care about, and then you can pass these addresses back to host code, and use them in subsequent kernel calls as parameters, and this should allow your mechanism to work. But I don't see how it prevents the need to index through a set of pre-defined functions known at compile-time. If the direction you are headed in is that you want the user to be able to provide user-defined functions at runtime I think you will find this quite difficult to do at the moment with the CUDA runtime API (I suspect this is likely to change in the future.) I provided a rather contorted mechanism to try to do this here (read the whole question and answer; talonmies answer there is informative as well). If, on the other hand, you are willing to use the CUDA driver API, then it should be possible, although somewhat involved, since this is exactly what is done in a very elegant fashion in PyCUDA, for example.
In the future, please indent your code.
Here's a fully worked example, demonstrating a few of the ideas above. In particular, I am showing in a rather crude fashion, that the func_d
address can be taken in device code, then passed back to the host, then used as a future kernel parameter to successfully select/call that device function.
$ cat t595.cu
// compiled with:
// nvcc -g -G -O0 -std=c++11 -arch=sm_20 -x cu sample.cpp
//
// Nvidia Quadro 6000 (compute capability 2.0)
// CUDA 6.5 (V6.5.12),
// Arch Linux, Nvidia driver 343.22-4, gcc 4.9.1
// Nov, 2014
#include <stdio.h>
__device__
void func_d(double* vol)
{
if ((!threadIdx.x) && (!blockIdx.x)) printf("value = %f
", *vol);
*vol += 5.4321f;
}
template <typename... Types>
__global__ void setup_kernel(void (**my_callback)(Types*...)){
*my_callback = func_d;
}
// CUDA kernel function
template <typename... Types>
__global__ void kernel( void (*callback)(Types*...) )
{
double val0 = 1.2345f;
// // does not use gpu (0% gpu utilization)
// for ( int i = 0; i < 1000000; i++ ) {
callback( &val0 );
// }
val0 = 0.0f;
// uses gpu (99% gpu utilization)
// for ( int i = 0; i < 10000000; i++ ) {
func_d( &val0 );
// }
if ((!threadIdx.x)&&(!blockIdx.x)) printf("in-kernel func_d() address = %x
",func_d);
}
// host function
template <typename... Types>
void host_func( void (*callback)(Types*...) )
{
// get user kernel number of arguments.
constexpr int I = sizeof...(Types);
printf("size of Args = %d
",I);
printf("callback() address = %x
",callback);
printf("func_d() address = %x
",func_d);
dim3 nblocks = 100;
int nthread = 100;
unsigned long long *d_callback, h_callback;
cudaMalloc(&d_callback, sizeof(unsigned long long));
setup_kernel<<<1,1>>>((void (**)(Types*...))d_callback);
cudaMemcpy(&h_callback, d_callback, sizeof(unsigned long long), cudaMemcpyDeviceToHost);
kernel<Types...><<<nblocks,nthread>>>( (void (*)(Types*...))h_callback );
cudaDeviceSynchronize();
}
__host__
int main(int argc, char** argv)
{
host_func(func_d);
}
$ nvcc -std=c++11 -arch=sm_20 -o t595 t595.cu
$ cuda-memcheck ./t595
========= CUDA-MEMCHECK
size of Args = 1
callback() address = 4025dd
func_d() address = 4025dd
value = 1.234500
value = 0.000000
in-kernel func_d() address = 4
========= ERROR SUMMARY: 0 errors
$
相关文章