CUDA 内核作为类的成员函数

2022-01-10 00:00:00 windows cuda c++

我正在使用 CUDA 5.0 和 Compute Capability 2.1 卡.

I am using CUDA 5.0 and a Compute Capability 2.1 card.

问题很简单:内核可以成为类的一部分吗?例如:

The question is quite straightforward: Can a kernel be part of a class? For example:

class Foo
{
private:
 //...
public:
 __global__ void kernel();
};

__global__ void Foo::kernel()
{
 //implementation here
}

如果不是,那么解决方案是制作一个作为类成员的包装函数并在内部调用内核?

If not then the solution is to make a wrapper function that is member of the class and calls the kernel internally?

如果是,那么它是否可以像普通私有函数一样访问私有属性?

And if yes, then will it have access to the private attributes as a normal private function?

(我不只是尝试并看看会发生什么,因为我的项目现在还有其他几个错误,而且我认为这是一个很好的参考问题.我很难找到将 CUDA 与 C++ 结合使用的参考.基本功能可以找到示例,但不能找到结构化代码的策略.)

(I'm not just trying it and see what happens because my project has several other errors right now and also I think it's a good reference question. It was difficult for me to find reference for using CUDA with C++. Basic functionality examples can be found but not strategies for structured code.)

推荐答案

让我暂时不讨论 cuda 动态并行性(即假设计算能力 3.0 或更高版本).

Let me leave cuda dynamic parallelism out of the discussion for the moment (i.e. assume compute capability 3.0 or prior).

记住 __ global__ 用于将(仅)从主机调用(但在设备上执行)的 cuda 函数.如果您在设备上实例化此对象,它将不起作用.此外,为了使成员函数可以使用设备可访问的私有数据,必须在设备上实例化该对象.

remember __ global__ is used for cuda functions that will (only) be called from the host (but execute on the device). If you instantiate this object on the device, it won't work. Furthermore, to get device-accessible private data to be available to the member function, the object would have to be instantiated on the device.

所以你可以有一个内核调用(即mykernel<<<blocks,threads>>>(...);宿主对象成员函数,但内核定义(即带有 __ global__ 装饰器的函数定义)通常会在源代码中的对象定义之前.如前所述,这种方法不能用于在设备上实例化的对象. 它也无法访问对象中其他地方定义的普通私有数据. (可能会想出一个方案,用于创建设备数据的仅主机对象,使用全局内存中的指针,然后是可以在设备上访问,但乍一看,这样的方案对我来说似乎很复杂).

So you could have a kernel invocation (ie. mykernel<<<blocks,threads>>>(...); embedded in a host object member function, but the kernel definition (i.e. the function definition with the __ global__ decorator) would normally precede the object definition in your source code. And as stated already, such a methodology could not be used for an object instantiated on the device. It would also not have access to ordinary private data defined elsewhere in the object. (It may be possible to come up with a scheme for a host-only object that does create device data, using pointers in global memory, that would then be accessible on the device, but such a scheme seems quite convoluted to me at first glance).

通常,设备可用的成员函数会在 __ device__ 装饰器之前.在这种情况下,设备成员函数中的所有代码都在调用它的线程中执行.

Normally, device-usable member functions would be preceded by the __ device__ decorator. In this case, all the code in the device member function executes from within the thread that called it.

这个问题给出了一个例子(在我编辑的答案)具有可从主机和设备调用的成员函数的 C++ 对象,并在主机和设备对象之间进行适当的数据复制.

This question gives an example (in my edited answer) of a C++ object with a member function callable from both the host and the device, with appropriate data copying between host and device objects.

相关文章