一个结构的 cudaMalloc 和一个相同结构的元素
我想知道当我分配一个结构然后 allocate(?) 并复制同一结构的指针元素时设备上会发生什么(内存方面).
I would like to know what happens on the device (memory wise) when I allocate a structure and then allocate(?) and copy a pointer element of the same structure.
我还需要元素 *a 的 cudaMalloc
吗?
Do I need cudaMalloc
of the element *a again?
示例代码:
typedef struct {
int *a;
...
} StructA;
int main()
{
int row, col, numS = 10; // defined at runtime
StructA *d_A = (StructA*)malloc(numS * sizeof(StructA));
int *h_A = d_a->a;
cudaMalloc( (void**)&(d_A), numS * sizeof(StructA) );
cudaMalloc( &(d_A->a), row*col*sizeof(int) ); // no (void**) needed?
cudaMemcpy( d_A->a, h_A, row*col*sizeof(int), cudaMemcpyHostToDevice );
kernel<<<grid, block>>>(d_A); // Passing pointer to StructA in device
...
}
内核定义:
__global__ kernel(StructA *d_A)
{
d_A->a = ...;
...
}
这个问题是 这个问题并与 这个问题.
推荐答案
我建议您花一些精力来编译和运行您的代码,并进行适当的 cuda 错误检查.学习解释编译器输出和运行时输出将使您成为更好、更聪明、更高效的编码器.我还建议查看我之前在此处向您指出的文章.它处理这个确切的主题,并包括链接的工作示例.这个问题与那个问题重复.
I would suggest that you put some effort into compiling and running your codes with proper cuda error checking. Learning to interpret the compiler output and runtime output will make you a better, smarter, more efficient coder. I also suggest reviewing the writeup I previously pointed you at here. It deals with this exact topic, and includes linked worked examples. This question is a duplicate of that one.
有各种错误:
StructA *d_A = (StructA*)malloc(numS * sizeof(StructA));
上面的代码行在 host 内存中为大小为 StructA
的结构创建分配,并设置指针 d_A
指向该分配的开始.目前没有任何问题.
The above line of code creates an allocation in host memory for a structure of size StructA
, and sets the pointer d_A
pointing to the start of that allocation. Nothing wrong at the moment.
cudaMalloc( (void**)&(d_A), numS * sizeof(StructA) );
上面这行代码在设备内存中创建了一个StructA
大小的分配,并设置指针d_A
指向开始的分配.这有效地消除了先前的指针和分配.(之前的主机分配还在某处,但你无法访问它.它基本上丢失了.)当然这不是你的意图.
The above line of code creates an allocation in device memory of the size of StructA
, and sets the pointer d_A
pointing to the start of that allocation. This has effectively wiped out the previous pointer and allocation. (The previous host allocation is still somewhere, but you can't access it. It's basically lost.) Surely that was not your intent.
int *h_A = d_a->a;
现在 d_A
(我假设您的意思是 d_A
,而不是 d_a
)已被分配为设备内存指针,->
操作将取消引用该指针以定位元素 a
.这在主机代码中是非法的,并且会抛出错误(段错误).
Now that d_A
(I assume you meant d_A
, not d_a
) has been assigned as a device memory pointer, the ->
operation will dereference that pointer to locate the element a
. This is illegal in host code and will throw an error (seg fault).
cudaMalloc( &(d_A->a), row*col*sizeof(int) );
这行代码也有类似的问题.我们不能 cudaMalloc
一个存在于设备内存中的指针.cudaMalloc
创建位于主机内存中但引用设备内存中的位置的指针.此操作 &(d_A->a)
正在取消引用设备指针,这在主机代码中是非法的.
This line of code has a similar issue. We cannot cudaMalloc
a pointer that lives in device memory. cudaMalloc
creates pointers that live in host memory but reference a location in device memory. This operation &(d_A->a)
is dereferencing a device pointer, which is illegal in host code.
正确的代码应该是这样的:
A proper code would be something like this:
$ cat t363.cu
#include <stdio.h>
typedef struct {
int *a;
int foo;
} StructA;
__global__ void kernel(StructA *data){
printf("The value is %d
", *(data->a + 2));
}
int main()
{
int numS = 1; // defined at runtime
//allocate host memory for the structure storage
StructA *h_A = (StructA*)malloc(numS * sizeof(StructA));
//allocate host memory for the storage pointed to by the embedded pointer
h_A->a = (int *)malloc(10*sizeof(int));
// initialize data pointed to by the embedded pointer
for (int i = 0; i <10; i++) *(h_A->a+i) = i;
StructA *d_A; // pointer for device structure storage
//allocate device memory for the structure storage
cudaMalloc( (void**)&(d_A), numS * sizeof(StructA) );
// create a pointer for cudaMalloc to use for embedded pointer device storage
int *temp;
//allocate device storage for the embedded pointer storage
cudaMalloc((void **)&temp, 10*sizeof(int));
//copy this newly created *pointer* to it's proper location in the device copy of the structure
cudaMemcpy(&(d_A->a), &temp, sizeof(int *), cudaMemcpyHostToDevice);
//copy the data pointed to by the embedded pointer from the host to the device
cudaMemcpy(temp, h_A->a, 10*sizeof(int), cudaMemcpyHostToDevice);
kernel<<<1, 1>>>(d_A); // Passing pointer to StructA in device
cudaDeviceSynchronize();
}
$ nvcc -arch=sm_20 -o t363 t363.cu
$ cuda-memcheck ./t363
========= CUDA-MEMCHECK
The value is 2
========= ERROR SUMMARY: 0 errors
$
您会注意到,我还没有解决您正在处理 StructA
数组(即 numS
> 1)的情况,这将需要环形.我将留给您处理我在这里和我的 previous linked回答 看看您是否可以计算出该循环的详细信息.此外,为了清楚/简洁,我已经放弃了通常的 cuda 错误检查 但请在您的代码中使用它.最后,如果您还没有得出结论,这个过程(有时称为深拷贝操作")在普通 CUDA 中有些乏味.以前的建议是扁平化"此类结构(以便它们不包含指针),但您也可以探索 cudaMallocManaged
即 CUDA 6 中的统一内存.
You'll note that I haven't worked out the case where you are dealing with an array of StructA
(i.e. numS
> 1), that will require a loop. I'll leave it to you to work through the logic I've presented here and in my previous linked answer to see if you can work out the details of that loop. Furthermore, for the sake of clarity/brevity I've dispensed with the usual cuda error checking but please use it in your codes. Finally, this process (sometimes called a "deep copy operation") is somewhat tedious in ordinary CUDA if you haven't concluded that yet. Previous recommendations along these lines are to "flatten" such structures (so that they don't contiain pointers), but you can also explore cudaMallocManaged
i.e. Unified Memory in CUDA 6.
相关文章