cudaMalloc 一个结构和一个相同结构的元素

2021-12-24 00:00:00 c memory-management cuda c++

我想知道当我分配一个结构然后分配(?)并复制同一结构的指针元素时,设备上会发生什么(内存方面).

I would like to know what happens on the device (memory wise) when I allocate a structure and then allocate(?) and copy a pointer element of the same structure.

我还需要元素 *a 的 cudaMalloc 吗?

Do I need cudaMalloc of the element *a again?

示例代码:

typedef struct {
  int *a;
  ...
} StructA;

int main() 
{
  int row, col, numS = 10; // defined at runtime

  StructA *d_A = (StructA*)malloc(numS * sizeof(StructA));
  int *h_A = d_a->a;

  cudaMalloc( (void**)&(d_A), numS * sizeof(StructA) );

  cudaMalloc( &(d_A->a), row*col*sizeof(int) ); // no (void**) needed?
  cudaMemcpy( d_A->a, h_A, row*col*sizeof(int), cudaMemcpyHostToDevice );

  kernel<<<grid, block>>>(d_A); // Passing pointer to StructA in device
  ...
}

内核定义:

__global__ kernel(StructA *d_A)
{
  d_A->a = ...;
  ...
}

这个问题是这个问题 与有关这个问题.

推荐答案

我建议您通过适当的 cuda 错误检查来编译和运行代码.学习解释编译器输出和运行时输出将使您成为一个更好、更聪明、更高效的编码器.我还建议您查看我之前在此处上给您的文章.它处理这个确切的主题,并包括链接的工作示例.这个问题是那个问题的重复.

I would suggest that you put some effort into compiling and running your codes with proper cuda error checking. Learning to interpret the compiler output and runtime output will make you a better, smarter, more efficient coder. I also suggest reviewing the writeup I previously pointed you at here. It deals with this exact topic, and includes linked worked examples. This question is a duplicate of that one.

有各种错误:

StructA *d_A = (StructA*)malloc(numS * sizeof(StructA));

上面这行代码在host内存中创建了一个StructA大小的结构体的分配,并设置了指向d_A的指针该分配的开始.目前没有任何问题.

The above line of code creates an allocation in host memory for a structure of size StructA, and sets the pointer d_A pointing to the start of that allocation. Nothing wrong at the moment.

cudaMalloc( (void**)&(d_A), numS * sizeof(StructA) );

上面这行代码在device内存中创建了一个StructA大小的内存分配,并设置了指向开始的指针d_A那个分配.这有效地消除了之前的指针和分配.(之前的主机分配还在某处,但您无法访问它.它基本上丢失了.)当然这不是您的意图.

The above line of code creates an allocation in device memory of the size of StructA, and sets the pointer d_A pointing to the start of that allocation. This has effectively wiped out the previous pointer and allocation. (The previous host allocation is still somewhere, but you can't access it. It's basically lost.) Surely that was not your intent.

int *h_A = d_a->a;

既然 d_A(我假设你的意思是 d_A,而不是 d_a)已被分配为设备内存指针,-> 操作将取消引用该指针以定位元素 a.这在主机代码中是非法,并且会抛出错误(seg fault).

Now that d_A (I assume you meant d_A, not d_a) has been assigned as a device memory pointer, the -> operation will dereference that pointer to locate the element a. This is illegal in host code and will throw an error (seg fault).

cudaMalloc( &(d_A->a), row*col*sizeof(int) );

这行代码也有类似的问题.我们不能 cudaMalloc 一个存在于设备内存中的指针.cudaMalloc 创建存在于主机内存中但引用设备内存中的位置的指针.此操作 &(d_A->a) 正在取消引用设备指针,这在主机代码中是非法的.

This line of code has a similar issue. We cannot cudaMalloc a pointer that lives in device memory. cudaMalloc creates pointers that live in host memory but reference a location in device memory. This operation &(d_A->a) is dereferencing a device pointer, which is illegal in host code.

正确的代码应该是这样的:

A proper code would be something like this:

$ cat t363.cu
#include <stdio.h>

typedef struct {
  int *a;
  int foo;
} StructA;

__global__ void kernel(StructA *data){

  printf("The value is %d
", *(data->a + 2));
}

int main()
{
  int  numS = 1; // defined at runtime

  //allocate host memory for the structure storage
  StructA *h_A = (StructA*)malloc(numS * sizeof(StructA));
  //allocate host memory for the storage pointed to by the embedded pointer
  h_A->a = (int *)malloc(10*sizeof(int));
  // initialize data pointed to by the embedded pointer
  for (int i = 0; i <10; i++) *(h_A->a+i) = i;
  StructA *d_A;  // pointer for device structure storage
  //allocate device memory for the structure storage
  cudaMalloc( (void**)&(d_A), numS * sizeof(StructA) );
  // create a pointer for cudaMalloc to use for embedded pointer device storage
  int *temp;
  //allocate device storage for the embedded pointer storage
  cudaMalloc((void **)&temp, 10*sizeof(int));
  //copy this newly created *pointer* to it's proper location in the device copy of the structure
  cudaMemcpy(&(d_A->a), &temp, sizeof(int *), cudaMemcpyHostToDevice);
  //copy the data pointed to by the embedded pointer from the host to the device
  cudaMemcpy(temp, h_A->a, 10*sizeof(int), cudaMemcpyHostToDevice);

  kernel<<<1, 1>>>(d_A); // Passing pointer to StructA in device
  cudaDeviceSynchronize();
}
$ nvcc -arch=sm_20 -o t363 t363.cu
$ cuda-memcheck ./t363
========= CUDA-MEMCHECK
The value is 2
========= ERROR SUMMARY: 0 errors
$

您会注意到,我还没有解决您正在处理 StructA 数组(即 numS > 1)的情况,这将需要一个环形.我会让你来完成我在这里和我的 之前的链接中提出的逻辑回答 以查看您是否可以计算出该循环的详细信息.此外,为了清晰/简洁起见,我省去了通常的 cuda 错误检查 但请在您的代码中使用它.最后,这个过程(有时称为深拷贝操作")在普通 CUDA 中有些乏味,如果您还没有得出结论的话.以前的建议是扁平化"此类结构(以便它们不包含指针),但您也可以探索 cudaMallocManaged,即 CUDA 6 中的统一内存.

You'll note that I haven't worked out the case where you are dealing with an array of StructA (i.e. numS > 1), that will require a loop. I'll leave it to you to work through the logic I've presented here and in my previous linked answer to see if you can work out the details of that loop. Furthermore, for the sake of clarity/brevity I've dispensed with the usual cuda error checking but please use it in your codes. Finally, this process (sometimes called a "deep copy operation") is somewhat tedious in ordinary CUDA if you haven't concluded that yet. Previous recommendations along these lines are to "flatten" such structures (so that they don't contiain pointers), but you can also explore cudaMallocManaged i.e. Unified Memory in CUDA 6.

相关文章