cudaMemcpy 分段错误

2022-01-10 00:00:00 cuda segmentation-fault c++

这个错误困扰我很久了,所以我决定把它贴在这里.

I've been haunted by this error for quite a while so I decided to post it here.

当调用 cudaMemcpy 时发生此分段错误:

This segmentation fault happened when a cudaMemcpy is called:

CurrentGrid->cdata[i] = new float[size];
cudaMemcpy(CurrentGrid->cdata[i], Grid_dev->cdata[i], size*sizeof(float),
                cudaMemcpyDeviceToHost);

CurrentGridGrid_dev 分别是指向主机和设备上的 grid 类对象的指针,在此上下文中 i=0.类成员cdata 是一个浮点型指针数组.为了调试,在调用 cudaMemcpy 之前,我打印出 Grid_Dev->cdata[i] 的每个元素的值,即 CurrentGrid->cdata[i]Grid_dev->cdata[i] 以及 size 的值,看起来都不错.但它仍然以Segmentation fault (core dumped)"告终,这是唯一的错误消息.cuda-memcheck 只给出进程没有成功终止".我目前无法使用 cuda-gdb.关于去哪里有什么建议吗?

CurrentGrid and Grid_dev are pointer to a grid class object on host and device respectively and i=0 in this context. Class member cdata is a float type pointer array. For debugging, right before this cudaMemcpy call I printed out the value of each element of Grid_Dev->cdata[i], the address of CurrentGrid->cdata[i] and Grid_dev->cdata[i] and the value of size, which all looks good. But it still ends up with "Segmentation fault (core dumped)", which is the only error message. cuda-memcheck only gave "process didn't terminate successfully". I'm not able to use cuda-gdb at the moment. Any suggestion about where to go?

更新:现在看来我已经通过 cudaMalloc 解决了这个问题,设备上的另一个浮点指针 A 和 cudaMemcpy Grid_dev->cdata[i] 的值到 A,然后 cudaMemcpy A 到主机.所以上面写的这段代码就变成了:

UPDATE: It seems now I have solved this problem by cudaMalloc another float pointer A on device and cudaMemcpy the value of Grid_dev->cdata[i] to A, and then cudaMemcpy A to host. So the segment of code written above becomes:

float * A;
cudaMalloc((void**)&A, sizeof(float));
...
...
cudaMemcpy(&A, &(Grid_dev->cdata[i]), sizeof(float *), cudaMemcpyDeviceToHost);    
CurrentGrid->cdata[i] = new float[size];
cudaMemcpy(CurrentGrid->cdata[i], A, size*sizeof(float), cudaMemcpyDeviceToHost);            

我这样做是因为 valgrind 弹出大小为 8 的无效读取",我认为这是指 Grid_dev->cdata[i].我用 gdb 再次检查它,打印出 Grid_dev->cdata[i] 的值为 NULL.所以我想即使在这个 cudaMemcpy 调用中我也不能直接取消引用设备指针.但为什么 ?根据此 线程底部的评论 ,我们应该能够在 cudaMemcpy 函数中取消引用设备指针.

I did this because valgrind popped up "invalid read of size 8", which I thought referring to Grid_dev->cdata[i]. I checked it again with gdb, printing out the value of Grid_dev->cdata[i] being NULL. So I guess I cannot directly dereference the device pointer even in this cudaMemcpy call. But why ? According to the comment at the bottom of this thread , we should be able to dereference device pointer in cudaMemcpy function.

另外,我不知道 cudaMalloc 和 cudaMemcpy 如何工作的底层机制,但我认为通过 cudaMalloc 一个指针,在这里说 A,我们实际上分配这个指针指向设备上的某个地址.并且通过 cudaMemcpy 将 Grid_dev->cdata[i] 分配给 A,就像上面修改过的代码一样,我们重新分配指针 A 以指向数组.那我们在cudaMalloced时不就丢失了A指向的上一个地址的轨迹吗?这会导致内存泄漏还是什么?如果是,我应该如何正确解决这种情况?谢谢!

Also, I don't know the the underlying mechanism of how cudaMalloc and cudaMemcpy work but I think by cudaMalloc a pointer, say A here, we actually assign this pointer to point to a certain address on the device. And by cudaMemcpy the Grid_dev->cdata[i] to A as in the modified code above, we re-assign the pointer A to point to the array. Then don't we lose the track of the previous address that A pointed to when it is cudaMalloced? Could this cause memory leak or something? If yes, how should I work around this situation properly? Thanks!

作为参考,我将发生此错误的完整函数的代码放在下面.

For reference I put the code of the complete function in which this error happened below.

非常感谢!

__global__ void Print(grid *, int);
__global__ void Printcell(grid *, int);
void CopyDataToHost(param_t p, grid * CurrentGrid, grid * Grid_dev){

    cudaMemcpy(CurrentGrid, Grid_dev, sizeof(grid), cudaMemcpyDeviceToHost);
#if DEBUG_DEV
    cudaCheckErrors("cudaMemcpy1 error");
#endif
    printf("
Before copy cell data
");
    Print<<<1,1>>>(Grid_dev, 0);            //Print out some Grid_dev information for 
    cudaDeviceSynchronize();                //debug 
    int NumberOfBaryonFields = CurrentGrid->ReturnNumberOfBaryonFields();
    int size = CurrentGrid->ReturnSize();
    int vsize = CurrentGrid->ReturnVSize();
    CurrentGrid->FieldType = NULL;
    CurrentGrid->FieldType = new int[NumberOfBaryonFields];
    printf("CurrentGrid size is %d
", size);
    for( int i = 0; i < p.NumberOfFields; i++){
        CurrentGrid->cdata[i] = NULL;
        CurrentGrid->vdata[i] = NULL;
        CurrentGrid->cdata[i] = new float[size];
        CurrentGrid->vdata[i] = new float[vsize];

        Printcell<<<1,1>>>(Grid_dev, i);//Print out element value of Grid_dev->cdata[i]
        cudaDeviceSynchronize();        

        cudaMemcpy(CurrentGrid->cdata[i], Grid_dev->cdata[i], size*sizeof(float),
                cudaMemcpyDeviceToHost);               //where error occurs
#if DEBUG_DEV
        cudaCheckErrors("cudaMemcpy2 error");
#endif
        printf("
After copy cell data
");
        Print<<<1,1>>>(Grid_dev, i);
        cudaDeviceSynchronize();
        cudaMemcpy(CurrentGrid->vdata[i], Grid_dev->vdata[i], vsize*sizeof(float),
                cudaMemcpyDeviceToHost);
#if DEBUG_DEV
        cudaCheckErrors("cudaMemcpy3 error");
#endif
    }
    cudaMemcpy(CurrentGrid->FieldType, Grid_dev->FieldType,
            NumberOfBaryonFields*sizeof(int), cudaMemcpyDeviceToHost);
#if DEBUG_DEV
    cudaCheckErrors("cudaMemcpy4 error");
#endif
}

这是来自 valgrind 的信息,我试图从中找出内存泄漏发生的位置.

here is the information from valgrind, from which I'm trying to track down where the memory leak happened.

==19340== Warning: set address range perms: large range [0x800000000, 0xd00000000) (noaccess)
==19340== Warning: set address range perms: large range [0x200000000, 0x400000000) (noaccess)
==19340== Invalid read of size 8
==19340==    at 0x402C79: CopyDataToHost(param_t, grid*, grid*) (CheckDevice.cu:48)
==19340==    by 0x403646: CheckDevice(param_t, grid*, grid*) (CheckDevice.cu:186)
==19340==    by 0x40A6CD: main (Transport.cu:81)
==19340==  Address 0x2003000c0 is not stack'd, malloc'd or (recently) free'd
==19340== 
==19340== 
==19340== Process terminating with default action of signal 11 (SIGSEGV)
==19340==  Bad permissions for mapped region at address 0x2003000C0
==19340==    at 0x402C79: CopyDataToHost(param_t, grid*, grid*) (CheckDevice.cu:48)
==19340==    by 0x403646: CheckDevice(param_t, grid*, grid*) (CheckDevice.cu:186)
==19340==    by 0x40A6CD: main (Transport.cu:81)
==19340== 
==19340== HEAP SUMMARY:
==19340==     in use at exit: 2,611,365 bytes in 5,017 blocks
==19340==   total heap usage: 5,879 allocs, 862 frees, 4,332,278 bytes allocated
==19340== 
==19340== LEAK SUMMARY:
==19340==    definitely lost: 0 bytes in 0 blocks
==19340==    indirectly lost: 0 bytes in 0 blocks
==19340==      possibly lost: 37,416 bytes in 274 blocks
==19340==    still reachable: 2,573,949 bytes in 4,743 blocks
==19340==         suppressed: 0 bytes in 0 blocks
==19340== Rerun with --leak-check=full to see details of leaked memory
==19340== 
==19340== For counts of detected and suppressed errors, rerun with: -v
==19340== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 2 from 2)

推荐答案

我相信我知道问题出在哪里,但要确认它,查看您用于设置 的代码会很有用Grid_dev 设备上的类.

I believe I know what the problem is, but to confirm it, it would be useful to see the code that you are using to set up the Grid_dev classes on the device.

当要在设备上使用一个类或其他数据结构时,并且该类中有指针,这些指针指向内存中的其他对象或缓冲区(大概在设备内存中,对于将在设备上使用的类)),那么使这个顶级类在设备上可用的过程变得更加复杂.

When a class or other data structure is to be used on the device, and that class has pointers in it which refer to other objects or buffers in memory (presumably in device memory, for a class that will be used on the device), then the process of making this top-level class usable on the device becomes more complicated.

假设我有这样的课程:

class myclass{
  int myval;
  int *myptr;
  }

我可以在主机上实例化上述类,然后 malloc 一个 int 数组并将该指针分配给 myptr,然后一切会好的.要使此类仅在设备上可用且仅在设备上可用,过程可能相似.我可以:

I could instantiate the above class on the host, and then malloc an array of int and assign that pointer to myptr, and everything would be fine. To make this class usable on the device and the device only, the process could be similar. I could:

  1. cudaMalloc 指向将保存 myclass
  2. 的设备内存的指针
  3. (可选)使用 cudaMemcpy 将主机上 myclass 的实例化对象从步骤 1 复制到设备指针
  4. 在设备上,使用mallocnewmyptr
  5. 分配设备存储
  1. cudaMalloc a pointer to device memory that will hold myclass
  2. (optionally) copy an instantiated object of myclass on the host to the device pointer from step 1 using cudaMemcpy
  3. on the device, use malloc or new to allocate device storage for myptr

如果我不想访问主机上为 myptr 分配的存储空间,上述顺序很好.但如果我确实希望主机可以看到该存储,我需要一个不同的顺序:

The above sequence is fine if I never want to access the storage allocated for myptr on the host. But if I do want that storage to be visible from the host, I need a different sequence:

  1. cudaMalloc 一个指向设备内存的指针,它将保存 myclass,我们称之为 mydevobj
  2. (可选)使用 cudaMemcpy 将主机上 myclass 的实例化对象复制到步骤 1 中的设备指针 mydevobj
  3. 在宿主机上创建一个单独的int指针,我们称之为myhostptr
  4. cudaMalloc int 存储在设备上,用于 myhostptr
  5. cudaMemcpy 指针值 myhostptr 从主机到设备指针&(mydevobj->myptr)
  1. cudaMalloc a pointer to device memory that will hold myclass, let's call this mydevobj
  2. (optionally) copy an instantiated object of myclass on the host to the device pointer mydevobj from step 1 using cudaMemcpy
  3. Create a separate int pointer on the host, let's call it myhostptr
  4. cudaMalloc int storage on the device for myhostptr
  5. cudaMemcpy the pointer value of myhostptr from the host to the device pointer &(mydevobj->myptr)

之后,你可以cudaMemcpy将嵌入指针myptr指向的数据指向上分配(通过cudaMalloc)的区域>myhostptr

After that, you can cudaMemcpy the data pointed to by the embedded pointer myptr to the region allocated (via cudaMalloc) on myhostptr

注意,在第 5 步中,因为我取的是这个指针位置的地址,所以这个 cudaMemcpy 操作只需要主机上的 mydevobj 指针,这在 cudaMemcpy 操作中有效(仅).

Note that in step 5, because I am taking the address of this pointer location, this cudaMemcpy operation only requires the mydevobj pointer on the host, which is valid in a cudaMemcpy operation (only).

设备指针 myint 的值将被正确设置以执行您尝试执行的操作.如果您想在 myint 和主机之间传输 cudaMemcpy 数据,请在任何 cudaMemcpy 调用中使用指针 myhostptr,not mydevobj->myptr.如果我们尝试使用 mydevobj->myptr,则需要解引用 mydevobj,然后使用它来检索存储在 myptr 中的指针,然后使用该指针作为复制到/从位置.这在主机代码中是不可接受的.如果你尝试这样做,你会得到一个段错误.(请注意,通过类比,我的 mydevobj 就像你的 Grid_dev 而我的 myptr 就像你的 cdata)

The value of the device pointer myint will then be properly set up to do the operations you are trying to do. If you then want to cudaMemcpy data to and from myint to the host, you use the pointer myhostptr in any cudaMemcpy calls, not mydevobj->myptr. If we tried to use mydevobj->myptr, it would require dereferencing mydevobj and then using it to retrieve the pointer that is stored in myptr, and then using that pointer as the copy to/from location. This is not acceptable in host code. If you try to do it, you will get a seg fault. (Note that by way of analogy, my mydevobj is like your Grid_dev and my myptr is like your cdata)

总体而言,这是一个在您第一次遇到它时需要仔细考虑的概念,因此此类问题在 SO 中出现的频率很高.您可能想研究其中一些问题以查看代码示例(因为您尚未提供设置 Grid_dev 的代码):

Overall it is a concept that requires some careful thought the first time you run into it, and so questions like this come up with some frequency on SO. You may want to study some of these questions to see code examples (since you haven't provided your code that sets up Grid_dev):

  1. 示例 1
  2. 示例2
  3. 示例 3

相关文章