CUDA 设备到设备传输昂贵

2022-01-10 00:00:00 fft cuda c++

我编写了一些代码来尝试交换二维矩阵的象限以用于 FFT，该矩阵存储在平面数组中.

I have written some code to try to swap quadrants of a 2D matrix for FFT purposes, that is stored in a flat array.

int leftover = W-dcW; T *temp; T *topHalf; cudaMalloc((void **)&temp, dcW * sizeof(T)); //swap every row, left and right for(int i = 0; i < H; i++) { cudaMemcpy(temp, &data[i*W], dcW*sizeof(T),cudaMemcpyDeviceToDevice); cudaMemcpy(&data[i*W],&data[i*W+dcW], leftover*sizeof(T), cudaMemcpyDeviceToDevice); cudaMemcpy(&data[i*W+leftover], temp, dcW*sizeof(T), cudaMemcpyDeviceToDevice); } cudaMalloc((void **)&topHalf, dcH*W* sizeof(T)); leftover = H-dcH; cudaMemcpy(topHalf, data, dcH*W*sizeof(T), cudaMemcpyDeviceToDevice); cudaMemcpy(data, &data[dcH*W], leftover*W*sizeof(T), cudaMemcpyDeviceToDevice); cudaMemcpy(&data[leftover*W], topHalf, dcH*W*sizeof(T), cudaMemcpyDeviceToDevice);

请注意，此代码采用设备指针，并执行 DeviceToDevice 传输.

Notice that this code takes device pointers, and does DeviceToDevice transfers.

为什么这似乎运行得这么慢?这可以以某种方式优化吗?与使用常规 memcpy 在主机上进行相同操作相比，我对此进行了计时，并且速度慢了大约 2 倍.

Why does this seem to run so slow? Can this be optimized somehow? I timed this compared to the same operation on host using regular memcpy and it was about 2x slower.

有什么想法吗?

推荐答案

我最终编写了一个内核来进行交换.这确实比 Device to Device memcpy 操作快

I ended up writing a kernel to do the swaps. This was indeed faster than the Device to Device memcpy operations

相关文章