CUDA 设备到主机复制非常慢

2022-01-20 00:00:00 复制 performance cuda c++ device

我正在运行 Windows 7 64 位、cuda 4.2、Visual Studio 2010.

I'm running windows 7 64 bits, cuda 4.2, visual studio 2010.

首先,我在 cuda 上运行一些代码,然后将数据下载回主机.然后进行一些处理并移回设备.然后我做了以下从设备到主机的复制,它运行得非常快,比如 1 毫秒.

First, I run some code on cuda, then download the data back to host. Then do some processing and move back to the device. Then I did the following copy from device to host, it runs very fast, like 1ms.

clock_t start, end;
count=1000000;
thrust::host_vector <int> h_a(count);
thrust::device_vector <int> d_b(count,0);
int *d_bPtr = thrust::raw_pointer_cast(&d_b[0]);
start=clock();
thrust::copy(d_b.begin(), d_b.end(), h_a.begin());
end=clock();
cout<<"Time Spent:"<<end-start<<endl;

大约需要 1 毫秒才能完成.

It takes ~1ms to finish.

然后我再次在 cuda 上运行了一些其他代码,主要是原子操作.然后我将数据从设备复制到主机,这需要很长时间,大约9s.

Then I ran some other code on the cuda again, mainly atomic operations. Then I copy the data from device to host, it takes very long time, like ~9s.

__global__ void dosomething(int *d_bPtr)
{
....
atomicExch(d_bPtr,c)
....
}

start=clock();
thrust::copy(d_b.begin(), d_b.end(), h_a.begin());
end=clock();
cout<<"Time Spent:"<<end-start<<endl;

~9s

我多次运行代码,例如

int i=0;
while (i<10)
{
clock_t start, end;
count=1000000;
thrust::host_vector <int> h_a(count);
thrust::device_vector <int> d_b(count,0);
int *d_bPtr = thrust::raw_pointer_cast(&d_b[0]);
start=clock();
thrust::copy(d_b.begin(), d_b.end(), h_a.begin());
end=clock();
cout<<"Time Spent:"<<end-start<<endl;

__global__ void dosomething(int *d_bPtr)
{
....
atomicExch(d_bPtr,c)
....
}

start=clock();
thrust::copy(d_b.begin(), d_b.end(), h_a.begin());
end=clock();
cout<<"Time Spent:"<<end-start<<endl;
i++
}

结果几乎相同.
可能是什么问题?

The results are pretty much the same.
What could be the problem?

谢谢!

推荐答案

问题是时间问题,而不是复制性能的任何变化.内核启动在 CUDA 中是异步的,因此您测量的不仅仅是 thrust::copy 的时间,还包括您启动的先前内核完成的时间.如果您将用于计时复制操作的代码更改为如下所示:

The problem is one of timing, not of any change in copy performance. Kernel launches are asynchronous in CUDA, so what you are measuring is not just the time for thrust::copy but also for the prior kernel you launched to complete. If you change you code for timing the copy operation to something like this:

cudaDeviceSynchronize(); // wait until prior kernel is finished
start=clock();
thrust::copy(d_b.begin(), d_b.end(), h_a.begin());
end=clock();
cout<<"Time Spent:"<<end-start<<endl;

您应该会发现传输时间已恢复到之前的性能.所以你真正的问题不是为什么 thrust::copy 慢",而是为什么我的内核慢".根据您发布的相当糟糕的伪代码,答案是因为它充满了序列化内核内存事务的 atomicExch() 调用".

You should find the transfer times are restored to their previous performance. So you real question isn't "why is thrust::copy slow", it is "why is my kernel slow". And based on the rather terrible pseudo code you posted, the answer is "because it is full of atomicExch() calls which serialise kernel memory transactions".

相关文章