为什么在使用 cudaMallocManaged 时 NVIDIA Pascal GPU 运行 CUDA 内核时速度很慢
我正在测试新的 CUDA 8 和 Pascal Titan X GPU,并希望我的代码能够加快速度,但由于某种原因它最终变慢了.我在 Ubuntu 16.04 上.
这是可以重现结果的最少代码:
CUDASample.cuh
类 CUDASample{上市:void AddOneToVector(std::vector<int> &in);};
CUDASample.cu
__global__ static void CUDAKernelAddOneToVector(int *data){常量 int x = blockIdx.x * blockDim.x + threadIdx.x;常量 int y = blockIdx.y * blockDim.y + threadIdx.y;常量 int mx = gridDim.x * blockDim.x;数据[y * mx + x] = 数据[y * mx + x] + 1.0f;}void CUDASample::AddOneToVector(std::vector<int> &in){整数 *数据;cudaMallocManaged(reinterpret_cast<void **>(&data),in.size() * sizeof(int),cudaMemAttachGlobal);for (std::size_t i = 0; i
Main.cpp
std::vector五;for (int i = 0; i <8192000; i++){v.push_back(i);}CUDASsample cudasample;cudasample.AddOneToVector(v);
唯一的区别是 NVCC 标志,对于 Pascal Titan X 来说是:
-gencode arch=compute_61,code=sm_61-std=c++11;
对于旧的 Maxwell Titan X 来说是:
-gencode arch=compute_52,code=sm_52-std=c++11;
这是运行 NVIDIA Visual Profiling 的结果.
对于旧的 Maxwell Titan,内存传输时间约为 205 ms,内核启动时间约为 268 us.
对于 Pascal Titan,内存传输时间约为 202 毫秒,内核启动时间约为 8343 us,这让我相信有些地方出了问题.
我通过将 cudaMallocManaged 替换为良好的旧 cudaMalloc 来进一步隔离问题,并进行了一些分析并观察了一些有趣的结果.
CUDASample.cu
__global__ static void CUDAKernelAddOneToVector(int *data){常量 int x = blockIdx.x * blockDim.x + threadIdx.x;常量 int y = blockIdx.y * blockDim.y + threadIdx.y;常量 int mx = gridDim.x * blockDim.x;数据[y * mx + x] = 数据[y * mx + x] + 1.0f;}void CUDASample::AddOneToVector(std::vector<int> &in){整数 *数据;cudaMalloc(reinterpret_cast<void **>(&data), in.size() * sizeof(int));cudaMemcpy(reinterpret_cast<void*>(数据),reinterpret_cast<void*>(in.data()),in.size() * sizeof(int), cudaMemcpyHostToDevice);dim3 blks(in.size()/(16*32),1);dim3 线程(32、16);CUDAKernelAddOneToVector<<<块,线程>>>(数据);cudaDeviceSynchronize();cudaMemcpy(reinterpret_cast<void*>(in.data()),reinterpret_cast<void*>(数据),in.size() * sizeof(int), cudaMemcpyDeviceToHost);cudaFree(数据);}
对于旧的 Maxwell Titan,双向内存传输时间约为 5 ms,内核启动时间约为 264 us.
对于 Pascal Titan,双向内存传输时间约为 5 ms,内核启动时间约为 194 us,这实际上导致了我希望看到的性能提升......
当使用 cudaMallocManaged 时,为什么 Pascal GPU 在运行 CUDA 内核时会这么慢?如果我必须将所有使用 cudaMallocManaged 的??现有代码恢复为 cudaMalloc,那将是一种讽刺.这个实验也表明,使用 cudaMallocManaged 的??内存传输时间比使用 cudaMalloc 慢很多,这也让人感觉有些不对劲.如果使用它会导致运行时间变慢,甚至代码更容易,这应该是不可接受的,因为使用 CUDA 而不是普通的 C++ 的全部目的是加快速度.我做错了什么,为什么我会观察到这种结果?
解决方案在带有 Pascal GPU 的 CUDA 8 下,统一内存 (UM) 机制下的托管内存数据迁移通常会与以前的架构不同,您正在经历这个的影响.(另请参阅最后关于 CUDA 9 更新的 Windows 行为的说明.)
对于以前的架构(例如 Maxwell),特定内核调用使用的托管分配将在内核启动时一次性迁移,就像您调用 cudaMemcpy
来自己移动数据一样.
使用 CUDA 8 和 Pascal GPU,数据迁移通过按需分页发生.在内核启动时,默认情况下,没有数据显式迁移到设备 (*).当 GPU 设备代码尝试访问不在 GPU 内存中的特定页面中的数据时,将发生页面错误.此页面错误的最终结果是:
- 导致 GPU 内核代码(访问页面的一个或多个线程)停止(直到第 2 步完成)
- 导致该页内??存从 CPU 迁移到 GPU
此过程将根据需要重复,因为 GPU 代码会触及不同的数据页面.除了实际移动数据所花费的时间之外,在处理页面错误时,上述第 2 步中涉及的操作序列还涉及一些延迟.由于此过程将一次移动一页数据,因此它可能比使用 cudaMemcpy
或通过导致所有数据移动的前 Pascal UM 安排一次移动所有数据效率低得多在内核启动时移动(无论是否需要,也不管内核代码何时真正需要它).
这两种方法都有其优点和缺点,我不想辩论优点或各种意见或观点.需求分页过程为 Pascal GPU 提供了许多重要的特性和功能.
但是,这个特定的代码示例并没有好处.这是意料之中的,因此建议使用使行为与先前(例如 maxwell)行为/性能保持一致的方法是在内核启动之前使用 cudaMemPrefetchAsync()
调用.
您将使用 CUDA 流语义强制此调用在内核启动之前完成(如果内核启动未指定流,您可以为流参数传递 NULL,以选择默认流).我相信这个函数调用的其他参数是不言自明的.
在内核调用之前使用此函数调用,覆盖有问题的数据,在 Pascal 情况下您不应观察到任何页面错误,并且配置文件行为应类似于 Maxwell 情况.
正如我在评论中提到的,如果您创建了一个测试用例,其中依次涉及两个内核调用,您会观察到即使在 Pascal 情况下,第二个调用也几乎以全速运行,因为所有数据都有已经通过第一次内核执行迁移到 GPU 端.因此,这种预取功能的使用不应被认为是强制性的或自动的,而应慎重使用.在某些情况下,GPU 可能能够在一定程度上隐藏页面错误的延迟,显然不需要预取已经驻留在 GPU 上的数据.
请注意,上面第 1 步中提到的停顿"可能具有误导性.内存访问本身不会触发停顿.但是,如果操作确实需要请求的数据,例如一个乘法,然后扭曲将停止在乘法操作,直到必要的数据可用.因此,相关的一点是,以这种方式从主机到设备的数据请求分页只是 GPU 可能隐藏在其延迟隐藏架构中的另一个延迟",如果有足够的其他可用工作"参与到.
作为附加说明,在 CUDA 9 中,pascal 及更高版本的按需分页机制仅在 linux 上可用;先前在 CUDA 8 中宣传的对 Windows 的支持已被删除.请参阅此处.在 Windows 上,即使对于 Pascal 设备及更高版本,从 CUDA 9 开始,UM 机制与 maxwell 和以前的设备相同;数据在内核启动时整体迁移到 GPU.
(*) 这里的假设是数据驻留在"主机上,即在托管分配调用之后已经在 CPU 代码中触及"或初始化.托管分配本身会创建与设备关联的数据页面,当 CPU 代码接触"这些页面时,CUDA 运行时将要求分页驻留在主机内存中的必要页面,以便 CPU 可以使用它们.如果您执行分配但从不触摸" CPU 代码中的数据(可能是一种奇怪的情况),那么当内核运行时,它实际上已经驻留在"设备内存中,并且观察到的行为会有所不同.但对于这个特定的示例/问题,情况并非如此.
这篇博客文章中提供了更多信息.p>
I was testing the new CUDA 8 along with the Pascal Titan X GPU and is expecting speed up for my code but for some reason it ends up being slower. I am on Ubuntu 16.04.
Here is the minimum code that can reproduce the result:
CUDASample.cuh
class CUDASample{
public:
void AddOneToVector(std::vector<int> &in);
};
CUDASample.cu
__global__ static void CUDAKernelAddOneToVector(int *data)
{
const int x = blockIdx.x * blockDim.x + threadIdx.x;
const int y = blockIdx.y * blockDim.y + threadIdx.y;
const int mx = gridDim.x * blockDim.x;
data[y * mx + x] = data[y * mx + x] + 1.0f;
}
void CUDASample::AddOneToVector(std::vector<int> &in){
int *data;
cudaMallocManaged(reinterpret_cast<void **>(&data),
in.size() * sizeof(int),
cudaMemAttachGlobal);
for (std::size_t i = 0; i < in.size(); i++){
data[i] = in.at(i);
}
dim3 blks(in.size()/(16*32),1);
dim3 threads(32, 16);
CUDAKernelAddOneToVector<<<blks, threads>>>(data);
cudaDeviceSynchronize();
for (std::size_t i = 0; i < in.size(); i++){
in.at(i) = data[i];
}
cudaFree(data);
}
Main.cpp
std::vector<int> v;
for (int i = 0; i < 8192000; i++){
v.push_back(i);
}
CUDASample cudasample;
cudasample.AddOneToVector(v);
The only difference is the NVCC flag, which for the Pascal Titan X is:
-gencode arch=compute_61,code=sm_61-std=c++11;
and for the old Maxwell Titan X is:
-gencode arch=compute_52,code=sm_52-std=c++11;
EDIT: Here are the results for running NVIDIA Visual Profiling.
For the old Maxwell Titan, the time for memory transfer is around 205 ms, and the kernel launch is around 268 us.
For the Pascal Titan, the time for memory transfer is around 202 ms, and the kernel launch is around an insanely long 8343 us, which makes me believe something is wrong.
I further isolate the problem by replacing cudaMallocManaged into good old cudaMalloc and did some profiling and observe some interesting result.
CUDASample.cu
__global__ static void CUDAKernelAddOneToVector(int *data)
{
const int x = blockIdx.x * blockDim.x + threadIdx.x;
const int y = blockIdx.y * blockDim.y + threadIdx.y;
const int mx = gridDim.x * blockDim.x;
data[y * mx + x] = data[y * mx + x] + 1.0f;
}
void CUDASample::AddOneToVector(std::vector<int> &in){
int *data;
cudaMalloc(reinterpret_cast<void **>(&data), in.size() * sizeof(int));
cudaMemcpy(reinterpret_cast<void*>(data),reinterpret_cast<void*>(in.data()),
in.size() * sizeof(int), cudaMemcpyHostToDevice);
dim3 blks(in.size()/(16*32),1);
dim3 threads(32, 16);
CUDAKernelAddOneToVector<<<blks, threads>>>(data);
cudaDeviceSynchronize();
cudaMemcpy(reinterpret_cast<void*>(in.data()),reinterpret_cast<void*>(data),
in.size() * sizeof(int), cudaMemcpyDeviceToHost);
cudaFree(data);
}
For the old Maxwell Titan, the time for memory transfer is around 5 ms both ways, and the kernel launch is around 264 us.
For the Pascal Titan, the time for memory transfer is around 5 ms both ways, and the kernel launch is around 194 us, which actually results in the performance increase I am hoping to see...
Why is Pascal GPU so slow on running CUDA kernels when cudaMallocManaged is used? It will be a travesty if I have to revert all my existing code that uses cudaMallocManaged into cudaMalloc. This experiment also shows that the memory transfer time using cudaMallocManaged is a lot slower than using cudaMalloc, which also feels like something is wrong. If using this results in a slow run time even the code is easier, this should be unacceptable because the whole purpose of using CUDA instead of plain C++ is to speed things up. What am I doing wrong and why am I observing this kind of result?
解决方案Under CUDA 8 with Pascal GPUs, managed memory data migration under a unified memory (UM) regime will generally occur differently than on previous architectures, and you are experiencing the effects of this. (Also see note at the end about CUDA 9 updated behavior for windows.)
With previous architectures (e.g. Maxwell), managed allocations used by a particular kernel call will be migrated all at once, upon launch of the kernel, approximately as if you called cudaMemcpy
to move the data yourself.
With CUDA 8 and Pascal GPUs, data migration occurs via demand-paging. At kernel launch, by default, no data is explicitly migrated to the device(*). When the GPU device code attempts to access data in a particular page that is not resident in GPU memory, a page fault will occur. The net effect of this page fault is to:
- Cause the GPU kernel code (the thread or threads that accessed the page) to stall (until step 2 is complete)
- Cause that page of memory to be migrated from the CPU to the GPU
This process will be repeated as necessary, as GPU code touches various pages of data. The sequence of operations involved in step 2 above involves some latency as the page fault is processed, in addition to the time spent to actually move the data. Since this process will move data a page at a time, it may be signficantly less efficient than moving all the data at once, either using cudaMemcpy
or else via the pre-Pascal UM arrangement that caused all data to be moved at kernel launch (whether it was needed or not, and regardless of when the kernel code actually needed it).
Both approaches have their pros and cons, and I don't wish to debate the merits or various opinions or viewpoints. The demand-paging process enables a great many important features and capabilities for Pascal GPUs.
This particular code example, however, does not benefit. This was anticipated, and so the recommended use to bring the behavior in line with previous (e.g. maxwell) behavior/performance is to precede the kernel launch with a cudaMemPrefetchAsync()
call.
You would use the CUDA stream semantics to force this call to complete prior to the kernel launch (if the kernel launch does not specify a stream, you can pass NULL for the stream parameter, to select the default stream). I believe the other parameters for this function call are pretty self-explanatory.
With this function call before your kernel call, covering the data in question, you should not observe any page-faulting in the Pascal case, and the profile behavior should be similar to the Maxwell case.
As I mentioned in the comments, if you had created a test case that involved two kernel calls in sequence, you would have observed that the 2nd call runs at approximately full speed even in the Pascal case, since all of the data has already been migrated to the GPU side through the first kernel execution. Therefore, the use of this prefetch function should not be considered mandatory or automatic, but should be used thoughtfully. There are situations where the GPU may be able to hide the latency of page-faulting to some degree, and obviously data already resident on the GPU does not need to be prefetched.
Note that the "stall" referred to in step 1 above is possibly misleading. A memory access by itself does not trigger a stall. But if the data requested is actually needed for an operation, e.g. a multiply, then the warp will stall at the multiply operation, until the necessary data becomes available. A related point, then, is that demand-paging of data from host to device in this fashion is just another "latency" that the GPU can possibly hide in it's latency-hiding architecture, if there is sufficient other available "work" to attend to.
As an additional note, in CUDA 9, the demand-paging regime for pascal and beyond is only available on linux; the previous support for Windows advertised in CUDA 8 has been dropped. See here. On windows, even for Pascal devices and beyond, as of CUDA 9, the UM regime is the same as maxwell and prior devices; data is migrated to the GPU en-masse, at kernel launch.
(*) The assumption here is that data is "resident" on the host, i.e. already "touched" or initialized in CPU code, after the managed allocation call. The managed allocation itself creates data pages associated with the device, and when CPU code "touches" these pages, the CUDA runtime will demand-page the necessary pages to be resident in host memory, so that the CPU can use them. If you perform an allocation but never "touch" the data in CPU code (an odd situation, probably) then it will actually already be "resident" in device memory when the kernel runs, and the observed behavior will be different. But that is not the case in view for this particular example/question.
Additional information is available in this blog article.
相关文章