异常后重置 Cuda 上下文

2021-12-17 00:00:00 exception cuda reset c++ cuda-context

我有一个使用 Cuda/C++ 的工作应用程序,但有时,由于内存泄漏,会引发异常.我需要能够实时重置 GPU,我的应用是一个服务器,所以它必须保持可用.

I have a working app which uses Cuda / C++, but sometimes, because of memory leaks, throws exception. I need to be able to reset the GPU on live, my app is a server so it has to stay available.

我尝试过类似的方法,但似乎不起作用:

I tried something like this, but it doesnt seems to work:

try
{
    // do process using GPU
}
catch (std::exception &e)
{
    // catching exception from cuda only

    cudaSetDevice(0);
    CUDA_RETURN_(cudaDeviceReset());
}

我的想法是每次从 GPU 收到异常时重置设备,但我无法使其正常工作.:(顺便说一句,由于某些原因,我无法解决 Cuda 代码的所有问题,我需要一个临时解决方案.谢谢!

My idea is to reset the device each times I get an exception from the GPU, but I cannot manage to make it working. :( Btw, for some reasons, I cannot fix every problems of my Cuda code, I need a temporary solution. Thanks !

推荐答案

在发生不可恢复(粘性")CUDA 错误后恢复正确设备功能的唯一方法是终止主机进程启动(即发出导致错误的 CUDA 运行时 API 调用).

The only method to restore proper device functionality after a non-recoverable ("sticky") CUDA error is to terminate the host process that initiated (i.e. issued the CUDA runtime API calls that led to) the error.

因此,对于单进程应用程序,唯一的方法就是终止应用程序.

Therefore, for a single-process application, the only method is to terminate the application.

应该可以设计一个多进程应用程序,其中初始(父")进程不使用任何 CUDA,并生成一个使用 GPU 的子进程.当子进程遇到不可恢复的 CUDA 错误时,它必须终止.

It should be possible to design a multi-process application, where the initial ("parent") process makes no usage of CUDA whatsoever, and spawns a child process that uses the GPU. When the child process encounters an unrecoverable CUDA error, it must terminate.

父进程可以选择性地监视子进程.如果它确定子进程已终止,则可以重新生成该进程并恢复 CUDA 功能行为.

The parent process can, optionally, monitor the child process. If it determines that the child process has terminated, it can re-spawn the process and restore CUDA functional behavior.

粘性与非粘性错误在别处有介绍,例如 这里.

Sticky vs. non-sticky errors are covered elsewhere, such as here.

一个适当的多进程应用程序的例子,它使用例如fork() 生成使用 CUDA 的子进程可在 CUDA 示例代码 simpleIPC 中找到.这是从 simpleIPC 示例(适用于 linux)组装而成的粗略示例:

An example of a proper multi-process app that uses e.g. fork() to spawn a child process that uses CUDA is available in the CUDA sample code simpleIPC. Here is a rough example assembled from the simpleIPC example (for linux):

$ cat t477.cu
/*
 * Copyright 1993-2015 NVIDIA Corporation.  All rights reserved.
 *
 * Please refer to the NVIDIA end user license agreement (EULA) associated
 * with this source code for terms and conditions that govern your use of
 * this software. Any use, reproduction, disclosure, or distribution of
 * this software and related documentation outside the terms of the EULA
 * is strictly prohibited.
 *
 */

// Includes
#include <stdio.h>
#include <assert.h>

// CUDA runtime includes
#include <cuda_runtime_api.h>

// CUDA utilities and system includes
#include <helper_cuda.h>

#define MAX_DEVICES          1
#define PROCESSES_PER_DEVICE 1
#define DATA_BUF_SIZE        4096

#ifdef __linux
#include <unistd.h>
#include <sched.h>
#include <sys/mman.h>
#include <sys/wait.h>
#include <linux/version.h>

typedef struct ipcDevices_st
{
    int count;
    int results[MAX_DEVICES];
} ipcDevices_t;


// CUDA Kernel
__global__ void simpleKernel(int *dst, int *src, int num)
{
    // Dummy kernel
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    dst[idx] = src[idx] / num;
}


void runTest(int index, ipcDevices_t* s_devices)
{
    if (s_devices->results[0] == 0){
        simpleKernel<<<1,1>>>(NULL, NULL, 1);  // make a fault
        cudaDeviceSynchronize();
        s_devices->results[0] = 1;}
    else {
        int *d, *s;
        int n = 1;
        cudaMalloc(&d, n*sizeof(int));
        cudaMalloc(&s, n*sizeof(int));
        simpleKernel<<<1,1>>>(d, s, n);
        cudaError_t err = cudaDeviceSynchronize();
        if (err != cudaSuccess)
          s_devices->results[0] = 0;
        else
          s_devices->results[0] = 2;}
    cudaDeviceReset();
}
#endif

int main(int argc, char **argv)
{

    ipcDevices_t *s_devices = (ipcDevices_t *) mmap(NULL, sizeof(*s_devices),
                                                    PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS, 0, 0);
    assert(MAP_FAILED != s_devices);

    // We can't initialize CUDA before fork() so we need to spawn a new process
    s_devices->count = 1;
    s_devices->results[0] = 0;

    printf("
Spawning child process
");
    int index = 0;

    pid_t pid = fork();

    printf("> Process %3d
", pid);
    if (pid == 0) { // child process
    // launch our test
      runTest(index, s_devices);
    }
    // Cleanup and shutdown
    else { // parent process
            int status;
            waitpid(pid, &status, 0);
            if (s_devices->results[0] < 2) {
              printf("first process launch reported error: %d
", s_devices->results[0]);
              printf("respawn
");
              pid_t newpid = fork();
              if (newpid == 0) { // child process
                    // launch our test
                 runTest(index, s_devices);
                  }
    // Cleanup and shutdown
              else { // parent process
                int status;
                waitpid(newpid, &status, 0);
                if (s_devices->results[0] < 2)
                  printf("second process launch reported error: %d
", s_devices->results[0]);
                else
                  printf("second process launch successful
");
                }

            }

    }

    printf("
Shutting down...
");

    exit(EXIT_SUCCESS);

}
$ nvcc -I/usr/local/cuda/samples/common/inc t477.cu -o t477
$ ./t477

Spawning child process
> Process 10841
> Process   0

Shutting down...
first process launch reported error: 1
respawn

Shutting down...
second process launch successful

Shutting down...
$

对于 Windows,唯一需要更改的应该是使用 Windows IPC 机制进行主机进程间通信.

For windows, the only changes need should be to use a windows IPC mechanism for host interprocess communication.

相关文章