多线程会降低GPU性能

2022-02-27 00:00:00 python multithreading performance gpu

问题描述

在我的Python应用程序中,我使用Detectron2对图像运行预测,并检测图像中所有人的关键点。

我希望(使用aiortc)对流式传输到我的应用程序LIVE的帧运行预测,但我发现预测时间要糟糕得多,因为它现在运行在新线程上(服务器占用了主线程)。

在线程上运行预测需要1.5到4秒,这是很长的时间。

在主线程(不含视频流部分)上运行预测时,我得到的预测时间小于1秒。

我的问题是为什么会发生这种情况,我如何修复它?为什么从新线程使用GPU时,GPU性能会如此急剧下降?

备注:

  1. 代码在使用Tesla P100 GPU的Google Colab中进行测试,并通过从视频文件中读取帧来模拟视频流。

  2. 我使用this question中的代码计算对帧运行预测所需的时间。

我尝试切换到多进程,但无法使用CUDA(我使用import multiprocessingimport torch.multiprocessing都尝试了import torch.multiprocessingset_stratup_method('spawn')),只是在进程上调用start时卡住了。

示例代码:

from detectron2 import model_zoo
from detectron2.engine import DefaultPredictor
from detectron2.config import get_cfg

import threading
from typing import List
import numpy as np
import timeit
import cv2

# Prepare the configuration file
cfg = get_cfg()
cfg.merge_from_file(model_zoo.get_config_file("COCO-Keypoints/keypoint_rcnn_R_50_FPN_3x.yaml"))
cfg.MODEL.ROI_HEADS.SCORE_THRESH_TEST = 0.7  # set threshold for this model
cfg.MODEL.WEIGHTS = model_zoo.get_checkpoint_url("COCO-Keypoints/keypoint_rcnn_R_50_FPN_3x.yaml")

cfg.MODEL.DEVICE = "cuda"
predictor = DefaultPredictor(cfg)


def get_frames(video: cv2.VideoCapture):
    frames = list()
    while True:
        has_frame, frame = video.read()
        if not has_frame:
            break
        frames.append(frame)
    return frames

class CodeTimer:
    # Source: https://stackoverflow.com/a/52749808/9977758
    def __init__(self, name=None):
        self.name = " '" + name + "'" if name else ''

    def __enter__(self):
        self.start = timeit.default_timer()

    def __exit__(self, exc_type, exc_value, traceback):
        self.took = (timeit.default_timer() - self.start) * 1000.0
        print('Code block' + self.name + ' took: ' + str(self.took) + ' ms')

video = cv2.VideoCapture('DemoVideo.mp4')
num_frames = round(video.get(cv2.CAP_PROP_FRAME_COUNT))
frames_buffer = list()
predictions = list()

def send_frames():
    # This function emulates the stream, so here we "get" a frame and add it to our buffer
    for frame in get_frames(video):
        frames_buffer.append(frame)
        # Simulate delays between frames
        time.sleep(random.uniform(0.3, 2.1))

def predict_frames():
    predicted_frames = 0  # The number of frames predicted so far
    while predicted_frames < num_frames:  # Stop after we predicted all frames
        buffer_length = len(frames_buffer)
        if buffer_length <= predicted_frames:
            continue  # Wait until we get a new frame

        # Read all the frames from the point we stopped
        for frame in frames_buffer[predicted_frames:]:
            # Measure the prediction time
            with CodeTimer('In stream prediction'):
                predictions.append(predictor(frame))
            predicted_frames += 1


t1 = threading.Thread(target=send_frames)
t1.start()
t2 = threading.Thread(target=predict_frames)
t2.start()
t1.join()
t2.join()

解决方案

问题出在您的硬件、库或示例代码与实际代码之间的差异。

我在NVIDIA Jetson Xavier上实现了您的代码。我使用以下命令安装了所有需要的库:

# first create your virtual env
virtualenv -p python3 detectron_gpu
source detectron_gpu/bin/activate

#torch for jetson
wget https://nvidia.box.com/shared/static/p57jwntv436lfrd78inwl7iml6p13fzh.whl -O torch-1.8.0-cp36-cp36m-linux_aarch64.whl
sudo apt-get install python3-pip libopenblas-base libopenmpi-dev 
pip3 install Cython
pip3 install numpy torch-1.8.0-cp36-cp36m-linux_aarch64.whl

# torchvision
pip install 'git+https://github.com/pytorch/vision.git@v0.9.0'

# detectron
python -m pip install 'git+https://github.com/facebookresearch/detectron2.git'

# ipython bindings (optional)
pip install ipykernel cloudpickle 

# opencv
pip install opencv-python

之后,我在示例视频上运行您的示例脚本,并收到以下输出:

Code block 'In stream prediction' took: 2932.241764000537 ms
Code block 'In stream prediction' took: 409.69691300051636 ms
Code block 'In stream prediction' took: 410.03823099981673 ms
Code block 'In stream prediction' took: 409.4023269999525 ms

在第一次通过之后,检测器始终需要大约400ms来运行检测。这对杰森·泽维尔来说似乎很合适。我没有经历您所描述的速度减慢。

我必须指出,Jetson是一种特定的硬件。在此硬件中,CPU和GPU共享RAM内存。因此,我不必将数据从CPU传输到GPU。因此,如果您的速度减慢是由CPU和GPU内存之间的传输引起的,我的设置中不会遇到此问题。

相关文章