为什么我不能用 numba (cuda python) 得到正确的一维数组的总和?

2022-01-10 00:00:00 python numba cuda numba-pro

问题描述

我尝试将 cuda python 与 numba 一起使用.代码是计算一维数组的总和如下,但我不知道如何得到一个值结果而不是三个值.

I try to use cuda python with numba. The code is to calculate the sum of a 1D array as follows, but I don't know how to get one value result rather than three values.

python3.5 与 numba+ CUDA8.0

python3.5 with numba + CUDA8.0

import os,sys,time
import pandas as pd
import numpy as np
from numba import cuda, float32

os.environ['NUMBAPRO_NVVM']=r'D:NVIDIA GPU Computing ToolkitCUDAv8.0
vvmin
vvm64_31_0.dll'
os.environ['NUMBAPRO_LIBDEVICE']=r'D:NVIDIA GPU Computing ToolkitCUDAv8.0
vvmlibdevice'

bpg = (1,1) 
tpb = (1,3) 

@cuda.jit
def calcu_sum(D,T):
    ty = cuda.threadIdx.y
    bh = cuda.blockDim.y
    index_i = ty
    L = len(D)
    su = 0
    while index_i<L:
        su +=D[index_i]
        index_i +=bh
    print('su:',su)
    T[0,0]=su
    print('T:',T[0,0])


D = np.array([ 0.42487645,0.41607881,0.42027071,0.43751907,0.43512794,0.43656972,
               0.43940639,0.43864551,0.43447691,0.43120232], dtype=np.float32)
T = np.empty([1,1])
print('D: ',D)

stream = cuda.stream()
with stream.auto_synchronize():
    dD = cuda.to_device(D, stream)
    dT= cuda.to_device(TE, stream)
    calcu_sum[bpg, tpb, stream](dD,dT)

输出是:

D:  [ 0.42487645  0.41607881  0.42027071  0.43751907  0.43512794  0.43656972
  0.43940639  0.43864551  0.43447691  0.43120232]
su:  1.733004
su:  1.289852
su:  1.291317
T: 1.733004
T: 1.289852
T: 1.291317

为什么我不能得到输出 "4.31417383" 而不是 "1.733004 1.289852 1.291317" ?1.733004+1.289852+1.291317=4.314173.

Why can't I get the output "4.31417383" rather than "1.733004 1.289852 1.291317" ? 1.733004+1.289852+1.291317=4.314173.

我是 numba 新手,阅读 numba 文档,但不知道如何操作.有人可以给点建议吗?

I'm new to numba, read the numba documentation, but don't know how to do it. Can someone give advice ?


解决方案

你没有得到你期望的总和的原因是你没有编写代码来产生这个总和.

The reason you don't get the sum you expect is because you haven't written code to produce that sum.

基本的 CUDA 编程模型(无论您使用 CUDA C、Fortran 还是 Python 作为您的语言)是您编写由每个线程执行的内核代码.您已经为每个线程编写了代码来读取和求和输入数组的一部分.您还没有为这些线程编写任何代码来共享它们各自的部分总和并将其汇总为最终总和.

The basic CUDA programming model (whether you use CUDA C, Fortran or Python as your language) is that you write kernel code which is executed by each thread. You have written code for each thread to read and sum part of the input array. You have not written any code for those threads to share and sum their individual partial sums into a final sum.

有一个非常描述得很好的算法可以做到这一点——它被称为并行归约.您可以在每个版本的 CUDA 工具包的示例中找到该算法的介绍,或下载有关它的演示文稿 这里.您还可以阅读更现代的算法版本,它使用 CUDA 的新功能(warp shuffle 指令和原子事务)这里.

There is an extremely well described algorithm for doing this -- it is called a parallel reduction. You can find an introduction to the algorithm in a PDF which ships in the examples of every version of the CUDA toolkit, or download a presentation about it here. You can also read a more modern version of the algorithm which uses newer features of CUDA (warp shuffle instructions and atomic transactions) here.

在学习了归约算法之后,您需要将标准 CUDA C 内核代码改编为 Numba Python 内核方言.至少是这样的:

After you have studied the reduction algorithm, you will need to adapt the standard CUDA C kernel code into the Numba Python kernel dialect. At the bare minimum, something like this:

tpb = (1,3) 

@cuda.jit
def calcu_sum(D,T):

    ty = cuda.threadIdx.y
    bh = cuda.blockDim.y
    index_i = ty
    sbuf = cuda.shared.array(tpb, float32)

    L = len(D)
    su = 0
    while index_i < L:
        su += D[index_i]
        index_i +=bh

    print('su:',su)

    sbuf[0,ty] = su
    cuda.syncthreads()

    if ty == 0:
        T[0,0] = 0
        for i in range(0, bh):
            T[0,0] += sbuf[0,i]
        print('T:',T[0,0])

可能会做你想做的事,尽管距离最优的并行共享内存减少还有很长的路要走,当你阅读我提供的链接时你会看到.

will probably do what you want, although it is still a long way from an optimal parallel shared memory reduction, as you will see when you read the material I provided links to.

相关文章