如何使用 numba 在 GPU 上泛化快速矩阵乘法
问题描述
最近我一直在尝试使用 Numba 库在 Python 中进行 GPU 编程.我一直在使用那里的教程在他们的网站上阅读它,目前我坚持使用他们的示例,可以在这里找到:https://numba.pydata.org/numba-doc/latest/cuda/examples.html.我试图概括一下快速矩阵乘法的示例(其形式为 A*B=C).在测试时,我注意到维度不能完全被每块线程数 (TPB) 整除的矩阵不会产生正确的答案.
Lately I've been trying to get into programming for GPUs in Python using the Numba library. I have been reading up on it on their website using the tutorial there and currently I'm stuck on their example, which can be found here: https://numba.pydata.org/numba-doc/latest/cuda/examples.html. I'm attempting to generalize the example for the fast matrix multiplication a bit (which is of the form A*B=C). When testing I noticed that matrices with dimensions that are not perfectly divisible by the number of threads per block (TPB) do not yield a correct answer.
我从 https://的示例中复制了以下代码numba.pydata.org/numba-doc/latest/cuda/examples.html 并用 4 x 4 矩阵创建了一个非常小的测试用例.如果我选择 TPB=2 一切都很好,但是当我设置 TPB=3 时就会出错.我知道代码超出了矩阵的范围,但我无法阻止这种情况的发生(我尝试了一些关于 ty + i * TPB 和 tx + i * 的 if 语句TPB,但这些都不起作用.
I copied the code below from the example at https://numba.pydata.org/numba-doc/latest/cuda/examples.html and created a very small test case with 4 by 4 matrices. If I choose TPB=2 everything is fine, but when I set TPB=3 it goes wrong. I understand that the code goes out of the bounds of the matrices, but I am unable to prevent this from happening (I tried some if statements on ty + i * TPB and tx + i * TPB, but these did not work.
from numba import cuda, float32
import numpy as np
import math
@cuda.jit
def fast_matmul(A, B, C):
# Define an array in the shared memory
# The size and type of the arrays must be known at compile time
sA = cuda.shared.array(shape=(TPB, TPB), dtype=float32)
sB = cuda.shared.array(shape=(TPB, TPB), dtype=float32)
x, y = cuda.grid(2)
tx = cuda.threadIdx.x
ty = cuda.threadIdx.y
bpg = cuda.gridDim.x # blocks per grid
if x >= C.shape[0] and y >= C.shape[1]:
# Quit if (x, y) is outside of valid C boundary
return
# Each thread computes one element in the result matrix.
# The dot product is chunked into dot products of TPB-long vectors.
tmp = 0.
for i in range(bpg):
# Preload data into shared memory
sA[tx, ty] = A[x, ty + i * TPB]
sB[tx, ty] = B[tx + i * TPB, y]
# Wait until all threads finish preloading
cuda.syncthreads()
# Computes partial product on the shared memory
for j in range(TPB):
tmp += sA[tx, j] * sB[j, ty]
# Wait until all threads finish computing
cuda.syncthreads()
C[x, y] = tmp
#%%
x_h = np.arange(16).reshape([4,4])
y_h = np.ones([4,4])
z_h = np.zeros([4,4])
x_d = cuda.to_device(x_h)
y_d = cuda.to_device(y_h)
z_d = cuda.to_device(z_h)
TPB = 3
threadsperblock = (TPB, TPB)
blockspergrid_x = math.ceil(z_h.shape[0] / threadsperblock[0])
blockspergrid_y = math.ceil(z_h.shape[1] / threadsperblock[1])
blockspergrid = (blockspergrid_x, blockspergrid_y)
fast_matmul[blockspergrid, threadsperblock](x_d, y_d, z_d)
z_h = z_d.copy_to_host()
print(z_h)
我想编写一些不依赖于矩阵 A、B 和 C 的代码,这些矩阵的维度可以被 TPB 完全整除,因为这些有时是我无法控制的.我知道 GPU 只有在处理非常大的矩阵时才能更快地进行矩阵乘法,但我想使用一些小示例来检查答案是否正确,然后再将其应用于实际数据.
I would like to write some code that is not dependent on the matrices A, B, and C having dimensions that are perfectly divisible by the TPB, as these are sometimes out of my control. I understand that GPUs are only faster with matrix multiplication for very large matrices, but I wanted to use small examples to be able to check whether the answer is correct before applying it to actual data.
解决方案
可以说在 那张贴代码:
这不可能是一个正确的范围检查:
This can't possibly be a correct range check:
if x >= C.shape[0] and y >= C.shape[1]:
为了让我们确定网格中的特定线程不执行任何加载活动,我们要求 要么 x
超出范围或 y
超出范围.and
应该是 or
.
In order for us to decide that a particular thread in the grid not do any loading activity, we require either that x
is out of range or that y
is out of range. The and
should have been an or
.
非法 如果块中的所有线程都不能参与该语句,则在条件代码中使用 cuda.syncthreads()
.上面第 1 项中的前一个 return
语句(即使从 and
更正为 or
)几乎可以保证这种非法行为对于问题大小不是完整的——数可被线程块大小整除.
It is illegal to use cuda.syncthreads()
in conditional code, if all the threads in the block cannot participate in that statement. The previous return
statement in item 1 above (even if corrected from and
to or
) pretty much guarantees this illegal behavior for problem sizes not whole-number-divisible by the threadblock size.
因此,要解决这些问题,我们不能只对越界线程使用简单的 return
语句.相反,在加载点,我们必须只允许线程从全局加载到共享内存,如果计算的全局加载索引(对于 A
或 B
)在-界限(根据定义,共享索引在界限内).此外,在编写结果时,我们必须只编写 C
范围内的计算结果.
Therefore, to fix these issues, we cannot use just a simple return
statement for an out-of-bounds thread. Instead, at the point of load, we must only allow threads to load from global to shared memory, if the computed global load indices (for A
or B
) are in-bounds (the shared indices are in-bounds, by definition). Furthermore, when writing a result, we must only write computed results that are in-bounds for C
.
以下代码已修复这些项目.它似乎适用于您给定的测试用例:
The following code has those items fixed. It seems to work correctly for your given test case:
$ cat t49.py
from numba import cuda, float32
import numpy as np
import math
@cuda.jit
def fast_matmul(A, B, C):
# Define an array in the shared memory
# The size and type of the arrays must be known at compile time
sA = cuda.shared.array(shape=(TPB, TPB), dtype=float32)
sB = cuda.shared.array(shape=(TPB, TPB), dtype=float32)
x, y = cuda.grid(2)
tx = cuda.threadIdx.x
ty = cuda.threadIdx.y
bpg = cuda.gridDim.x # blocks per grid
# Each thread computes one element in the result matrix.
# The dot product is chunked into dot products of TPB-long vectors.
tmp = float32(0.)
for i in range(bpg):
# Preload data into shared memory
sA[tx, ty] = 0
sB[tx, ty] = 0
if x < A.shape[0] and (ty+i*TPB) < A.shape[1]:
sA[tx, ty] = A[x, ty + i * TPB]
if y < B.shape[1] and (tx+i*TPB) < B.shape[0]:
sB[tx, ty] = B[tx + i * TPB, y]
# Wait until all threads finish preloading
cuda.syncthreads()
# Computes partial product on the shared memory
for j in range(TPB):
tmp += sA[tx, j] * sB[j, ty]
# Wait until all threads finish computing
cuda.syncthreads()
if x < C.shape[0] and y < C.shape[1]:
C[x, y] = tmp
#%%
x_h = np.arange(16).reshape([4,4])
y_h = np.ones([4,4])
z_h = np.zeros([4,4])
x_d = cuda.to_device(x_h)
y_d = cuda.to_device(y_h)
z_d = cuda.to_device(z_h)
TPB = 3
threadsperblock = (TPB, TPB)
blockspergrid_x = math.ceil(z_h.shape[0] / threadsperblock[0])
blockspergrid_y = math.ceil(z_h.shape[1] / threadsperblock[1])
blockspergrid = (blockspergrid_x, blockspergrid_y)
fast_matmul[blockspergrid, threadsperblock](x_d, y_d, z_d)
z_h = z_d.copy_to_host()
print(z_h)
print(x_h@y_h)
$ cuda-memcheck python t49.py
========= CUDA-MEMCHECK
[[ 6. 6. 6. 6.]
[22. 22. 22. 22.]
[38. 38. 38. 38.]
[54. 54. 54. 54.]]
[[ 6. 6. 6. 6.]
[22. 22. 22. 22.]
[38. 38. 38. 38.]
[54. 54. 54. 54.]]
========= ERROR SUMMARY: 0 errors
$
(注意这里在边界测试中使用 和
是正确的.测试一组索引是否在边界内与测试一组索引是否在外在布尔意义上是不同的-of-bounds.在 in-bounds 测试中,我们要求两者都是 in-bounds.在 out-of-bounds 测试中,任何一个 index out-of-bounds 都是不合格的).
(Note the use of and
here in the bounds tests is correct. Testing whether a set of indices are in-bound is different in a boolean sense compared to testing whether a set of indices is out-of-bounds. In the in-bounds test, we require both to be in-bounds. In the out-of-bounds test, either index out-of-bounds is disqualifying).
我并不是说上述代码没有缺陷或适用于任何特定目的.提供它是为了演示对我发现的问题的可能修复.正如您所发现的,让共享内存平铺矩阵乘法在每个可以想象的配置中工作并非易事,而且我没有在此处显示的范围之外对其进行测试.(例如,如果你决定让 TPB 大于 32,你会遇到其他问题.另外,原始发布的代码仅用于方阵乘法,这在一般的非平方情况下不起作用.)
I'm not suggesting the above code is defect-free or suitable for any particular purpose. It is offered to demonstrate possible fixes for the issues I identified. Getting a shared-memory tiled matrix multiply to work in every imaginable configuration is non-trivial, as you have discovered, and I've not tested it beyond what is shown here. (For example, if you decided to make TPB larger than 32, you would run into other problems. Also, the original posted code is advertised only for square matrix multiplication, and this will not work in the general non-square case.)
如上所述,发布的代码和上面带有修复"的代码是将无法正确处理一般的非方形情况.我相信一些简单的修改将使我们能够处理非正方形的情况.简而言之,我们必须将网格的大小设置得足够大以处理两个输入矩阵的维度,同时仍然只为输出矩阵的边界值写入结果.这是一个经过简单测试的示例:
As noted above, the posted code and the above code with "fixes" will not correctly handle the general non-square case. I believe some straightforward modifications will allow us to handle the non-square case. In a nutshell, we must size the grid large enough to handle the dimensions of both input matrices, while still only writing results for the in-bounds values of the output matrix. Here is a lightly tested example:
$ cat t49.py
from numba import cuda, float32
import numpy as np
import math
@cuda.jit
def fast_matmul(A, B, C):
# Define an array in the shared memory
# The size and type of the arrays must be known at compile time
sA = cuda.shared.array(shape=(TPB, TPB), dtype=float32)
sB = cuda.shared.array(shape=(TPB, TPB), dtype=float32)
x, y = cuda.grid(2)
tx = cuda.threadIdx.x
ty = cuda.threadIdx.y
bpg = cuda.gridDim.x # blocks per grid
# Each thread computes one element in the result matrix.
# The dot product is chunked into dot products of TPB-long vectors.
tmp = float32(0.)
for i in range(bpg):
# Preload data into shared memory
sA[ty, tx] = 0
sB[ty, tx] = 0
if y < A.shape[0] and (tx+i*TPB) < A.shape[1]:
sA[ty, tx] = A[y, tx + i * TPB]
if x < B.shape[1] and (ty+i*TPB) < B.shape[0]:
sB[ty, tx] = B[ty + i * TPB, x]
# Wait until all threads finish preloading
cuda.syncthreads()
# Computes partial product on the shared memory
for j in range(TPB):
tmp += sA[ty, j] * sB[j, tx]
# Wait until all threads finish computing
cuda.syncthreads()
if y < C.shape[0] and x < C.shape[1]:
C[y, x] = tmp
#%%
x_h = np.arange(115).reshape([5,23])
y_h = np.ones([23,7])
z_h = np.zeros([5,7])
x_d = cuda.to_device(x_h)
y_d = cuda.to_device(y_h)
z_d = cuda.to_device(z_h)
#TPB must be an integer between 1 and 32
TPB = 32
threadsperblock = (TPB, TPB)
grid_y_max = max(x_h.shape[0],y_h.shape[0])
grid_x_max = max(x_h.shape[1],y_h.shape[1])
blockspergrid_x = math.ceil(grid_x_max / threadsperblock[0])
blockspergrid_y = math.ceil(grid_y_max / threadsperblock[1])
blockspergrid = (blockspergrid_x, blockspergrid_y)
fast_matmul[blockspergrid, threadsperblock](x_d, y_d, z_d)
z_h = z_d.copy_to_host()
print(z_h)
print(x_h@y_h)
$ cuda-memcheck python t49.py
========= CUDA-MEMCHECK
[[ 253. 253. 253. 253. 253. 253. 253.]
[ 782. 782. 782. 782. 782. 782. 782.]
[1311. 1311. 1311. 1311. 1311. 1311. 1311.]
[1840. 1840. 1840. 1840. 1840. 1840. 1840.]
[2369. 2369. 2369. 2369. 2369. 2369. 2369.]]
[[ 253. 253. 253. 253. 253. 253. 253.]
[ 782. 782. 782. 782. 782. 782. 782.]
[1311. 1311. 1311. 1311. 1311. 1311. 1311.]
[1840. 1840. 1840. 1840. 1840. 1840. 1840.]
[2369. 2369. 2369. 2369. 2369. 2369. 2369.]]
========= ERROR SUMMARY: 0 errors
$
我还重新排序了 x
和 y
的含义(以及 tx
和 ty
的用法)修复上述代码中的性能问题.原始发布的文档代码中也存在相同的性能问题.
I've also reordered the sense of x
and y
(and usage of tx
and ty
) to fix a performance issue in the above code. The same performance issue was present in the original posted doc code as well.
同样,没有声称没有缺陷.此外,我确信更理想".代码可以到达.然而,优化矩阵乘法是一个应该很快导致使用库实现的练习.在此处使用 cupy
因为 GPU 方法应该是在 GPU 上利用高质量矩阵乘法例程的一种相当直接的方法.
Again, no claims of defect free. Furthermore I'm sure "more optimal" code could be arrived at. However optimizing matrix multiplication is an exercise that should fairly quickly lead to using a library implementation. Using cupy
here for the GPU approach should be a fairly straightforward way to tap into a high-quality matrix multiply routine on the GPU.
正如所讨论的这里 OP的代码(和,看来,doc example) 也有表现tmp
变量的设置问题.将其更改为适当的 32 位浮点变量会产生重要的性能差异.
As discussed here OP's code (and, it seems, the doc example) also had a performance issue around the setup of the tmp
variable. Changing that to a proper 32-bit float variable makes an important performance difference.
相关文章