增加 CUDA 中每个线程的工作量的示例
算法:
我正在用 CUDA 编写程序,问题如下:
两个矩阵 A (n * 128) 和 B (m * 128)
我取 A 的第一行,并逐一计算该向量与 B 的所有行之间的距离.
我把每个距离的结果写在矩阵C的一行上,所以C的元素C(i,j)包含了A的i行和B的j行的距离.
然后我继续 A 的下一行.
我是这样实现的:我有一个由 (n * m) 个块组成的网格,每个块有 128 个线程.( 1 * 128 ).
问题:程序成功运行并达到预期结果,但执行时间仅比单线程 CPU 版本快 5 到 10 倍左右.所以我想知道如何在减少之前增加每个线程的工作量以提高性能.
内核代码(原始:未优化)
__global__ void EuclideanDistances( float *A, float *B , float *C , int n , int m){//大小等于 128__shared__ float accumResult[SIZE];浮动SA;浮动SB;//映射int bx = blockIdx.x;//nint by = blockIdx.y;//米int ty = threadIdx.y;//128int tx = threadIdx.x;//1sA = A [bx * 大小 + ty];sB = B [* SIZE + ty];__syncthreads();accumResult[ty] = (sA - sB) * (sA - sB);__syncthreads();//并行树减少for (int stride = SIZE/2 ; stride > 0 ; stride >>= 1)如果 (ty <步幅){accumResult[ty] += accumResult [stride + ty];__syncthreads();}//将结果写入输出矩阵if ((threadIdx.y == 0))C [bx * m + by] = accumResult[ty];__syncthreads();}
更新
现在,我正在使用另一个映射:而不是通过 m
块和 128
线程块获取 n
网格,我正在增加块内的线程数以减少块数.
新映射:
8
个线程的 128
块(总共 1024 个线程,这是最大大小)
由 m/8
块组成的 n/8
网格
不幸的是,它给出了错误的结果).
优化内核代码(待更新)
__global__ void EuclideanDistances( float *A, float *B , float *C, int n , int m){__shared__ float accumResult[SIZE][8];__shared__ 浮动 sA[SIZE][8];__shared__ float sB[SIZE][8];int bx = blockIdx.x;//n/8int by = blockIdx.y;//米/8int tx = threadIdx.x;//8int ty = threadIdx.y;//128int i = bx * tx * 大小 + ty;int j = by * tx * SIZE + ty;sA[ty][tx] = A[i];sB[ty][tx] = B[j];__syncthreads();accumResult[ty][tx] = (sA[ty][tx] - sB[ty][tx]) * (sA[ty][tx] - sB[ty][tx]);__syncthreads();//减少for (int stride = SIZE/2 ; stride > 0 ; stride > > = 1)如果 (ty <步幅){accumResult[ty][tx] += accumResult [stride + ty][tx];__syncthreads();}C[bx * m + by] = accumResult[0][tx];}
主机代码(分配 + 内核调用)
int main(){诠释米 = 20000;//矩阵A大小:m * SIZE整数 n = 4000;//矩阵B大小:n * SIZEsrand((无符号)时间(0));//主机分配float *matrixA = (float *) malloc (n * SIZE * sizeof(float));for(int i=0; i
PS:我有 CUDA 6.0 和 NVIDIA GTX 650(计算能力 3.0)
解决方案您的问题似乎有两个组成部分:
- 为什么我的第二个内核不工作?
- 如何让我的代码运行得更快?
<块引用>
为什么我的第二个内核不工作?
你有几个问题:
i
、j
以及存储C
值的索引初始计算中的索引问题.- 违反使用
_syncthreads()
在条件块中
项目 1 是使代码正常工作的关键元素.
<块引用>如何让我的代码运行得更快?
这涉及更多.首先,您尝试增加每个线程的工作量"并没有做任何事情,它只是增加了每个块的线程数(从 128 增加到 8*128).每个线程都在做大致相同的工作量.此外,在这次尝试进入 2D 线程块的过程中,我相信发生了一些不好的事情:
- 各种合并和共享内存库冲突加载和存储模式被破坏.
- 由于每个块所需的共享内存量,有效占用率下降.
第二个内核的最终效果是大约使执行时间增加了一倍.所以这不是我们想要的.
但是,增加每个线程的工作可能是一个好主意,同时使用共享内存,并尝试保持良好的(全局、共享)内存访问模式,以及增加占用率.
接下来是沿着这些方向进行的工作.以下代码修复了您的第二个内核,以及计时基础设施,以及完整的数据验证,以及 2 个新内核.第一个新内核(#3)就是我所说的朴素"内核.它只是为每个输出点分配一个线程,每个线程循环遍历必要的向量,计算其单独的结果.不使用共享内存,甚至不关注合并或任何其他优化.然而,通过对线程块配置 (16,16) -> (8,32) 线程的调整,我从@talonmies 的回答(现已删除)中观察到,这个内核的执行速度比你的快速"内核快得多(3 倍).在进一步思考 (8,32) 观察后,我得出结论,下一次优化尝试应侧重于:
- 消除使用并行归约来计算向量距离(即允许相邻线程使用直接 for 循环来循环遍历向量)
- 从缓存中获得最大收益
- 高效使用共享内存
- 坚持对所有读写实现完美的全局合并/完美使用共享内存
第 4 项在评论中提示了问题我可以转置矩阵吗?"有了这个权限,就可以重新组织数据以促进上面的第 4 项.上面的第 2 项在我的快速"内核(#4)中通过将 B 向量加载到共享内存中来解决,同时允许缓存主要专注于缓存 A 向量,希望减少缓存抖动(A 是 2 中较小的一个)矢量数组,大约 2MB - fermi L2 是 768K,Kepler L2 是 1.5MB).通过以转置形式传递 A,并在共享内存中有效地在芯片上转置"B,可以使用直接 for 循环来计算向量距离,同时允许相邻线程具有完美合并的读取和写入,以及高效"使用共享内存(即非存储库冲突加载和广播读取).
对于我的特定时间,(Quadro5000 cc2.0 GPU,CUDA 6,RHEL 5.5)我看到你的快速"内核需要大约 2 秒,我的幼稚"内核需要大约 0.7 秒,而我的快速"内核需要大约 0.7 秒需要大约 0.2 秒,尽管有转置的 (A,C) 数据.
我做了一项额外的优化,即让每个块一次计算多个 (CHKSIZE
) B 向量.您可以将 CHKSIZE 设置为 1 以查看之前的结果(~0.2 秒).我发现 4 的 CHKSIZE 有很好的改进.这是一种试图利用 A 的数据重用的攻击.通过在 CHKSIZE 为 4 时进行的额外优化,内核 4 的内核时间下降到大约 0.1 秒.
以下是代码和运行示例:
$ cat t460.cu#include <stdio.h>#include <stdlib.h>#include <iostream>//M 和 N 都必须能被 SIZE 整除,M 必须能被 CHKSIZE 整除#定义尺寸 128#define N 4000#define M 20000#define CHKSIZE 4__global__ void EuclideanDistances1( float *A, float *B, float *C, int n, int m){//大小等于 128__shared__ float accumResult[SIZE];浮动SA;浮动SB;//映射int bx = blockIdx.x;//nint by = blockIdx.y;//米int ty = threadIdx.y;//128//int tx = threadIdx.x;//1sA = A [bx * 大小 + ty];sB = B [* SIZE + ty];__syncthreads();accumResult[ty] = (sA - sB) * (sA - sB);__syncthreads();//并行树减少for (int stride = SIZE/2 ; stride > 0 ; stride >>= 1){如果 (ty <步幅){accumResult[ty] += accumResult [stride + ty];}__syncthreads();}//将结果写入输出矩阵如果 ((ty == 0))C [bx * m + by] = accumResult[ty];__syncthreads();}__global__ void EuclideanDistances2( 浮动 *A, 浮动 *B , 浮动 *C, int n , int m){__shared__ float accumResult[SIZE][8];__shared__ 浮动 sA[SIZE][8];__shared__ float sB[SIZE][8];int bx = blockIdx.x;//n/8int by = blockIdx.y;//米int tx = threadIdx.x;//8int ty = threadIdx.y;//128int i = ((bx*8) + tx) * 大小 + ty;int j = by * SIZE + ty;sA[ty][tx] = A[i];sB[ty][tx] = B[j];__syncthreads();accumResult[ty][tx] = (sA[ty][tx] - sB[ty][tx]) * (sA[ty][tx] - sB[ty][tx]);__syncthreads();//减少for (int stride = SIZE/2 ; stride > 0 ; stride>>=1){如果 (ty <步幅){accumResult[ty][tx] += accumResult [stride + ty][tx];}__syncthreads();}如果(ty == 0)C[((bx*8)+tx) * m + by] = accumResult[0][tx];}//朴素内核__global__ void EuclideanDistances3( float *A, float *B , float *C, int n , int m){int idx = threadIdx.x+blockDim.x*blockIdx.x;int idy = threadIdx.y+blockDim.y*blockIdx.y;浮动结果 = 0.0f;if ((idx < n) && (idy < m)){for (int i = 0; i < SIZE; i++){浮动温度 = A[(idx*SIZE)+i] - B[(idy*SIZE)+i];结果+=温度*温度;}C[(idx*m) + idy] = 结果;}}//优化内核__global__ void EuclideanDistances4( const float *A, const float *B , float *C, const int n , const int m){//n, A, 4000 这个内核假设 A 是列优先的 A(SIZE, n)//m, B, 20000 这个内核假设 B 是行优先的 B(m, SIZE)//这个内核假设 C 是列优先的 C(m,n)//这个内核假设每个线程块的线程数 == SIZE//CHKSIZE 是每个块将要计算的 B 向量的数量__shared__ 浮动 my_sB[CHKSIZE*SIZE];//为 B 的 CHKSIZE 向量提供足够的共享存储空间int bx = blockIdx.x;//每 CHKSIZE 行 B(较大的输入矩阵)一个块while ((bx*CHKSIZE) < m){//未使用,此 while 循环可用于将块扩展到多个块int tx = threadIdx.x;for (int i = 0; i < CHKSIZE; i++)//将 B 的向量加载到共享内存中my_sB[(i*SIZE)+tx] = B[(((bx*CHKSIZE)+i)*SIZE)+tx];__syncthreads();while (tx < n){//循环遍历 A 中的所有向量浮动结果[CHKSIZE];for (int i = 0; i
希望这会让你对工作有更多的想法.当然,您可能会在 cc3.0 设备上获得不同的计时.
是否可以进一步优化?大概.我要研究的第一个目标是弄清楚如何利用向量 A 上的数据重用机会.(向量 B 的数据重用已经在内核 4 中通过将其加载到共享内存中来处理.可能是使用一些共享内存来存储 A 的部分以使代码运行得更快的方法.)
我想我还应该提到,按照您提供的代码,这段代码正在计算 square/Euclidean_distance" rel="noreferrer">欧几里得距离.对内核的一个简单修改可以让它计算实际的欧几里德距离(C[...] = sqrtf(...);
) 但是,我包含的验证假设结果是"in-range" 用于在 float
中完美存储整数.您的测试用例满足此要求,但需要修改验证代码(如果使用了 sqrtf
).
Algorithm :
I'm writing a program with CUDA and the problem is the following:
Two matrices A (n * 128) and B (m * 128)
I take the first row of A, and I compute the distance between that vector and all the rows of B, one by one.
I write the result of each distance on a row of a matrix C, so the element C(i,j) of C contains the distance between row i of A and row j of B.
and I proceed with the next row of A.
I've implemented it this way: I've got a grid made by ( n * m ) blocks, and 128 threads per block. ( 1 * 128 ).
QUESTION: The program runs successfully with the expected results but the time execution is only around 5 to 10 times faster than the one-threaded CPU version of it. So I would like to know how to increase the work per thread before reduction in order to increase performance.
Kernel code (original : Not optimized)
__global__ void EuclideanDistances( float *A, float *B , float *C , int n , int m)
{
// SIZE is equal to 128
__shared__ float accumResult[SIZE];
float sA;
float sB;
// MAPPING
int bx = blockIdx.x; // n
int by = blockIdx.y; // m
int ty = threadIdx.y; // 128
int tx = threadIdx.x; // 1
sA = A [bx * SIZE + ty];
sB = B [by * SIZE + ty];
__syncthreads();
accumResult[ty] = (sA - sB) * (sA - sB);
__syncthreads();
// Parallel tree-reduction
for (int stride = SIZE/2 ; stride > 0 ; stride >>= 1)
if (ty < stride)
{
accumResult[ty] += accumResult [stride + ty];
__syncthreads();
}
// Writing results to output matrix
if ((threadIdx.y == 0))
C [bx * m + by] = accumResult[ty];
__syncthreads();
}
UPDATE
Now, I'm using another mapping : Instead of taking a grid of n
by m
blocks and a block of 128
threads, I'm increasing the number of threads within a block in order to decrease the number of blocks.
New mapping:
Block of 128
by 8
threads (total of 1024 threads, which is the max size)
Grid of n/8
by m/8
blocks
Unfortunately, it's giving wrong results ).
Optimized kernel code (to be updated)
__global__ void EuclideanDistances( float *A, float *B , float *C, int n , int m)
{
__shared__ float accumResult[SIZE][8];
__shared__ float sA[SIZE][8];
__shared__ float sB[SIZE][8];
int bx = blockIdx.x; // n / 8
int by = blockIdx.y; // m / 8
int tx = threadIdx.x; // 8
int ty = threadIdx.y; // 128
int i = bx * tx * SIZE + ty;
int j = by * tx * SIZE + ty;
sA[ty][tx] = A [i];
sB[ty][tx] = B[j];
__syncthreads();
accumResult[ty][tx] = (sA[ty][tx] - sB[ty][tx]) * (sA[ty][tx] - sB[ty][tx]);
__syncthreads();
// Reduction
for (int stride = SIZE/2 ; stride > 0 ; stride>>=1)
if (ty < stride)
{
accumResult[ty][tx] += accumResult [stride + ty][tx];
__syncthreads();
}
C[bx * m + by] = accumResult[0][tx];
}
HOST CODE (allocations + kernel calls)
int main()
{
int m = 20000; //MatrixA size : m * SIZE
int n = 4000; //MatrixB size : n * SIZE
srand((unsigned)time(0));
// Host Allocations
float *matrixA = (float *) malloc (n * SIZE * sizeof(float));
for(int i=0; i < n * SIZE; i++)
matrixA[i] = (float) (rand()%100)+1;
float *matrixB = (float *) malloc (m * SIZE * sizeof(float));
for(int i=0; i < m * SIZE; i++)
matrixB[i] = (float) (rand()%100)+1;
float *results_kernel1 = (float *) malloc (n * m * sizeof(float));
float *results_kernel2 = (float *) malloc (n * m * sizeof(float));
//Device Allocation
float *d_matrixA;
float *d_matrixB;
cudaMalloc((void **)&d_matrixA, n * SIZE * sizeof(float));
cudaMalloc((void **)&d_matrixB, m * SIZE * sizeof(float));
cudaMemcpy(d_matrixA , matrixA , n * SIZE * sizeof(float) , cudaMemcpyHostToDevice);
cudaMemcpy(d_matrixB , matrixB , m * SIZE * sizeof(float) , cudaMemcpyHostToDevice);
float *d_results_kernel1;
float *d_results_kernel2;
cudaMalloc((void **)&d_results_kernel1 , n * m * sizeof(float));
cudaMalloc((void **)&d_results_kernel2 , n * m * sizeof(float));
dim3 threads1 (1 , 128);
dim3 blocks1 (n , m);
EuclideanDistances1 <<<blocks1 , threads1>>> (d_matrixA , d_matrixB , d_results_kernel1 , n , m);
cudaDeviceSynchronize();
cudaMemcpy(results_kernel1 , d_results_kernel1 , n * m *sizeof(float) , cudaMemcpyDeviceToHost);
cudaFree(d_results_kernel1);
dim3 threads2 (8 , 128); // 1024 threads per block (maximum)
dim3 blocks2 (ceil((float)n/8) , ceil((float)m/8));
EuclideanDistances2 <<<blocks2 , threads2>>> (d_matrixA , d_matrixB , d_results_kernel2 , n , m);
cudaDeviceSynchronize();
cudaMemcpy(results_kernel2 , d_results_kernel2 , n * m *sizeof(float) , cudaMemcpyDeviceToHost);
cudaFree(d_results_kernel2);
// Visualising and comparing results
for (int i = 0 ; i < 50 ; i++)
std::cout << "kernel1 : " << results_kernel1[i] << " | kernel2 : " << results_kernel2[i] << std::endl;
free(matrixA);
free(matrixB);
free(results_kernel1);
free(results_kernel2);
return 0;
}
PS: I have CUDA 6.0 with a NVIDIA GTX 650 (compute capability 3.0)
解决方案It seems your question has 2 components:
- why isn't my second kernel working?
- how do I make my code run faster?
Why isn't my second kernel working?
You had several issues:
- indexing problems in initial calculation of
i
,j
as well as the index for storing theC
value. - violation of usage of
_syncthreads()
inside a conditional block
item 1 was the key element to get the code working.
How do I make my code run faster?
This is more involved. First of all, your attempt at "increasing work per thread" didn't do anything of the kind, it was merely an increase in the number of threads per block (from 128 to 8*128). Each thread was doing approximately the same amount of work. Furthermore, in the process of going to a 2D threadblock for this attempt, I believe a couple of bad things happened:
- various coalescing and shared-memory-bank-conflict load and store patterns were broken.
- effective occupancy went down, due the amount of shared memory required per block.
The net effect of the second kernel was to approximately double the execution time. So that is not what we want.
However, increasing work per thread may be a good idea, along with using shared memory, as well as trying to preserve good (global, shared) memory access patterns, as well as allowing for increased occupancy.
What follows is a work-in-progress along those lines. The following code has your second kernel fixed, along with timing infrastructure, as well as full data verification, as well as 2 new kernels. The first new kernel (#3) is what I would call a "naive" kernel. It simply allocates one thread per output point, and each thread loops through the necessary vectors, computing its individual result. No usage of shared memory, or even much attention to coalescing or any other optimization. However with a tweak to threadblock configuration (16,16) -> (8,32) threads, which I observed from @talonmies answer (now deleted), this kernel performs significantly (3x) faster than your "fast" kernel. After further thought about the (8,32) observation, I concluded that the next attempt at optimization should focus on:
- elimination of the usage of a parallel reduction to compute the vector distance (i.e. allow adjacent threads to use a straight for-loop to loop through the vectors)
- maximization of benefit from the cache
- efficient usage of shared memory
- insist on perfect global coalescing/perfect usage of shared memory for all reads and writes
Item 4 prompted the question in the comments "may I transpose the matrices?" With this permission, it's possible to re-organize the data to facilitate item 4 above. Item 2 above is addressed in my "fast" kernel (#4) by loading the B vector into shared memory, while allowing the cache to mostly focus on caching the A vectors, hopefully reducing cache-thrashing (A is the smaller of the 2 vector arrays, at about 2MB - fermi L2 is 768K, Kepler L2 is 1.5MB). By delivering A in transposed form, and effectively "transposing" B on-chip from shared memory, it's possible to use a straight for-loop to compute the vector distance, while allowing adjacent threads to have perfectly coalesced reads and writes, as well as "efficient" use of shared memory (i.e. non-bank-conflicted loads, and broadcast reads).
For my particular timing, (Quadro5000 cc2.0 GPU, CUDA 6, RHEL 5.5) I see that your "fast" kernel requires about 2 seconds, my "naive" kernel requires about 0.7 seconds, and my "fast" kernel requires about 0.2 seconds, albeit with transposed (A,C) data.
EDIT: I've made one additional optimization, that is to have each block compute multiple (CHKSIZE
) B vectors at one time. You can set CHKSIZE to 1 to see the previous result (~0.2sec). I found CHKSIZE of 4 gave good improvement. This is an attack at attempting to exploit the data re-use of A. With this additional optimization at CHKSIZE of 4, the kernel time for kernel 4 drops to about 0.1 second.
Following is the code and a sample run:
$ cat t460.cu
#include <stdio.h>
#include <stdlib.h>
#include <iostream>
// both M and N must be evenly divisible by SIZE, M must be evenly divisible by CHKSIZE
#define SIZE 128
#define N 4000
#define M 20000
#define CHKSIZE 4
__global__ void EuclideanDistances1( float *A, float *B , float *C , int n , int m)
{
// SIZE is equal to 128
__shared__ float accumResult[SIZE];
float sA;
float sB;
// MAPPING
int bx = blockIdx.x; // n
int by = blockIdx.y; // m
int ty = threadIdx.y; // 128
//int tx = threadIdx.x; // 1
sA = A [bx * SIZE + ty];
sB = B [by * SIZE + ty];
__syncthreads();
accumResult[ty] = (sA - sB) * (sA - sB);
__syncthreads();
// Parallel tree-reduction
for (int stride = SIZE/2 ; stride > 0 ; stride >>= 1){
if (ty < stride)
{
accumResult[ty] += accumResult [stride + ty];
}
__syncthreads();
}
// Writing results to output matrix
if ((ty == 0))
C [bx * m + by] = accumResult[ty];
__syncthreads();
}
__global__ void EuclideanDistances2( float *A, float *B , float *C, int n , int m)
{
__shared__ float accumResult[SIZE][8];
__shared__ float sA[SIZE][8];
__shared__ float sB[SIZE][8];
int bx = blockIdx.x; // n / 8
int by = blockIdx.y; // m
int tx = threadIdx.x; // 8
int ty = threadIdx.y; // 128
int i = ((bx*8) + tx) * SIZE + ty;
int j = by * SIZE + ty;
sA[ty][tx] = A[i];
sB[ty][tx] = B[j];
__syncthreads();
accumResult[ty][tx] = (sA[ty][tx] - sB[ty][tx]) * (sA[ty][tx] - sB[ty][tx]);
__syncthreads();
// Reduction
for (int stride = SIZE/2 ; stride > 0 ; stride>>=1){
if (ty < stride)
{
accumResult[ty][tx] += accumResult [stride + ty][tx];
}
__syncthreads();
}
if (ty == 0)
C[((bx*8)+tx) * m + by] = accumResult[0][tx];
}
//naive kernel
__global__ void EuclideanDistances3( float *A, float *B , float *C, int n , int m){
int idx = threadIdx.x+blockDim.x*blockIdx.x;
int idy = threadIdx.y+blockDim.y*blockIdx.y;
float result = 0.0f;
if ((idx < n) && (idy < m)){
for (int i = 0; i < SIZE; i++){
float temp = A[(idx*SIZE)+i] - B[(idy*SIZE)+i];
result += temp * temp;}
C[(idx*m) + idy] = result;
}
}
//optimized kernel
__global__ void EuclideanDistances4( const float *A, const float *B , float *C, const int n , const int m){
// n, A, 4000 this kernel assumes A is column-major A(SIZE, n)
// m, B, 20000 this kernel assumes B is row-major B(m, SIZE)
// this kernel assumes C is column-major C(m,n)
// this kernel assumes number of threads per threadblock == SIZE
// CHKSIZE is the number of B vectors that will be compute per block
__shared__ float my_sB[CHKSIZE*SIZE]; // enough shared storage for CHKSIZE vectors of B
int bx = blockIdx.x; // one block per CHKSIZE rows of B (the larger input matrix)
while ((bx*CHKSIZE) < m){ // not used, this while loop could be used to extend a block to multiple chunks
int tx = threadIdx.x;
for (int i = 0; i < CHKSIZE; i++) // load vectors of B into shared memory
my_sB[(i*SIZE)+tx] = B[(((bx*CHKSIZE)+i)*SIZE)+tx];
__syncthreads();
while (tx < n){ //loop across all vectors in A
float result[CHKSIZE];
for (int i = 0; i < CHKSIZE; i++)
result[i] = 0.0f;
for (int i = 0; i < SIZE; i++){
float Atemp = A[(n*i)+tx];
for (int j = 0; j < CHKSIZE; j++){ // compute all CHKSIZE B vectors with read of A
float temp = Atemp - my_sB[i + (j*SIZE)];
result[j] += temp * temp;}}
for (int i = 0; i < CHKSIZE; i++) // store CHKSIZE results
C[((i+(bx*CHKSIZE))*n)+ tx] = result[i];
tx += blockDim.x; } // continue looping across vectors in A
__syncthreads(); // necessary to prevent warps from racing ahead, if block looping is used
bx += gridDim.x;}
}
float comp_euclid_sq(const float *rA, const float *rB, const int size){
float result = 0.0f;
float temp;
for (int i = 0; i < size; i++){
temp = (rA[i] - rB[i]);
result += temp * temp;}
return result;
}
int main()
{
float et1=0.0f, et2=0.0f, et3=0.0f, et4=0.0f;
cudaEvent_t start1, start2, start3,start4, stop1, stop2, stop3, stop4;
cudaEventCreate(&start1);
cudaEventCreate(&start2);
cudaEventCreate(&start3);
cudaEventCreate(&start4);
cudaEventCreate(&stop1);
cudaEventCreate(&stop2);
cudaEventCreate(&stop3);
cudaEventCreate(&stop4);
int n = N; //MatrixA size : n * SIZE
int m = M; //MatrixB size : m * SIZE
srand((unsigned)time(0));
// Host Allocations
float *matrixA = (float *) malloc (n * SIZE * sizeof(float));
for(int i=0; i < n * SIZE; i++)
matrixA[i] = (float) (rand()%100)+1;
float *matrixB = (float *) malloc (m * SIZE * sizeof(float));
for(int i=0; i < m * SIZE; i++)
matrixB[i] = (float) (rand()%100)+1;
float *results_kernel = (float *) malloc (n * m * sizeof(float));
float *cpu_results_kernel = (float *) malloc (n * m * sizeof(float));
for (int i = 0; i< n*m; i++)
cpu_results_kernel[i] = comp_euclid_sq(matrixA + ((i/m)*SIZE), matrixB + (i%m)*SIZE, SIZE);
//Device Allocation
float *d_matrixA;
float *d_matrixB;
cudaMalloc((void **)&d_matrixA, n * SIZE * sizeof(float));
cudaMalloc((void **)&d_matrixB, m * SIZE * sizeof(float));
cudaMemcpy(d_matrixA , matrixA , n * SIZE * sizeof(float) , cudaMemcpyHostToDevice);
cudaMemcpy(d_matrixB , matrixB , m * SIZE * sizeof(float) , cudaMemcpyHostToDevice);
float *d_results_kernel;
cudaMalloc((void **)&d_results_kernel , n * m * sizeof(float));
dim3 threads1 (1 , SIZE);
dim3 blocks1 (n , m);
cudaEventRecord(start1);
EuclideanDistances1 <<<blocks1 , threads1>>> (d_matrixA , d_matrixB , d_results_kernel , n , m);
cudaEventRecord(stop1);
cudaMemcpy(results_kernel , d_results_kernel , n * m *sizeof(float) , cudaMemcpyDeviceToHost);
for (int i = 0; i< n*m; i++) {
if (results_kernel[i] != cpu_results_kernel[i]) {printf("cpu/kernel1 mismatch at %d, cpu: %f, kernel1: %f
", i, cpu_results_kernel[i], results_kernel[i]); return 1;}}
cudaMemset(d_results_kernel, 0, n*m*sizeof(float));
cudaEventSynchronize(stop1);
cudaEventElapsedTime(&et1, start1, stop1);
dim3 threads2 (8 , SIZE); // 1024 threads per block (maximum)
dim3 blocks2 (n/8 , m); // assumes n evenly divisible by 8
cudaEventRecord(start2);
EuclideanDistances2 <<<blocks2 , threads2>>> (d_matrixA , d_matrixB , d_results_kernel , n , m);
cudaEventRecord(stop2);
cudaMemcpy(results_kernel , d_results_kernel , n * m *sizeof(float) , cudaMemcpyDeviceToHost);
for (int i = 0; i< n*m; i++) {
if (results_kernel[i] != cpu_results_kernel[i]) {printf("cpu/kernel2 mismatch at %d, cpu: %f, kernel1: %f
", i, cpu_results_kernel[i], results_kernel[i]); return 1;}}
cudaMemset(d_results_kernel, 0, n*m*sizeof(float));
cudaEventSynchronize(stop2);
cudaEventElapsedTime(&et2, start2, stop2);
cudaFuncSetCacheConfig(EuclideanDistances3, cudaFuncCachePreferL1);
dim3 threads3 (8, 32); // 1024 threads per block (maximum)
dim3 blocks3 (n/threads3.x , m/threads3.y); // assumes evenly divisible
cudaEventRecord(start3);
EuclideanDistances3 <<<blocks3 , threads3>>> (d_matrixA , d_matrixB , d_results_kernel , n , m);
cudaEventRecord(stop3);
cudaMemcpy(results_kernel , d_results_kernel , n * m *sizeof(float) , cudaMemcpyDeviceToHost);
for (int i = 0; i< n*m; i++) {
if (results_kernel[i] != cpu_results_kernel[i]) {printf("cpu/kernel3 mismatch at %d, cpu: %f, kernel3: %f
", i, cpu_results_kernel[i], results_kernel[i]); return 1;}}
cudaMemset(d_results_kernel, 0, n*m*sizeof(float));
cudaEventSynchronize(stop3);
cudaEventElapsedTime(&et3, start3, stop3);
// transpose matrix A
float *matrixA_T = (float *) malloc (n * SIZE * sizeof(float));
for (int i = 0; i < n; i++)
for (int j = 0; j < SIZE; j++)
matrixA_T[(j*n)+i] = matrixA[(i*SIZE)+j];
cudaMemcpy(d_matrixA , matrixA_T , n * SIZE * sizeof(float) , cudaMemcpyHostToDevice);
cudaFuncSetCacheConfig(EuclideanDistances4, cudaFuncCachePreferL1);
dim3 threads4(SIZE); // one thread per vector element
dim3 blocks4(m/CHKSIZE);
cudaEventRecord(start4);
EuclideanDistances4 <<<blocks4 , threads4>>> (d_matrixA , d_matrixB , d_results_kernel , n , m);
cudaEventRecord(stop4);
cudaMemcpy(results_kernel , d_results_kernel , n * m *sizeof(float) , cudaMemcpyDeviceToHost);
// test for correct transposed result C(m,n)
for (int i = 0; i< n; i++)
for (int j = 0; j < m; j++)
if (results_kernel[(j*n)+i] != cpu_results_kernel[(i*m)+j]) {printf("cpu/kernel4 mismatch at %d,%d, cpu: %f, kernel4: %f
", i,j, cpu_results_kernel[(i*m)+j], results_kernel[(j*n)+i]); return 1;}
cudaEventSynchronize(stop4);
cudaEventElapsedTime(&et4, start4, stop4);
cudaFree(d_results_kernel);
printf("Success!
");
printf("kernel1 : %.fms, kernel2 : %.fms, kernel3 : %.fms, kernel4 : %.fms
", et1, et2, et3, et4);
free(matrixA);
free(matrixB);
free(results_kernel);
return 0;
}
$ nvcc -O3 -arch=sm_20 -o t460 t460.cu
$ ./t460
Success!
kernel1 : 2213ms, kernel2 : 4660ms, kernel3 : 691ms, kernel4 : 99ms
$
Hopefully that will get you going with more ideas of things to work on. You may get different timings of course on your cc3.0 device.
Are further optimizations possible? Probably. The first target I would look at would be to figure out how to take advantage of the data-reuse opportunities on vector A. (data re-use of vector B is already handled in the kernel 4 by loading it into shared memory. There may be ways to use some shared memory to store portions of A to make the code run even faster.)
I guess I should also mention that following the lead of the code you provided, this code is computing the square of the euclidean distance. A trivial modification to the kernels can make it compute the actual euclidean distance instead (C[...] = sqrtf(...);
) The validation I have included, however, assumes the results are "in-range" for perfect storage of an integer quantity in a float
. Your test case satisfies this requirement, but otherwise the validation code would need to be modified (if sqrtf
were used).
相关文章