exp 的 CUDA C++ 数学函数是否默认具有覆盖函数?

2022-01-10 00:00:00 cuda c++

问题来自我在文档中找到了两个函数expexpf.据说exp表示double exp(double)expf表示float expf(float).我想知道 exp 是否可以具有默认覆盖版本,例如 float exp(float)fp16 exp(fp16).或者当输入是不同的类型时我必须使用不同的功能吗?

The problem comes from the document where I find two function exp and expf. It is said that exp means double exp(double) and expf means float expf(float). I wonder if exp can have default override version such as float exp(float) or fp16 exp(fp16). Or must I use different functions when the input are different types ?

考虑一个我使用模板的场景:

Consider a scenario where I use template:

template <typename T>
T compute (T in) {return exp(in);}

如果没有默认的float exp(float),我无法使用compute(1.f)来调用这个模板函数.我知道我可以这样调用该函数,但我不知道编译器如何处理它.当我调用 exp(1.f) 时,编译器是否首先将输入转换为 double 并将返回值转换回 float,还是编译器直接使用浮点数作为输入?

If there is no default float exp(float), I cannot use compute<float>(1.f) to call this template function. I know that I can call that function like that but I do not how how does the compiler deal with it. When I call exp(1.f), does the compiler first cast the input into double and the cast the return value back to float, or does the compiler use the float number as input directly?

推荐答案

据说exp表示double exp(double)expf表示float expf(float).我想知道 exp 是否可以具有默认覆盖版本,例如 float exp(float) ...

It is said that exp means double exp(double) and expf means float expf(float). I wonder if exp can have default override version such as float exp(float) ...

是的,CUDA 编译器执行普通 C++ 编译器所做的工作,并且会透明地为正确类型重载函数的正确版本.这适用于 floatdouble ...

Yes, the CUDA compiler does what a normal C++ compiler does and will transparently overload the correct version of the function for the correct type. This works for floatand double ...

... 或 fp16 exp(fp16).

...但它目前不适用于半精度浮点.

... but it does not presently work for half precision floating point.

举个例子:

$ cat overlay.cu
#include <cuda_fp16.h>

template<typename T>
__global__ void kernel(const T* x, const T* y, T* output, int N)
{
    int tid = threadIdx.x + blockIdx.x * blockDim.x;

    if (tid < N)
        output[tid] = exp(x[tid]) * y[tid];
};

template __global__ void kernel<float>(const float*, const float*, float*, int);
template __global__ void kernel<double>(const double*, const double*, double*, int);

将正确编译:

$ nvcc -arch=sm_70 -Xptxas="-v" -c overlay.cu
ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function '_Z6kernelIdEvPKT_S2_PS0_i' for 'sm_70'
ptxas info    : Function properties for _Z6kernelIdEvPKT_S2_PS0_i
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 16 registers, 380 bytes cmem[0], 88 bytes cmem[2]
ptxas info    : Compiling entry function '_Z6kernelIfEvPKT_S2_PS0_i' for 'sm_70'
ptxas info    : Function properties for _Z6kernelIfEvPKT_S2_PS0_i
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 14 registers, 380 bytes cmem[0]

但添加

template __global__ void kernel<__half>(const __half*, const __half*, __half*, int);

会失败:

$ nvcc -arch=sm_70 -Xptxas="-v" -c overlay.cu
overlay.cu(9): error: more than one instance of overloaded function "exp" matches the argument list:
            function "std::exp(long double)"
            function "std::exp(float)"
            argument types are: (const __half)
          detected during instantiation of "void kernel(const T *, const T *, T *, int) [with T=__half]"

正如评论中所指出的,C++14/C++17 没有定义标准化的半精度类型或标准库,所以这个错误与预期的行为非常一致.

As pointed out in comments, C++14/C++17 don't define a standardized half precision type or standard library, so this error is pretty much in line with expected behaviour.

如果您想要半精度版本,那么我建议对 fp16 版本使用显式模板特化,该版本利用该类型的(最高性能)内在函数,例如:

If you want a half precision version, then I suggest using explicit template specialization for the fp16 version which exploits the (most performant) intrinsic for the type, for example:

#include <cuda_fp16.h>

template<typename T>
__global__ void kernel(const T* x, const T* y, T* output, int N)
{
    int tid = threadIdx.x + blockIdx.x * blockDim.x;

    if (tid < N)
        output[tid] = exp(x[tid]) * y[tid];
};

template __global__ void kernel<float>(const float*, const float*, float*, int);
template __global__ void kernel<double>(const double*, const double*, double*, int);

template<> __global__ void kernel(const __half* x, const __half* y, __half* output, int N)
{
    int tid = threadIdx.x + blockIdx.x * blockDim.x;

    if (tid < N)
        output[tid] = hexp(x[tid]) * y[tid];
};

可能是现阶段最优化的实现,可以按预期编译:

is probably the most optimal implementation at this stage, which compiles as expected:

$ nvcc -std=c++11 -arch=sm_70 -Xptxas="-v" -c overlay.cu
ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function '_Z6kernelI6__halfEvPKT_S3_PS1_i' for 'sm_70'
ptxas info    : Function properties for _Z6kernelI6__halfEvPKT_S3_PS1_i
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 12 registers, 380 bytes cmem[0]
ptxas info    : Compiling entry function '_Z6kernelIdEvPKT_S2_PS0_i' for 'sm_70'
ptxas info    : Function properties for _Z6kernelIdEvPKT_S2_PS0_i
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 16 registers, 380 bytes cmem[0], 88 bytes cmem[2]
ptxas info    : Compiling entry function '_Z6kernelIfEvPKT_S2_PS0_i' for 'sm_70'
ptxas info    : Function properties for _Z6kernelIfEvPKT_S2_PS0_i
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 14 registers, 380 bytes cmem[0]

[从评论中收集答案并添加自己的编辑以从 CUDA 标签的未回答列表中删除问题.请根据需要编辑/改进]

[Answer assembled from comments with own editorialisation added to get question off unanswered list for the CUDA tag. Please edit/improve as you see fit]

相关文章