有没有公式来计算浮点数中指数或有效位的位数？

2022-03-01 00:00:00 floating-point c bit-manipulation c++

最近，我对使用浮点数的位移位来进行一些快速计算很感兴趣。

为了使它们以更通用的方式工作，我希望使我的函数使用不同的浮点类型，可能是通过模板，这些浮点类型不仅限于float和double，还包括"；Halfwidth"；或"；四倍宽&Quot；浮点数等等。

然后我注意到：

 - Half   ---  5 exponent bits  ---  10 signicant bits
 - Float  ---  8 exponent bits  ---  23 signicant bits
 - Double --- 11 exponent bits  ---  52 signicant bits

到目前为止我认为exponent bits = logbase2(total byte) * 3 + 2，
这意味着128位浮点数应该有14个指数位，256位浮点数应该有17个指数位。

然而，然后我了解到：

 - Quad   --- 15 exponent bits  ---  112 signicant bits
 - Octuple--- 19 exponent bits  ---  237 signicant bits

那么，有没有什么公式可以找到它呢？或者，有没有办法通过一些内置函数调用它？
首选C或C++，但对其他语言开放。

谢谢。

解决方案

通过内置函数提供的特性

C++通过std::numeric_limits模板提供此信息：

#include <iostream>
#include <limits>
#include <cmath>


template<typename T> void ShowCharacteristics()
{
    int radix = std::numeric_limits<T>::radix;

    std::cout << "The floating-point radix is " << radix << ".
";

    std::cout << "There are " << std::numeric_limits<T>::digits
        << " base-" << radix << " digits in the significand.
";

    int min = std::numeric_limits<T>::min_exponent;
    int max = std::numeric_limits<T>::max_exponent;

    std::cout << "Exponents range from " << min << " to " << max << ".
";
    std::cout << "So there must be " << std::ceil(std::log2(max-min+1))
        << " bits in the exponent field.
";
}


int main()
{
    ShowCharacteristics<double>();
}

示例输出：

The floating-point radix is 2.
There are 53 base-2 digits in the significand.
Exponents range from -1021 to 1024.
So there must be 11 bits in the exponent field.

C也通过<float.h>中定义的DBL_MANT_DIG这样的宏定义提供信息，但标准仅为类型float(前缀FLT)、double(DBL)和long double(LDBL)定义名称，因此支持其他浮点类型的C实现中的名称是不可预测的。

请注意，C和C++标准中指定的指数与IEEE-754中描述的通常指数相差1：它针对缩放到[1/2，1)而不是[1，2]的有效数进行调整，因此它比通常的IEEE-754指数大1。(上例显示的指数范围是IEEE1021到1024，但是?-754指数范围是?1022到1023。)

公式

IEEE-754确实提供了建议字段宽度的公式，但它不要求IEEE-754实现符合这些公式，当然，C和C++标准也不要求C和C++实现符合IEEE-754。交换格式参数在IEEE 754-2008 3.6中有规定，二进制参数为：

对于16、32、64或128位的浮点格式，有效位宽度(包括前导位)应为11、24、53或113位，指数字段宽度应为5、8、11或15位。
否则，对于k位的浮点格式，k应为32的倍数，有效位宽度应为k?round(4?log₂k)+13，，指数字段应为四舍五入(4・log₂k)?13。

相关文章