浮点数和双精度数有什么区别?
我已经了解了双精度和单精度之间的区别.然而,在大多数情况下,float
和 double
似乎可以互换,即使用其中一个似乎不会影响结果.真的是这样吗?浮点数和双精度数何时可以互换?它们之间有什么区别?
I've read about the difference between double precision and single precision. However, in most cases, float
and double
seem to be interchangeable, i.e. using one or the other does not seem to affect the results. Is this really the case? When are floats and doubles interchangeable? What are the differences between them?
推荐答案
差别很大.
顾名思义,double
有 2xfloat
[1]的精度支持>.一般来说,double
有 15 位精度,而 float
有 7.
As the name implies, a double
has 2x the precision of float
[1]. In general a double
has 15 decimal digits of precision, while float
has 7.
位数的计算方法如下:
double
有 52 个尾数位 + 1 个隐藏位:log(253)÷log(10) = 15.95 位
double
has 52 mantissa bits + 1 hidden bit: log(253)÷log(10) = 15.95 digits
float
有 23 个尾数位 + 1 个隐藏位:log(224)÷log(10) = 7.22 位
float
has 23 mantissa bits + 1 hidden bit: log(224)÷log(10) = 7.22 digits
这种精度损失可能会导致在重复计算时累积更大的截断误差,例如
This precision loss could lead to greater truncation errors being accumulated when repeated calculations are done, e.g.
float a = 1.f / 81;
float b = 0;
for (int i = 0; i < 729; ++ i)
b += a;
printf("%.7g
", b); // prints 9.000023
同时
double a = 1.0 / 81;
double b = 0;
for (int i = 0; i < 729; ++ i)
b += a;
printf("%.15g
", b); // prints 8.99999999999996
另外,float的最大值约为3e38
,而double约为1.7e308
,所以使用float
可以达到无穷大";(即一个特殊的浮点数)比 double
更容易做一些简单的事情,例如计算 60 的阶乘.
Also, the maximum value of float is about 3e38
, but double is about 1.7e308
, so using float
can hit "infinity" (i.e. a special floating-point number) much more easily than double
for something simple, e.g. computing the factorial of 60.
在测试过程中,可能有几个测试用例包含这些巨大的数字,如果使用浮点数,可能会导致程序失败.
During testing, maybe a few test cases contain these huge numbers, which may cause your programs to fail if you use floats.
当然,有时候,即使 double
也不够准确,因此我们有时会有 long double
[1] (上面的例子在 Mac 上给出 9.000000000000000066),但所有浮点类型都存在 舍入误差,因此如果精度非常重要(例如货币处理),您应该使用 int
或分数类.
Of course, sometimes, even double
isn't accurate enough, hence we sometimes have long double
[1] (the above example gives 9.000000000000000066 on Mac), but all floating point types suffer from round-off errors, so if precision is very important (e.g. money processing) you should use int
or a fraction class.
此外,不要使用 +=
对大量浮点数求和,因为错误会迅速累积.如果您使用的是 Python,请使用 fsum
.否则,尝试实现 Kahan 求和算法.
Furthermore, don't use +=
to sum lots of floating point numbers, as the errors accumulate quickly. If you're using Python, use fsum
. Otherwise, try to implement the Kahan summation algorithm.
[1]:C 和 C++ 标准没有指定 float
、double
和 long double
的表示.有可能所有三个都实现为 IEEE 双精度.尽管如此,对于大多数架构(gcc、MSVC;x86、x64、ARM)float
is 确实是 IEEE 单精度浮点数(binary32),而 double
是一个IEEE双精度浮点数(binary64).
[1]: The C and C++ standards do not specify the representation of float
, double
and long double
. It is possible that all three are implemented as IEEE double-precision. Nevertheless, for most architectures (gcc, MSVC; x86, x64, ARM) float
is indeed a IEEE single-precision floating point number (binary32), and double
is a IEEE double-precision floating point number (binary64).
相关文章