重新审视浮点比较

这个话题在 StackOverflow 上出现过很多次,但我相信这是一个新的尝试.是的,我已阅读 Bruce Dawson 的文章 和 每个计算机科学家都应该知道的浮点运算和这个不错的答案.

This topic has come up many times on StackOverflow, but I believe this is a new take. Yes, I have read Bruce Dawson's articles and What Every Computer Scientist Should Know About Floating-Point Arithmetic and this nice answer.

据我了解,在典型系统上比较浮点数是否相等时存在四个基本问题:

As I understand it, on a typical system there are four basic problems when comparing floating-point numbers for equality:

  1. 浮点计算不精确
  2. a-b是否小"取决于ab
  3. 的规模
  4. ab 是否小"取决于ab 的类型(例如float、double、long double)
  5. 浮点通常具有 +-infinity、NaN 和非规范化表示,其中任何一种都可能干扰朴素的公式
  1. Floating point calculations are not exact
  2. Whether a-b is "small" depends on the scale of a and b
  3. Whether a-b is "small" depends on the type of a and b (e.g. float, double, long double)
  4. Floating point typically has +-infinity, NaN, and denormalized representations, any of which can interfere with a naïve formulation

这个答案――又名.谷歌方法"――似乎很流行.它确实处理了所有棘手的情况.它确实非常精确地扩展了比较,检查两个值是否在 ULPs 的固定数量内彼此的.因此,例如,一个非常大的数字将几乎等于"比作无穷大.

This answer -- aka. "the Google approach" -- seems to be popular. It does handle all of the tricky cases. And it does scale the comparison very precisely, checking whether two values are within a fixed number of ULPs of each other. Thus, for example, a very large number compares "almost equal" to infinity.

但是:

  • 在我看来,这非常混乱.
  • 它不是特别便携,严重依赖内部表示,使用联合从浮点数中读取位等.
  • 它只处理单精度和双精度 IEEE 754(特别是没有 x86 long double)

我想要类似的东西,但使用标准 C++ 并处理长双精度数.标准"是指 C++03(如果可能)和 C++11(如果需要).

I want something similar, but using standard C++ and handling long doubles. By "standard", I mean C++03 if possible and C++11 if necessary.

这是我的尝试.

#include <cmath>
#include <limits>
#include <algorithm>

namespace {
// Local version of frexp() that handles infinities specially.
template<typename T>
T my_frexp(const T num, int *exp)
{
    typedef std::numeric_limits<T> limits;

    // Treat +-infinity as +-(2^max_exponent).
    if (std::abs(num) > limits::max())
    {
        *exp = limits::max_exponent + 1;
        return std::copysign(0.5, num);
    }
    else return std::frexp(num, exp);
}
}

template<typename T>
bool almostEqual(const T a, const T b, const unsigned ulps=4)
{
    // Handle NaN.
    if (std::isnan(a) || std::isnan(b))
        return false;

    typedef std::numeric_limits<T> limits;

    // Handle very small and exactly equal values.
    if (std::abs(a-b) <= ulps * limits::denorm_min())
        return true;

    // frexp() does the wrong thing for zero.  But if we get this far
    // and either number is zero, then the other is too big, so just
    // handle that now.
    if (a == 0 || b == 0)
        return false;

    // Break the numbers into significand and exponent, sorting them by
    // exponent.
    int min_exp, max_exp;
    T min_frac = my_frexp(a, &min_exp);
    T max_frac = my_frexp(b, &max_exp);
    if (min_exp > max_exp)
    {
        std::swap(min_frac, max_frac);
        std::swap(min_exp, max_exp);
    }

    // Convert the smaller to the scale of the larger by adjusting its
    // significand.
    const T scaled_min_frac = std::ldexp(min_frac, min_exp-max_exp);

    // Since the significands are now in the same scale, and the larger
    // is in the range [0.5, 1), 1 ulp is just epsilon/2.
    return std::abs(max_frac-scaled_min_frac) <= ulps * limits::epsilon() / 2;
}

我声称这段代码 (a) 处理所有相关情况,(b) 与 Google 实现的 IEEE-754 单精度和双精度实现相同,并且 (c) 是完全标准的 C++.

I claim that this code (a) handles all of the relevant cases, (b) does the same thing as the Google implementation for IEEE-754 single- and double-precision, and (c) is perfectly standard C++.

其中一项或多项声明几乎肯定是错误的.我会接受任何证明这一点的答案,最好是修复.一个好的答案应该包括以下一项或多项:

One or more of these claims is almost certainly wrong. I will accept any answer that demonstrates such, preferably with a fix. A good answer should include one or more of:

  • 特定输入的差异超过最后位置的 ulps 个单位,但此函数返回 true(差异越大越好)
  • 在 Last Place 中的特定输入最多相差 ulps 个单位,但此函数返回 false(差异越小越好)
  • 我错过的任何案例
  • 此代码依赖于未定义行为或因实现定义的行为而中断的任何方式.(如果可能,请引用相关规范.)
  • 解决您发现的任何问题
  • 在不破坏代码的情况下简化代码的任何方法
  • Specific inputs differing by more than ulps Units in Last Place, but for which this function returns true (the bigger the difference, the better)
  • Specific inputs differing by up to ulps Units in Last Place, but for which this function returns false (the smaller the difference, the better)
  • Any case(s) I have missed
  • Any way in which this code relies on undefined behavior or breaks depending on implementation-defined behavior. (If possible, please cite a relevant spec.)
  • Fixes for whatever problem(s) you identify
  • Any way to simplify the code without breaking it

我打算在这个问题上放一个不平凡的赏金.

I intend to place a non-trivial bounty on this question.

推荐答案

几乎等于"不是一个好函数

4 不是一个合适的值:您指出的答案是因此,4 应该足以满足日常使用",但没有任何依据.事实上,在普通情况下,通过不同方式以浮点计算的数字可能会因许多 ULP 不同而不同,即使如果通过精确数学计算它们会相等.因此,容差不应有默认值;应该要求每个用户提供自己的,希望基于对其代码的彻底分析.

"Almost Equals" Is Not a Good Function

4 is not an appropriate value: The answer you point to states "Therefore, 4 should be enough for ordinary use" but contains no basis for that claim. In fact, there are ordinary situations in which numbers calculated in floating-point by different means may differ by many ULP even though they would be equal if calculated by exact mathematics. Therefore, there should be no default value for the tolerance; each user should be required to supply their own, hopefully based on thorough analysis of their code.

作为默认值 4 ULP 不好的示例,请考虑 1./49*49-1.数学上精确的结果为 0,但计算结果(64 位 IEEE 754 二进制)为 2-53,误差超过精确结果的 10307 ULP,几乎1016 计算结果的 ULP.

As an example of why a default of 4 ULP is bad, consider 1./49*49-1. The mathematically exact result is 0, but the computed result (64-bit IEEE 754 binary) is 2?53, an error exceeding 10307 ULP of the exact result and almost 1016 ULP of the computed result.

有时,没有合适的值: 在某些情况下,容差不能与所比较的值相关,既不是数学上精确的相对容差,也不是量化的 ULP 容差.例如,FFT 中几乎每个输出值都受到几乎每个输入值的影响,并且任何一个元素的误差都与其他元素的大小有关.必须为几乎等于"例程提供额外的上下文以及有关潜在错误的信息.

Sometimes, no value is appropriate: In some cases, the tolerance cannot be relative to the values being compared, neither a mathematically exact relative tolerance nor a quantized ULP tolerance. For example, nearly every output value in an FFT is affected by nearly every input value, and the error in any one element is related to the magnitudes of other elements. An "almost equals" routine must be supplied additional context with information about the potential error.

几乎等于"的数学性质很差:这显示了几乎等于"的缺点之一:缩放会改变结果.下面的代码打印出 1 和 0.

"Almost Equals" has poor mathematical properties: This shows one of the shortcomings of "almost equals": Scaling changes the results. The code below prints 1 and 0.

double x0 = 1.1;
double x1 = 1.1 + 3*0x1p-52;
std::cout << almostEqual(x0, x1) << "
";
x0 *= .8;
x1 *= .8;
std::cout << almostEqual(x0, x1) << "
";

另一个失败是它不是传递的;almostEqual(a, b)almostEqual(b, c) 并不意味着 almostEqual(a, c).

Another failing is that it is not transitive; almostEqual(a, b) and almostEqual(b, c) does not imply almostEqual(a, c).

almostEqual(1.f, 1.f/11, 0x745d17) 错误地返回 1.

1.f/11 为 0x1.745d18p-4(十六进制浮点表示法,表示 1.745d1816?2?4).从 1(即 0x10p-4)中减去它,得到 0xe.8ba2e8p-4.由于 1 的 ULP 是 0x1p-23,即 0xe.8ba2e8p19 ULP = 0xe8ba2e.8/2 ULP(移位 20 位并除以 2,净 19 位)= 0x745d17.4 ULP.这超出了 0x745d17 的指定容差,因此正确答案为 0.

1.f/11 is 0x1.745d18p-4 (hexadecimal floating-point notation, meaning 1.745d1816?2?4). Subtracting this from 1 (which is 0x10p-4) yields 0xe.8ba2e8p-4. Since an ULP of 1 is 0x1p-23, that is 0xe.8ba2e8p19 ULP = 0xe8ba2e.8/2 ULP (shifted 20 bits and divided by 2, netting 19 bits) = 0x745d17.4 ULP. That exceeds the specified tolerance of 0x745d17, so the correct answer would be 0.

此错误是由 max_frac-scaled_min_frac 中的舍入引起的.

This error is caused by rounding in max_frac-scaled_min_frac.

解决这个问题的一个简单方法是指定 ulps 必须小于 .5/limits::epsilon.只有当差值(即使舍入后)超过 ulps 时,才会在 max_frac-scaled_min_frac 中进行舍入;如果差值小于这个值,则减法是精确的,根据 Sterbenz 引理.

An easy escape from this problem is to specify that ulps must be less than .5/limits::epsilon. Then rounding occurs in max_frac-scaled_min_frac only if the difference (even when rounded) exceeds ulps; if the difference is less than that, the subtraction is exact, by Sterbenz’ Lemma.

有人建议使用 long double 来纠正这个问题.但是,long double 不会更正此问题.考虑将 1 和 -0x1p-149f 与 ulps 设置为 1/limits::epsilon 进行比较.除非您的有效位有 149 位,否则减法结果会四舍五入为 1,即小于或等于 1/limits::epsilon ULP.然而,数学上的差异显然超过了 1.

There was a suggestion about using long double to correct this. However, long double would not correct this. Consider comparing 1 and -0x1p-149f with ulps set to 1/limits::epsilon. Unless your significand has 149 bits, the subtraction result rounds to 1, which is less than or equal to 1/limits::epsilon ULP. Yet the mathematical difference clearly exceeds 1.

表达式 factor * limits::epsilon/2factor 转换为浮点类型,这会导致 factor.很可能,例程不打算用于如此大的值(float 中的数百万 ULP),因此应该将其指定为例程的限制而不是错误.

The expression factor * limits::epsilon / 2 converts factor to the floating-point type, which causes rounding errors for large values of factor that are not exactly representable. Likely, the routine is not intended to be used with such large values (millions of ULPs in float), so this ought to be specified as a limit on the routine rather than a bug.

相关文章