为什么在 x86 上除以 3 需要右移(和其他奇怪的东西)?

我有以下 C/C++ 函数:

unsigned div3(unsigned x) {返回 x/3;}

-O3 使用 clang 10 编译时,结果为:

div3(无符号整数):mov ecx, edi # tmp = xmov eax, 2863311531 # 结果 = 3^-1imul rax, rcx # 结果 *= tmpshr rax, 33 # 结果 >>= 33退

我所理解的是:除以 3 相当于乘以乘法逆 3-1 mod 232,即 2863311531.

有些东西我不明白:

  1. 为什么我们需要使用ecx/rcx?不能直接将raxedi相乘吗?
  2. 为什么我们要在 64 位模式下进行乘法运算?eaxecx 相乘不是更快吗?
  3. 为什么我们使用 imul 而不是 mul?我认为模算术都是无符号的.
  4. 最后的 33 位右移是怎么回事?我以为我们可以放弃最高的 32 位.

编辑 1

对于那些不明白我所说的 3-1 mod 232 是什么意思的人,我在这里谈论的是乘法逆.例如:

//乘以 3 的倒数:15 * 2863311531 = 4294967296542949672965 模 2^32 = 5//使用定点乘法15 * 2863311531 = 4294967296542949672965 >>33 = 5//简单地除以 315/3 = 5

所以乘以 42949672965 实际上相当于除以 3.我认为 clang 的优化是基于模运算的,而实际上它是基于定点运算的.

编辑 2

我现在意识到乘法逆只能用于没有余数的除法.例如,1 乘以 3-1 等于 3-1,而不是零.只有定点算法的四舍五入才是正确的.

不幸的是,clang 没有使用任何模算术,在这种情况下它只是一个 imul 指令,即使它可以.以下函数的编译输出与上面相同.

unsigned div3(unsigned x) {__builtin_assume(x % 3 == 0);返回 x/3;}


(关于适用于每个可能输入的精确除法的定点乘法逆的规范问答:为什么 GCC 在实现整数除法时使用乘以奇怪的数? - 不是很重复,因为它只涵盖数学,而不是一些实现细节,如寄存器宽度和 imul 与 mul.)

解决方案

  1. 我们不能直接将 rax 与 edi 相乘吗?

我们不能imul rax, rdi,因为调用约定允许调用者在RDI的高位留下垃圾;只有 EDI 部分包含该值.内联时这不是问题;写入 32 位寄存器确实隐式地零扩展到完整的 64 位寄存器,因此编译器通常不需要额外的指令来零扩展 32 位值.

(由于移动消除的限制,如果无法避免的话,零扩展到不同的寄存器会更好).

从字面上看你的问题,不,x86 没有任何乘法指令可以零扩展其输入之一,让你乘以 32 位和 64 位寄存器.两个输入的宽度必须相同.

<块引用>

  1. 为什么我们要在 64 位模式下乘法?

(术语:所有这些代码都在 64 位模式下运行.你在问为什么 64 位 operand-size.)

您可以 mul edi 将 EAX 与 EDI 相乘以获得跨 EDX:EAX 拆分的 64 位结果,但是 mul edi 在 Intel CPU 上是 3 uops,而大多数现代 x86-64 CPU 具有快速的 64 位 imul.(尽管 imul r64, r64 在 AMD Bulldozer 系列和一些低功耗 CPU 上速度较慢.)https://uops.info/ 和 https://agner.org/optimize/ (指令表和 microarch PDF)(有趣的事实:mul rdi 在 Intel CPU 上实际上 更便宜,只有 2 uops.也许与不必对整数乘法单元的输出进行额外拆分有关,就像 mul edi 必须将 64 位低半乘法器输出分成 EDX 和 EAX 两半,但对于 64x64 => 128 位 mul,这自然发生.)

此外,您想要的部分在 EDX 中,因此您需要另一个 mov eax, edx 来处理它.(同样,因为我们正在查看函数的独立定义的代码,而不是在内联到调用者之后.)

GCC 8.3 及更早版本确实使用 32 位 mul 而不是 64 位 imul (https://godbolt.org/z/5qj7d5).当推土机系列和旧的 Silvermont CPU 更相关时,对于 -mtune=generic 来说这并不疯狂,但是对于最近的 GCC,这些 CPU 在过去更远,其通用调整选择反映了这一点.不幸的是,GCC 还浪费了一条将 EDI 复制到 EAX 的 mov 指令,使这种方式看起来更糟:/

# gcc8.3 -O3(默认 -mtune=generic)div3(无符号整数):mov eax, edi # 1 uop, 愚蠢的浪费指令mov edx, -1431655765 # 1 uop(相同的 32 位常量,只是打印方式不同)mul edx # 3 uops on Sandybridge-familymov eax, edx # 1 uopshr eax # 1 uop退# SnB 系列总共 7 个 uops

使用 mov eax, 0xAAAAAAAB/mul edi 只会是 6 uop,但仍然比:

# gcc9.3 -O3(默认 -mtune=generic)div3(无符号整数):mov eax, edi # 1 uopmov edi, 2863311531 # 1 uopimul rax, rdi # 1 uopshr rax, 33 # 1 uop退# 总共 4 个 uops,不包括 ret

不幸的是,64 位 0x00000000AAAAAAAB 不能表示为 32 位符号扩展立即数,因此 imul rax, rcx, 0xAAAAAAAB 不可编码.这意味着 0xFFFFFFFFAAAAAAAB.

<块引用>

  1. 为什么我们使用 imul 而不是 mul?我认为模算术都是无符号的.

未签名.输入的有符号性只影响结果的高半部分,但 imul reg, reg 不会产生高半部分.只有 mulimul 的单操作数形式是满足 NxN =>2N,所以只有他们需要单独的签名和未签名版本.

只有 imul 具有更快、更灵活的低半只形式.关于 imul reg, reg 的唯一签名是它根据下半部分的有符号溢出设置 OF.仅仅为了拥有一个 mul r,rimul r,r 的唯一区别是 FLAGS 输出是不值得花费更多的操作码和更多的晶体管的.

英特尔的手册(https://www.felixcloutier.com/x86/imul)甚至指出它可以用于未签名的事实.

<块引用>

  1. 最后的 33 位右移是怎么回事?我以为我们可以放弃最高的 32 位.

不,没有乘数常数可以为每个可能的输入 x 提供准确正确的答案,如果您以这种方式实现它.优化规则不允许近似,只允许为程序使用的每个输入产生完全相同的可观察行为的实现.如果不知道 x 的值范围而不是 unsigned 的完整范围,编译器就没有这个选项.(-ffast-math 仅适用于浮点数;如果您想要更快的整数数学近似值,请手动编写如下代码):

参见 为什么GCC 在实现整数除法时使用一个奇怪的数乘法? 了解更多关于定点乘法逆方法编译器使用编译时间常数进行精确除法.

有关此不在一般情况下工作的示例,请参阅我对 使用位移位除以 10? 哪个提议

//警告:大输入不精确//这种快速近似可以只使用高半部分,//所以在 32 位机器上它避免了一个移位指令与精确除法int32_t div10(int32_t 股息){int64_t invDivisor = 0x1999999A;返回(int32_t)((invDivisor *股息)>> 32);}

它的第一个错误答案(如果从 0 向上循环)是 div10(1073741829) = 1073741831073741829/10 实际上是 107374182.(它向上舍入而不是应该像 C 整数除法那样朝 0 方向移动.)


从您的编辑中,我看到您实际上是在谈论使用乘法结果的低一半,这显然适用于一直到 UINT_MAX 的精确倍数.

正如你所说,当除法有余数时,它完全失败,例如16 * 0xaaaaaaab = 0xaaaaaab0 当截断为 32 位,而不是 5.

unsigned div3_exact_only(unsigned x) {__builtin_assume(x % 3 == 0);//或等效的 if() __builtin_unreachable()返回 x/3;}

是的,如果这个数学公式成立,编译器用 32 位 imul 实现它是合法和最佳的.他们不寻找这种优化,因为它很少是一个已知的事实.IDK是否值得添加编译器代码甚至寻找优化,就编译时间而言,更不用说开发人员时间的编译器维护成本.这在运行时成本上没有巨大差异,而且几乎不可能.不过还是不错的.

div3_exact_only:imul eax, edi, 0xAAAAAAAB # 1 uop, 3c 延迟退

但是,您可以在源代码中自己做一些事情,至少对于像 uint32_t 这样的已知类型宽度:

uint32_t div3_exact_only(uint32_t x) {返回 x * 0xaaaaaaabU;}

I have the following C/C++ function:

unsigned div3(unsigned x) {
    return x / 3;
}

When compiled using clang 10 at -O3, this results in:

div3(unsigned int):
        mov     ecx, edi         # tmp = x
        mov     eax, 2863311531  # result = 3^-1
        imul    rax, rcx         # result *= tmp
        shr     rax, 33          # result >>= 33
        ret

What I do understand is: division by 3 is equivalent to multiplying with the multiplicative inverse 3-1 mod 232 which is 2863311531.

There are some things that I don't understand though:

  1. Why do we need to use ecx/rcx at all? Can't we multiply rax with edi directly?
  2. Why do we multiply in 64-bit mode? Wouldn't it be faster to multiply eax and ecx?
  3. Why are we using imul instead of mul? I thought modular arithmetic would be all unsigned.
  4. What's up with the 33-bit rightshift at the end? I thought we can just drop the highest 32-bits.

Edit 1

For those who don't understand what I mean by 3-1 mod 232, I am talking about the multiplicative inverse here. For example:

// multiplying with inverse of 3:
15 * 2863311531      = 42949672965
42949672965 mod 2^32 = 5

// using fixed-point multiplication
15 * 2863311531      = 42949672965
42949672965 >> 33    = 5

// simply dividing by 3
15 / 3               = 5

So multiplying with 42949672965 is actually equivalent to dividing by 3. I assumed clang's optimization is based on modular arithmetic, when it's really based on fixed point arithmetic.

Edit 2

I have now realized that the multiplicative inverse can only be used for divisions without a remainder. For example, multiplying 1 times 3-1 is equal to 3-1, not zero. Only fixed point arithmetic has correct rounding.

Unfortunately, clang does not make any use of modular arithmetic which would just be a single imul instruction in this case, even when it could. The following function has the same compile output as above.

unsigned div3(unsigned x) {
    __builtin_assume(x % 3 == 0);
    return x / 3;
}


(Canonical Q&A about fixed-point multiplicative inverses for exact division that work for every possible input: Why does GCC use multiplication by a strange number in implementing integer division? - not quite a duplicate because it only covers the math, not some of the implementation details like register width and imul vs. mul.)

解决方案

  1. Can't we multiply rax with edi directly?

We can't imul rax, rdi because the calling convention allows the caller to leave garbage in the high bits of RDI; only the EDI part contains the value. This is a non-issue when inlining; writing a 32-bit register does implicitly zero-extend to the full 64-bit register, so the compiler will usually not need an extra instruction to zero-extend a 32-bit value.

(zero-extending into a different register is better because of limitations on mov-elimination, if you can't avoid it).

Taking your question even more literally, no, x86 doesn't have any multiply instructions that zero-extend one of their inputs to let you multiply a 32-bit and a 64-bit register. Both inputs must be the same width.

  1. Why do we multiply in 64-bit mode?

(terminology: all of this code runs in 64-bit mode. You're asking why 64-bit operand-size.)

You could mul edi to multiply EAX with EDI to get a 64-bit result split across EDX:EAX, but mul edi is 3 uops on Intel CPUs, vs. most modern x86-64 CPUs having fast 64-bit imul. (Although imul r64, r64 is slower on AMD Bulldozer-family, and on some low-power CPUs.) https://uops.info/ and https://agner.org/optimize/ (instruction tables and microarch PDF) (Fun fact: mul rdi is actually cheaper on Intel CPUs, only 2 uops. Perhaps something to do with not having to do extra splitting on the output of the integer multiply unit, like mul edi would have to split the 64-bit low half multiplier output into EDX and EAX halves, but that happens naturally for 64x64 => 128-bit mul.)

Also the part you want is in EDX so you'd need another mov eax, edx to deal with it. (Again, because we're looking at code for a stand-alone definition of the function, not after inlining into a caller.)

GCC 8.3 and earlier did use 32-bit mul instead of 64-bit imul (https://godbolt.org/z/5qj7d5). That was not crazy for -mtune=generic when Bulldozer-family and old Silvermont CPUs were more relevant, but those CPUs are farther in the past for more recent GCC, and its generic tuning choices reflect that. Unfortunately GCC also wasted a mov instruction copying EDI to EAX, making this way look even worse :/

# gcc8.3 -O3  (default -mtune=generic)
div3(unsigned int):
        mov     eax, edi                 # 1 uop, stupid wasted instruction
        mov     edx, -1431655765         # 1 uop  (same 32-bit constant, just printed differently)
        mul     edx                      # 3 uops on Sandybridge-family
        mov     eax, edx                 # 1 uop
        shr     eax                      # 1 uop
        ret
                                  # total of 7 uops on SnB-family

Would only be 6 uops with mov eax, 0xAAAAAAAB / mul edi, but still worse than:

# gcc9.3 -O3  (default -mtune=generic)
div3(unsigned int):
        mov     eax, edi                # 1 uop
        mov     edi, 2863311531         # 1 uop
        imul    rax, rdi                # 1 uop
        shr     rax, 33                 # 1 uop
        ret
                      # total 4 uops, not counting ret

Unfortunately, 64-bit 0x00000000AAAAAAAB can't be represented as a 32-bit sign-extended immediate, so imul rax, rcx, 0xAAAAAAAB isn't encodeable. It would mean 0xFFFFFFFFAAAAAAAB.

  1. Why are we using imul instead of mul? I thought modular arithmetic would be all unsigned.

It is unsigned. Signedness of the inputs only affects the high half of the result, but imul reg, reg doesn't produce the high half. Only the one-operand forms of mul and imul are full multiplies that do NxN => 2N, so only they need separate signed and unsigned versions.

Only imul has the faster and more flexible low-half-only forms. The only thing that's signed about imul reg, reg is that it sets OF based on signed overflow of the low half. It wasn't worth spending more opcodes and more transistors just to have a mul r,r whose only difference from imul r,r is the FLAGS output.

Intel's manual (https://www.felixcloutier.com/x86/imul) even points out the fact that it can be used for unsigned.

  1. What's up with the 33-bit rightshift at the end? I thought we can just drop the highest 32-bits.

No, there's no multiplier constant that would give the exact right answer for every possible input x if you implemented it that way. The "as-if" optimization rule doesn't allow approximations, only implementations that produce the exact same observable behaviour for every input the program uses. Without knowing a value-range for x other than full range of unsigned, compilers don't have that option. (-ffast-math only applies to floating point; if you want faster approximations for integer math, code them manually like below):

See Why does GCC use multiplication by a strange number in implementing integer division? for more about the fixed-point multiplicative inverse method compilers use for exact division by compile time constants.

For an example of this not working in the general case, see my edit to an answer on Divide by 10 using bit shifts? which proposed

// Warning: INEXACT FOR LARGE INPUTS
// this fast approximation can just use the high half,
// so on 32-bit machines it avoids one shift instruction vs. exact division
int32_t div10(int32_t dividend)
{
    int64_t invDivisor = 0x1999999A;
    return (int32_t) ((invDivisor * dividend) >> 32);
}

Its first wrong answer (if you loop from 0 upward) is div10(1073741829) = 107374183 when 1073741829/10 is actually 107374182. (It rounded up instead of toward 0 like C integer division is supposed to.)


From your edit, I see you were actually talking about using the low half of a multiply result, which apparently works perfectly for exact multiples all the way up to UINT_MAX.

As you say, it completely fails when the division would have a remainder, e.g. 16 * 0xaaaaaaab = 0xaaaaaab0 when truncated to 32-bit, not 5.

unsigned div3_exact_only(unsigned x) {
    __builtin_assume(x % 3 == 0);  // or an equivalent with if() __builtin_unreachable()
    return x / 3;
}

Yes, if that math works out, it would be legal and optimal for compilers to implement that with 32-bit imul. They don't look for this optimization because it's rarely a known fact. IDK if it would be worth adding compiler code to even look for the optimization, in terms of compile time, not to mention compiler maintenance cost in developer time. It's not a huge difference in runtime cost, and it's rarely going to be possible. It is nice, though.

div3_exact_only:
    imul  eax, edi, 0xAAAAAAAB        # 1 uop, 3c latency
    ret

However, it is something you can do yourself in source code, at least for known type widths like uint32_t:

uint32_t div3_exact_only(uint32_t x) {
    return x * 0xaaaaaaabU;
}

相关文章