x86 上两个 128 位整数的高效乘法/除法(非 64 位)

2022-01-06 00:00:00 algorithm c++ bignum x86

编译器:MinGW/GCC
问题:不允许使用 GPL/LGPL 代码(GMP 或任何与此相关的 bignum 库,对于这个问题来说太过分了,因为我已经实现了该类).

Compiler: MinGW/GCC
Issues: No GPL/LGPL code allowed (GMP or any bignum library for that matter, is overkill for this problem, as I already have the class implemented).

我已经构建了自己的 128 位 固定大小的大整数类(旨在用于游戏引擎,但可以推广到任何用例),并且我发现了当前乘法的性能并且除法运算非常糟糕(是的,我已经对它们进行了计时,见下文),并且我想改进(或更改)执行低级数字运算的算法.

I have constructed my own 128-bit fixed-size big integer class (intended for use in a game engine but may be generalized to any usage case) and I find the performance of the current multiply and divide operations to be quite abysmal (yes, I have timed them, see below), and I'd like to improve on (or change) the algorithms that do the low-level number crunching.

说到乘法和除法运算符,与类中的几乎所有其他操作相比,它们的速度慢得令人难以忍受.

When it comes to the multiply and divide operators, they are unbearably slow compared to just about everything else in the class.

这些是相对于我自己的计算机的近似测量值:

These are the approximate measurements relative to my own computer:

Raw times as defined by QueryPerformanceFrequency:
1/60sec          31080833u
Addition:              ~8u
Subtraction:           ~8u
Multiplication:      ~546u
Division:           ~4760u (with maximum bit count)

如您所见,仅进行乘法运算比加法或减法慢很多很多倍.除法大约比乘法慢 10 倍.

As you can see, just doing the multiplication is many, many times slower than add or subtract. Division is about 10 times slower than multiplication.

我想提高这两个算子的速度,因为每帧可能要进行大量计算(点积、各种碰撞检测方法等).

I'd like to improve the speed of these two operators because there may be a very high amount of computations being done per frame (dot products, various collision detection methods, etc).

结构(省略方法)看起来有点像:

The structure (methods omitted) looks somewhat like:

class uint128_t
{
    public:
        unsigned long int dw3, dw2, dw1, dw0;
  //...
}

乘法目前使用典型的长乘方法(在汇编中以便我可以捕获EDX输出)同时忽略超出范围的单词(也就是说,我只做了 10 个 mull 与 16 个相比).

Multiplication is currently done using the typical long-multiplication method (in assembly so that I can catch the EDX output) while ignoring the words that go out of range (that is, I'm only doing 10 mull's compared to 16).

除法使用移位减算法(速度取决于操作数的位数).但是,它不是在组装中完成的.我发现这有点太难了,决定让编译器优化它.

Division uses the shift-subtract algorithm (speed depends on bit counts of the operands). However, it is not done in assembly. I found that a little too difficult to muster and decided to let the compiler optimize it.

我用 Google 搜索了好几天,查看描述算法的页面,例如 Karatsuba Multiplication、高基数划分和 牛顿-拉普森划分 但是数学符号对我来说有点太远了.我想使用其中一些高级方法来加速我的代码,但我必须首先将希腊语"翻译成易于理解的内容.

I have Google'd around for several days looking at pages describing algorithms such as Karatsuba Multiplication, high-radix division, and Newton-Rapson Division but the math symbols are a little too far over my head. I'd like to use some of these advanced methods to speed up my code, but I'd have to translate the "Greek" into something comprehensible first.

对于那些可能认为我的努力过早优化"的人;我认为这段代码是一个瓶颈,因为非常基本的数学运算本身变得很慢.我可以忽略对更高级别代码的此类优化,但此代码将被调用/使用到足够重要.

For those that may deem my efforts "premature optimization"; I consider this code to be a bottleneck because the very elementary-math operations themselves become slow. I can ignore such types of optimization on higher-level code, but this code will be called/used enough for it to matter.

我想要关于我应该使用哪种算法来改进乘法和除法(如果可能)的建议,以及对建议算法如何工作的基本(希望易于理解)解释高度 赞赏.

I'd like suggestions on which algorithm I should use to improve multiply and divide (if possible), and a basic (hopefully easy to understand) explanation on how the suggested algorithm works would be highly appreciated.

我能够通过将代码内联到 operator*= 中来改进乘法运算,而且它看起来尽可能快.

I was able to improve the multiply operation by inlining code into operator*= and it seems as fast as possible.

Updated raw times:
1/60sec          31080833u
Addition:              ~8u
Subtraction:           ~8u
Multiplication:      ~100u (lowest ~86u, highest around ~256u)
Division:           ~4760u (with maximum bit count)

这里有一些简单的代码供您检查(请注意,我的类型名称实际上不同,为了简单起见对其进行了编辑):

Here's some bare-bones code for you to examine (note that my type names are actually different, this was edited for simplicity):

//File: "int128_t.h"
class int128_t
{
    uint32_t dw3, dw2, dw1, dw0;

    // Various constrctors, operators, etc...

    int128_t& operator*=(const int128_t&  rhs) __attribute__((always_inline))
    {
        int128_t Urhs(rhs);
        uint32_t lhs_xor_mask = (int32_t(dw3) >> 31);
        uint32_t rhs_xor_mask = (int32_t(Urhs.dw3) >> 31);
        uint32_t result_xor_mask = (lhs_xor_mask ^ rhs_xor_mask);
        dw0 ^= lhs_xor_mask;
        dw1 ^= lhs_xor_mask;
        dw2 ^= lhs_xor_mask;
        dw3 ^= lhs_xor_mask;
        Urhs.dw0 ^= rhs_xor_mask;
        Urhs.dw1 ^= rhs_xor_mask;
        Urhs.dw2 ^= rhs_xor_mask;
        Urhs.dw3 ^= rhs_xor_mask;
        *this += (lhs_xor_mask & 1);
        Urhs += (rhs_xor_mask & 1);

        struct mul128_t
        {
            int128_t dqw1, dqw0;
            mul128_t(const int128_t& dqw1, const int128_t& dqw0): dqw1(dqw1), dqw0(dqw0){}
        };

        mul128_t data(Urhs,*this);
        asm volatile(
        "push      %%ebp                            

        movl       %%eax,   %%ebp                   

        movl       $0x00,   %%ebx                   

        movl       $0x00,   %%ecx                   

        movl       $0x00,   %%esi                   

        movl       $0x00,   %%edi                   

        movl   28(%%ebp),   %%eax #Calc: (dw0*dw0)  

        mull             12(%%ebp)                  

        addl       %%eax,   %%ebx                   

        adcl       %%edx,   %%ecx                   

        adcl       $0x00,   %%esi                   

        adcl       $0x00,   %%edi                   

        movl   24(%%ebp),   %%eax #Calc: (dw1*dw0)  

        mull             12(%%ebp)                  

        addl       %%eax,   %%ecx                   

        adcl       %%edx,   %%esi                   

        adcl       $0x00,   %%edi                   

        movl   20(%%ebp),   %%eax #Calc: (dw2*dw0)  

        mull             12(%%ebp)                  

        addl       %%eax,   %%esi                   

        adcl       %%edx,   %%edi                   

        movl   16(%%ebp),   %%eax #Calc: (dw3*dw0)  

        mull             12(%%ebp)                  

        addl       %%eax,   %%edi                   

        movl   28(%%ebp),   %%eax #Calc: (dw0*dw1)  

        mull              8(%%ebp)                  

        addl       %%eax,   %%ecx                   

        adcl       %%edx,   %%esi                   

        adcl       $0x00,   %%edi                   

        movl   24(%%ebp),   %%eax #Calc: (dw1*dw1)  

        mull              8(%%ebp)                  

        addl       %%eax,   %%esi                   

        adcl       %%edx,   %%edi                   

        movl   20(%%ebp),   %%eax #Calc: (dw2*dw1)  

        mull              8(%%ebp)                  

        addl       %%eax,   %%edi                   

        movl   28(%%ebp),   %%eax #Calc: (dw0*dw2)  

        mull              4(%%ebp)                  

        addl       %%eax,   %%esi                   

        adcl       %%edx,   %%edi                   

        movl   24(%%ebp),  %%eax #Calc: (dw1*dw2)   

        mull              4(%%ebp)                  

        addl       %%eax,   %%edi                   

        movl   28(%%ebp),   %%eax #Calc: (dw0*dw3)  

        mull               (%%ebp)                  

        addl       %%eax,   %%edi                   

        pop        %%ebp                            
"
        :"=b"(this->dw0),"=c"(this->dw1),"=S"(this->dw2),"=D"(this->dw3)
        :"a"(&data):"%ebp");

        dw0 ^= result_xor_mask;
        dw1 ^= result_xor_mask;
        dw2 ^= result_xor_mask;
        dw3 ^= result_xor_mask;
        return (*this += (result_xor_mask & 1));
    }
};

至于除法,检查代码毫无意义,因为我需要更改数学算法才能看到任何实质性的好处.唯一可行的选择似乎是高基数除法,但我还没有确定(在我的脑海中)它如何工作.

As for division, examining the code is rather pointless, as I will need to change the mathematical algorithm to see any substantial benefits. The only feasible choice seems to be high-radix division, but I have yet to iron out (in my mind) just how it will work.

推荐答案

我不会太担心乘法.你在做什么似乎很有效率.在 Karatsuba 乘法中,我并没有真正遵循希腊语,但我的感觉是,只有处理比您处理的数字大得多的数字,它才会更有效率.

I wouldn't worry much about multiplication. What you're doing seems quite efficient. I didn't really follow the Greek on the Karatsuba Multiplication, but my feeling is that it would be more efficient only with much larger numbers than you're dealing with.

我的一个建议是尝试使用最小的内联汇编块,而不是在汇编中编码逻辑.你可以写一个函数:

One suggestion I do have is to try to use the smallest blocks of inline assembly, rather than coding your logic in assembly. You could write a function:

struct div_result { u_int x[2]; };
static inline void mul_add(int a, int b, struct div_result *res);

该函数将在内联汇编中实现,您将从 C++ 代码中调用它.它应该与纯汇编一样高效,并且更容易编码.

The function would be implemented in inline assembly, and you'll call it from C++ code. It should be as efficient as pure assembly, and much easier to code.

关于除法,我不知道.我看到的大多数算法都在谈论渐进效率,这可能意味着它们仅对非常多的位有效.

About division, I don't know. Most algorithms I saw talk about asymptotic efficiency, which may mean they're efficient only for very high numbers of bits.

相关文章