在计算中使用布尔值来避免分支

这是我想出的一点微优化好奇心:

Here's a little micro-optimization curiosity that I came up with:

struct Timer {
    bool running{false};
    int ticks{0};

    void step_versionOne(int mStepSize) {
        if(running) ticks += mStepSize;
    }

    void step_versionTwo(int mStepSize) {
        ticks += mStepSize * static_cast<int>(running);
    }
};

这两种方法似乎实际上做同样的事情.第二个版本是否避免了分支(因此比第一个版本更快),或者任何编译器都能够使用 -O3 进行这种优化?

It seems the two methods practically do the same thing. Does the second version avoid a branch (and consequently, is faster than the first version), or is any compiler able to do this kind of optimization with -O3?

推荐答案

是的,你的技巧可以避免分支,而且它可以让它更快......有时.

Yes, your trick allows to avoid branch and it makes it faster... sometimes.

我编写了基准来比较各种情况下的这些解决方案,以及我自己的:

I wrote benchmark that compares these solutions in various situations, along with my own:

ticks += mStepSize & -static_cast<int>(running)

我的结果如下:

Off:
 branch: 399949150
 mul:    399940271
 andneg: 277546678
On:
 branch: 204035423
 mul:    399937142
 andneg: 277581853
Pattern:
 branch: 327724860
 mul:    400010363
 andneg: 277551446
Random:
 branch: 915235440
 mul:    399916440
 andneg: 277537411

Off 是关闭计时器的时间.在这种情况下,解决方案大约需要相同的时间.

Off is when timers are turned off. In this cases solutions take about the same time.

On 是它们打开的时间.分支解决方案快两倍.

On is when they are turned on. Branching solution two times faster.

Pattern 是当它们处于 100110 模式时.性能相似,但分支要快一些.

Pattern is when they are in 100110 pattern. Performance is similar, but branching is a bit faster.

Random 是指分支不可预测的情况.在这种情况下,乘法的速度要快 2 倍以上.

Random is when branch is unpredictable. In this case multiplications is more than 2 times faster.

在所有情况下,我的 bit-hacking 技巧都是最快的,除了 On 分支获胜.

In all cases my bit-hacking trick is fastest, except for On where branching wins.

请注意,此基准不一定代表所有编译器版本的处理器等.即使是基准的微小变化也会使结果颠倒(例如,如果编译器可以内联知道 mStepSize1 比乘法实际上最快).

Note that this benchmark is not necessarily representative for all compiler versions processors etc. Even small changes of benchmark can turn results upside down (for example if compiler can inline knowing mStepSize is 1 than multiplication can be actually fastest).

基准代码:

#include<array>
#include<iostream>
#include<chrono>

struct Timer {
    bool running{false};
    int ticks{0};

    void branch(int mStepSize) {
        if(running) ticks += mStepSize;
    }

    void mul(int mStepSize) {
        ticks += mStepSize * static_cast<int>(running);
    }

    void andneg(int mStepSize) {
        ticks += mStepSize & -static_cast<int>(running);
    }
};

void run(std::array<Timer, 256>& timers, int step) {
    auto start = std::chrono::steady_clock::now();
    for(int i = 0; i < 1000000; i++)
        for(auto& t : timers)
            t.branch(step);
    auto end = std::chrono::steady_clock::now();
    std::cout << "branch: " << (end - start).count() << std::endl;
    start = std::chrono::steady_clock::now();
    for(int i = 0; i < 1000000; i++)
        for(auto& t : timers)
            t.mul(step);
    end = std::chrono::steady_clock::now();
    std::cout << "mul:    " << (end - start).count() << std::endl;
    start = std::chrono::steady_clock::now();
    for(int i = 0; i < 1000000; i++)
        for(auto& t : timers)
            t.andneg(step);
    end = std::chrono::steady_clock::now();
    std::cout << "andneg: " << (end - start).count() << std::endl;
}

int main() {
    std::array<Timer, 256> timers;
    int step = rand() % 256;

    run(timers, step); // warm up
    std::cout << "Off:
";
    run(timers, step);
    for(auto& t : timers)
        t.running = true;
    std::cout << "On:
";
    run(timers, step);
    std::array<bool, 6> pattern = {1, 0, 0, 1, 1, 0};
    for(int i = 0; i < 256; i++)
        timers[i].running = pattern[i % 6];
    std::cout << "Pattern:
";
    run(timers, step);
    for(auto& t : timers)
        t.running = rand()&1;
    std::cout << "Random:
";
    run(timers, step);
    for(auto& t : timers)
        std::cout << t.ticks << ' ';
    return 0;
}

相关文章