自动向量化比较

2022-03-16 00:00:00 vectorization c++ avx2

我在将g++5.4用于比较时遇到问题。基本上，我想使用向量化比较4个无符号整数。我的第一个方法是直截了当的：

bool compare(unsigned int const pX[4]) {
    bool c1 = (temp[0] < 1);
    bool c2 = (temp[1] < 2);
    bool c3 = (temp[2] < 3);
    bool c4 = (temp[3] < 4); 
    return c1 && c2 && c3 && c4;
}

使用g++ -std=c++11 -Wall -O3 -funroll-loops -march=native -mtune=native -ftree-vectorize -msse -msse2 -ffast-math -fopt-info-vec-missed编译时告诉BE，由于数据未对齐，它无法向量化比较：

main.cpp:5:17: note: not vectorized: failed to find SLP opportunities in basic block.
main.cpp:5:17: note: misalign = 0 bytes of ref MEM[(const unsigned int *)&x]
main.cpp:5:17: note: misalign = 4 bytes of ref MEM[(const unsigned int *)&x + 4B]
main.cpp:5:17: note: misalign = 8 bytes of ref MEM[(const unsigned int *)&x + 8B]
main.cpp:5:17: note: misalign = 12 bytes of ref MEM[(const unsigned int *)&x + 12B]

因此，我的第二次尝试是告诉g++对齐数据并使用临时数组：

bool compare(unsigned int const pX[4] ) {
    unsigned int temp[4] __attribute__ ((aligned(16)));
    temp[0] = pX[0];
    temp[1] = pX[1];
    temp[2] = pX[2];
    temp[3] = pX[3];

    bool c1 = (temp[0] < 1);
    bool c2 = (temp[1] < 2);
    bool c3 = (temp[2] < 3);
    bool c4 = (temp[3] < 4); 
    return c1 && c2 && c3 && c4;
}

但是，输出相同。我的CPU支持AVX2，英特尔内部指南告诉我，例如_mm256_cmpgt_epi8/16/32/64可以进行比较。你知道怎么告诉g++使用这个吗？

解决方案

好吧，显然编译器不喜欢"展开循环"。这对我有效：

bool compare(signed int const pX[8]) {
    signed int const w[] __attribute__((aligned(32))) = {1,2,3,4,5,6,7,8};
    signed int out[8] __attribute__((aligned(32)));

    for (unsigned int i = 0; i < 8; ++i) {
        out[i] = (pX[i] <= w[i]);
    }

    bool temp = true;
    for (unsigned int i = 0; i < 8; ++i) {
        temp = temp && out[i];
        if (!temp) {
            return false;
        }
    }
    return true;
}

请注意，out也是signed int。现在我只需要一种快速方法来合并out

中保存的结果

相关文章