如何用 g++ 向量化我的循环?

2022-01-23 00:00:00 optimization vectorization g++ c++ loop-unrolling

我在搜索时找到的介绍链接:

The introductory links I found while searching:

6.59.14 Loop-Specific Pragmas
2.100 Pragma Loop_Optimize
如何向 gcc 提供有关循环计数的提示
告诉 gcc 专门展开循环
如何在 C++ 中强制向量化

正如您所见，它们中的大多数都是用于 C 的，但我认为它们也可能适用于 C++.这是我的代码:

As you can see most of them are for C, but I thought that they might work at C++ as well. Here is my code:

template<typename T> //__attribute__((optimize("unroll-loops"))) //__attribute__ ((pure)) void foo(std::vector<T> &p1, size_t start, size_t end, const std::vector<T> &p2) { typename std::vector<T>::const_iterator it2 = p2.begin(); //#pragma simd //#pragma omp parallel for //#pragma GCC ivdep Unroll Vector for (size_t i = start; i < end; ++i, ++it2) { p1[i] = p1[i] - *it2; p1[i] += 1; } } int main() { size_t n; double x,y; n = 12800000; vector<double> v,u; for(size_t i=0; i<n; ++i) { x = i; y = i - 1; v.push_back(x); u.push_back(y); } using namespace std::chrono; high_resolution_clock::time_point t1 = high_resolution_clock::now(); foo(v,0,n,u); high_resolution_clock::time_point t2 = high_resolution_clock::now(); duration<double> time_span = duration_cast<duration<double>>(t2 - t1); std::cout << "It took me " << time_span.count() << " seconds."; std::cout << std::endl; return 0; }

我使用了上面注释的所有提示，但我没有得到任何加速，如示例输出所示(第一次运行未注释此 #pragma GCC ivdep Unroll Vector:

I used al the hints one can see commented above, but I did not get any speedup, as a sample output shows (with the first run having uncommented this #pragma GCC ivdep Unroll Vector:

samaras@samaras-A15:~/Downloads$ g++ test.cpp -O3 -std=c++0x -funroll-loops -ftree-vectorize -o test samaras@samaras-A15:~/Downloads$ ./test It took me 0.026575 seconds. samaras@samaras-A15:~/Downloads$ g++ test.cpp -O3 -std=c++0x -o test samaras@samaras-A15:~/Downloads$ ./test It took me 0.0252697 seconds.

还有希望吗?还是优化标志 O3 就可以解决问题?欢迎提出任何加速此代码(foo 函数)的建议！

Is there any hope? Or the optimization flag O3 just does the trick? Any suggestions to speedup this code (the foo function) are welcome!

我的 g++ 版本:

samaras@samaras-A15:~/Downloads$ g++ --version g++ (Ubuntu 4.8.1-2ubuntu1~12.04) 4.8.1

<小时>
请注意，循环的主体是随机的.我对以其他形式重写它不感兴趣.

Notice that the body of the loop is random. I am not interesting in re-writing it in some other form.

编辑

回答说无能为力也是可以接受的！

An answer saying that there is nothing more that can be done is also acceptable!

推荐答案

O3 标志自动开启-ftree-vectorize.https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html

-O3 打开 -O2 指定的所有优化，同时打开 -finline-functions、-funswitch-loops、-fpredictive-commoning、-fgcse-after-reload、-ftree-loop-vectorize、-ftree-loop-distribute-patterns、-ftree-slp-vectorize、-fvect-cost-model、-ftree-partial-pre 和 -fipa-cp-clone 选项

-O3 turns on all optimizations specified by -O2 and also turns on the -finline-functions, -funswitch-loops, -fpredictive-commoning, -fgcse-after-reload, -ftree-loop-vectorize, -ftree-loop-distribute-patterns, -ftree-slp-vectorize, -fvect-cost-model, -ftree-partial-pre and -fipa-cp-clone options

所以在这两种情况下，编译器都在尝试进行循环向量化.

So in both cases the compiler is trying to do loop vectorization.

使用g++ 4.8.2编译:

Using g++ 4.8.2 to compile with:

# In newer versions of GCC use -fopt-info-vec-missed instead of -ftree-vectorize g++ test.cpp -O2 -std=c++0x -funroll-loops -ftree-vectorize -ftree-vectorizer-verbose=1 -o test

给出这个:

Analyzing loop at test.cpp:16 Vectorizing loop at test.cpp:16 test.cpp:16: note: create runtime check for data references *it2$_M_current_106 and *_39 test.cpp:16: note: created 1 versioning for alias checks. test.cpp:16: note: LOOP VECTORIZED. Analyzing loop at test_old.cpp:29 test.cpp:22: note: vectorized 1 loops in function. test.cpp:18: note: Unroll loop 7 times test.cpp:16: note: Unroll loop 7 times test.cpp:28: note: Unroll loop 1 times

不带 -ftree-vectorize 标志编译:

g++ test.cpp -O2 -std=c++0x -funroll-loops -ftree-vectorizer-verbose=1 -o test

只返回这个:

test_old.cpp:16: note: Unroll loop 7 times test_old.cpp:28: note: Unroll loop 1 times

第 16 行是循环函数的开始，因此编译器肯定会对其进行矢量化处理.检查汇编程序也证实了这一点.

Line 16 is the start of the loop function, so the compiler is definitely vectorizing it. Checking the assembler confirms this too.

我目前正在使用的笔记本电脑上似乎有一些激进的缓存，这使得准确测量函数运行所需的时间变得非常困难.

I seem to be getting some aggressive caching on the laptop I'm currently using which is making it very hard to accurately measure how long the function takes to run.

但您也可以尝试以下其他一些方法:

But here's a couple of other things you can try too:

使用 __restrict__ 限定符告诉编译器数组之间没有重叠.

Use the __restrict__ qualifier to tell the compiler that there is no overlap between the arrays.

告诉编译器数组与 __builtin_assume_aligned 对齐(不可移植)

Tell the compiler the arrays are aligned with __builtin_assume_aligned (not portable)

这是我的结果代码(我删除了模板，因为您希望对不同的数据类型使用不同的对齐方式)

Here's my resulting code (I removed the template since you will want to use different alignment for different data types)

#include <iostream> #include <chrono> #include <vector> void foo( double * __restrict__ p1, double * __restrict__ p2, size_t start, size_t end ) { double* pA1 = static_cast<double*>(__builtin_assume_aligned(p1, 16)); double* pA2 = static_cast<double*>(__builtin_assume_aligned(p2, 16)); for (size_t i = start; i < end; ++i) { pA1[i] = pA1[i] - pA2[i]; pA1[i] += 1; } } int main() { size_t n; double x, y; n = 12800000; std::vector<double> v,u; for(size_t i=0; i<n; ++i) { x = i; y = i - 1; v.push_back(x); u.push_back(y); } using namespace std::chrono; high_resolution_clock::time_point t1 = high_resolution_clock::now(); foo(&v[0], &u[0], 0, n ); high_resolution_clock::time_point t2 = high_resolution_clock::now(); duration<double> time_span = duration_cast<duration<double>>(t2 - t1); std::cout << "It took me " << time_span.count() << " seconds."; std::cout << std::endl; return 0; }

就像我说的那样，我无法获得一致的时间测量结果，因此无法确认这是否会提高您的性能(甚至可能会降低！)

Like I said I've had trouble getting consistent time measurements, so can't confirm if this will give you a performance increase (or maybe even decrease!)

相关文章