如何使用 g++ 向量化我的循环?
我在搜索时找到的介绍性链接:
The introductory links I found while searching:
- 6.59.14 Loop-Specific Pragmas
- 2.100 Pragma Loop_Optimize
- 如何向 gcc 提示循环计数一个>
- 告诉 gcc 专门展开一个循环
- 如何在 C++ 中强制矢量化
正如你所看到的,它们中的大部分是针对 C 的,但我认为它们也可能适用于 C++.这是我的代码:
As you can see most of them are for C, but I thought that they might work at C++ as well. Here is my code:
template<typename T>
//__attribute__((optimize("unroll-loops")))
//__attribute__ ((pure))
void foo(std::vector<T> &p1, size_t start,
size_t end, const std::vector<T> &p2) {
typename std::vector<T>::const_iterator it2 = p2.begin();
//#pragma simd
//#pragma omp parallel for
//#pragma GCC ivdep Unroll Vector
for (size_t i = start; i < end; ++i, ++it2) {
p1[i] = p1[i] - *it2;
p1[i] += 1;
}
}
int main()
{
size_t n;
double x,y;
n = 12800000;
vector<double> v,u;
for(size_t i=0; i<n; ++i) {
x = i;
y = i - 1;
v.push_back(x);
u.push_back(y);
}
using namespace std::chrono;
high_resolution_clock::time_point t1 = high_resolution_clock::now();
foo(v,0,n,u);
high_resolution_clock::time_point t2 = high_resolution_clock::now();
duration<double> time_span = duration_cast<duration<double>>(t2 - t1);
std::cout << "It took me " << time_span.count() << " seconds.";
std::cout << std::endl;
return 0;
}
我使用了上面评论的所有提示,但我没有得到任何加速,如示例输出所示(第一次运行已取消注释此#pragma GCC ivdep Unroll Vector
:
I used al the hints one can see commented above, but I did not get any speedup, as a sample output shows (with the first run having uncommented this #pragma GCC ivdep Unroll Vector
:
samaras@samaras-A15:~/Downloads$ g++ test.cpp -O3 -std=c++0x -funroll-loops -ftree-vectorize -o test
samaras@samaras-A15:~/Downloads$ ./test
It took me 0.026575 seconds.
samaras@samaras-A15:~/Downloads$ g++ test.cpp -O3 -std=c++0x -o test
samaras@samaras-A15:~/Downloads$ ./test
It took me 0.0252697 seconds.
还有希望吗?或者优化标志 O3
就可以解决问题?欢迎任何加速此代码(foo
函数)的建议!
Is there any hope? Or the optimization flag O3
just does the trick? Any suggestions to speedup this code (the foo
function) are welcome!
我的 g++ 版本:
samaras@samaras-A15:~/Downloads$ g++ --version
g++ (Ubuntu 4.8.1-2ubuntu1~12.04) 4.8.1
<小时>
注意循环体是随机的.我对以其他形式重写它没有兴趣.
Notice that the body of the loop is random. I am not interesting in re-writing it in some other form.
编辑
回答说没有什么可以做的也可以接受!
An answer saying that there is nothing more that can be done is also acceptable!
推荐答案
O3
标志会自动打开 -ftree-vectorize
.https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html
-O3 打开 -O2 指定的所有优化,同时打开 -finline-functions、-funswitch-loops、-fpredictive-commoning、-fgcse-after-reload、-ftree-loop-vectorize、-ftree-循环分布模式、-ftree-slp-vectorize、-fvect-cost-model、-ftree-partial-pre 和 -fipa-cp-clone 选项
-O3 turns on all optimizations specified by -O2 and also turns on the -finline-functions, -funswitch-loops, -fpredictive-commoning, -fgcse-after-reload, -ftree-loop-vectorize, -ftree-loop-distribute-patterns, -ftree-slp-vectorize, -fvect-cost-model, -ftree-partial-pre and -fipa-cp-clone options
所以在这两种情况下,编译器都在尝试进行循环向量化.
So in both cases the compiler is trying to do loop vectorization.
使用 g++ 4.8.2 编译:
Using g++ 4.8.2 to compile with:
# In newer versions of GCC use -fopt-info-vec-missed instead of -ftree-vectorize
g++ test.cpp -O2 -std=c++0x -funroll-loops -ftree-vectorize -ftree-vectorizer-verbose=1 -o test
给出这个:
Analyzing loop at test.cpp:16
Vectorizing loop at test.cpp:16
test.cpp:16: note: create runtime check for data references *it2$_M_current_106 and *_39
test.cpp:16: note: created 1 versioning for alias checks.
test.cpp:16: note: LOOP VECTORIZED.
Analyzing loop at test_old.cpp:29
test.cpp:22: note: vectorized 1 loops in function.
test.cpp:18: note: Unroll loop 7 times
test.cpp:16: note: Unroll loop 7 times
test.cpp:28: note: Unroll loop 1 times
在没有-ftree-vectorize
标志的情况下编译:
Compiling without the -ftree-vectorize
flag:
g++ test.cpp -O2 -std=c++0x -funroll-loops -ftree-vectorizer-verbose=1 -o test
只返回这个:
test_old.cpp:16: note: Unroll loop 7 times
test_old.cpp:28: note: Unroll loop 1 times
第 16 行是循环函数的开始,因此编译器肯定会对其进行矢量化.检查汇编程序也证实了这一点.
Line 16 is the start of the loop function, so the compiler is definitely vectorizing it. Checking the assembler confirms this too.
我目前正在使用的笔记本电脑上似乎有一些激进的缓存,这使得很难准确测量该函数运行所需的时间.
I seem to be getting some aggressive caching on the laptop I'm currently using which is making it very hard to accurately measure how long the function takes to run.
但您也可以尝试以下几件事:
But here's a couple of other things you can try too:
使用
__restrict__
限定符告诉编译器数组之间没有重叠.
Use the
__restrict__
qualifier to tell the compiler that there is no overlap between the arrays.
告诉编译器数组与__builtin_assume_aligned
(不可移植)对齐
Tell the compiler the arrays are aligned with __builtin_assume_aligned
(not portable)
这是我的结果代码(我删除了模板,因为您会希望对不同的数据类型使用不同的对齐方式)
Here's my resulting code (I removed the template since you will want to use different alignment for different data types)
#include <iostream>
#include <chrono>
#include <vector>
void foo( double * __restrict__ p1,
double * __restrict__ p2,
size_t start,
size_t end )
{
double* pA1 = static_cast<double*>(__builtin_assume_aligned(p1, 16));
double* pA2 = static_cast<double*>(__builtin_assume_aligned(p2, 16));
for (size_t i = start; i < end; ++i)
{
pA1[i] = pA1[i] - pA2[i];
pA1[i] += 1;
}
}
int main()
{
size_t n;
double x, y;
n = 12800000;
std::vector<double> v,u;
for(size_t i=0; i<n; ++i) {
x = i;
y = i - 1;
v.push_back(x);
u.push_back(y);
}
using namespace std::chrono;
high_resolution_clock::time_point t1 = high_resolution_clock::now();
foo(&v[0], &u[0], 0, n );
high_resolution_clock::time_point t2 = high_resolution_clock::now();
duration<double> time_span = duration_cast<duration<double>>(t2 - t1);
std::cout << "It took me " << time_span.count() << " seconds.";
std::cout << std::endl;
return 0;
}
就像我说的那样,我无法获得一致的时间测量值,因此无法确认这是否会给您带来性能提升(甚至可能降低!)
Like I said I've had trouble getting consistent time measurements, so can't confirm if this will give you a performance increase (or maybe even decrease!)
相关文章