如何用 g++ 向量化我的循环?
我在搜索时找到的介绍链接:
The introductory links I found while searching:
- 6.59.14 Loop-Specific Pragmas
- 2.100 Pragma Loop_Optimize
- 如何向 gcc 提供有关循环计数的提示一个>
- 告诉 gcc 专门展开循环
- 如何在 C++ 中强制向量化
正如您所见,它们中的大多数都是用于 C 的,但我认为它们也可能适用于 C++.这是我的代码:
As you can see most of them are for C, but I thought that they might work at C++ as well. Here is my code:
template<typename T>
//__attribute__((optimize("unroll-loops")))
//__attribute__ ((pure))
void foo(std::vector<T> &p1, size_t start,
size_t end, const std::vector<T> &p2) {
typename std::vector<T>::const_iterator it2 = p2.begin();
//#pragma simd
//#pragma omp parallel for
//#pragma GCC ivdep Unroll Vector
for (size_t i = start; i < end; ++i, ++it2) {
p1[i] = p1[i] - *it2;
p1[i] += 1;
}
}
int main()
{
size_t n;
double x,y;
n = 12800000;
vector<double> v,u;
for(size_t i=0; i<n; ++i) {
x = i;
y = i - 1;
v.push_back(x);
u.push_back(y);
}
using namespace std::chrono;
high_resolution_clock::time_point t1 = high_resolution_clock::now();
foo(v,0,n,u);
high_resolution_clock::time_point t2 = high_resolution_clock::now();
duration<double> time_span = duration_cast<duration<double>>(t2 - t1);
std::cout << "It took me " << time_span.count() << " seconds.";
std::cout << std::endl;
return 0;
}
我使用了上面注释的所有提示,但我没有得到任何加速,如示例输出所示(第一次运行未注释此 #pragma GCC ivdep Unroll Vector
:
I used al the hints one can see commented above, but I did not get any speedup, as a sample output shows (with the first run having uncommented this #pragma GCC ivdep Unroll Vector
:
samaras@samaras-A15:~/Downloads$ g++ test.cpp -O3 -std=c++0x -funroll-loops -ftree-vectorize -o test
samaras@samaras-A15:~/Downloads$ ./test
It took me 0.026575 seconds.
samaras@samaras-A15:~/Downloads$ g++ test.cpp -O3 -std=c++0x -o test
samaras@samaras-A15:~/Downloads$ ./test
It took me 0.0252697 seconds.
还有希望吗?还是优化标志 O3
就可以解决问题?欢迎提出任何加速此代码(foo
函数)的建议!
Is there any hope? Or the optimization flag O3
just does the trick? Any suggestions to speedup this code (the foo
function) are welcome!
我的 g++ 版本:
samaras@samaras-A15:~/Downloads$ g++ --version
g++ (Ubuntu 4.8.1-2ubuntu1~12.04) 4.8.1
<小时>
请注意,循环的主体是随机的.我对以其他形式重写它不感兴趣.
Notice that the body of the loop is random. I am not interesting in re-writing it in some other form.
编辑
回答说无能为力也是可以接受的!
An answer saying that there is nothing more that can be done is also acceptable!
推荐答案
O3
标志自动开启-ftree-vectorize
.https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html
-O3 打开 -O2 指定的所有优化,同时打开 -finline-functions、-funswitch-loops、-fpredictive-commoning、-fgcse-after-reload、-ftree-loop-vectorize、-ftree-loop-distribute-patterns、-ftree-slp-vectorize、-fvect-cost-model、-ftree-partial-pre 和 -fipa-cp-clone 选项
-O3 turns on all optimizations specified by -O2 and also turns on the -finline-functions, -funswitch-loops, -fpredictive-commoning, -fgcse-after-reload, -ftree-loop-vectorize, -ftree-loop-distribute-patterns, -ftree-slp-vectorize, -fvect-cost-model, -ftree-partial-pre and -fipa-cp-clone options
所以在这两种情况下,编译器都在尝试进行循环向量化.
So in both cases the compiler is trying to do loop vectorization.
使用g++ 4.8.2编译:
Using g++ 4.8.2 to compile with:
# In newer versions of GCC use -fopt-info-vec-missed instead of -ftree-vectorize
g++ test.cpp -O2 -std=c++0x -funroll-loops -ftree-vectorize -ftree-vectorizer-verbose=1 -o test
给出这个:
Analyzing loop at test.cpp:16
Vectorizing loop at test.cpp:16
test.cpp:16: note: create runtime check for data references *it2$_M_current_106 and *_39
test.cpp:16: note: created 1 versioning for alias checks.
test.cpp:16: note: LOOP VECTORIZED.
Analyzing loop at test_old.cpp:29
test.cpp:22: note: vectorized 1 loops in function.
test.cpp:18: note: Unroll loop 7 times
test.cpp:16: note: Unroll loop 7 times
test.cpp:28: note: Unroll loop 1 times
不带 -ftree-vectorize
标志编译:
g++ test.cpp -O2 -std=c++0x -funroll-loops -ftree-vectorizer-verbose=1 -o test
只返回这个:
test_old.cpp:16: note: Unroll loop 7 times
test_old.cpp:28: note: Unroll loop 1 times
第 16 行是循环函数的开始,因此编译器肯定会对其进行矢量化处理.检查汇编程序也证实了这一点.
Line 16 is the start of the loop function, so the compiler is definitely vectorizing it. Checking the assembler confirms this too.
我目前正在使用的笔记本电脑上似乎有一些激进的缓存,这使得准确测量函数运行所需的时间变得非常困难.
I seem to be getting some aggressive caching on the laptop I'm currently using which is making it very hard to accurately measure how long the function takes to run.
但您也可以尝试以下其他一些方法:
But here's a couple of other things you can try too:
使用
__restrict__
限定符告诉编译器数组之间没有重叠.
Use the
__restrict__
qualifier to tell the compiler that there is no overlap between the arrays.
告诉编译器数组与 __builtin_assume_aligned
对齐(不可移植)
Tell the compiler the arrays are aligned with __builtin_assume_aligned
(not portable)
这是我的结果代码(我删除了模板,因为您希望对不同的数据类型使用不同的对齐方式)
Here's my resulting code (I removed the template since you will want to use different alignment for different data types)
#include <iostream>
#include <chrono>
#include <vector>
void foo( double * __restrict__ p1,
double * __restrict__ p2,
size_t start,
size_t end )
{
double* pA1 = static_cast<double*>(__builtin_assume_aligned(p1, 16));
double* pA2 = static_cast<double*>(__builtin_assume_aligned(p2, 16));
for (size_t i = start; i < end; ++i)
{
pA1[i] = pA1[i] - pA2[i];
pA1[i] += 1;
}
}
int main()
{
size_t n;
double x, y;
n = 12800000;
std::vector<double> v,u;
for(size_t i=0; i<n; ++i) {
x = i;
y = i - 1;
v.push_back(x);
u.push_back(y);
}
using namespace std::chrono;
high_resolution_clock::time_point t1 = high_resolution_clock::now();
foo(&v[0], &u[0], 0, n );
high_resolution_clock::time_point t2 = high_resolution_clock::now();
duration<double> time_span = duration_cast<duration<double>>(t2 - t1);
std::cout << "It took me " << time_span.count() << " seconds.";
std::cout << std::endl;
return 0;
}
就像我说的那样,我无法获得一致的时间测量结果,因此无法确认这是否会提高您的性能(甚至可能会降低!)
Like I said I've had trouble getting consistent time measurements, so can't confirm if this will give you a performance increase (or maybe even decrease!)
相关文章