为什么ubuntu 12.04下的OpenMP比串口版慢
我已经阅读了有关此主题的其他一些问题.然而,他们无论如何都没有解决我的问题.
I've read some other questions on this topic. However, they didn't solve my problem anyway.
我写的代码如下,我得到的 pthread
版本和 omp
版本都比串行版本慢.我很困惑.
I wrote the code as following and I got pthread
version and omp
version both slower than the serial version. I'm very confused.
环境下编译:
Ubuntu 12.04 64bit 3.2.0-60-generic
g++ (Ubuntu 4.8.1-2ubuntu1~12.04) 4.8.1
CPU(s): 2
On-line CPU(s) list: 0,1
Thread(s) per core: 1
Vendor ID: AuthenticAMD
CPU family: 18
Model: 1
Stepping: 0
CPU MHz: 800.000
BogoMIPS: 3593.36
L1d cache: 64K
L1i cache: 64K
L2 cache: 512K
NUMA node0 CPU(s): 0,1
编译命令:
g++ -std=c++11 ./eg001.cpp -fopenmp
#include <cmath>
#include <cstdio>
#include <ctime>
#include <omp.h>
#include <pthread.h>
#define NUM_THREADS 5
const int sizen = 256000000;
struct Data {
double * pSinTable;
long tid;
};
void * compute(void * p) {
Data * pDt = (Data *)p;
const int start = sizen * pDt->tid/NUM_THREADS;
const int end = sizen * (pDt->tid + 1)/NUM_THREADS;
for(int n = start; n < end; ++n) {
pDt->pSinTable[n] = std::sin(2 * M_PI * n / sizen);
}
pthread_exit(nullptr);
}
int main()
{
double * sinTable = new double[sizen];
pthread_t threads[NUM_THREADS];
pthread_attr_t attr;
pthread_attr_init(&attr);
pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_JOINABLE);
clock_t start, finish;
start = clock();
int rc;
Data dt[NUM_THREADS];
for(int i = 0; i < NUM_THREADS; ++i) {
dt[i].pSinTable = sinTable;
dt[i].tid = i;
rc = pthread_create(&threads[i], &attr, compute, &dt[i]);
}//for
pthread_attr_destroy(&attr);
for(int i = 0; i < NUM_THREADS; ++i) {
rc = pthread_join(threads[i], nullptr);
}//for
finish = clock();
printf("from pthread: %lf
", (double)(finish - start)/CLOCKS_PER_SEC);
delete sinTable;
sinTable = new double[sizen];
start = clock();
# pragma omp parallel for
for(int n = 0; n < sizen; ++n)
sinTable[n] = std::sin(2 * M_PI * n / sizen);
finish = clock();
printf("from omp: %lf
", (double)(finish - start)/CLOCKS_PER_SEC);
delete sinTable;
sinTable = new double[sizen];
start = clock();
for(int n = 0; n < sizen; ++n)
sinTable[n] = std::sin(2 * M_PI * n / sizen);
finish = clock();
printf("from serial: %lf
", (double)(finish - start)/CLOCKS_PER_SEC);
delete sinTable;
pthread_exit(nullptr);
return 0;
}
输出:
from pthread: 21.150000
from omp: 20.940000
from serial: 20.800000
不知道是不是我代码的问题,所以我用pthread来做同样的事情.
I wonder whether it was my code's problem so I used pthread to do the same thing.
然而,我完全错了,我想知道这是否可能是 Ubuntu 在 OpenMP/pthread 上的问题.
However, I'm totally wrong, and I wonder whether it might be Ubuntu's problem on OpenMP/pthread.
我有一个朋友也有 AMD CPU 和 Ubuntu 12.04,在那里遇到了同样的问题,所以我可能有理由相信问题不仅限于我.
I have a friend who has AMD CPU and Ubuntu 12.04 as well, and got the same problem there, so I might have some reason to believe that the problem is not limited to only me.
如果有人和我有同样的问题,或者对这个问题有一些线索,提前致谢.
If anyone has the same problem as me, or has some clue on the problem, thanks in advance.
如果代码不够好,我运行了一个基准测试并将结果粘贴在这里:
If the code is not good enough, I ran a benchmark and I pasted the result here:
http://pastebin.com/RquLPREc
基准网址:http://www.cs.kent.edu/~farrell/mc08/lectures/progs/openmp/microBenchmarks/src/download.html
新信息:
我使用 VS2012 在 windows(没有 pthread 版本)上运行代码.
I ran the code on windows (without pthread version) with VS2012.
我使用了 sizen 的 1/10,因为 windows 不允许我分配大内存主干的结果:
I used 1/10 of sizen because windows does not allow me to allocate that great trunk of memory where the results are:
from omp: 1.004
from serial: 1.420
from FreeNickName: 735 (this one is the suggestion improvement by @FreeNickName)
这是否表明它可能是 Ubuntu OS
的问题??
Does this indicate that it could be a problem of Ubuntu OS
??
问题通过使用在操作系统之间可移植的omp_get_wtime
函数解决.请参阅 Hristo Iliev
的答案.
Problem is solved by using omp_get_wtime
function that is portable among Operating Systems. See the answer by Hristo Iliev
.
FreeNickName
对这个有争议的话题进行了一些测试.
Some tests about the controversial topic by FreeNickName
.
(抱歉,我需要在 Ubuntu 上测试它,因为 Windows 是我的朋友之一.)
(Sorry I need to test it on Ubuntu cause the windows was one of my friends'.)
--1-- 从 delete
更改为 delete []
: (但没有 memset)(-std=c++11 -fopenmp)
--1-- Change from delete
to delete []
: (but without memset)(-std=c++11 -fopenmp)
from pthread: 13.491405
from omp: 13.023099
from serial: 20.665132
from FreeNickName: 12.022501
--2-- 在 new 之后立即使用 memset:(-std=c++11 -fopenmp)
--2-- With memset immediately after new: (-std=c++11 -fopenmp)
from pthread: 13.996505
from omp: 13.192444
from serial: 19.882127
from FreeNickName: 12.541723
--3-- 在 new 之后立即使用 memset:(-std=c++11 -fopenmp -march=native -O2)
--3-- With memset immediately after new: (-std=c++11 -fopenmp -march=native -O2)
from pthread: 11.886978
from omp: 11.351801
from serial: 17.002865
from FreeNickName: 11.198779
--4-- 在 new 之后立即使用 memset,并将 FreeNickName 的版本放在 OMP 之前用于版本:(-std=c++11 -fopenmp -march=native -O2)
--4-- With memset immediately after new, and put FreeNickName's version before OMP for version: (-std=c++11 -fopenmp -march=native -O2)
from pthread: 11.831127
from FreeNickName: 11.571595
from omp: 11.932814
from serial: 16.976979
--5-- 在 new 之后立即使用 memset,并将 FreeNickName 的版本放在 OMP for version 之前,并将 NUM_THREADS
设置为 5 而不是 2(我是双核).
--5-- With memset immediately after new, and put FreeNickName's version before OMP for version, and set NUM_THREADS
to 5 instead of 2 (I'm dual core).
from pthread: 9.451775
from FreeNickName: 9.385366
from omp: 11.854656
from serial: 16.960101
推荐答案
在您的情况下,OpenMP 没有任何问题.问题在于您测量经过的时间的方式.
There is nothing wrong with OpenMP in your case. What is wrong is the way you measure the elapsed time.
使用 clock()
测量 Linux(和大多数其他类 Unix 操作系统)上多线程应用程序的性能是一个错误,因为它不返回挂钟(实时)时间,而是返回所有进程线程的累积 CPU 时间(在某些 Unix 风格上甚至是所有子进程的累积 CPU 时间).您的并行代码在 Windows 上显示出更好的性能,因为 clock()
返回的是实时时间,而不是累积的 CPU 时间.
Using clock()
to measure the performance of multithreaded applications on Linux (and most other Unix-like OSes) is a mistake since it does not return the wall-clock (real) time but instead the accumulated CPU time for all process threads (and on some Unix flavours even the accumulated CPU time for all child processes). Your parallel code shows better performance on Windows since there clock()
returns the real time and not the accumulated CPU time.
防止此类差异的最佳方法是使用可移植的 OpenMP 计时器例程 omp_get_wtime()
:
The best way to prevent such discrepancies is to use the portable OpenMP timer routine omp_get_wtime()
:
double start = omp_get_wtime();
#pragma omp parallel for
for(int n = 0; n < sizen; ++n)
sinTable[n] = std::sin(2 * M_PI * n / sizen);
double finish = omp_get_wtime();
printf("from omp: %lf
", finish - start);
对于非 OpenMP 应用程序,您应该使用 clock_gettime()
和 CLOCK_REALTIME
时钟:
For non-OpenMP applications, you should use clock_gettime()
with the CLOCK_REALTIME
clock:
struct timespec start, finish;
clock_gettime(CLOCK_REALTIME, &start);
#pragma omp parallel for
for(int n = 0; n < sizen; ++n)
sinTable[n] = std::sin(2 * M_PI * n / sizen);
clock_gettime(CLOCK_REALTIME, &finish);
printf("from omp: %lf
", (finish.tv_sec + 1.e-9 * finish.tv_nsec) -
(start.tv_sec + 1.e-9 * start.tv_nsec));
相关文章