OpenMP - 在外循环之前具有并行时,嵌套的 for 循环会变得更快.为什么?
我目前正在实施一种动态规划算法来解决背包问题.因此我的代码有两个 for 循环,一个外循环和一个内循环.
从逻辑的角度来看,我可以并行化内部 for 循环,因为那里的计算彼此独立.由于依赖关系,无法并行化外部 for 循环.
所以这是我的第一种方法:
for(int i=1; i THRESHOLD)for(int c=1; c
代码运行良好,算法正确解决问题.然后我在考虑优化它,因为我不确定 OpenMP 的线程管理是如何工作的.我想防止在每次迭代期间对线程进行不必要的初始化,因此我在外循环周围放置了一个外并行块.
第二种方法:
#pragma omp parallel if(weightColumns > THRESHOLD){for(int i=1; i < itemRows; i++){int itemsIndex = i-1;int itemWeight = integerItems[itemsIndex].weight;int itemWorth = integerItems[itemsIndex].worth;#pragma omp forfor(int c=1; c
这有一个不需要的副作用:并行块内的所有内容现在都将执行 n 次,其中 n 是可用内核的数量.我已经尝试使用编译指示 single
和 critical
来强制在一个线程中执行外部 for 循环,但是我无法通过多个线程计算内部循环除非我打开一个新的并行块(但这样不会提高速度).但没关系,因为好消息是:这不会影响结果.问题还是得到了正确解决.
现在很奇怪:第二种方法比第一种方法快!
这怎么可能?我的意思是,虽然外部 for 循环计算了 n 次(并行)并且内部 for 循环在 n 个内核之间分布了 n 次,但它比第一种方法更快,后者只计算一次外部循环并分配工作负载内部 for 循环均匀.
起初我在想:嗯,是的,这可能是因为线程管理",但后来我读到 OpenMP 将实例化的线程池化,这与我的假设相悖.然后我禁用了编译器优化(编译器标志 -O0)以检查它是否与此有关.但这并不影响测量.
请你们中的任何人都可以对此有所了解吗?
解决包含 7500 个物品、最大容量为 45000 的背包问题的测量时间(创建一个 7500x45000 的矩阵,这远远超过代码中使用的 THRESHOLD 变量):
- 方法 1:~0.88s
- 方法 2:~0.52s
提前致谢,
衬里
编辑:
测量更复杂的问题:为问题增加了2500项(从7500到10000)(更复杂的问题由于内存原因目前无法处理).
- 方法 1:~1.19 秒
- 方法 2:~0.71 秒
EDIT2:我对编译器优化有误解.这不影响测量.至少我无法重现我之前测量的差异.我根据这个编辑了问题文本.
解决方案让我们首先考虑您的代码在做什么.本质上,您的代码正在转换矩阵(二维数组),其中行的值取决于前一行,但列的值与其他列无关.让我选择一个更简单的例子
for(int i=1; i
并行化的一种方法是像这样交换循环
方法一:
#pragma omp parallel forfor(int j=0; j
使用这种方法,每个线程运行内循环的 i
的所有 n-1
次迭代,但只运行 n/nthreads
次 迭代>j
.这有效地并行处理列条.但是,这种方法对缓存非常不友好.
另一种可能性是仅并行化内部循环.
方法二:
for(int i=1; i
这实质上是并行处理单行中的列,但按顺序处理每一行.i
的值仅由主线程运行.
另一种并行处理列但按顺序处理每一行的方法是:
方法三:
#pragma omp parallelfor(int i=1; i<n; i++) {#pragma omp forfor(int j=0; j
在这种方法中,与方法 1 一样,每个线程在 i
上运行所有 n-1
次迭代.但是,此方法在内循环之后有一个隐式障碍,这会导致每个线程暂停,直到所有线程完成一行,使此方法对每一行按顺序执行,如方法 2.
最好的解决方案是像方法 1 一样并行处理列条,但仍然对缓存友好.这可以使用 nowait
子句来实现.
方法四:
#pragma omp parallelfor(int i=1; i<n; i++) {#pragma omp for nowaitfor(int j=0; j
在我的测试中,nowait
子句没有太大区别.这可能是因为负载是均匀的(这就是为什么在这种情况下静态调度是理想的).如果负载更小,nowait
可能会产生更大的不同.
以下是我的四核 IVB 系统 GCC 4.9.2 上 n=3000
的时间(以秒为单位):
方法一:3.00方法2:0.26方法3:0.21方法4:0.21
此测试可能受内存带宽限制,因此我可以选择使用更多计算的更好案例,但差异仍然足够显着.为了消除由于创建线程池而产生的偏差,我在没有先计时的情况下运行了其中一种方法.
从时间上可以清楚地看出方法 1 对缓存不友好.很明显,方法 3 比方法 2 更快,并且 nowait
在这种情况下几乎没有影响.
由于方法 2 和方法 3 都并行处理一行中的列,但按顺序处理行,因此可能期望它们的时间相同.那么它们为什么不同呢?让我做一些观察:
由于有线程池,方法 2 的外循环的每次迭代都不会创建和销毁线程,所以我不清楚额外的开销是什么.请注意,OpenMP 没有提及线程池.这是每个编译器都实现的东西.
方法 3 和方法 2 之间唯一的其他区别是在方法 2 中只有主线程处理
i
而在方法 3 中每个线程都处理私有i
.但这对我来说似乎太琐碎了,无法解释方法之间的显着差异,因为方法 3 中的隐式障碍导致它们无论如何都会同步,并且处理i
是一个增量和条件测试的问题.p>事实上,方法 3 并不比方法 4 慢,方法 4 并行处理整条列,这说明方法 2 中的额外开销全部在于
i<的每次迭代离开和进入并行区域/code>
所以我的结论是,要解释为什么方法 2 比方法 3 慢得多,需要查看线程池的实现.对于使用 pthread 的 GCC,这可能可以通过创建线程池的玩具模型来解释,但我对此还没有足够的经验.
I'm currently implementing an dynamic programming algorithm for solving knapsack problems. Therefore my code has two for-loops, an outer and an inner loop.
From the logical point of view I can parallelize the inner for loop since the calculations there are independant from each other. The outer for loop can not be parallelized because of dependancies.
So this was my first approach:
for(int i=1; i < itemRows; i++){
int itemsIndex = i-1;
int itemWeight = integerItems[itemsIndex].weight;
int itemWorth = integerItems[itemsIndex].worth;
#pragma omp parallel for if(weightColumns > THRESHOLD)
for(int c=1; c < weightColumns; c++){
if(c < itemWeight){
table[i][c] = table[i-1][c];
}else{
int worthOfNotUsingItem = table[i-1][c];
int worthOfUsingItem = itemWorth + table[i-1][c-itemWeight];
table[i][c] = worthOfNotUsingItem < worthOfUsingItem ? worthOfUsingItem : worthOfNotUsingItem;
}
}
}
The code works well, the algorithm solves the problems correctly. Then I was thinking about optimizing this, since I was not sure how OpenMP's thread management works. I wanted to prevent unnecessary initialization of the threads during each iteration, thus I put an outer parallel block around the outer loop.
Second approach:
#pragma omp parallel if(weightColumns > THRESHOLD)
{
for(int i=1; i < itemRows; i++){
int itemsIndex = i-1;
int itemWeight = integerItems[itemsIndex].weight;
int itemWorth = integerItems[itemsIndex].worth;
#pragma omp for
for(int c=1; c < weightColumns; c++){
if(c < itemWeight){
table[i][c] = table[i-1][c];
}else{
int worthOfNotUsingItem = table[i-1][c];
int worthOfUsingItem = itemWorth + table[i-1][c-itemWeight];
table[i][c] = worthOfNotUsingItem < worthOfUsingItem ? worthOfUsingItem : worthOfNotUsingItem;
}
}
}
}
This has an unwanted side effect: Everything inside the parallel block will now be executed n-times, where n is the number of available cores. I already tried to work with pragmas single
and critical
to force the outer for-loop to be executed in one thread, but then I can not calculate the inner loop by multiple threads unless I open a new parallel block (but then there would be no gain in speed). But nevermind, because the good thing is: this does not affect the result. The problems are still solved correctly.
NOW THE STRANGE THING: The second approach is FASTER than the first one!
How can this be? I mean, although the outer for-loop is calculated n times (in parallel) and the inner for-loop is distributed n times among n cores it is faster than the first approach which only calculates the outer loop one time and distributes the workload of the inner for-loop evenly.
At first I was thinking: "well yeah, it's probably because of thread management" but then I read that OpenMP pools the instantiated threads which would speak against my assumption. Then I disabled compiler optimization (compiler flag -O0) to check if it has something to do with. But this did not affect the measurement.
Can anybody of you shed more light into this please?
The measured times for solving the knapsack-problem containing 7500 items with a max capacity of 45000 (creating a matrix of 7500x45000, which is way over the used THRESHOLD variable in code):
- Approach 1: ~0.88s
- Approach 2: ~0.52s
Thanks in advance,
phineliner
EDIT:
Measurement of a more complex problem: Added 2500 items to the problem (from 7500 to 10000) (more complex problems can't currently be handled because of memory reasons).
- Approach 1: ~1.19s
- Approach 2: ~0.71s
EDIT2: I was mistaken about the compiler optimization. This does not affect measurement. At least I can not reproduce the difference I measured before. I edited the question text according to this.
解决方案Let's first consider what your code is doing. Essentially your code is transforming a matrix (2D array) where the values of the rows depend on the previous row but the values of the columns are independent of other columns. Let me choose a simpler example of this
for(int i=1; i<n; i++) {
for(int j=0; j<n; j++) {
a[i*n+j] += a[(i-1)*n+j];
}
}
One way to parallelize this is to swap the loops like this
Method 1:
#pragma omp parallel for
for(int j=0; j<n; j++) {
for(int i=1; i<n; i++) {
a[i*n+j] += a[(i-1)*n+j];
}
}
With this method each thread runs all n-1
iterations of i
of the inner loop but only n/nthreads
iterations of j
. This effectively processes strips of columns in parallel. However, this method is highly cache unfriendly.
Another possibility is to only parallelize the inner loop.
Method 2:
for(int i=1; i<n; i++) {
#pragma omp parallel for
for(int j=0; j<n; j++) {
a[i*n+j] += a[(i-1)*n+j];
}
}
This essentially processes the columns in a single row in parallel but each row sequentially. The values of i
are only run by the master thread.
Another way to process the columns in parallel but each row sequentially is:
Method 3:
#pragma omp parallel
for(int i=1; i<n; i++) {
#pragma omp for
for(int j=0; j<n; j++) {
a[i*n+j] += a[(i-1)*n+j];
}
}
In this method, like method 1, each thread runs over all n-1
iteration over i
. However, this method has an implicit barrier after the inner loop which causes each thread to pause until all threads have finished a row making this method sequential for each row like method 2.
The best solution is one which processes strips of columns in parallel like method 1 but is still cache friendly. This can be achieved using the nowait
clause.
Method 4:
#pragma omp parallel
for(int i=1; i<n; i++) {
#pragma omp for nowait
for(int j=0; j<n; j++) {
a[i*n+j] += a[(i-1)*n+j];
}
}
In my tests the nowait
clause does not make much difference. This is probably because the load is even (which is why static scheduling is ideal in this case). If the load was less even nowait
would probably make more of a difference.
Here are the times in seconds for n=3000
on my my four cores IVB system GCC 4.9.2:
method 1: 3.00
method 2: 0.26
method 3: 0.21
method 4: 0.21
This test is probably memory bandwidth bound so I could have chosen a better case using more computation but nevertheless the differences are significant enough. In order to remove a bias due to creating the thread pool I ran one of the methods without timing it first.
It's clear from the timing how un-cache friendly method 1 is. It's also clear method 3 is faster than method 2 and that nowait
has little effect in this case.
Since method 2 and method 3 both processes columns in a row in parallel but rows sequentially one might expect their timing to be the same. So why do they differ? Let me make some observations:
Due to a thread pool the threads are not created and destroyed for each iteration of the outer loop of method 2 so it's not clear to me what the extra overhead is. Note that OpenMP says nothing about a thread pool. This is something that each compiler implements.
The only other difference between method 3 and method 2 is that in method 2 only the master thread processes
i
whereas in method 3 each thread processes a privatei
. But this seems too trivially to me to explain the significant difference between the methods because the implicit barrier in method 3 causes them to sync anyway and processingi
is a matter of an increment and a conditional test.The fact that method 3 is no slower than method 4 which processes whole strips of columns in parallel says the extra overhead in method 2 is all in leaving and entering a parallel region for each iteration of
i
So my conclusion is that to explain why method 2 is so much slower than method 3 requires looking into the implementation of the thread pool. For GCC which uses pthreads, this could probably be explained by creating a toy model of a thread pool but I don't have enough experience with that yet.
相关文章