OpenMP:嵌套并行化有什么好处?

据我了解,#pragma omp parallel 及其变体基本上是在多个并发线程中执行以下块,这与 CPU 的数量相对应.当有嵌套并行化时 - 并行内并行化,并行函数内并行化等 - 内部并行化会发生什么?

From what I understand, #pragma omp parallel and its variations basically execute the following block in a number of concurrent threads, which corresponds to the number of CPUs. When having nested parallelizations - parallel for within parallel for, parallel function within parallel function etc. - what happens on the inner parallelization?

我是 OpenMP 的新手,我想到的情况可能相当简单 - 将向量与矩阵相乘.这是在两个嵌套的 for 循环中完成的.假设 CPU 的数量小于向量中的元素数量,尝试并行运行内部循环有什么好处?总线程数会大于CPU数,还是内循环会顺序执行?

I'm new to OpenMP, and the case I have in mind is probably rather trivial - multiplying a vector with a matrix. This is done in two nested for loops. Assuming the number of CPUs is smaller than the number of elements in the vector, is there any benefit in trying to run the inner loop in parallel? Will the total number of threads be larger than the number of CPUs, or will the inner loop be executed sequentially?

推荐答案

(1) OpenMP 中的嵌套并行:http://docs.oracle.com/cd/E19205-01/819-5270/aewbc/index.html

(1) Nested parallelism in OpenMP: http://docs.oracle.com/cd/E19205-01/819-5270/aewbc/index.html

您需要通过设置OMP_NESTEDomp_set_nested 来开启嵌套并行,因为很多实现默认关闭了这个功能,甚至有些实现并不完全支持嵌套并行.如果打开,每当您遇到 parallel for 时,OpenMP 将创建 OMP_NUM_THREADS 中定义的线程数.因此,如果是 2 级并行,则线程总数将为 N^2,其中 N = OMP_NUM_THREADS.

You need to turn on nested parallelism by setting OMP_NESTED or omp_set_nested because many implementations turn off this feature by default, even some implementations didn't support nested parallelism fully. If turned on, whenever you meet parallel for, OpenMP will create the number of threads as defined in OMP_NUM_THREADS. So, if 2-level parallelism, the total number of threads would be N^2, where N = OMP_NUM_THREADS.

这种嵌套的并行性会导致过度订阅(即繁忙线程的数量大于核心数),这可能会降低加速比.在极端情况下,嵌套并行被递归调用,线程可能会膨胀(例如,创建 1000 个线程),而计算机只会浪费时间进行上下文切换.在这种情况下,您可以通过设置omp_set_dynamic来动态控制线程数.

Such nested parallelism will cause oversubscription, (i.e., the number of busy threads is greater than the cores), which may degrade the speedup. In an extreme case, where nested parallelism is called recursively, threads could be bloated (e.g., creating 1000s threads), and computer just wastes time for context switching. In such case, you may control the number of threads dynamically by setting omp_set_dynamic.

(2) 矩阵向量乘法示例:代码如下:

(2) An example of matrix-vector multiplication: the code would look like:

// Input:  A(N by M), B(M by 1)
// Output: C(N by 1)
for (int i = 0; i < N; ++i)
  for (int j = 0; j < M; ++j)
     C[i] += A[i][j] * B[j];

通常,由于线程的分叉/连接开销,在可能存在外循环的情况下并行化内循环是不好的.(尽管许多 OpenMP 实现预创建线程,但它仍然需要一些将任务分派到线程并在 parallel-for 结束时调用隐式障碍)

In general, parallelizing inner loops while outer loops are possible is bad because of forking/joining overhead of threads. (though many OpenMP implementations pre-create threads, it still requires some to dispatch tasks to threads and to call implicit barrier at the end of parallel-for)

您担心的是 N <# CPU.对对对,这种情况下,加速会受限于N,让嵌套并行肯定会有好处.

Your concern is the case of where N < # of CPU. Yes, right, in this case, the speedup would be limited by N, and letting nested parallelism will definitely have benefits.

然而,如果 N 足够大,那么代码会导致超额订阅.我只是在考虑以下解决方案:

However, then the code would cause oversubscription if N is sufficiently large. I'm just thinking the following solutions:

  • 更改循环结构,以便仅存在 1 级循环.(看起来可行)
  • 专门化代码:如果 N 很小,则执行嵌套并行,否则不要这样做.
  • 使用 omp_set_dynamic 嵌套并行.但是,请确定omp_set_dynamic 是如何控制线程数和线程活动的.实现可能会有所不同.
  • Changing the loop structure so that only 1-level loop exists. (It looks doable)
  • Specializing the code: if N is small, then do nested parallelism, otherwise don't do that.
  • Nested parallelism with omp_set_dynamic. But, please make it sure how omp_set_dynamic controls the number of threads and the activity of threads. Implementations may vary.

相关文章