OpenMP 如何在reduction 子句中使用原子指令?

OpenMP 如何在归约构造函数中使用 atomic 指令?完全不依赖原子指令吗?

比如下面代码中的变量sum是不是用atomic '+'操作符累加的?

#include #include <向量>使用命名空间标准;int main(){整数米 = 1000000;向量v(m);for (int i = 0; i 

解决方案

OpenMP如何在reduce内部使用原子指令?不是吗完全依赖原子?

由于 OpenMP 标准没有指定 reduction 子句应该(或不)如何实现(例如基于 atomic 操作或不是),其实现可能因 OpenMP 标准的每个具体实现而异.

<块引用>

比如下面代码中的变量 sum 是与原子 + 运算符?

尽管如此,从 OpenMP 标准中,您可以阅读以下内容:

<块引用>

reduction 子句可用于执行某些形式的递归并行计算(...).对于并行和工作共享结构,创建每个列表项的私有副本, 为每个隐式任务创建一个副本,好像使用了私人条款.(...) 私人副本是然后按照上面的规定进行初始化. 在区域的末尾为指定了归约子句,原始列表项是通过将其原始值与每个的最终值相结合来更新私人副本,使用指定的组合器减少标识符.

因此,基于此,可以推断在reduction 子句中使用的变量将是私有,因此不会自动更新.尽管如此,即使情况并非如此,OpenMP 标准的具体实现也不太可能依赖于 atomic 操作(对于指令 sum += v[i];) 因为(在这种情况下)不是最有效的策略.有关原因的更多信息,请检查以下 SO 线程:

  1. 为什么我使用 openMP atomic 的并行代码比串行代码需要更长的时间?;
  2. 为什么我应该使用减少而不是不是原子变量?.

非常非正式地,比使用 atomic 更有效的方法是让每个线程都有自己的变量 sum 副本,并在 的末尾>parallel region,每个线程将其副本保存到线程间共享的资源中――现在,根据减少的实现方式,atomic 操作可能用于更新该资源共享资源.然后,该资源将被 master 线程获取,从而减少其内容并相应地更新原始 sum 变量.

更正式地来自 OpenMP 减少引擎盖下:><块引用>

在详细地重新审视并行缩减之后,您可能仍然有一些关于 OpenMP 如何真正改变您的开放性问题顺序代码变成并行代码.特别是,你可能想知道OpenMP 如何检测循环体中执行的部分减少.例如,这个或类似的代码片段可以经常在代码示例中找到:

 #pragma omp parallel for reduction(+:x)for (int i = 0; i 

您也可以使用 - 作为归约运算符(实际上是多余的+).但是 OpenMP 如何隔离更新步骤 x-= some_value?令人不安的答案是 OpenMP根本检测不到更新!编译器处理像这样的 for 循环:

#pragma omp parallel for reduction(+:x)for (int i = 0; i 

因此,x 的修改也可能隐藏在不透明的 > 后面.函数调用.从编译器的角度来看,这是一个可以理解的决定开发商.不幸的是,这意味着您必须确保所有x 的更新与定义在减少条款.

一个reduction的整体执行流程可以概括为如下:

  1. 生成一组线程并确定每个线程 j 必须执行的迭代集.
  2. 每个线程都声明了一个归约变量 x 的私有化变体,并用相应的中性元素 e 初始化幺半群.
  3. 所有线程都执行它们的迭代,无论它们是否或如何涉及私有化变量的更新.
  4. 结果被计算为对(局部)部分结果和全局变量 x 的顺序减少.最后,结果是写回 x.

How does OpenMP uses atomic instructions inside reduction constructor? Doesn't it rely on atomic instructions at all?

For instance, is the variable sum in the code below accumulated with atomic '+' operator?

#include <omp.h>
#include <vector>

using namespace std;
int main()
{
  int m = 1000000; 
  vector<int> v(m);
  for (int i = 0; i < m; i++)
    v[i] = i;

  int sum = 0;
  #pragma omp parallel for reduction(+:sum)
  for (int i = 0; i < m; i++)
    sum += v[i];
}

解决方案

How does OpenMP uses atomic instruction inside reduction? Doesn't it rely on atomic at all?

Since the OpenMP standard does not specify how the reduction clause should (or not) be implemented (e.g., based on atomic operations or not), its implementation may vary depending on each concrete implementation of the OpenMP standard.

For instance, is the variable sum in the code below accumulated with atomic + operator?

Nonetheless, from the OpenMP standard, one can read the following:

The reduction clause can be used to perform some forms of recurrence calculations (...) in parallel. For parallel and work-sharing constructs, a private copy of each list item is created, one for each implicit task, as if the private clause had been used. (...) The private copy is then initialized as specified above. At the end of the region for which the reduction clause was specified, the original list item is updated by combining its original value with the final value of each of the private copies, using the combiner of the specified reduction-identifier.

So based on that, one can infer that the variables used on the reduction clause will be private, and consequently, will not be updated atomically. Notwithstanding, even if that was not the case it would be unlikely, though, that a concrete implementation of the OpenMP standard would rely on the atomic operation (for the instruction sum += v[i];) since (in this case) would not be the most efficient strategy. For more information on why is that the case check the following SO threads:

  1. Why my parallel code using openMP atomic takes a longer time than serial code?;
  2. Why should I use a reduction rather than an atomic variable?.

Very informally, a more efficient approach than using atomic would be for each thread to have their own copy of the variable sum, and at the end of the parallel region, each thread would save its copy into a resource shared among threads -- now, depending on how the reduction is implemented, atomic operations might be used to update that shared resource. That resource would then be picked up by the master thread that would reduce its content and update the original sum variable, accordingly.

More formally from OpenMP Reductions Under the Hood:

After having revisited parallel reductions in detail you might still have some open questions about how OpenMP actually transforms your sequential code into parallel code. In particular, you might wonder how OpenMP detects the portion in the body of the loop that performs the reduction. As an example, this or a similar code fragment can often be found in code samples:

 #pragma omp parallel for reduction(+:x)
 for (int i = 0; i < n; i++)
     x -= some_value;

You could also use - as reduction operator (which is actually redundant to +). But how does OpenMP isolate the update step x-= some_value? The discomforting answer is that OpenMP does not detect the update at all! The compiler treats the body of the for-loop like this:

#pragma omp parallel for reduction(+:x)
     for (int i = 0; i < n; i++)
         x = some_expression_involving_x_or_not(x);

As a result, the modification of x could also be hidden behind an opaque > function call. This is a comprehensible decision from the point of view of a compiler developer. Unfortunately, this means that you have to ensure that all updates of x are compatible with the operation defined in the reduction clause.

The overall execution flow of a reduction can be summarized as follows:

  1. Spawn a team of threads and determine the set of iterations that each thread j has to perform.
  2. Each thread declares a privatized variant of the reduction variable x initialized with the neutral element e of the corresponding monoid.
  3. All threads perform their iterations no matter whether or how they involve an update of the privatized variable .
  4. The result is computed as sequential reduction over the (local) partial results and the global variable x. Finally, the result is written back to x.

相关文章