并行化广度优先搜索

我只是自学了一些 OpenMP,这可能很愚蠢.基本上,我试图在 C++ 中并行化广度优先搜索程序,每个节点都需要很长时间来处理.这是一个示例代码:

I just taught myself some OpenMP and this could be stupid. Basically I'm trying to parallelize a breadth first search program in c++, with each node taking a long time to process. Here's an example code:

queue<node*> q;
q.push(head);
while (!q.empty()) {
  qSize = q.size();
  for (int i = 0; i < qSize; i++) {
    node* currNode = q.front();
    q.pop();
    doStuff(currNode);
    q.push(currNode);
  }
}

处理函数 doStuff() 非常昂贵,我想对其进行并行化.但是,如果我通过将 #pragma omp parallel for 放在 for 行之前来并行化 for 循环,则在运行时会弹出各种奇怪的错误.我猜原因是这样 q.front()q.push() 也将并行化,并且多个线程可能会通过相同的节点q.front()(因为它们都在任何 q.push 被处理之前就被处理了).

The processing function doStuff() is quite expensive and I want to parallelize it. However if I parallelize the for loop by putting #pragma omp parallel for right before the for line, all kinds of weird error pop up at runtime. I'm guessing the reason is that this way q.front() and q.push() will also get parallelized, and multiple threads will possibly get the same node through q.front() (because they all got processed before any q.push has been processed).

我该如何解决这个问题?

How can I get around this?

推荐答案

解决方案是使用临界区保护对队列的访问.

The solution is to protect access to the queue with a critical section.

queue<node*> q;
q.push(head);
while (!q.empty()) {
  qSize = q.size();
  #pragma omp parallel for
  for (int i = 0; i < qSize; i++) {
    node* currNode;
    #pragma omp critical
    {
      currNode = q.front();
      q.pop();
    }
    doStuff(currNode);
    #pragma omp critical
    q.push(currNode);
  }
}

这类似于拥有一个共同的互斥锁并锁定它.

This is similar to having a common mutex and locking it.

此版本在效率上有一些限制:在 for 循环结束时,尽管有工作在队列中,但一些线程可能会空闲.就处理队列为空但某些线程仍在计算的情况而言,制作一个线程在队列中有东西时连续工作的版本有点棘手.

There are some limits in efficiency with this version: At the end of the for loop, some threads may idle, despite work being in the queue. Making a version where threads continuously work whenever there is something in the queue is a bit tricky in terms of handling the situations where the queue is empty but some threads are still computing.

根据节点中涉及的数据大小,缓存效应和错误共享可能也会对性能产生重大影响.但这不能用具体的例子来讨论.在许多情况下,简单版本可能足够高效,但获得最佳性能可能变得任意复杂.

Depending of the data size involved in a node, you may also have significant performance impact of cache-effects and false sharing. But that can't be discussed with a specific example. The simple version will probably be sufficiently efficient in many cases, but getting optimal performance may become arbitrarily complex.

无论如何,您必须确保 doStuff 不会修改任何全局或共享状态.

In any case, you have to ensure that doStuff does not do modify any global or shared state.

相关文章