如何使用 OpenMP 通过 C++ std::list 并行化 for 循环?
我想使用 OpenMP 以并行方式遍历 std::list 中的所有元素.循环应该能够改变列表的元素.有没有一个简单的解决方案?当迭代器是随机访问迭代器时,OpenMP 3.0 似乎支持并行 for 循环,否则不支持.无论如何,我更愿意使用 OpenMP 2.0,因为我无法完全控制哪些编译器可供我使用.
I would like to iterate through all elements in an std::list in parallel fashion using OpenMP. The loop should be able to alter the elements of the list. Is there a simple solution for this? It seems that OpenMP 3.0 supports parallel for loops when the iterator is a Random Access Iterator, but not otherwise. In any case, I would prefer to use OpenMP 2.0 as I don't have full control over which compilers are available to me.
如果我的容器是一个向量,我可能会使用:
If my container were a vector, I might use:
#pragma omp parallel for
for (auto it = v.begin(); it != v.end(); ++it) {
it->process();
}
我知道我可以将列表复制到向量中,执行循环,然后将所有内容复制回来.但是,如果可能,我想避免这种复杂性和开销.
I understand that I could copy the list into a vector, do the loop, then copy everything back. However, I would like to avoid this complexity and overhead if possible.
推荐答案
如果您决定使用 Openmp 3.0
,您可以使用 task
功能:
If you decide to use Openmp 3.0
, you can use the task
feature:
#pragma omp parallel
#pragma omp single
{
for(auto it = l.begin(); it != l.end(); ++it)
#pragma omp task firstprivate(it)
it->process();
#pragma omp taskwait
}
这将在一个线程中执行循环,但将元素的处理委托给其他线程.
This will execute the loop in one thread, but delegate the processing of elements to others.
如果没有 OpenMP 3.0
,最简单的方法是将所有指向列表中元素的指针(或向量中的迭代器并迭代那个.这样你就不必复制任何东西并避免复制元素本身的开销,所以它不应该有太多开销:
Without OpenMP 3.0
the easiest way would be writing all pointers to elements in the list (or iterators in a vector and iterating over that one. This way you wouldn't have to copy anything back and avoid the overhead of copying the elements themselves, so it shouldn't have to much overhead:
std::vector<my_element*> elements; //my_element is whatever is in list
for(auto it = list.begin(); it != list.end(); ++it)
elements.push_back(&(*it));
#pragma omp parallel shared(chunks)
{
#pragma omp for
for(size_t i = 0; i < elements.size(); ++i) // or use iterators in newer OpenMP
elements[i]->process();
}
如果你想避免复制指针,你总是可以手动创建一个并行化的 for 循环.您可以让线程访问列表的交错元素(由 KennyTM 提出),或者在迭代和迭代之前将范围拆分为大致相等的连续部分.后者似乎更可取,因为线程避免访问当前由其他线程处理的列表节点(即使只有下一个指针),这可能导致错误共享.大致如下所示:
If you want to avoid copying even the pointers, you can always create a parallelized for loop by hand. You can either have the threads access interleaved elements of the list (as proposed by KennyTM) or split the range in roughly equal contious parts before iterating and iterating over those. The later seems preferable since the threads avoid accessing listnodes currently processed by other threads (even if only the next pointer), which could lead to false sharing. This would look roughly like this:
#pragma omp parallel
{
int thread_count = omp_get_num_threads();
int thread_num = omp_get_thread_num();
size_t chunk_size= list.size() / thread_count;
auto begin = list.begin();
std::advance(begin, thread_num * chunk_size);
auto end = begin;
if(thread_num = thread_count - 1) // last thread iterates the remaining sequence
end = list.end();
else
std::advance(end, chunk_size);
#pragma omp barrier
for(auto it = begin; it != end; ++it)
it->process();
}
barrier 不是严格需要的,但是如果 process
改变了处理过的元素(意味着它不是一个 const 方法),如果线程迭代一个已经被变异的序列.这种方式将在序列上迭代 3*n 次(其中 n 是线程数),因此对于大量线程,缩放比例可能不如最佳.
The barrier is not strictly needed, however if process
mutates the processed element (meaning it is not a const method), there might be some sort of false sharing without it, if threads iterate over a sequence which is already being mutated. This way will iterate 3*n times over the sequence (where n is the number of threads), so scaling might be less then optimal for a high number of threads.
为了减少开销,您可以将范围的生成放在 #pragma omp parallel
之外,但是您需要知道将形成并行部分的线程数.因此,您可能必须手动设置 num_threads
,或使用 omp_get_max_threads()
并处理创建的线程数少于 omp_get_max_threads() 的情况
(这只是一个上限).在这种情况下,最后一种方法可以通过为每个线程分配多个块来处理(使用 #pragma omp for
应该这样做):
To reduce the overhead you could put the generation of the ranges outside of the #pragma omp parallel
, however you will need to know how many threads will form the parallel section. So you'd probably have to manually set the num_threads
, or use omp_get_max_threads()
and handle the case that the number of threads created is less then omp_get_max_threads()
(which is only an upper bound). The last way could be handled by possibly assigning each thread severa chunks in that case (using #pragma omp for
should do that):
int max_threads = omp_get_max_threads();
std::vector<std::pair<std::list<...>::iterator, std::list<...>::iterator> > chunks;
chunks.reserve(max_threads);
size_t chunk_size= list.size() / max_threads;
auto cur_iter = list.begin();
for(int i = 0; i < max_threads - 1; ++i)
{
auto last_iter = cur_iter;
std::advance(cur_iter, chunk_size);
chunks.push_back(std::make_pair(last_iter, cur_iter);
}
chunks.push_back(cur_iter, list.end();
#pragma omp parallel shared(chunks)
{
#pragma omp for
for(int i = 0; i < max_threads; ++i)
for(auto it = chunks[i].first; it != chunks[i].second; ++it)
it->process();
}
这将只需要对 list
进行三次迭代(两次,如果您无需迭代即可获得列表的大小).我认为这是对非随机访问迭代器可以做的最好的事情,而无需使用 tasks
或迭代一些不合适的数据结构(如指针向量).
This will take only three iterations over list
(two, if you can get the size of the list without iterating). I think that is about the best you can do for non random access iterators without using tasks
or iterating over some out of place datastructure (like a vector of pointer).
相关文章