同步 push_back 和 std::thread

我的代码

void build(std::vector<RKD <DivisionSpace> >& roots, ...) {
  try {
    // using a local lock_guard to lock mtx guarantees unlocking on destruction / exception:
    std::lock_guard<std::mutex> lck (mtx);
    roots.push_back(RKD<DivisionSpace>(...));
  }
  catch (const std::bad_alloc&) {
    std::cout << "[exception caught when constructing tree]
";
    return;
  }
}

现在,实际的工作应该是串行的,而不是并行的.

Now, the actual work should be done serially, not in parallel.

RKD 的构造函数可以与RKD 的其他构造函数并行运行.但是,将对象推回 std::Vector 是一个临界区,对吧?

The constructor of RKD can run in parallel with other constructors of RKD. However, pushing the objects back in std::Vector is a critical section, right?

我要构建的对象的数量是已知的.实际上,它将在 [2, 16] 范围内.理论上它可以是任何正数.

The number of the objects I am going to be build is known. It is going to be something in the range [2, 16] in practise. In theory it could be any positive number.

另外,我对它们插入容器的顺序不感兴趣.

Also, I am not interesting for the order they will be inserted in the container.

所以我可以这样做:

RKD tree = RKD(...);
mutex_lock(...);
roots.push_back(tree);

但是这意味着复制,不是吗?

However this would imply copying, wouldn't it?

我应该怎么做才能让我的代码并行?

What should I do to make my code parallel?

因为 this 答案,我决定使用锁(而不仅仅是互斥锁).

I decided to use a lock (instead of just a mutex) because of this answer.

推荐答案

Tomasz Lewowski 在他的评论中提出并且我已经扩展的建议非常简单,并且基于以下观察:A push_backstd::vector 上的 code> 可能需要重新分配后备存储并复制(或者,最好是移动)元素.这构成了需要同步的关键部分.

The suggestion that Tomasz Lewowski has brought up in his comment and I have expanded upon is pretty simple and based upon the following observation: A push_back on a std::vector potentially needs to re-allocate the backing store and copy (or, preferably, move) the elements. This constitutes a critical section that needs to be synchronized.

对于下一个示例,假设我们想要一个向量填充前 12 个素数,但我们不关心它们的顺序.(我刚刚在这里对数字进行了硬编码,但假设它们是通过一些昂贵的计算获得的,这些计算可以并行进行.)在以下场景中存在危险的竞争条件.

For the next examples, assume we want to have a vector filled with the first 12 primes but we don't care about their ordering. (I have just hard-coded the numbers here but assume they are obtained via some expensive computation that makes sense to do in parallel.) There is a dangerous race condition in the following scenario.

std::vector<int> numbers {};  // an empty vector

// thread A             // thread B             // thread C

numbers.push_back( 2);  numbers.push_back(11);  numbers.push_back(23);
numbers.push_back( 3);  numbers.push_back(13);  numbers.push_back(27);
numbers.push_back( 5);  numbers.push_back(17);  numbers.push_back(29);
numbers.push_back( 7);  numbers.push_back(19);  numbers.push_back(31);

push_back 还有另一个问题.如果两个线程同时调用它,它们都将尝试在同一索引处构造一个对象,这可能会带来灾难性的后果.因此,在分叉线程之前,问题并没有通过 reserve(n) 解决.

There is also another problem with the push_back. If two threads call it simultaneously, they will both attempt to construct an object at the same index with potentially disastrous consequences. So the problem is not solved with a reserve(n) before forking the threads.

但是,由于您事先知道元素的数量,您可以简单地将它们分配到 std::vector 内的特定位置,而无需更改其大小.如果不更改大小,则没有临界区.因此,以下场景中不存在比赛.

However, since you know the number of elements in advance, you can simply assign them to a specific location inside a std::vector without changing its size. If you don't change the size, there is no critical section. Therefore, there is no race in the following scenario.

std::vector<int> numbers(12);  // 12 elements initialized with 0

// thread A          // thread B          // thread C

numbers[ 0] =  2;    numbers[ 1] =  3;    numbers[ 2] =  5;
numbers[ 3] =  7;    numbers[ 4] = 11;    numbers[ 5] = 13;
numbers[ 6] = 17;    numbers[ 7] = 19;    numbers[ 8] = 23;
numbers[ 9] = 29;    numbers[10] = 31;    numbers[11] = 37;

当然,如果两个线程都尝试写入 相同 索引,那么竞争将再次出现.幸运的是,在实践中防止这种情况并不困难.如果你的向量有 n 个元素并且你有 p 个线程,线程 i 只写入元素 [i n/p, (i + 1) n/p).请注意,仅当 j mod p = i 因为它会导致更少的缓存失效.所以上面例子中的访问模式是次优的,最好是这样.

Of course, if both threads attempt to write to the same index, the race will be there again. Fortunately, protecting against this is not difficult in practice. If your vector has n elements and you have p threads, thread i writes only to elements [i n / p, (i + 1) n / p). Note that this is preferable over having thread i write to elements at index j only if j mod p = i because it leads to fewer cache invalidations. So the access pattern in the above example is sub-optimal and had better been like this.

std::vector<int> numbers(12);  // 12 elements initialized with 0

// thread A          // thread B          // thread C

numbers[ 0] =  2;    numbers[ 4] = 11;    numbers[ 8] = 23;
numbers[ 1] =  3;    numbers[ 5] = 13;    numbers[ 9] = 29;
numbers[ 2] =  5;    numbers[ 6] = 17;    numbers[10] = 31;
numbers[ 3] =  7;    numbers[ 7] = 19;    numbers[11] = 37;

到目前为止一切顺利.但是如果你没有 std::vector<int> 而是 std::vector<Foo> 怎么办?如果 Foo 没有默认构造函数,那么

So far so good. But what if you don't have a std::vector<int> but a std::vector<Foo>? If Foo does not have a default constructor, then

std::vector<Foo> numbers(10);

将无效.即使它有一个,创建许多昂贵的默认构造对象只是为了尽快重新分配它们而没有检索到值,这将是令人发指的.

will be invalid. And even if it has one, it would be outrageous to create many expensive default-constructed objects just to re-assign them soon without ever retrieving the value.

当然,大多数设计良好的类都应该有一个非常便宜的默认构造函数.例如,std::string 默认构造为不需要内存分配的空字符串.一个好的实现会将默认构造字符串的成本降低到只是

Of course, most well-designed classes should have a very cheap default constructor. For example, a std::string is default constructed to an empty string that requires no memory allocation. A good implementation will reduce the cost of default-constructing a string to just

std::memset(this, 0, sizeof(std::string));

如果编译器足够聪明,可以确定我们正在分配和初始化整个 std::vector,它或许可以进一步优化一次调用

And if the compiler is smart enough to figure out that we are allocating and initializing an entire std::vector<std::string>(n), it might be able to optimize this further to a single call to

std::calloc(n, sizeof(std::string));

因此,如果有任何机会您可以使 Foo 廉价地默认构造和分配,那么您就完成了.但是,如果这很困难,您可以通过将其移至堆来避免该问题.智能指针可以很便宜地默认构造,所以

So if there is any chance you can make Foo be cheaply default-constructible and assignable, you are done. However, if this turns out to be difficult, you can avoid the problem by moving it to the heap. A smart pointer is cheaply default-constructible, so

std::vector<std::unique_ptr<Foo>> foos(n);

最终会减少到一个

std::calloc(n, sizeof(std::unique_ptr<Foo>));

没有你对 Foo 做任何事情.当然,这种便利是以为每个元素动态分配内存为代价的.

without you doing anything to Foo. Of course, this convenience comes at the price of a dynamic memory allocation for each element.

std::vector<std::unique_ptr<Foo>> foos(n);

// thread A                    // thread B                           // thread C

foos[0].reset(new Foo {...});  foos[n / 3 + 0].reset(new Foo {...});  foos[2 * n / 3 + 0].reset(new Foo {...});
foos[1].reset(new Foo {...});  foos[n / 3 + 1].reset(new Foo {...});  foos[2 * n / 3 + 1].reset(new Foo {...});
foos[2].reset(new Foo {...});  foos[n / 3 + 2].reset(new Foo {...});  foos[2 * n / 3 + 2].reset(new Foo {...});
...                            ...                                    ...

这可能没有你想象的那么糟糕,因为虽然动态内存分配不是免费的,但 sizeof a std::unique_ptr 非常小,所以如果 sizeof(Foo) 很大,你会得到一个更紧凑的向量,迭代速度更快.当然,这完全取决于您打算如何使用您的数据.

This might not be as bad as you might think because while dynamic memory allocations are not free, the sizeof a std::unique_ptr is very small so if sizeof(Foo) is large, you get the bonus of a more compact vector that is faster to iterate. It all depends of course how you intend to use your data.

如果您事先不知道元素的确切数量,或者担心会弄乱索引,还有另一种方法可以做到这一点:让每个线程填充自己的向量并在最后合并它们.继续素数的例子,我们得到了这个.

If you don't know the exact number of elements in advance or are afraid you'll mess up the indexing, there is yet another way to do it: Have each thread fill its own vector and merge them at the end. Continuing the primes example, we get this.

std::vector<int> numbersA {};  // private store for thread A
std::vector<int> numbersB {};  // private store for thread B
std::vector<int> numbersC {};  // private store for thread C

// thread A              // thread B              // thread C

numbersA.push_back( 2);  numbersB.push_back(11);  numbersC.push_back(23);
numbersA.push_back( 3);  numbersB.push_back(13);  numbersC.push_back(27);
numbersA.push_back( 5);  numbersB.push_back(17);  numbersC.push_back(29);
numbersA.push_back( 7);  numbersB.push_back(21);  numbersC.push_back(31);

// Back on the main thread after A, B and C are joined:

std::vector<int> numbers(
    numbersA.size() + numbersB.size() + numbersC.size());
auto pos = numbers.begin();
pos = std::move(numbersA.begin(), numbersA.end(), pos);
pos = std::move(numbersB.begin(), numbersB.end(), pos);
pos = std::move(numbersC.begin(), numbersC.end(), pos);
assert(pos == numbers.end());

// Now dispose of numbersA, numbersB and numbersC as soon as possible
// in order to release their no longer needed memory.

(上述代码中使用的std::move为算法库中的那个.)

(The std::move used in the above code is the one from the algorithms library.)

这种方法具有最理想的内存访问模式,因为 numbersAnumbersBnumbersC 正在写入完全独立分配的内存.当然,我们必须做额外的顺序工作来加入中间结果.请注意,效率在很大程度上取决于这样一个事实,即与查找/创建元素的成本相比,移动分配元素的成本可以忽略不计.至少如上所述,代码还假定您的类型具有廉价的默认构造函数.当然,如果您的类型不是这种情况,您可以再次使用智能指针.

This approach has the most desirable memory access pattern of all because numbersA, numbersB and numbersC are writing to completely independently allocated memory. Of course, we got to do the additional sequential work of joining the intermediate results. Note that the efficiency relies heavily on the fact that the cost of move-assigning an element is negligible compared to the cost of finding / creating it. At least as written above, the code also assumes that your type has a cheap default constructor. Of course, if this is not the case for your type, you can again use smart pointers.

我希望这为您提供了足够的想法来优化您的问题.

I hope this provided you with enough ideas to optimize your problem.

如果您以前从未使用过智能指针,请查看 "C++ 中的 RAII 和智能指针" 并查看标准库的 动态内存管理库.上面显示的技术当然也适用于 std::vector<Foo *> 但我们不再在现代 C++ 中使用拥有原始指针的资源.

If you have never uses smart pointers before, have a look at "RAII and smart pointers in C++" and check out the standard library's dynamic memory management library. The techniques shown above would of course also work with a std::vector<Foo *> but we don't use resource owning raw pointers like this in modern C++ any more.

相关文章