由于标准容器中元素的默认初始化导致性能下降
(是的,我知道有一个问题 几乎是同一个标题,但答案并不令人满意,见下文)
(Yes, I know there is a question with almost the same title, but the answer was not satisfactory, see below)
编辑对不起,原来的问题没有使用编译器优化.现在已修复此问题,但为了避免琐碎的优化并更接近我的实际用例,测试已分为两个编译单元.
std::vector<>
的构造函数具有线性复杂性这一事实对于性能关键型应用程序来说是个麻烦事.考虑这个简单的代码
The fact that the constructor of std::vector<>
has linear complexity is a nuisance when it comes to performance-critical applications. Consider this simple code
// compilation unit 1:
void set_v0(type*x, size_t n)
{
for(size_t i=0; i<n; ++i)
x[i] = simple_function(i);
}
// compilation unit 2:
std::vector<type> x(n); // default initialisation is wasteful
set_v0(x.data(),n); // over-writes initial values
当大量时间被构造 x
所浪费时.正如 this question 所探讨的,解决此问题的传统方法似乎只是保留存储并使用push_back()
填写数据:
when a significant amount of time is wasted by constructing x
. The conventional way around this, as explored by this question, seems to be to merely reserve the storage and use push_back()
to fill in the data:
// compilation unit 1:
void set_v1(std::vector<type>&x, size_t n)
{
x.reserve(n);
for(size_t i=0; i<n; ++i)
x.push_back(simple_function(i));
}
// compilation unit 2:
std::vector<type> x(); x.reserve(n); // no initialisation
set_v1(x,n); // using push_back()
但是,正如我的评论所指出的,push_back()
本质上很慢,对于足够简单的可构造对象,这使得第二种方法实际上比第一种方法慢,例如作为size_t
s,当为
However, as indicated by my comment, the push_back()
is inherently slow, making this second approach actually slower than the first one for sufficiently simply constructible objects, such as size_t
s, when for
simple_function = [](size_t i) { return i; };
我得到以下时间(使用带有 -O3 的 gcc 4.8;clang 3.2 产生的代码慢了约 10%)
I get the following timings (using gcc 4.8 with -O3; clang 3.2 produced ~10% slower code)
timing vector::vector(n) + set_v0();
n=10000 time: 3.9e-05 sec
n=100000 time: 0.00037 sec
n=1000000 time: 0.003678 sec
n=10000000 time: 0.03565 sec
n=100000000 time: 0.373275 sec
timing vector::vector() + vector::reserve(n) + set_v1();
n=10000 time: 1.9e-05 sec
n=100000 time: 0.00018 sec
n=1000000 time: 0.00177 sec
n=10000000 time: 0.020829 sec
n=100000000 time: 0.435393 sec
如果可以省略元素的默认构造,实际上可能的加速可以通过以下作弊版本来估计
The speed-up actually possible if one could elide the default construction of elements can be estimated by the following cheating version
// compilation unit 2
std::vector<type> x; x.reserve(n); // no initialisation
set_v0(x,n); // error: write beyond end of vector
// note: vector::size() == 0
当我们得到
timing vector::vector + vector::reserve(n) + set_v0(); (CHEATING)
n=10000 time: 8e-06 sec
n=100000 time: 7.2e-05 sec
n=1000000 time: 0.000776 sec
n=10000000 time: 0.01119 sec
n=100000000 time: 0.298024 sec
那么,我的第一个问题:是否有任何合法的方式来使用标准库容器来提供这些后一个时间?还是我必须自己管理内存?
So, my first question: Is there any legal way to use a standard library container which would give these latter timings? Or do I have to resort to manage the memory myself?
现在,我真正想要的是使用多线程来填充容器.幼稚的代码(为了简单起见,本例中使用了 openMP,暂时不包括 clang)
Now, what I really want, is to use multi-threading to fill in the container. The naive code (using openMP in this example for simplicity, which excludes clang for the moment)
// compilation unit 1
void set_v0(type*x, size_t n)
{
#pragma omp for // only difference to set_v0() from above
for(size_t i=0; i<n; ++i)
x[i] = simple_function(i);
}
// compilation unit 2:
std::vector<type> x(n); // default initialisation not mutli-threaded
#pragma omp parallel
set_v0(x,n); // over-writes initial values in parallel
现在所有元素的默认初始化都不是多线程的,这会导致潜在的严重性能下降.以下是 set_omp_v0()
的时间安排和等效的作弊方法(使用我的 macbook 的 4 核 8 超线程的 intel i7 芯片):
now suffers from the fact that the default initialization of all elements is not multi-threaded, resulting in an potentially serious performance degradation. Here are the timings for set_omp_v0()
and a equivalent cheating method (using my macbook's intel i7 chip with 4 cores, 8 hyperthreads):
timing std::vector::vector(n) + omp parallel set_v0()
n=10000 time: 0.000389 sec
n=100000 time: 0.000226 sec
n=1000000 time: 0.001406 sec
n=10000000 time: 0.019833 sec
n=100000000 time: 0.35531 sec
timing vector::vector + vector::reserve(n) + omp parallel set_v0(); (CHEATING)
n=10000 time: 0.000222 sec
n=100000 time: 0.000243 sec
n=1000000 time: 0.000793 sec
n=10000000 time: 0.008952 sec
n=100000000 time: 0.089619 sec
请注意,作弊版比串行作弊版快约 3.3 倍,大致符合预期,但标准版则不然.
Note that the cheat version is ~3.3 times faster than the serial cheat version, roughly as expected, but the standard version is not.
那么,我的第二个问题:是否有任何合法的方法可以使用标准库容器,从而在多线程情况下提供后面的计时?
So, my second question: Is there any legal way to use a standard library container which would give these latter timings in multi-threaded situations?
PS.我发现 这个问题,其中 std::vector
通过向其提供 uninitialized_allocator
被欺骗以避免默认初始化.这不再符合标准,但对我的测试用例非常有效(请参阅下面我自己的答案和 这个问题了解详情).
PS. I found this question, where std::vector
is tricked into avoiding the default initialization by providing it with a uninitialized_allocator
.
This is no longer standard compliant, but works very well for my test case (see my own answer below and this question for details).
推荐答案
好的,这是我提出这个问题后学到的.
Okay, here is what I've learned since asking this question.
Q1(是否有任何合法的方法可以使用标准库容器来提供后面这些时间?)是的在某种程度上,如 Mark 和 Evgeny 的回答所示.为 std::vector
的构造函数提供生成器的方法省略了默认构造.
Q1 (Is there any legal way to use a standard library container which would give these latter timings?) Yes to some degree, as shown in the answers by Mark and Evgeny. The method of providing a generator to the constructor of std::vector
elides the default construction.
Q2(是否有任何合法的方法可以使用标准库容器,从而在多线程情况下提供后面这些时间?)没有,我不这么认为.原因是在构造时任何符合标准的容器都必须初始化其元素,以确保对元素析构函数的调用(在容器的销毁或调整大小时)是格式良好.由于std库容器不支持多线程构造元素,Q1的技巧在这里无法复制,所以不能省略默认构造.
Q2 (Is there any legal way to use a standard library container which would give these latter timings in multi-threaded situations?) No, I don't think so. The reason is that on construction any standard-compliant container must initialise its elements, already to ensure that the call to the element destructors (upon destruction or resizing of the container) is well-formed. As the std library containers do not support the usage of multi-threading in constructing their elements, the trick of Q1 cannot be replicated here, so we cannot elide the default construction.
因此,如果我们想使用 C++ 进行高性能计算,在管理大量数据时,我们的选择会受到一定限制.我们可以
Thus, if we want to use C++ for high-performance computing our options are somewhat limited when it comes to managing large amounts of data. We can
1声明一个容器对象,并在同一个编译单元中立即填充它(同时),当编译器希望在构造时优化初始化时;
1 declare a container object and, in the same compilation unit, immediately fill it (concurrently), when the compiler hopefully optimizes the initialization at construction away;
2 使用 new[]
和 delete[]
甚至 malloc()
和 free()
,当所有的内存管理,在后一种情况下,元素的构造是我们的责任,我们对 C++ 标准库的潜在使用非常有限.
2 resort to new[]
and delete[]
or even malloc()
and free()
, when all the memory management and, in the latter case, construction of elements is our responsibility and our potential usage of the C++ standard library very limited.
3 欺骗 std::vector
不使用自定义 unitialised_allocator
来初始化其元素默认构造.遵循 想法 Jared Hoberock 这样的分配器可能看起来像这样(另请参阅 这里):
3 trick a std::vector
to not initialise its elements using a custom unitialised_allocator
that elides the default construction. Following the ideas of Jared Hoberock such an allocator could look like this (see also here):
// based on a design by Jared Hoberock
// edited (Walter) 10-May-2013, 23-Apr-2014
template<typename T, typename base_allocator = std::allocator<T> >
struct uninitialised_allocator
: base_allocator
{
static_assert(std::is_same<T,typename base_allocator::value_type>::value,
"allocator::value_type mismatch");
template<typename U>
using base_t =
typename std::allocator_traits<base_allocator>::template rebind_alloc<U>;
// rebind to base_t<U> for all U!=T: we won't leave other types uninitialised!
template<typename U>
struct rebind
{
typedef typename
std::conditional<std::is_same<T,U>::value,
uninitialised_allocator, base_t<U> >::type other;
}
// elide trivial default construction of objects of type T only
template<typename U>
typename std::enable_if<std::is_same<T,U>::value &&
std::is_trivially_default_constructible<U>::value>::type
construct(U*) {}
// elide trivial default destruction of objects of type T only
template<typename U>
typename std::enable_if<std::is_same<T,U>::value &&
std::is_trivially_destructible<U>::value>::type
destroy(U*) {}
// forward everything else to the base
using base_allocator::construct;
using base_allocator::destroy;
};
然后一个 unitialised_vector<>
模板可以这样定义:
Then an unitialised_vector<>
template could be defined like this:
template<typename T, typename base_allocator = std::allocator<T>>
using uninitialised_vector = std::vector<T,uninitialised_allocator<T,base_allocator>>;
我们仍然可以使用几乎所有标准库的功能.虽然必须说 uninitialised_allocator
,因此暗示 unitialised_vector
不符合标准,因为它的元素不是默认构造的(例如 vector
构造后不会有所有的0
).
and we can still use almost all of the standard library's functionality. Though it must be said that the uninitialised_allocator
, and hence by implication the unitialised_vector
are not standard compliant, because its elements are not default constructed (e.g. a vector<int>
will not have all 0
after construction).
当我使用这个工具解决我的小测试问题时,我得到了很好的结果:
When using this tool for my little test problem, I get excellent results:
timing vector::vector(n) + set_v0();
n=10000 time: 3.7e-05 sec
n=100000 time: 0.000334 sec
n=1000000 time: 0.002926 sec
n=10000000 time: 0.028649 sec
n=100000000 time: 0.293433 sec
timing vector::vector() + vector::reserve() + set_v1();
n=10000 time: 2e-05 sec
n=100000 time: 0.000178 sec
n=1000000 time: 0.001781 sec
n=10000000 time: 0.020922 sec
n=100000000 time: 0.428243 sec
timing vector::vector() + vector::reserve() + set_v0();
n=10000 time: 9e-06 sec
n=100000 time: 7.3e-05 sec
n=1000000 time: 0.000821 sec
n=10000000 time: 0.011685 sec
n=100000000 time: 0.291055 sec
timing vector::vector(n) + omp parllel set_v0();
n=10000 time: 0.00044 sec
n=100000 time: 0.000183 sec
n=1000000 time: 0.000793 sec
n=10000000 time: 0.00892 sec
n=100000000 time: 0.088051 sec
timing vector::vector() + vector::reserve() + omp parallel set_v0();
n=10000 time: 0.000192 sec
n=100000 time: 0.000202 sec
n=1000000 time: 0.00067 sec
n=10000000 time: 0.008596 sec
n=100000000 time: 0.088045 sec
当作弊版本和合法"版本之间不再有区别时.
when there is no difference any more between the cheating and "legal" versions.
相关文章