就性能而言,使用 std::memcpy() 或 std::copy() 更好吗?
就性能而言,是使用 memcpy
更好,还是使用 std::copy()
更好?为什么?
Is it better to use memcpy
as shown below or is it better to use std::copy()
in terms to performance? Why?
char *bits = NULL;
...
bits = new (std::nothrow) char[((int *) copyMe->bits)[0]];
if (bits == NULL)
{
cout << "ERROR Not enough memory.
";
exit(1);
}
memcpy (bits, copyMe->bits, ((int *) copyMe->bits)[0]);
推荐答案
我将与这里的普遍观点背道而驰,即 std::copy
会有轻微的、几乎察觉不到的性能损失.我刚刚做了一个测试,发现这是不正确的:我确实注意到了性能差异.然而,获胜者是std::copy
.
I'm going to go against the general wisdom here that std::copy
will have a slight, almost imperceptible performance loss. I just did a test and found that to be untrue: I did notice a performance difference. However, the winner was std::copy
.
我写了一个 C++ SHA-2 实现.在我的测试中,我使用所有四个 SHA-2 版本(224、256、384、512)对 5 个字符串进行哈希处理,并循环了 300 次.我使用 Boost.timer 测量时间.300 循环计数器足以完全稳定我的结果.我每次都运行了 5 次测试,在 memcpy
版本和 std::copy
版本之间交替.我的代码利用尽可能大的块获取数据(许多其他实现使用 char
/char *
操作,而我使用 T
>/T *
(其中 T
是用户实现中具有正确溢出行为的最大类型),因此对最大类型的快速内存访问对于我的算法的性能.这些是我的结果:
I wrote a C++ SHA-2 implementation. In my test, I hash 5 strings using all four SHA-2 versions (224, 256, 384, 512), and I loop 300 times. I measure times using Boost.timer. That 300 loop counter is enough to completely stabilize my results. I ran the test 5 times each, alternating between the memcpy
version and the std::copy
version. My code takes advantage of grabbing data in as large of chunks as possible (many other implementations operate with char
/ char *
, whereas I operate with T
/ T *
(where T
is the largest type in the user's implementation that has correct overflow behavior), so fast memory access on the largest types I can is central to the performance of my algorithm. These are my results:
完成 SHA-2 测试运行的时间(以秒为单位)
std::copy memcpy % increase
6.11 6.29 2.86%
6.09 6.28 3.03%
6.10 6.29 3.02%
6.08 6.27 3.03%
6.08 6.27 3.03%
std::copy 的总平均速度比 memcpy 提高:2.99%
我的编译器是 Fedora 16 x86_64 上的 gcc 4.6.3.我的优化标志是 -Ofast -march=native -funsafe-loop-optimizations
.
My compiler is gcc 4.6.3 on Fedora 16 x86_64. My optimization flags are -Ofast -march=native -funsafe-loop-optimizations
.
我的 SHA-2 实现代码.
我决定也对我的 MD5 实现进行测试.结果不太稳定,所以我决定运行 10 次.然而,在我最初的几次尝试之后,我得到的结果从一次运行到下一次运行变化很大,所以我猜有某种操作系统活动正在进行.我决定重新开始.
I decided to run a test on my MD5 implementation as well. The results were much less stable, so I decided to do 10 runs. However, after my first few attempts, I got results that varied wildly from one run to the next, so I'm guessing there was some sort of OS activity going on. I decided to start over.
相同的编译器设置和标志.只有一个版本的 MD5,而且它比 SHA-2 更快,所以我对一组类似的 5 个测试字符串进行了 3000 次循环.
Same compiler settings and flags. There is only one version of MD5, and it's faster than SHA-2, so I did 3000 loops on a similar set of 5 test strings.
这是我最后的 10 个结果:
These are my final 10 results:
完成 MD5 测试运行的时间(以秒为单位)
std::copy memcpy % difference
5.52 5.56 +0.72%
5.56 5.55 -0.18%
5.57 5.53 -0.72%
5.57 5.52 -0.91%
5.56 5.57 +0.18%
5.56 5.57 +0.18%
5.56 5.53 -0.54%
5.53 5.57 +0.72%
5.59 5.57 -0.36%
5.57 5.56 -0.18%
std::copy 相对于 memcpy 的总平均速度下降:0.11%
我的 MD5 实现代码
这些结果表明,std::copy 在我的 SHA-2 测试中使用了一些优化,而 std::copy
在我的 MD5 测试中无法使用.在 SHA-2 测试中,两个数组都是在调用 std::copy
/memcpy
的同一函数中创建的.在我的 MD5 测试中,其中一个数组作为函数参数传递给函数.
These results suggest that there is some optimization that std::copy used in my SHA-2 tests that std::copy
could not use in my MD5 tests. In the SHA-2 tests, both arrays were created in the same function that called std::copy
/ memcpy
. In my MD5 tests, one of the arrays was passed in to the function as a function parameter.
我做了更多的测试,看看我能做些什么让 std::copy
再次更快.答案很简单:开启链接时间优化.这些是我打开 LTO 的结果(gcc 中的选项 -flto):
I did a little bit more testing to see what I could do to make std::copy
faster again. The answer turned out to be simple: turn on link time optimization. These are my results with LTO turned on (option -flto in gcc):
使用 -flto 完成 MD5 测试运行的时间(以秒为单位)
std::copy memcpy % difference
5.54 5.57 +0.54%
5.50 5.53 +0.54%
5.54 5.58 +0.72%
5.50 5.57 +1.26%
5.54 5.58 +0.72%
5.54 5.57 +0.54%
5.54 5.56 +0.36%
5.54 5.58 +0.72%
5.51 5.58 +1.25%
5.54 5.57 +0.54%
std::copy 的速度比 memcpy 平均提高:0.72%
总而言之,使用 std::copy
似乎没有性能损失.事实上,性能似乎有所提升.
In summary, there does not appear to be a performance penalty for using std::copy
. In fact, there appears to be a performance gain.
结果说明
那么为什么 std::copy
可以提升性能?
So why might std::copy
give a performance boost?
首先,只要开启了内联优化,我不希望任何实现都会变慢.所有编译器都积极地内联;它可能是最重要的优化,因为它支持许多其他优化.std::copy
可以(我怀疑所有现实世界的实现都可以)检测到参数是微不足道的可复制的,并且内存是按顺序排列的.这意味着在最坏的情况下,当 memcpy
合法时,std::copy
的表现应该不会更糟.遵循 memcpy
的 std::copy
的简单实现应该满足编译器的在优化速度或大小时始终内联它"的标准.
First, I would not expect it to be slower for any implementation, as long as the optimization of inlining is turned on. All compilers inline aggressively; it is possibly the most important optimization because it enables so many other optimizations. std::copy
can (and I suspect all real world implementations do) detect that the arguments are trivially copyable and that memory is laid out sequentially. This means that in the worst case, when memcpy
is legal, std::copy
should perform no worse. The trivial implementation of std::copy
that defers to memcpy
should meet your compiler's criteria of "always inline this when optimizing for speed or size".
然而,std::copy
也保留了更多的信息.当您调用 std::copy
时,该函数会保持类型不变.memcpy
对 void *
进行操作,它丢弃了几乎所有有用的信息.例如,如果我传入一个 std::uint64_t
数组,编译器或库实现者可能能够利用 std::copy
的 64 位对齐,但使用 memcpy
可能更难做到这一点.像这样的算法的许多实现首先处理范围开头的未对齐部分,然后是对齐的部分,最后是未对齐的部分.如果保证全部对齐,那么代码会变得更简单、更快,并且处理器中的分支预测器更容易获得正确的结果.
However, std::copy
also keeps more of its information. When you call std::copy
, the function keeps the types intact. memcpy
operates on void *
, which discards almost all useful information. For instance, if I pass in an array of std::uint64_t
, the compiler or library implementer may be able to take advantage of 64-bit alignment with std::copy
, but it may be more difficult to do so with memcpy
. Many implementations of algorithms like this work by first working on the unaligned portion at the start of the range, then the aligned portion, then the unaligned portion at the end. If it is all guaranteed to be aligned, then the code becomes simpler and faster, and easier for the branch predictor in your processor to get correct.
过早的优化?
std::copy
处于一个有趣的位置.我希望它永远不会比 memcpy
慢,有时使用任何现代优化编译器都会更快.此外,任何你可以memcpy
,你都可以std::copy
.memcpy
不允许缓冲区中有任何重叠,而 std::copy
支持一个方向的重叠(std::copy_backward
用于另一个方向)重叠方向).memcpy
仅适用于指针,std::copy
适用于任何迭代器(std::map
、std::vector
、std::deque
或我自己的自定义类型).换句话说,当您需要复制数据块时,您应该只使用 std::copy
.
std::copy
is in an interesting position. I expect it to never be slower than memcpy
and sometimes faster with any modern optimizing compiler. Moreover, anything that you can memcpy
, you can std::copy
. memcpy
does not allow any overlap in the buffers, whereas std::copy
supports overlap in one direction (with std::copy_backward
for the other direction of overlap). memcpy
only works on pointers, std::copy
works on any iterators (std::map
, std::vector
, std::deque
, or my own custom type). In other words, you should just use std::copy
when you need to copy chunks of data around.
相关文章