性能的 C++ 模板?
我在网上多次看到有人提到使用模板可以使 C++ 更快.
I have seen online a few times it has been mentioned that C++ can be ever faster using templates.
有人可以解释一下,包括在低层次上为什么会这样吗?我一直认为这样一个不错"的功能会像大多数有用的概念一样产生开销.
Could someone explain, including at a low level why this is exactly? I always presumed such a "nice" feature would have overhead like most useful concepts.
从超低延迟的角度来看,我真的对此很感兴趣!
I am really intrigued by this from a ultra low latency perspective!
推荐答案
一个常见的例子是排序.
A common example is sorting.
在 C 中,qsort
接受一个指向比较函数的指针.一般情况下,qsort
代码都会有一份,不内联.它将通过指向比较例程的指针进行调用――这当然也不是内联的.
In C, qsort
takes a pointer to a comparison function. Generally speaking, there will be one copy of the qsort
code, which is not inlined. It will make a call through the pointer to the comparison routine -- this of course is also not inlined.
在 C++ 中,std::sort
是一个模板,它可以将一个函子对象作为比较器.对于用作比较器的每种不同类型,都有不同的 std::sort
副本.假设您使用带有重载 operator()
的函子类,那么对比较器的调用 可以很容易地内联到 std::sort
的这个副本中>.
In C++, std::sort
is a template, and it can take a functor object as comparator. There is a different copy of std::sort
for each different type used as a comparator. Assuming you use a functor class with overloaded operator()
, then the call to the comparator can easily be inlined into this copy of std::sort
.
因此,模板为您提供了更多的内联,因为 sort
代码的副本更多,每个副本都可以内联不同的比较器.内联是一种很好的优化,排序例程会进行大量比较,因此您通常可以测量 std::sort
比等效的 qsort
运行得更快.这样做的代价是可能会产生更大的代码――如果您的程序使用很多不同的比较器,那么您会得到很多不同的排序例程副本,每个副本都有不同的比较器.
So, templates give you more inlining because there are more copies of the sort
code, each of which can inline a different comparator. Inlining is quite a good optimization, and sort routines do a lot of comparisons, so you can often measure std::sort
running faster than an equivalent qsort
. The cost of this is the chance of much larger code -- if your program uses a lot of different comparators then you get a lot of different copies of the sort routine, each with a different comparator baked into it.
原则上,C 实现没有理由不能将 qsort
内联到它被调用的地方.然后,如果使用函数名调用它,优化器理论上可以观察到,在使用它时,函数指针必须仍然指向同一个函数.然后它可以内联对函数的调用,结果将类似于 std::sort
的结果.但在实践中,编译器往往不会采取第一步,即内联 qsort
.那是因为 (a) 它很大,并且 (b) 它位于不同的翻译单元中,通常编译到您的程序所链接的某个库中,并且 (c) 以这种方式执行此操作,您将拥有 qsort
用于每次调用它,而不仅仅是每个不同比较器的副本.所以它会比 C++ 更加臃肿,除非实现也可以找到一种方法来在 qsort
在不同地方使用相同的比较器调用的情况下通用代码.
In principle there's no reason why a C implementation can't inline qsort
into the place it is called. Then if it was called with the name of the function, the optimizer could in theory observe that at the point it is used, the function pointer must still point to that same function. Then it can inline the call to the function, and the result would be similar to the result with std::sort
. But in practice, compilers tend not to take the first step, inlining qsort
. That's because (a) it's large, and (b) it's in a different translation unit, usually compiled into some library that your program is linked against, and (c) to do it this way, you'd have an inlined copy of qsort
for every call to it, not just a copy for every different comparator. So it would be even more bloated than the C++, unless the implementation could also find a way to common up the code in cases where qsort
is called in different places with the same comparator.
因此,像 C 中的 qsort
这样的通用函数往往会因为通过函数指针或其他间接[*] 进行调用而产生一些开销.C++ 中的模板是保持源代码通用但确保它编译为专用函数(或多个此类函数)的常用方法.希望专用代码更快.
So, general-purpose functions like qsort
in C tend to have some overheads on account of calls through function pointers, or other indirection[*]. Templates in C++ are a common way of keeping the source code generic, but ensuring that it compiles to a special-purpose function (or several such functions). The special-purpose code hopefully is faster.
值得注意的是,模板绝不只是为了性能.在某些方面,std::sort
本身比 qsort
更通用.例如 qsort
只对数组进行排序,而 std::sort
可以对任何提供随机访问迭代器的东西进行排序.例如,它可以对 deque
进行排序,其背后是几个单独分配的不相交数组.因此,模板的使用不一定提供任何性能优势,可能是出于其他原因.碰巧模板确实会影响性能.
It's worth noting that templates are not by any means just about performance. std::sort
is itself more general-purpose than qsort
in some ways. For example qsort
only sorts arrays, whereas std::sort
can sort anything that provides a random-access iterator. It can for example sort a deque
, which under the covers is several disjoint arrays allocated separately. So the use of templates doesn't necessarily provide any performance benefit, it might be done for other reasons. It just happens that templates do affect performance.
[*] 另一个排序示例 - qsort
接受一个整数参数,表示数组的每个元素有多大,当它移动元素时,它必须调用 memcpy
或与此变量的值类似.std::sort
在编译时知道元素的确切类型,因此知道确切的大小.它可以内联一个复制构造函数调用,该调用又可能转换为复制该字节数的指令.与内联比较器一样,通常可以比调用复制可变字节数的例程更快地复制 4(或 8、16 或其他)字节,并将值 4(或 8,或 16,或其他).和以前一样,如果您使用大小的文字值调用 qsort
,并且对 qsort
的调用是内联的,那么编译器可以在 C 中执行完全相同的优化.但是在实践中你看不到这一点.
[*] another example with sorting - qsort
takes an integer parameter saying how big each element of the array is, and when it moves elements it therefore must call memcpy
or similar with the value of this variable. std::sort
knows at compile-time the exact type of the elements, and hence the exact size. It can inline a copy constructor call that in turn might translate to instructions to copy that number of bytes. As with the inlined comparator, it's often possible to copy exactly 4 (or 8, or 16, or whatever) bytes faster than you'd get by calling a routine that copies a variable number of bytes, passing it the value 4 (or 8, or 16, or whatever). As before, if you called qsort
with a literal value for the size, and that call to qsort
was inlined, then the compiler could perform the exact same optimization in C. But in practice you don't see that.
相关文章