现代硬件上的浮点与整数计算
我正在用 C++ 做一些性能关键的工作,我们目前正在使用整数计算来解决固有的浮点问题,因为它更快".这会导致一大堆烦人的问题,并添加了很多烦人的代码.
I am doing some performance critical work in C++, and we are currently using integer calculations for problems that are inherently floating point because "its faster". This causes a whole lot of annoying problems and adds a lot of annoying code.
现在,我记得读过关于浮点计算在大约 386 天左右如此缓慢的文章,我相信 (IIRC) 有一个可选的协处理器.但是现在可以肯定的是,随着 CPU 的复杂性和功能呈指数级增长,如果进行浮点计算或整数计算,它的速度"没有区别吗?特别是因为与导致管道停顿或从主内存中获取某些内容相比,实际计算时间很小?
Now, I remember reading about how floating point calculations were so slow approximately circa the 386 days, where I believe (IIRC) that there was an optional co-proccessor. But surely nowadays with exponentially more complex and powerful CPUs it makes no difference in "speed" if doing floating point or integer calculation? Especially since the actual calculation time is tiny compared to something like causing a pipeline stall or fetching something from main memory?
我知道正确的答案是在目标硬件上进行基准测试,什么是测试它的好方法?我编写了两个小型 C++ 程序,并将它们的运行时间与 Linux 上的时间"进行了比较,但实际运行时间变化太大(对我在虚拟服务器上运行没有帮助).除了花一整天的时间运行数百个基准测试、制作图表等之外,我还能做些什么来对相对速度进行合理的测试?有什么想法或想法吗?我完全错了吗?
I know the correct answer is to benchmark on the target hardware, what would be a good way to test this? I wrote two tiny C++ programs and compared their run time with "time" on Linux, but the actual run time is too variable (doesn't help I am running on a virtual server). Short of spending my entire day running hundreds of benchmarks, making graphs etc. is there something I can do to get a reasonable test of the relative speed? Any ideas or thoughts? Am I completely wrong?
我使用的程序如下,它们无论如何都不相同:
The programs I used as follows, they are not identical by any means:
#include <iostream>
#include <cmath>
#include <cstdlib>
#include <time.h>
int main( int argc, char** argv )
{
int accum = 0;
srand( time( NULL ) );
for( unsigned int i = 0; i < 100000000; ++i )
{
accum += rand( ) % 365;
}
std::cout << accum << std::endl;
return 0;
}
程序 2:
#include <iostream>
#include <cmath>
#include <cstdlib>
#include <time.h>
int main( int argc, char** argv )
{
float accum = 0;
srand( time( NULL ) );
for( unsigned int i = 0; i < 100000000; ++i )
{
accum += (float)( rand( ) % 365 );
}
std::cout << accum << std::endl;
return 0;
}
提前致谢!
我关心的平台是在桌面 Linux 和 Windows 机器上运行的常规 x86 或 x86-64.
The platform I care about is regular x86 or x86-64 running on desktop Linux and Windows machines.
编辑 2(从下面的评论粘贴):我们目前有一个广泛的代码库.真的,我遇到了我们不能使用浮点数,因为整数计算更快"的概括――我正在寻找一种方法(如果这是真的)来反驳这个概括的假设.我意识到,如果不完成所有工作并在事后对其进行分析,就不可能预测我们的确切结果.
Edit 2(pasted from a comment below): We have an extensive code base currently. Really I have come up against the generalization that we "must not use float since integer calculation is faster" - and I am looking for a way (if this is even true) to disprove this generalized assumption. I realize that it would be impossible to predict the exact outcome for us short of doing all the work and profiling it afterwards.
无论如何,感谢您所有出色的回答和帮助.随意添加任何其他内容:)
Anyway, thanks for all your excellent answers and help. Feel free to add anything else :).
推荐答案
唉,我只能给你一个视情况而定"的答案...
Alas, I can only give you an "it depends" answer...
根据我的经验,性能有很多很多变量......尤其是在整数和;浮点数学.由于不同的处理器具有不同的管道"长度,因此它因处理器而异(即使在同一个系列中,例如 x86).此外,有些运算通常非常简单(例如加法)并且通过处理器的路径加快,而其他运算(例如除法)需要的时间要长得多.
From my experience, there are many, many variables to performance...especially between integer & floating point math. It varies strongly from processor to processor (even within the same family such as x86) because different processors have different "pipeline" lengths. Also, some operations are generally very simple (such as addition) and have an accelerated route through the processor, and others (such as division) take much, much longer.
另一个大变量是数据所在的位置.如果您只需要添加几个值,那么所有数据都可以驻留在缓存中,在那里它们可以快速发送到 CPU.缓存中已经有数据的非常非常慢的浮点运算将比需要从系统内存中复制整数的整数运算快很多倍.
The other big variable is where the data reside. If you only have a few values to add, then all of the data can reside in cache, where they can be quickly sent to the CPU. A very, very slow floating point operation that already has the data in cache will be many times faster than an integer operation where an integer needs to be copied from system memory.
我假设您问这个问题是因为您正在开发一个性能关键的应用程序.如果您正在为 x86 架构进行开发,并且需要额外的性能,您可能需要考虑使用 SSE 扩展.这可以大大加快单精度浮点运算的速度,因为可以同时对多个数据执行相同的操作,而且还有一个单独的* 寄存器组用于 SSE 操作.(我注意到在你的第二个例子中你使用了float"而不是double",让我觉得你在使用单精度数学).
I assume that you are asking this question because you are working on a performance critical application. If you are developing for the x86 architecture, and you need extra performance, you might want to look into using the SSE extensions. This can greatly speed up single-precision floating point arithmetic, as the same operation can be performed on multiple data at once, plus there is a separate* bank of registers for the SSE operations. (I noticed in your second example you used "float" instead of "double", making me think you are using single-precision math).
*注意:使用旧的 MMX 指令实际上会减慢程序的速度,因为那些旧的指令实际上使用与 FPU 相同的寄存器,从而无法同时使用 FPU 和 MMX.
*Note: Using the old MMX instructions would actually slow down programs, because those old instructions actually used the same registers as the FPU does, making it impossible to use both the FPU and MMX at the same time.
相关文章