如何在 C++ 中快速解析空格分隔的浮点数?

2021-12-12 00:00:00 parsing c++ boost-spirit

我有一个包含数百万行的文件,每行有 3 个以空格分隔的浮点数.读取文件需要很多时间,所以我尝试使用内存映射文件读取它们才发现问题不是IO的速度而是解析的速度.

我目前的解析是取流(称为文件)并执行以下操作

float x,y,z;文件 >>x>>y>>z;

Stack Overflow 中有人推荐使用 Boost.Spirit,但我找不到任何简单的教程来解释如何使用它.

我试图找到一种简单有效的方法来解析如下所示的行:

"134.32 3545.87 3425"

我会非常感谢您的帮助.我想用 strtok 来分割它,但我不知道如何将字符串转换为浮点数,我不太确定这是最好的方法.

我不介意解决方案是否为 Boost.我不介意它是否不是有史以来最有效的解决方案,但我确信可以将速度提高一倍.

提前致谢.

解决方案

如果转换是瓶颈(这很有可能),你应该从使用不同的可能性开始标准.从逻辑上讲,人们会期望它们非常接近,但实际上,它们并不总是:

  • 您已经确定 std::ifstream 太慢了.

  • 将内存映射数据转换为 std::istringstream几乎肯定不是一个好的解决方案;你首先必须创建一个字符串,它将复制所有数据.

  • 自己写streambuf直接从内存中读取,不复制(或使用已弃用的 std::istrstream)可能是一个解决方案,尽管如果问题真的是转换...这仍然使用相同的转换例程.

  • 您可以随时在映射的内存上尝试 fscanfscanf溪流.根据实现,它们可能会更快比各种 istream 实现.

  • 使用 strtod 可能比其中任何一个都快.没必要为此进行标记:strtod 跳过前导空格(包括 ' '),并有一个 out 参数,它把未读取的第一个字符的地址.结束条件是有点棘手,你的循环应该看起来有点像:

<前>字符*开始;//设置为指向 mmap 的数据...//你还必须安排一个 ''//跟踪数据.这大概是//最难的问题.字符*结束;错误号 = 0;double tmp = strtod( begin, &end );while ( 错误号 == 0 && 结束 != 开始 ) {//用 tmp 做任何事...开始=结束;tmp = strtod( 开始, &end );}

如果这些都不够快,您将不得不考虑实际数据.它可能有某种额外的约束,这意味着您可以编写一个比更一般的转换程序更快的转换程序;例如strtod 必须同时处理固定的和科学的,并且它即使有 17 位有效数字,也必须 100% 准确.它也必须是特定于语言环境的.所有这些都是添加的复杂性,这意味着添加要执行的代码.但要注意:编写有效且正确的转换例程,即使对于一组受限制的输入,是非平凡的;你真的必须知道你在做什么.

出于好奇,我进行了一些测试.除了前面提到的解决方案,我写了一个简单的自定义转换器,只处理定点(不科学),最多小数点后五位,小数点前的值必须适合 int:

double转换(字符常量*源,字符常量** endPtr){字符*结束;int left = strtol( source, &end, 10 );双结果 = 左;if ( *end == '.' ) {字符 * 开始 = 结束 + 1;int right = strtol( start, &end, 10 );静态双常量 fracMult[]= { 0.0, 0.1, 0.01, 0.001, 0.0001, 0.00001 };结果 += 右 * fracMult[ 结束 - 开始 ];}如果( endPtr != nullptr ){*endPtr = 结束;}返回结果;}

(如果你真的使用这个,你肯定应该添加一些错误处理.这只是为了实验而被迅速敲响目的,读取我生成的测试文件,什么都没有否则.)

界面与strtod完全相同,以简化编码.

我在两种环境(在不同的机器上,所以任何时间的绝对值都不相关).我拿到以下结果:

在 Windows 7 下,用 VC 11 (/O2) 编译:

直接使用 fstream 进行测试(5 次迭代)...每次迭代 6.3528e+006 微秒直接使用 fscan 进行测试(5 次迭代)...每次迭代 685800 微秒使用 strtod 进行测试(5 次迭代)...每次迭代 597000 微秒使用手动测试(5 次迭代)...每次迭代 269600 微秒

Linux 2.6.18 下,使用 g++ 4.4.2 (-O2, IIRC) 编译:

直接使用 fstream 进行测试(5 次迭代)...每次迭代 784000 微秒直接使用 fscanf 进行测试(5 次迭代)...每次迭代 526000 微秒使用 strtod 进行测试(5 次迭代)...每次迭代 382000 微秒使用 strtof 进行测试(5 次迭代)...每次迭代 360000 微秒使用手动测试(5 次迭代)...每次迭代 186000 微秒

在所有情况下,我正在阅读 554000 行,每行随机 3在 [0...10000) 范围内生成浮点数.

最引人注目的是两者之间的巨大差异fstreamfscan 在 Windows 下(以及相对较小的fscanstrtod 之间的区别).第二件事是简单的自定义转换函数有多少收益,在两个平台.必要的错误处理会减慢它的速度有一点,但差异仍然很大.我期望一些改进,因为它不能处理很多事情标准转换例程(如科学格式,非常非常小的数字,Inf 和 NaN,i18n 等),但不是这个很多.

I have a file with millions of lines, each line has 3 floats separated by spaces. It takes a lot of time to read the file, so I tried to read them using memory mapped files only to find out that the problem is not with the speed of IO but with the speed of the parsing.

My current parsing is to take the stream (called file) and do the following

float x,y,z;
file >> x >> y >> z;

Someone in Stack Overflow recommended to use Boost.Spirit but I couldn't find any simple tutorial to explain how to use it.

I'm trying to find a simple and efficient way to parse a line that looks like this:

"134.32 3545.87 3425"

I will really appreciate some help. I wanted to use strtok to split it, but I don't know how to convert strings to floats, and I'm not quite sure it's the best way.

I don't mind if the solution will be Boost or not. I don't mind if it won't be the most efficient solution ever, but I'm sure that it is possible to double the speed.

Thanks in advance.

解决方案

If the conversion is the bottle neck (which is quite possible), you should start by using the different possiblities in the standard. Logically, one would expect them to be very close, but practically, they aren't always:

  • You've already determined that std::ifstream is too slow.

  • Converting your memory mapped data to an std::istringstream is almost certainly not a good solution; you'll first have to create a string, which will copy all of the data.

  • Writing your own streambuf to read directly from the memory, without copying (or using the deprecated std::istrstream) might be a solution, although if the problem really is the conversion... this still uses the same conversion routines.

  • You can always try fscanf, or scanf on your memory mapped stream. Depending on the implementation, they might be faster than the various istream implementations.

  • Probably faster than any of these is to use strtod. No need to tokenize for this: strtod skips leading white space (including ' '), and has an out parameter where it puts the address of the first character not read. The end condition is a bit tricky, your loop should probably look a bit like:

    char* begin;    //  Set to point to the mmap'ed data...
                    //  You'll also have to arrange for a ''
                    //  to follow the data.  This is probably
                    //  the most difficult issue.
    char* end;
    errno = 0;
    double tmp = strtod( begin, &end );
    while ( errno == 0 && end != begin ) {
        //  do whatever with tmp...
        begin = end;
        tmp = strtod( begin, &end );
    }

If none of these are fast enough, you'll have to consider the actual data. It probably has some sort of additional constraints, which means that you can potentially write a conversion routine which is faster than the more general ones; e.g. strtod has to handle both fixed and scientific, and it has to be 100% accurate even if there are 17 significant digits. It also has to be locale specific. All of this is added complexity, which means added code to execute. But beware: writing an efficient and correct conversion routine, even for a restricted set of input, is non-trivial; you really do have to know what you are doing.

EDIT:

Just out of curiosity, I've run some tests. In addition to the afore mentioned solutions, I wrote a simple custom converter, which only handles fixed point (no scientific), with at most five digits after the decimal, and the value before the decimal must fit in an int:

double
convert( char const* source, char const** endPtr )
{
    char* end;
    int left = strtol( source, &end, 10 );
    double results = left;
    if ( *end == '.' ) {
        char* start = end + 1;
        int right = strtol( start, &end, 10 );
        static double const fracMult[] 
            = { 0.0, 0.1, 0.01, 0.001, 0.0001, 0.00001 };
        results += right * fracMult[ end - start ];
    }
    if ( endPtr != nullptr ) {
        *endPtr = end;
    }
    return results;
}

(If you actually use this, you should definitely add some error handling. This was just knocked up quickly for experimental purposes, to read the test file I'd generated, and nothing else.)

The interface is exactly that of strtod, to simplify coding.

I ran the benchmarks in two environments (on different machines, so the absolute values of any times aren't relevant). I got the following results:

Under Windows 7, compiled with VC 11 (/O2):

Testing Using fstream directly (5 iterations)...
    6.3528e+006 microseconds per iteration
Testing Using fscan directly (5 iterations)...
    685800 microseconds per iteration
Testing Using strtod (5 iterations)...
    597000 microseconds per iteration
Testing Using manual (5 iterations)...
    269600 microseconds per iteration

Under Linux 2.6.18, compiled with g++ 4.4.2 (-O2, IIRC):

Testing Using fstream directly (5 iterations)...
    784000 microseconds per iteration
Testing Using fscanf directly (5 iterations)...
    526000 microseconds per iteration
Testing Using strtod (5 iterations)...
    382000 microseconds per iteration
Testing Using strtof (5 iterations)...
    360000 microseconds per iteration
Testing Using manual (5 iterations)...
    186000 microseconds per iteration

In all cases, I'm reading 554000 lines, each with 3 randomly generated floating point in the range [0...10000).

The most striking thing is the enormous difference between fstream and fscan under Windows (and the relatively small difference between fscan and strtod). The second thing is just how much the simple custom conversion function gains, on both platforms. The necessary error handling would slow it down a little, but the difference is still significant. I expected some improvement, since it doesn't handle a lot of things the the standard conversion routines do (like scientific format, very, very small numbers, Inf and NaN, i18n, etc.), but not this much.

相关文章