为什么在 C++ 中拆分字符串比 Python 慢?

2021-12-28 00:00:00 python string split benchmarking c++

我正在尝试将一些代码从 Python 转换为 C++，以提高速度并提高我生疏的 C++ 技能.昨天，当 Python 中从标准输入读取行的幼稚实现比 C++ 快得多时，我感到震惊(参见这个).今天，我终于想出了如何在 C++ 中使用合并分隔符(类似于 python 的 split() 语义)拆分字符串，现在我正在体验似曾相识！我的 C++ 代码需要更长的时间来完成这项工作(尽管不像昨天的课程那样多一个数量级).

I'm trying to convert some code from Python to C++ in an effort to gain a little bit of speed and sharpen my rusty C++ skills. Yesterday I was shocked when a naive implementation of reading lines from stdin was much faster in Python than C++ (see this). Today, I finally figured out how to split a string in C++ with merging delimiters (similar semantics to python's split()), and am now experiencing deja vu! My C++ code takes much longer to do the work (though not an order of magnitude more, as was the case for yesterday's lesson).

Python 代码:

#!/usr/bin/env python from __future__ import print_function import time import sys count = 0 start_time = time.time() dummy = None for line in sys.stdin: dummy = line.split() count += 1 delta_sec = int(time.time() - start_time) print("Python: Saw {0} lines in {1} seconds. ".format(count, delta_sec), end='') if delta_sec > 0: lps = int(count/delta_sec) print(" Crunch Speed: {0}".format(lps)) else: print('')

C++ 代码:

#include <iostream> #include <string> #include <sstream> #include <time.h> #include <vector> using namespace std; void split1(vector<string> &tokens, const string &str, const string &delimiters = " ") { // Skip delimiters at beginning string::size_type lastPos = str.find_first_not_of(delimiters, 0); // Find first non-delimiter string::size_type pos = str.find_first_of(delimiters, lastPos); while (string::npos != pos || string::npos != lastPos) { // Found a token, add it to the vector tokens.push_back(str.substr(lastPos, pos - lastPos)); // Skip delimiters lastPos = str.find_first_not_of(delimiters, pos); // Find next non-delimiter pos = str.find_first_of(delimiters, lastPos); } } void split2(vector<string> &tokens, const string &str, char delim=' ') { stringstream ss(str); //convert string to stream string item; while(getline(ss, item, delim)) { tokens.push_back(item); //add token to vector } } int main() { string input_line; vector<string> spline; long count = 0; int sec, lps; time_t start = time(NULL); cin.sync_with_stdio(false); //disable synchronous IO while(cin) { getline(cin, input_line); spline.clear(); //empty the vector for the next line to parse //I'm trying one of the two implementations, per compilation, obviously: // split1(spline, input_line); split2(spline, input_line); count++; }; count--; //subtract for final over-read sec = (int) time(NULL) - start; cerr << "C++ : Saw " << count << " lines in " << sec << " seconds." ; if (sec > 0) { lps = count / sec; cerr << " Crunch speed: " << lps << endl; } else cerr << endl; return 0; //compiled with: g++ -Wall -O3 -o split1 split_1.cpp

请注意，我尝试了两种不同的拆分实现.一个 (split1) 使用字符串方法来搜索令牌，并且能够合并多个令牌以及处理多个令牌(它来自此处).第二个 (split2) 使用 getline 将字符串作为流读取，不合并分隔符，并且仅支持单个分隔符(该字符由多个 StackOverflow 用户在回答字符串拆分问题时发布).

Note that I tried two different split implementations. One (split1) uses string methods to search for tokens and is able to merge multiple tokens as well as handle numerous tokens (it comes from here). The second (split2) uses getline to read the string as a stream, doesn't merge delimiters, and only supports a single delimeter character (that one was posted by several StackOverflow users in answers to string splitting questions).

我以各种顺序运行了多次.我的测试机是 Macbook Pro(2011，8GB，四核)，这并不重要.我正在测试一个 20M 行的文本文件，其中包含三个以空格分隔的列，每个列看起来都类似于:foo.bar 127.0.0.1 home.foo.bar"

I ran this multiple times in various orders. My test machine is a Macbook Pro (2011, 8GB, Quad Core), not that it matters much. I'm testing with a 20M line text file with three space-separated columns that each look similar to this: "foo.bar 127.0.0.1 home.foo.bar"

结果:

$ /usr/bin/time cat test_lines_double | ./split.py 15.61 real 0.01 user 0.38 sys Python: Saw 20000000 lines in 15 seconds. Crunch Speed: 1333333 $ /usr/bin/time cat test_lines_double | ./split1 23.50 real 0.01 user 0.46 sys C++ : Saw 20000000 lines in 23 seconds. Crunch speed: 869565 $ /usr/bin/time cat test_lines_double | ./split2 44.69 real 0.02 user 0.62 sys C++ : Saw 20000000 lines in 45 seconds. Crunch speed: 444444

我做错了什么?有没有更好的方法在 C++ 中进行字符串拆分，它不依赖于外部库(即没有 boost)，支持合并分隔符序列(如 python 的拆分)，是线程安全的(所以没有 strtok)，并且其性能至少是与python相提并论?

What am I doing wrong? Is there a better way to do string splitting in C++ that does not rely on external libraries (i.e. no boost), supports merging sequences of delimiters (like python's split), is thread safe (so no strtok), and whose performance is at least on par with python?

编辑 1/部分解决方案?:

我尝试通过让 python 重置虚拟列表并每次附加到它来使它更公平，就像 C++ 一样.这仍然不是 C++ 代码所做的，但它更接近一些.基本上，循环现在是:

I tried making it a more fair comparison by having python reset the dummy list and append to it each time, as C++ does. This still isn't exactly what the C++ code is doing, but it's a bit closer. Basically, the loop is now:

for line in sys.stdin: dummy = [] dummy += line.split() count += 1

python 的性能现在与 split1 C++ 实现大致相同.

The performance of python is now about the same as the split1 C++ implementation.

/usr/bin/time cat test_lines_double | ./split5.py 22.61 real 0.01 user 0.40 sys Python: Saw 20000000 lines in 22 seconds. Crunch Speed: 909090

我仍然感到惊讶的是，即使 Python 针对字符串处理进行了如此优化(正如 Matt Joiner 建议的那样)，这些 C++ 实现也不会更快.如果有人对如何使用 C++ 以更优化的方式执行此操作有任何想法，请分享您的代码.(我认为我的下一步将尝试在纯 C 中实现这一点，尽管我不会牺牲程序员的生产力来重新实现我在 C 中的整个项目，所以这只是字符串拆分速度的实验.)

I still am surprised that, even if Python is so optimized for string processing (as Matt Joiner suggested), that these C++ implementations would not be faster. If anyone has ideas about how to do this in a more optimal way using C++, please share your code. (I think my next step will be trying to implement this in pure C, although I'm not going to trade off programmer productivity to re-implement my overall project in C, so this will just be an experiment for string splitting speed.)

感谢大家的帮助.

最终编辑/解决方案:

请参阅 Alf 接受的答案.由于 python 严格通过引用处理字符串并且 STL 字符串经常被复制，因此使用 vanilla python 实现的性能更好.为了进行比较，我通过 Alf 的代码编译并运行了我的数据，这是与所有其他运行在同一台机器上的性能，基本上与朴素的 python 实现相同(尽管比重置/附加列表的 python 实现快，如显示在上面的编辑中):

Please see Alf's accepted answer. Since python deals with strings strictly by reference and STL strings are often copied, performance is better with vanilla python implementations. For comparison, I compiled and ran my data through Alf's code, and here is the performance on the same machine as all the other runs, essentially identical to the naive python implementation (though faster than the python implementation that resets/appends the list, as shown in the above edit):

$ /usr/bin/time cat test_lines_double | ./split6 15.09 real 0.01 user 0.45 sys C++ : Saw 20000000 lines in 15 seconds. Crunch speed: 1333333

我唯一的小抱怨是关于在这种情况下让 C++ 执行所需的代码量.

My only small remaining gripe is regarding the amount of code necessary to get C++ to perform in this case.

从这个问题和昨天的标准输入行阅读问题(上面链接)中得到的教训之一是，人们应该始终进行基准测试，而不是对语言的相对默认"性能做出幼稚的假设.我很欣赏这种教育.

One of the lessons here from this issue and yesterday's stdin line reading issue (linked above) are that one should always benchmark instead of making naive assumptions about languages' relative "default" performance. I appreciate the education.

再次感谢大家的建议！

推荐答案

猜测，Python 字符串是引用计数的不可变字符串，因此在 Python 代码中不会复制字符串，而 C++ std::string 是可变值类型，在最小的机会被复制.

As a guess, Python strings are reference counted immutable strings, so that no strings are copied around in the Python code, while C++ std::string is a mutable value type, and is copied at the smallest opportunity.

如果目标是快速拆分，那么可以使用恒定时间子字符串操作，这意味着只引用原始字符串的一部分，就像在 Python(以及 Java 和 C#…)中一样.

If the goal is fast splitting, then one would use constant time substring operations, which means only referring to parts of the original string, as in Python (and Java, and C#…).

C++ std::string 类有一个补偿特性，不过:它是标准，因此它可以用于在效率高的地方安全且可移植地传递字符串不是主要考虑因素.但足够的聊天.代码――在我的机器上，这当然比 Python 快，因为 Python 的字符串处理是用 C 实现的，C 是 C++ 的一个子集(嘿嘿):

The C++ std::string class has one redeeming feature, though: it is standard, so that it can be used to pass strings safely and portably around where efficiency is not a main consideration. But enough chat. Code -- and on my machine this is of course faster than Python, since Python's string handling is implemented in C which is a subset of C++ (he he):

#include <iostream> #include <string> #include <sstream> #include <time.h> #include <vector> using namespace std; class StringRef { private: char const* begin_; int size_; public: int size() const { return size_; } char const* begin() const { return begin_; } char const* end() const { return begin_ + size_; } StringRef( char const* const begin, int const size ) : begin_( begin ) , size_( size ) {} }; vector<StringRef> split3( string const& str, char delimiter = ' ' ) { vector<StringRef> result; enum State { inSpace, inToken }; State state = inSpace; char const* pTokenBegin = 0; // Init to satisfy compiler. for( auto it = str.begin(); it != str.end(); ++it ) { State const newState = (*it == delimiter? inSpace : inToken); if( newState != state ) { switch( newState ) { case inSpace: result.push_back( StringRef( pTokenBegin, &*it - pTokenBegin ) ); break; case inToken: pTokenBegin = &*it; } } state = newState; } if( state == inToken ) { result.push_back( StringRef( pTokenBegin, &*str.end() - pTokenBegin ) ); } return result; } int main() { string input_line; vector<string> spline; long count = 0; int sec, lps; time_t start = time(NULL); cin.sync_with_stdio(false); //disable synchronous IO while(cin) { getline(cin, input_line); //spline.clear(); //empty the vector for the next line to parse //I'm trying one of the two implementations, per compilation, obviously: // split1(spline, input_line); //split2(spline, input_line); vector<StringRef> const v = split3( input_line ); count++; }; count--; //subtract for final over-read sec = (int) time(NULL) - start; cerr << "C++ : Saw " << count << " lines in " << sec << " seconds." ; if (sec > 0) { lps = count / sec; cerr << " Crunch speed: " << lps << endl; } else cerr << endl; return 0; } //compiled with: g++ -Wall -O3 -o split1 split_1.cpp -std=c++0x

免责声明:我希望没有任何错误.我还没有测试功能，但只检查了速度.但我认为，即使有一两个错误，纠正也不会显着影响速度.

Disclaimer: I hope there aren't any bugs. I haven't tested the functionality, but only checked the speed. But I think, even if there is a bug or two, correcting that won't significantly affect the speed.

相关文章