在 istream 上使用 regex_iterator

2022-01-10 00:00:00 regex iterator c++ istream istream-iterator

我希望能够解决这样的问题:获取 std :: ifstream 来处理 LF、CR 和 CRLF? istream 需要用复杂的分隔符标记;这样标记 istream 的唯一方法是:

I want to be able to solve problems like this: Getting std :: ifstream to handle LF, CR, and CRLF? where an istream needs to be tokenized by a complex delimiter; such that the only way to tokenize the istream is to:

  1. istream中一次读取一个字符
  2. 收集人物
  3. 当分隔符被击中时,将集合作为标记返回

正则表达式非常擅长用复杂的分隔符标记字符串:

Regexes are very good at tokenizing strings with complex delimiters:

string foo{ "A
BC
" };
vector<string> bar;

// This puts {"A", "B", "C"} into bar
transform(sregex_iterator(foo.cbegin(), foo.cend(), regex("(.*)(?:
?|)")), sregex_iterator(), back_inserter(bar), [](const smatch& i){ return i[1].str(); });

但我不能在 istream 上使用 regex_iterator :( 我的解决方案是啜食 istream 然后运行 ??regex_iterator 在上面,但是 slurping 步骤似乎是多余的.

But I can't use a regex_iterator on a istream :( My solution has been to slurp the istream and then run the regex_iterator over it, but the slurping step seems superfluous.

在某处是否存在 istream_iteratorregex_iterator 的邪恶组合,或者如果我想要它,我必须自己编写吗?

Is there an unholy combination of istream_iterator and regex_iterator out there somewhere, or if I want it do I have to write it myself?

推荐答案

这个问题是关于代码外观的:

This question is about code appearance:

  1. 由于我们知道 regex 一次只能处理 1 个字符,因此这个问题要求使用库一次解析 istream 1 个字符,而不是在内部读取和解析 istream 一次 1 个字符
  2. 由于一次解析 istream 1 个字符仍会将该字符复制到临时变量(缓冲区),此代码旨在避免在内部缓冲所有代码,这取决于库而不是抽象那
  1. Since we know that a regex will work 1 character at a time, this question is asking to use a library to parse the istream 1 character at a time rather than internally reading and parsing the istream 1 character at a time
  2. Since parsing an istream 1 character at a time will still copy that one character to a temp variable (buffer) this code seeks to avoid buffering all the code internally, depending on a library instead to abstract that


C++11 的 regexes 使用不支持前瞻或后瞻的 ECMA-262:https://stackoverflow.com/a/14539500/2642059 这意味着 regex 只能使用 input_iterator_tag 进行匹配,但显然是用 C+ 实现的+11 没有.


C++11's regexes use ECMA-262 which does not support look aheads or look behinds: https://stackoverflow.com/a/14539500/2642059 This means that a regex could match using only an input_iterator_tag, but clearly those implemented in C++11 do not.

boost::regex_iterator 确实支持 boost::match_partial 标志(即 在 C++11 regex flags 中不可用.) boost::match_partial 允许用户对文件的 part 进行 slurp 并在其上运行 regex,如果由于输入结束而导致不匹配,则 regex 将保持"是手指"在正则表达式中的那个位置并等待更多被添加到缓冲区中.您可以在此处查看示例:http://www.boost.org/doc/libs/1_55_0/libs/regex/doc/html/boost_regex/partial_matches.html 一般情况下,如 "A BC ",这样可以节省缓冲区大小.

boost::regex_iterator on the other hand does support the boost::match_partial flag (which is not available in C++11 regex flags.) boost::match_partial allows the user to slurp part of the file and run the regex over that, on a mismatch due to end of input the regex will "hold it's finger" at that position in the regex and await more being added to the buffer. You can see an example here: http://www.boost.org/doc/libs/1_55_0/libs/regex/doc/html/boost_regex/partial_matches.html In the average case, like "A BC ", this can save buffer size.

boost::match_partial 有 4 个缺点:

  1. 在最坏的情况下,像 "ABC " 这样可以节省用户 no 的大小,并且他必须 slurp 整个 istream
  2. 如果程序员可以猜出一个太大的缓冲区大小,即它包含分隔符和更多的内容,那么减少缓冲区大小的好处就被浪费了
  3. 任何时候选择的缓冲区太小,与整个文件的 slurping 相比,都需要额外的计算,因此这种方法在分隔符密集的字符串中表现出色
  4. 包含 boost 总是会导致臃肿
  1. In the worst case, like "ABC " this saves the user no size and he must slurp the whole istream
  2. If the programmer can guess a buffer size that is too large, that is it contains the delimiter and a significant amount more, the benefits of the reduction in buffer size are squandered
  3. Any time the buffer size selected is too small, additional computations will be required compared to the slurping of the entire file, therefore this method excels in a delimiter-dense string
  4. The inclusion of boost always causes bloat

回过头来回答问题:标准库 regex_iterator 无法对 input_iterator_tag 进行操作,需要对整个 istream 进行啜饮.boost::regex_iterator 允许用户可能 啜饮少于整个 istream.因为这是一个关于代码外观的问题,并且因为 boost::regex_iterator 的最坏情况需要对整个文件进行 slurping,所以这不是这个问题的好答案.

Circling back to answer the question: A standard library regex_iterator cannot operate on an input_iterator_tag, slurping of the whole istream required. A boost::regex_iterator allows the user to possibly slurp less than the whole istream. Because this is a question about code appearance though, and because boost::regex_iterator's worst case requires slurping of the whole file anyway, it is not a good answer to this question.

为了获得最佳的代码外观,最好的办法是在整个文件上运行标准 regex_iterator.

For the best code appearance slurping the whole file and running a standard regex_iterator over it is your best bet.

相关文章