在 istream 上使用 regex_iterator
我希望能够解决这样的问题:获取 std :: ifstream 来处理 LF、CR 和 CRLF? istream
需要用复杂的分隔符标记;这样标记 istream
的唯一方法是:
I want to be able to solve problems like this: Getting std :: ifstream to handle LF, CR, and CRLF? where an istream
needs to be tokenized by a complex delimiter; such that the only way to tokenize the istream
is to:
- 在
istream
中一次读取一个字符 - 收集人物
- 当分隔符被击中时,将集合作为标记返回
正则表达式非常擅长用复杂的分隔符标记字符串:
Regexes are very good at tokenizing strings with complex delimiters:
string foo{ "A
BC
" };
vector<string> bar;
// This puts {"A", "B", "C"} into bar
transform(sregex_iterator(foo.cbegin(), foo.cend(), regex("(.*)(?:
?|)")), sregex_iterator(), back_inserter(bar), [](const smatch& i){ return i[1].str(); });
但我不能在 istream
上使用 regex_iterator
:( 我的解决方案是啜食 istream
然后运行 ??regex_iterator
在上面,但是 slurping 步骤似乎是多余的.
But I can't use a regex_iterator
on a istream
:( My solution has been to slurp the istream
and then run the regex_iterator
over it, but the slurping step seems superfluous.
在某处是否存在 istream_iterator
和 regex_iterator
的邪恶组合,或者如果我想要它,我必须自己编写吗?
Is there an unholy combination of istream_iterator
and regex_iterator
out there somewhere, or if I want it do I have to write it myself?
推荐答案
这个问题是关于代码外观的:
This question is about code appearance:
- 由于我们知道
regex
一次只能处理 1 个字符,因此这个问题要求使用库一次解析istream
1 个字符,而不是在内部读取和解析istream
一次 1 个字符 - 由于一次解析
istream
1 个字符仍会将该字符复制到临时变量(缓冲区),此代码旨在避免在内部缓冲所有代码,这取决于库而不是抽象那
- Since we know that a
regex
will work 1 character at a time, this question is asking to use a library to parse theistream
1 character at a time rather than internally reading and parsing theistream
1 character at a time - Since parsing an
istream
1 character at a time will still copy that one character to a temp variable (buffer) this code seeks to avoid buffering all the code internally, depending on a library instead to abstract that
C++11 的 regex
es 使用不支持前瞻或后瞻的 ECMA-262:https://stackoverflow.com/a/14539500/2642059 这意味着 regex
只能使用 input_iterator_tag
进行匹配,但显然是用 C+ 实现的+11 没有.
C++11's regex
es use ECMA-262 which does not support look aheads or look behinds: https://stackoverflow.com/a/14539500/2642059 This means that a regex
could match using only an input_iterator_tag
, but clearly those implemented in C++11 do not.
boost::regex_iterator
确实支持 boost::match_partial
标志(即 在 C++11 regex
flags 中不可用.) boost::match_partial
允许用户对文件的 part 进行 slurp 并在其上运行 regex
,如果由于输入结束而导致不匹配,则 regex
将保持"是手指"在正则表达式中的那个位置并等待更多被添加到缓冲区中.您可以在此处查看示例:http://www.boost.org/doc/libs/1_55_0/libs/regex/doc/html/boost_regex/partial_matches.html 一般情况下,如 "A
BC
"
,这样可以节省缓冲区大小.
boost::regex_iterator
on the other hand does support the boost::match_partial
flag (which is not available in C++11 regex
flags.) boost::match_partial
allows the user to slurp part of the file and run the regex
over that, on a mismatch due to end of input the regex
will "hold it's finger" at that position in the regex and await more being added to the buffer. You can see an example here: http://www.boost.org/doc/libs/1_55_0/libs/regex/doc/html/boost_regex/partial_matches.html In the average case, like "A
BC
"
, this can save buffer size.
boost::match_partial
有 4 个缺点:
- 在最坏的情况下,像
"ABC "
这样可以节省用户 no 的大小,并且他必须 slurp 整个istream
- 如果程序员可以猜出一个太大的缓冲区大小,即它包含分隔符和更多的内容,那么减少缓冲区大小的好处就被浪费了
- 任何时候选择的缓冲区太小,与整个文件的 slurping 相比,都需要额外的计算,因此这种方法在分隔符密集的字符串中表现出色
- 包含
boost
总是会导致臃肿
- In the worst case, like
"ABC "
this saves the user no size and he must slurp the wholeistream
- If the programmer can guess a buffer size that is too large, that is it contains the delimiter and a significant amount more, the benefits of the reduction in buffer size are squandered
- Any time the buffer size selected is too small, additional computations will be required compared to the slurping of the entire file, therefore this method excels in a delimiter-dense string
- The inclusion of
boost
always causes bloat
回过头来回答问题:标准库 regex_iterator
无法对 input_iterator_tag
进行操作,需要对整个 istream
进行啜饮.boost::regex_iterator
允许用户可能 啜饮少于整个 istream
.因为这是一个关于代码外观的问题,并且因为 boost::regex_iterator
的最坏情况需要对整个文件进行 slurping,所以这不是这个问题的好答案.
Circling back to answer the question: A standard library regex_iterator
cannot operate on an input_iterator_tag
, slurping of the whole istream
required. A boost::regex_iterator
allows the user to possibly slurp less than the whole istream
. Because this is a question about code appearance though, and because boost::regex_iterator
's worst case requires slurping of the whole file anyway, it is not a good answer to this question.
为了获得最佳的代码外观,最好的办法是在整个文件上运行标准 regex_iterator
.
For the best code appearance slurping the whole file and running a standard regex_iterator
over it is your best bet.
相关文章