读取映射到内存的 CSV 文件的最简单方法?
当我在 C++(11) 中读取文件时,我使用以下方法将它们映射到内存中:
When I read from files in C++(11) I map them in to memory using:
boost::interprocess::file_mapping* fm = new file_mapping(path, boost::interprocess::read_only);
boost::interprocess::mapped_region* region = new mapped_region(*fm, boost::interprocess::read_only);
char* bytes = static_cast<char*>(region->get_address());
当我希望以极快的速度逐字节读取时,这很好.但是,我创建了一个 csv 文件,我想将其映射到内存,读取每一行并在逗号上拆分每一行.
Which is fine when I wish to read byte by byte extremely fast. However, I have created a csv file which I would like to map to memory, read each line and split each line on the comma.
有没有办法通过对上面的代码进行一些修改来做到这一点?
Is there a way I can do this with a few modifications of my above code?
(我映射到内存是因为我有大量内存,我不希望磁盘/IO 流传输出现任何瓶颈).
(I am mapping to memory because I have an awful lot of memory and I do not want any bottleneck with disk/IO streaming).
推荐答案
这是我对足够快"的看法.它在大约 1 秒内压缩 116 MiB 的 CSV(2.5Mio 行[1]).
Here's my take on "fast enough". It zips through 116 MiB of CSV (2.5Mio lines[1]) in ~1 second.
然后可以零拷贝随机访问结果,因此没有开销(除非页面被换出).
The result is then randomly accessible at zero-copy, so no overhead (unless pages are swapped out).
比较:
- 这比简单的
wc csv.txt
处理同一个文件快 ~3 倍 它大约与下面的 perl one liner 一样快(它列出了所有行上的不同字段计数):
- that's ~3x faster than a naive
wc csv.txt
takes on the same file it's about as fast as the following perl one liner (which lists the distinct field counts on all lines):
perl -ne '$fields{scalar split /,/}++; END { map { print "$_
" } keys %fields }' csv.txt
它只比 (LANG=C wc csv.txt)
慢,避免了语言环境功能(大约 1.5 倍)
it's only slower than (LANG=C wc csv.txt)
which avoids locale functionality (by about 1.5x)
这是解析器的全部荣耀:
Here's the parser in all it's glory:
using CsvField = boost::string_ref;
using CsvLine = std::vector<CsvField>;
using CsvFile = std::vector<CsvLine>; // keep it simple :)
struct CsvParser : qi::grammar<char const*, CsvFile()> {
CsvParser() : CsvParser::base_type(lines)
{
using namespace qi;
field = raw [*~char_(",
")]
[ _val = construct<CsvField>(begin(_1), size(_1)) ]; // semantic action
line = field % ',';
lines = line % eol;
}
// declare: line, field, fields
};
唯一棘手的事情(也是唯一的优化)是从具有匹配字符数的源迭代器构造 CsvField
的语义操作.
The only tricky thing (and the only optimization there) is the semantic action to construct a CsvField
from the source iterator with the matches number of characters.
主要内容如下:
int main()
{
boost::iostreams::mapped_file_source csv("csv.txt");
CsvFile parsed;
if (qi::parse(csv.data(), csv.data() + csv.size(), CsvParser(), parsed))
{
std::cout << (csv.size() >> 20) << " MiB parsed into " << parsed.size() << " lines of CSV field values
";
}
}
打印
116 MiB parsed into 2578421 lines of CSV values
您可以像 std::string
:
for (int i = 0; i < 10; ++i)
{
auto l = rand() % parsed.size();
auto& line = parsed[l];
auto c = rand() % line.size();
std::cout << "Random field at L:" << l << " C:" << c << " " << line[c] << "
";
}
打印出例如:
Random field at L:1979500 C:2 sateen's
Random field at L:928192 C:1 sackcloth's
Random field at L:1570275 C:4 accompanist's
Random field at L:479916 C:2 apparel's
Random field at L:767709 C:0 pinks
Random field at L:1174430 C:4 axioms
Random field at L:1209371 C:4 wants
Random field at L:2183367 C:1 Klondikes
Random field at L:2142220 C:1 Anthony
Random field at L:1680066 C:2 pines
完整的示例在此处在 Coliru 上直播
The fully working sample is here Live On Coliru
[1] 我通过重复附加
[1] I created the file by repeatedly appending the output of
while read a && read b && read c && read d && read e
do echo "$a,$b,$c,$d,$e"
done < /etc/dictionaries-common/words
到csv.txt
,直到它计数到 250 万行.
to csv.txt
, until it counted 2.5 million lines.
相关文章