读取映射到内存的 CSV 文件的最简单方法?

2021-12-24 00:00:00 csv io c++ boost memory-mapped-files

当我在 C++(11) 中读取文件时,我使用以下方法将它们映射到内存中:

When I read from files in C++(11) I map them in to memory using:

boost::interprocess::file_mapping* fm = new file_mapping(path, boost::interprocess::read_only);
boost::interprocess::mapped_region* region = new mapped_region(*fm, boost::interprocess::read_only);
char* bytes = static_cast<char*>(region->get_address());

当我希望以极快的速度逐字节读取时,这很好.但是,我创建了一个 csv 文件,我想将其映射到内存,读取每一行并在逗号上拆分每一行.

Which is fine when I wish to read byte by byte extremely fast. However, I have created a csv file which I would like to map to memory, read each line and split each line on the comma.

有没有办法通过对上面的代码进行一些修改来做到这一点?

Is there a way I can do this with a few modifications of my above code?

(我映射到内存是因为我有大量内存,我不希望磁盘/IO 流传输出现任何瓶颈).

(I am mapping to memory because I have an awful lot of memory and I do not want any bottleneck with disk/IO streaming).

推荐答案

这是我对足够快"的看法.它在大约 1 秒内压缩 116 MiB 的 CSV(2.5Mio 行[1]).

Here's my take on "fast enough". It zips through 116 MiB of CSV (2.5Mio lines[1]) in ~1 second.

然后可以零拷贝随机访问结果,因此没有开销(除非页面被换出).

The result is then randomly accessible at zero-copy, so no overhead (unless pages are swapped out).

比较:

  • 这比简单的 wc csv.txt 处理同一个文件快 ~3 倍
  • 它大约与下面的 perl one liner 一样快(它列出了所有行上的不同字段计数):

  • that's ~3x faster than a naive wc csv.txt takes on the same file
  • it's about as fast as the following perl one liner (which lists the distinct field counts on all lines):

perl -ne '$fields{scalar split /,/}++; END { map { print "$_
" } keys %fields  }' csv.txt

  • 它只比 (LANG=C wc csv.txt) 慢,避免了语言环境功能(大约 1.5 倍)

  • it's only slower than (LANG=C wc csv.txt) which avoids locale functionality (by about 1.5x)

    这是解析器的全部荣耀:

    Here's the parser in all it's glory:

    using CsvField = boost::string_ref;
    using CsvLine  = std::vector<CsvField>;
    using CsvFile  = std::vector<CsvLine>;  // keep it simple :)
    
    struct CsvParser : qi::grammar<char const*, CsvFile()> {
        CsvParser() : CsvParser::base_type(lines)
        {
            using namespace qi;
    
            field = raw [*~char_(",
    ")] 
                [ _val = construct<CsvField>(begin(_1), size(_1)) ]; // semantic action
            line  = field % ',';
            lines = line  % eol;
        }
        // declare: line, field, fields
    };
    

    唯一棘手的事情(也是唯一的优化)是从具有匹配字符数的源迭代器构造 CsvField 的语义操作.

    The only tricky thing (and the only optimization there) is the semantic action to construct a CsvField from the source iterator with the matches number of characters.

    主要内容如下:

    int main()
    {
        boost::iostreams::mapped_file_source csv("csv.txt");
    
        CsvFile parsed;
        if (qi::parse(csv.data(), csv.data() + csv.size(), CsvParser(), parsed))
        {
            std::cout << (csv.size() >> 20) << " MiB parsed into " << parsed.size() << " lines of CSV field values
    ";
        }
    }
    

    打印

    116 MiB parsed into 2578421 lines of CSV values
    

    您可以像 std::string:

    for (int i = 0; i < 10; ++i)
    {
        auto l     = rand() % parsed.size();
        auto& line = parsed[l];
        auto c     = rand() % line.size();
    
        std::cout << "Random field at L:" << l << "	 C:" << c << "	" << line[c] << "
    ";
    }
    

    打印出例如:

    Random field at L:1979500    C:2    sateen's
    Random field at L:928192     C:1    sackcloth's
    Random field at L:1570275    C:4    accompanist's
    Random field at L:479916     C:2    apparel's
    Random field at L:767709     C:0    pinks
    Random field at L:1174430    C:4    axioms
    Random field at L:1209371    C:4    wants
    Random field at L:2183367    C:1    Klondikes
    Random field at L:2142220    C:1    Anthony
    Random field at L:1680066    C:2    pines
    

    完整的示例在此处在 Coliru 上直播

    The fully working sample is here Live On Coliru

    [1] 我通过重复附加

    [1] I created the file by repeatedly appending the output of

    while read a && read b && read c && read d && read e
    do echo "$a,$b,$c,$d,$e"
    done < /etc/dictionaries-common/words
    

    csv.txt,直到它计数到 250 万行.

    to csv.txt, until it counted 2.5 million lines.

  • 相关文章