mmap() 与读取块
我正在开发一个程序,该程序将处理大小可能为 100GB 或更大的文件.这些文件包含多组可变长度记录.我已经启动并运行了第一个实现,现在正在寻求提高性能,特别是在更有效地执行 I/O 方面,因为输入文件被多次扫描.
I'm working on a program that will be processing files that could potentially be 100GB or more in size. The files contain sets of variable length records. I've got a first implementation up and running and am now looking towards improving performance, particularly at doing I/O more efficiently since the input file gets scanned many times.
是否有使用 mmap()
与通过 C++ 的 fstream
库读取块的经验法则?我想做的是从磁盘读取大块到缓冲区,处理缓冲区中的完整记录,然后再读取更多.
Is there a rule of thumb for using mmap()
versus reading in blocks via C++'s fstream
library? What I'd like to do is read large blocks from disk into a buffer, process complete records from the buffer, and then read more.
mmap()
代码可能会变得非常混乱,因为 mmap
的块需要位于页面大小的边界上(我的理解)并且记录可能会跨越页面边界.使用 fstream
s,我可以只查找记录的开头并再次开始读取,因为我们不仅限于读取位于页面大小边界上的块.
The mmap()
code could potentially get very messy since mmap
'd blocks need to lie on page sized boundaries (my understanding) and records could potentially lie across page boundaries. With fstream
s, I can just seek to the start of a record and begin reading again, since we're not limited to reading blocks that lie on page sized boundaries.
如何在不实际编写完整实现的情况下在这两个选项之间做出决定?任何经验法则(例如,mmap()
快 2 倍)或简单的测试?
How can I decide between these two options without actually writing up a complete implementation first? Any rules of thumb (e.g., mmap()
is 2x faster) or simple tests?
推荐答案
我试图找到关于 Linux 上的 mmap/读取性能的最终词,但我发现了一篇不错的帖子 (link) 在 Linux 内核邮件列表中.它是从 2000 年开始的,所以从那时起内核中的 IO 和虚拟内存有了很多改进,但它很好地解释了 mmap
或 read
可能更快或更慢的原因.
I was trying to find the final word on mmap / read performance on Linux and I came across a nice post (link) on the Linux kernel mailing list. It's from 2000, so there have been many improvements to IO and virtual memory in the kernel since then, but it nicely explains the reason why mmap
or read
might be faster or slower.
- 调用
mmap
比read
有更多的开销(就像epoll
比poll
有更多的开销,这比read
有更多的开销).更改虚拟内存映射在某些处理器上是一项非常昂贵的操作,原因与在不同进程之间切换的成本相同. - IO 系统已经可以使用磁盘缓存,因此如果您读取文件,无论您使用何种方法,您都会命中缓存或错过缓存.
- A call to
mmap
has more overhead thanread
(just likeepoll
has more overhead thanpoll
, which has more overhead thanread
). Changing virtual memory mappings is a quite expensive operation on some processors for the same reasons that switching between different processes is expensive. - The IO system can already use the disk cache, so if you read a file, you'll hit the cache or miss it no matter what method you use.
然而,
- 内存映射对于随机访问通常更快,尤其是在您的访问模式稀疏且不可预测的情况下.
- 内存映射允许您保持使用缓存中的页面,直到完成.这意味着如果你长时间大量使用一个文件,然后关闭它并重新打开它,页面仍然会被缓存.使用
read
,您的文件可能很久以前就已从缓存中清除了.如果您使用文件并立即丢弃它,则这不适用.(如果您尝试mlock
页面只是为了将它们保留在缓存中,那么您就是在尝试超越磁盘缓存,而这种愚蠢的做法很少有助于系统性能). - 直接读取文件非常简单快捷.
- Memory maps are generally faster for random access, especially if your access patterns are sparse and unpredictable.
- Memory maps allow you to keep using pages from the cache until you are done. This means that if you use a file heavily for a long period of time, then close it and reopen it, the pages will still be cached. With
read
, your file may have been flushed from the cache ages ago. This does not apply if you use a file and immediately discard it. (If you try tomlock
pages just to keep them in cache, you are trying to outsmart the disk cache and this kind of foolery rarely helps system performance). - Reading a file directly is very simple and fast.
关于 mmap/read 的讨论让我想起了另外两个关于性能的讨论:
The discussion of mmap/read reminds me of two other performance discussions:
一些 Java 程序员惊讶地发现非阻塞 I/O 通常比阻塞 I/O 慢,如果您知道非阻塞 I/O 需要进行更多的系统调用,这完全有道理.
Some Java programmers were shocked to discover that nonblocking I/O is often slower than blocking I/O, which made perfect sense if you know that nonblocking I/O requires making more syscalls.
其他一些网络程序员惊讶地发现 epoll
通常比 poll
慢,如果您知道管理 epoll
需要进行更多的系统调用.
Some other network programmers were shocked to learn that epoll
is often slower than poll
, which makes perfect sense if you know that managing epoll
requires making more syscalls.
结论:如果您随机访问数据、将数据保留很长时间,或者您知道可以与其他进程共享数据(MAP_SHARED
不是如果没有实际共享,那就不是很有趣).如果按顺序访问数据或在读取后丢弃它,则正常读取文件.如果任一方法使您的程序不那么复杂,请执行那个.对于许多现实世界的案例,如果不测试您的实际应用程序而不是基准测试,就没有确定的方法来证明它更快.
Conclusion: Use memory maps if you access data randomly, keep it around for a long time, or if you know you can share it with other processes (MAP_SHARED
isn't very interesting if there is no actual sharing). Read files normally if you access data sequentially or discard it after reading. And if either method makes your program less complex, do that. For many real world cases there's no sure way to show one is faster without testing your actual application and NOT a benchmark.
(对不起,我忽略了这个问题,但我一直在寻找答案,而这个问题一直出现在 Google 搜索结果的顶部.)
(Sorry for necro'ing this question, but I was looking for an answer and this question kept coming up at the top of Google results.)
相关文章