最快的子集化方法 - data.table vs. MySQL

2022-01-13 00:00:00 r data.table mysql rmysql

我是 R 用户，我经常发现我需要编写需要对大型数据集(数百万行)进行子集化的函数.当我将这些函数应用于大量观察时，如果我不小心如何实现它，可能会非常耗时.

I'm an R user, and I frequently find that I need to write functions that require subsetting large datasets (10s of millions of rows). When I apply such functions over a large number of observations, it can get very time consuming if I'm not careful about how I implement it.

为此，我有时会使用 data.table 包，它提供的速度比使用数据帧的子集快得多.最近开始试验RMySQL之类的包，把一些表push到mysql，用这个包运行sql查询并返回结果.

To do this, I have sometimes used the data.table package, and this provides much faster speeds than subsetting using data frames. Recently, I've started experimenting with packages like RMySQL, pushing some tables to mysql, and using the package to run sql queries and return results.

我发现性能改进参差不齐.对于较小的数据集(数百万)，似乎将数据加载到 data.table 并设置正确的键有助于更快的子集.对于较大的数据集(10 到 100 百万)，向 mysql 发送查询似乎移动得更快.

I have found mixed performance improvements. For smaller datasets (millions), it seems that loading up the data into a data.table and setting the right keys makes for faster subsetting. For larger datasets (10s to 100s of millions), it appears the sending out a query to mysql moves faster.

想知道是否有人知道哪种技术应该更快地返回简单的子集或聚合查询，以及这是否应该取决于数据的大小?我知道在 data.table 中设置键有点类似于创建索引，但除此之外我没有更多的直觉.

Was wondering if anyone has any insight into which technique should return simple subsetting or aggregation queries faster, and whether or not this should depend on the size of the data? I understand that setting keys in data.table is somewhat analogous to creating an index, but I don't have much more intuition beyond that.

推荐答案

如果数据适合 RAM，data.table 会更快.如果您提供一个示例，它可能很快就会变得很明显，您使用 data.table 很糟糕.您是否阅读过 data.table wiki 上的注意事项"?

If the data fits in RAM, data.table is faster. If you provide an example it will probably become evident, quickly, that you're using data.table badly. Have you read the "do's and don'ts" on the data.table wiki?

SQL 有一个下限，因为它是一个行存储.如果数据适合 RAM(并且 64 位相当多)，那么 data.table 更快，不仅因为它在 RAM 中，还因为列在内存中是连续的(最小化从 RAM 到 L2 的页面获取以进行列操作).正确使用 data.table 应该比 SQL 的下限要快.FAQ 3.1 对此进行了解释.如果您发现使用 data.table 速度较慢，那么您使用 data.table 的可能性非常高(或者我们需要修复一个性能错误).所以，请在阅读 data.table wiki 后发布一些测试.

SQL has a lower bound because it is a row store. If the data fits in RAM (and 64bit is quite a bit) then data.table is faster not just because it is in RAM but because columns are contiguous in memory (minimising page fetches from RAM to L2 for column operations). Use data.table correctly and it should be faster than SQL's lower bound. This is explained in FAQ 3.1. If you're seeing slower with data.table, then chances are very high that you're using data.table incorrectly (or there's a performance bug that we need to fix). So, please post some tests, after reading the data.table wiki.

相关文章