为什么在 2012 年 python 中的 pandas 合并比 R 中的 data.table 合并更快?

2022-01-13 00:00:00 python pandas join r data.table

问题描述

我最近遇到了 Python 的和 Python 代码用于对各种包进行基准测试.

解决方案

看起来 Wes 可能在 data.table 中发现了一个已知问题，当唯一字符串的数量(levels) 很大:10,000.

Rprof() 是否揭示了调用 sortedmatch(levels(i[[lc]]), levels(x[[rc]])? 这并不是真正的连接本身(算法)，而是一个初步步骤.

最近的努力已经进入允许键中的字符列，这应该通过与 R 自己的全局字符串哈希表更紧密地集成来解决这个问题.test.data.table() 已经报告了一些基准测试结果，但该代码尚未连接以将级别替换为级别匹配.

对于常规整数列，pandas 的合并速度是否比 data.table 快?这应该是一种将算法本身与因素问题隔离开来的方法.

此外，data.table 考虑了时间序列合并.有两个方面:i) 多列有序键，例如 (id,datetime) ii) 快速流行连接 (roll=TRUE) 也就是最后的观察结转.

我需要一些时间来确认，因为这是我第一次看到与 data.table 的比较.

<小时>

2012 年 7 月发布的 data.table v1.8.0 更新

内部函数 sortedmatch() 被移除并替换为 chmatch()将 i 级别与因子"类型的列的 x 级别匹配时.这初步步骤导致(已知)显着放缓，当数字因子列的水平数很大(例如 > 10,000).加剧了正如 Wes McKinney 所展示的，连接四个这样的列的测试(Python 包 Pandas 的作者).匹配 100 万个字符串，其中例如，其中 600,000 个是唯一的，现在从 16 秒减少到 0.5 秒.

在那个版本中还有:

现在允许在键中使用字符列，并且优先于因素.data.table() 和 setkey() 不再强制字符因素.因素仍然得到支持.实现 FR#1493、FR#1224和(部分)FR#951.
新函数 chmatch() 和 %chin%，match() 的更快版本和 %in% 用于字符向量.R的内部字符串缓存是使用(没有建立哈希表).它们的速度大约快 4 倍比 ?chmatch 中的示例中的 match().

截至 2013 年 9 月，data.table 在 CRAN 上是 v1.8.10，我们正在开发 v1.9.0.新闻实时更新.

<小时>

但正如我最初写的那样，上面:

<块引用>

data.table 考虑到了时间序列合并.两个方面:i)多列 ordered 键，例如 (id,datetime) ii) 快速流行加入 (roll=TRUE) 也就是最后一次观察结转.

因此，两个字符列的 Pandas equi join 可能仍然比 data.table 快.因为它听起来像是对合并的两列进行哈希处理.data.table 不会散列密钥，因为它考虑了普遍的有序连接.data.table 中的键"实际上只是排序顺序(类似于 SQL 中的聚集索引；即，这就是数据在 RAM 中的排序方式).清单上是添加辅助键，例如.

总而言之，由于已知问题已得到修复，这个包含超过 10,000 个唯一字符串的特殊两字符列测试所突出的明显速度差异现在应该不会那么糟糕了.

I recently came across the pandas library for python, which according to this benchmark performs very fast in-memory merges. It's even faster than the data.table package in R (my language of choice for analysis).

Why is pandas so much faster than data.table? Is it because of an inherent speed advantage python has over R, or is there some tradeoff I'm not aware of? Is there a way to perform inner and outer joins in data.table without resorting to merge(X, Y, all=FALSE) and merge(X, Y, all=TRUE)?

Here's the R code and the Python code used to benchmark the various packages.

解决方案

It looks like Wes may have discovered a known issue in data.table when the number of unique strings (levels) is large: 10,000.

Does Rprof() reveal most of the time spent in the call sortedmatch(levels(i[[lc]]), levels(x[[rc]])? This isn't really the join itself (the algorithm), but a preliminary step.

Recent efforts have gone into allowing character columns in keys, which should resolve that issue by integrating more closely with R's own global string hash table. Some benchmark results are already reported by test.data.table() but that code isn't hooked up yet to replace the levels to levels match.

Are pandas merges faster than data.table for regular integer columns? That should be a way to isolate the algorithm itself vs factor issues.

Also, data.table has time series merge in mind. Two aspects to that: i) multi column ordered keys such as (id,datetime) ii) fast prevailing join (roll=TRUE) a.k.a. last observation carried forward.

I'll need some time to confirm as it's the first I've seen of the comparison to data.table as presented.

UPDATE from data.table v1.8.0 released July 2012

Internal function sortedmatch() removed and replaced with chmatch() when matching i levels to x levels for columns of type 'factor'. This preliminary step was causing a (known) significant slowdown when the number of levels of a factor column was large (e.g. >10,000). Exacerbated in tests of joining four such columns, as demonstrated by Wes McKinney (author of Python package Pandas). Matching 1 million strings of which of which 600,000 are unique is now reduced from 16s to 0.5s, for example.

also in that release was :

character columns are now allowed in keys and are preferred to factor. data.table() and setkey() no longer coerce character to factor. Factors are still supported. Implements FR#1493, FR#1224 and (partially) FR#951.
New functions chmatch() and %chin%, faster versions of match() and %in% for character vectors. R's internal string cache is utilised (no hash table is built). They are about 4 times faster than match() on the example in ?chmatch.

As of Sep 2013 data.table is v1.8.10 on CRAN and we're working on v1.9.0. NEWS is updated live.

But as I wrote originally, above :

data.table has time series merge in mind. Two aspects to that: i) multi column ordered keys such as (id,datetime) ii) fast prevailing join (roll=TRUE) a.k.a. last observation carried forward.

So the Pandas equi join of two character columns is probably still faster than data.table. Since it sounds like it hashes the combined two columns. data.table doesn't hash the key because it has prevailing ordered joins in mind. A "key" in data.table is literally just the sort order (similar to a clustered index in SQL; i.e., that's how the data is ordered in RAM). On the list is to add secondary keys, for example.

In summary, the glaring speed difference highlighted by this particular two-character-column test with over 10,000 unique strings shouldn't be as bad now, since the known problem has been fixed.

相关文章