Java:优化哈希集以进行大规模重复检测
我正在处理一个处理大量推文的项目;目标是在我处理重复项时删除它们.我有推文 ID,它以 "166471306949304320"
I am working on a project where I am processing a lot of tweets; the goal is to remove duplicates as I process them. I have the tweet IDs, which come in as strings of the format "166471306949304320"
为此,我一直在使用 HashSet<String>
,它可以正常工作一段时间.但是当我达到大约 1000 万个项目时,我彻底陷入困境并最终得到一个 GC 错误,大概来自重新散列.我尝试使用
I have been using a HashSet<String>
for this, which works fine for a while. But by the time I get to around 10 million items I am drastically bogged down and eventually get a GC error, presumably from the rehashing. I tried defining a better size/load with
tweetids = new HashSet
这让它走得更远一点,但仍然非常缓慢(大约 1000 万,它需要 3 倍的处理时间).我该如何优化呢?鉴于我大概知道最后应该有多少项目(在这种情况下,大约 20-22 百万)我应该创建一个只重新散列两次或三次的 HashSet,或者这样的开销设置招致太多的时间惩罚?如果我不使用字符串,或者如果我定义不同的 HashCode 函数(在这种情况下是字符串的特定实例,我不知道该怎么做),事情会更好吗?这部分实现代码如下.
and that lets it get a little farther, but is still excruciatingly slow (by around 10 million it is taking 3x as long to process). How can I optimize this? Given that I have an approximate idea of how many items should be in the set by the end (in this case, around 20-22 million) should I create a HashSet that rehashes only two or three times, or would the overhead for such a set incur too many time-penalties? Would things work better if I wasn't using a String, or if I define a different HashCode function (which, in this case of a particular instance of a String, I'm not sure how to do)? This portion of the implementation code is below.
tweetids = new HashSet<String>(220000,0.80F); // in constructor
duplicates = 0;
...
// In loop: For(each tweet)
String twid = (String) tweet_twitter_data.get("id");
// Check that we have not processed this tweet already
if (!(tweetids.add(twid))){
duplicates++;
continue;
}
解决方案
感谢您的建议,我解决了.问题在于哈希表示所需的内存量;首先,HashSet<String>
非常庞大且不需要,因为 String.hashCode()
对于这种规模来说太高了.接下来我尝试了一个 Trie,但它在超过 100 万个条目时崩溃了;重新分配数组是有问题的.我使用 HashSet<Long>
来获得更好的效果,几乎成功了,但是速度下降了,最终在处理的最后一段(大约 1900 万)崩溃了.解决方案是脱离标准库并使用 Trove.它完成 2200 万条记录比根本不检查重复要快几分钟.最终实现很简单,看起来像这样:
Thanks to your recommendations, I solved it. The problem was the amount of memory required for the hash representations; first, HashSet<String>
was simply enormous and uncalled for because the String.hashCode()
is exorbitant for this scale. Next I tried a Trie, but it crashed at just over 1 million entries; reallocating the arrays was problematic. I used a HashSet<Long>
to better effect and almost made it, but speed decayed and it finally crashed on the last leg of the processing (around 19 million). The solution came with departing from the standard library and using Trove. It finished 22 million records a few minutes faster than not checking duplicates at all. Final implementation was simple, and looked like this:
import gnu.trove.set.hash.TLongHashSet;
...
TLongHashSet tweetids; // class variable
...
tweetids = new TLongHashSet(23000000,0.80F); // in constructor
...
// inside for(each record)
String twid = (String) tweet_twitter_data.get("id");
if (!(tweetids.add(Long.parseLong(twid)))) {
duplicates++;
continue;
}
推荐答案
您可能希望超越 Java 集合框架.我做了一些内存密集型处理,你会遇到几个问题
You may want to look beyond the Java collections framework. I've done some memory intensive processing and you will face several problems
- 大型哈希图和哈希集的桶数将导致大量开销(内存).您可以通过使用来影响这一点某种自定义哈希函数和模数,例如50000
- 字符串在 Java 中使用 16 位字符表示.您可以通过对大多数脚本使用 utf-8 编码的字节数组来减半.
- HashMap 通常是非常浪费的数据结构,而 HashSet 基本上只是这些数据结构的一个薄包装.
鉴于此,请查看 trove 或 guava 的替代品.此外,您的 id 看起来很长.这些是 64 位的,比字符串表示要小很多.
Given that, take a look at trove or guava for alternatives. Also, your ids look like longs. Those are 64 bit, quite a bit smaller than the string representation.
您可能要考虑的另一种选择是使用布隆过滤器(番石榴有一个不错的实现).布隆过滤器会告诉您某些东西是否绝对不在集合中,并且如果包含某些东西,则具有合理的确定性(小于 100%).结合一些基于磁盘的解决方案(例如数据库、mapdb、mecached ......)应该可以很好地工作.您可以缓冲传入的新 id,分批写入它们,并使用布隆过滤器检查您是否需要在数据库中查找,从而在大多数情况下避免昂贵的查找.
An alternative you might want to consider is using bloom filters (guava has a decent implementation). A bloom filter would tell you if something is definitely not in a set and with reasonable certainty (less than 100%) if something is contained. That combined with some disk based solution (e.g. database, mapdb, mecached, ...) should work reasonably well. You could buffer up incoming new ids, write them in batches, and use the bloom filter to check if you need to look in the database and thus avoid expensive lookups most of the time.
相关文章