计算大型 Oracle 表中的 CLOB 重复项

2021-12-24 00:00:00 sql duplicates oracle plsql clob

我有一个 Oracle 数据库表 LOG_MESSAGES,其中有一个名为 MESSAGE 的 CLOB 列.某些行包含相同的 MESSAGE.

I have an Oracle database table LOG_MESSAGES with a CLOB column called MESSAGE. Some of the rows contain the same MESSAGE.

对于每个至少有一个重复的 MESSAGE,我想知道重复的数量.相当多的这些 CLOB 很大(> 100 kB),因此转换为 VARCHAR2 是不可能的.由于GROUP BY等很多传统方法不适用于CLOB,请大神指教一下?

For each MESSAGE which has at least a duplicate, I'd like to know the number of duplicates. Quite a number of these CLOBs are huge (> 100 kB), so converting to VARCHAR2 is out of question. Since many traditional methods such as GROUP BY do not work with CLOB, could someone please enlighten me?

有关信息,该表非常大(大约 1 TB).因此,非常感谢优化方法.

For information, the table is very large (around 1 TB). So an optimised method would very much be appreciated.

预先感谢您的帮助.

推荐答案

我认为这个问题经常被问到,但不幸的是,似乎没有完美的方法来做到这一点.不过有一些方法可以很好地工作.

I think this question gets asked a lot but unfortunately there doesn't seem to be a perfect way of doing this. There are ways that work just fine though.

搜索clob group by"或clob distinct",你会在这个网站上看到几个点击.

Search for "clob group by" or "clob distinct" and you will see several hits just on this website.

一种方法是编写一个 PL/SQL 脚本,在表中的所有 clob 之间执行 DBMS_LOB.COMPARE 但效率可能是 O(n^2) 的顺序,即会让你的目的变得很慢.

One way would be to write a PL/SQL script that does a DBMS_LOB.COMPARE between all clobs in the table but the efficiency would probably be in the order of O(n^2) which would make it really slow for your purpose.

另一种被广泛接受的方法是使用 dbms_crypto 获取 clob 的散列值(我认为这允许对 clob 进行散列),然后对散列值进行分组.存在哈希冲突的可能性,但概率很小.我读了大约 2^80(不过数字可能是错误的).这不会像第一种方法那么慢,但计算散列也需要不可忽略的时间.

Another approach that is well accepted is to take a hash value of the clob using dbms_crypto (i think that allows hashing on clobs) and then group by on the hash values. There is a possibility of hash collision, but the probability is minute. I read somewhere around 2^80 (number might be wrong though). This won't be as slow as the first approach but calculating a hash would also take non-negligible time.

我建议先尝试散列方法,如果这看起来太慢,请寻找替代方法.

I would suggest try the hash approach first and if that seems too slow, look for alternatives.

相关文章