MySQL快速从大数据库中删除重复项

2021-11-20 00:00:00 sql duplicates mysql

我有很大的(>Mil 行)MySQL 数据库被重复项搞砸了.我认为它可能占整个数据库的 1/4 到 1/2.我需要快速摆脱它们(我的意思是查询执行时间).这是它的外观:
id(索引)|文本1 |文本2 |文本3
文本 1 &text2 组合应该是唯一的,如果有任何重复,则只应保留一个 text3 NOT NULL 组合.示例:

I've got big (>Mil rows) MySQL database messed up by duplicates. I think it could be from 1/4 to 1/2 of the whole db filled with them. I need to get rid of them quick (i mean query execution time). Here's how it looks:
id (index) | text1 | text2 | text3
text1 & text2 combination should be unique, if there are any duplicates, only one combination with text3 NOT NULL should remain. Example:

1 | abc | def | NULL  
2 | abc | def | ghi  
3 | abc | def | jkl  
4 | aaa | bbb | NULL  
5 | aaa | bbb | NULL  

...变成:

1 | abc | def | ghi   #(doesn't realy matter id:2 or id:3 survives)   
2 | aaa | bbb | NULL  #(if there's no NOT NULL text3, NULL will do)

新 ID 可以是任何东西,它们不依赖于旧表 ID.
我试过这样的事情:

New ids cold be anything, they do not depend on old table ids.
I've tried things like:

CREATE TABLE tmp SELECT text1, text2, text3
FROM my_tbl;
GROUP BY text1, text2;
DROP TABLE my_tbl;
ALTER TABLE tmp RENAME TO my_tbl;

或 SELECT DISTINCT 和其他变体.
虽然他们在小型数据库上工作,但我的查询执行时间非常长(实际上从未结束;> 20 分钟)

Or SELECT DISTINCT and other variations.
While they work on small databases, query execution time on mine is just huge (never got to the end, actually; > 20 min)

有没有更快的方法来做到这一点?请帮我解决这个问题.

Is there any faster way to do that? Please help me solve this problem.

推荐答案

我相信这会做到,使用重复键 + ifnull():

I believe this will do it, using on duplicate key + ifnull():

create table tmp like yourtable;

alter table tmp add unique (text1, text2);

insert into tmp select * from yourtable 
    on duplicate key update text3=ifnull(text3, values(text3));

rename table yourtable to deleteme, tmp to yourtable;

drop table deleteme;

应该比任何需要 group by 或 distinct 或子查询,甚至 order by 的东西快得多.这甚至不需要文件排序,这会降低大型临时表的性能.仍然需要对原始表进行全面扫描,但无法避免.

Should be much faster than anything that requires group by or distinct or a subquery, or even order by. This doesn't even require a filesort, which is going to kill performance on a large temporary table. Will still require a full scan over the original table, but there's no avoiding that.

相关文章