SQL 查询 - 如果重复超过 3 个,则删除重复项?

2022-01-10 00:00:00 sql duplicates sql-server

有没有人有一个优雅的 sql 语句从表中删除重复记录,但前提是重复次数超过 x ?所以它最多允许 2 或 3 个重复项,但仅此而已?

Does anyone have an elegant sql statement to delete duplicate records from a table, but only if there are more than x number of duplicates? So it allows up to 2 or 3 duplicates, but that's it?

目前我有一个执行以下操作的 select 语句:

Currently I have a select statement that does the following:

delete table
from table t
left outer join (
 select max(id) as rowid, dupcol1, dupcol2
 from table
 group by dupcol1, dupcol2
) as keeprows on t.id=keeprows.rowid
where keeprows.rowid is null

这很好用.但现在我想做的只是删除这些行,如果它们有超过 2 个重复项.

This works great. But now what I'd like to do is only delete those rows if they have more than say 2 duplicates.

谢谢

推荐答案

with cte as (
  select row_number() over (partition by dupcol1, dupcol2 order by ID) as rn
     from table)
delete from cte
   where rn > 2; -- or >3 etc

查询为每条记录生成一个行号",按 (dupcol1, dupcol2) 分组并按 ID 排序.实际上,此行号计算具有相同 dupcol1 和 dupcol2 的重复项",然后分配编号 1、2、3..N,按 ID 排序.如果您只想保留 2 个重复",那么您需要删除那些分配了数字 3,4,.. N 的那些,这是由 DELLETE 处理的部分.. 哪里 rn >2;

The query is manufacturing a 'row number' for each record, grouped by the (dupcol1, dupcol2) and ordered by ID. In effect this row number counts 'duplicates' that have the same dupcol1 and dupcol2 and assigns then the number 1, 2, 3.. N, order by ID. If you want to keep just 2 'duplicates', then you need to delete those that were assigned the numbers 3,4,.. N and that is the part taken care of by the DELLETE.. WHERE rn > 2;

使用此方法,您可以更改 ORDER BY 以适合您的首选顺序(例如.ORDER BY ID DESC),以便 LATESTrn=1,那么最新的下一个是 rn=2,依此类推.其余的保持不变,DELETE 将仅删除最旧的,因为它们具有最高的行号.

Using this method you can change the ORDER BY to suit your preferred order (eg.ORDER BY ID DESC), so that the LATEST has rn=1, then the next to latest is rn=2 and so on. The rest stays the same, the DELETE will remove only the oldest ones as they have the highest row numbers.

不同于 这个密切相关的问题,随着条件变得更加复杂,使用 CTE 和 row_number() 变得更加简单.如果不存在适当的访问索引,性能可能仍然存在问题.

Unlike this closely related question, as the condition becomes more complex, using CTEs and row_number() becomes simpler. Performance may be problematic still if no proper access index exists.

相关文章