删除重复的行,只留下最旧的行?

2022-01-10 00:00:00 duplicate-removal mysql delete-row

我有一张数据表,其中有很多来自用户提交的重复条目.

I have a table of data and there are many duplicate entries from user submissions.

我想根据字段 subscriberEmail 删除所有重复的行,只留下原始提交.

I want to delete all duplicates rows based on the field subscriberEmail, leaving only the original submission.

换句话说,我想搜索所有重复的电子邮件,并删除那些行,只留下原来的.

In other words, I want to search for all duplicate emails, and delete those rows, leaving only the original.

如何在不交换表的情况下做到这一点?
我的表包含每一行的唯一 ID.

How can I do this without swapping tables?
My table contains unique IDs for each row.

推荐答案

由于您使用 id 列作为记录是原始"的指示符:

Since you're using the id column as an indicator of which record is 'original':

delete x 
from myTable x
 join myTable z on x.subscriberEmail = z.subscriberEmail
where x.id > z.id

这将为每个电子邮件地址留下一条记录.

This will leave one record per email address.

编辑添加:

解释上面的查询...

To explain the query above...

这里的想法是把表连接起来.假设您有两个表格副本,每个副本命名不同.然后您可以将它们相互比较,并找到每个电子邮件地址的最低 id 或.然后,您会看到稍后创建的重复记录,并且可以将其删除.(考虑到这一点时,我正在可视化 Excel.)

The idea here is to join the table against itself. Pretend that you have two copies of the table, each named something different. Then you could compare them to each other, and find the lowest id or for each email address. You'd then see the duplicate records that were created later on and could delete them. (I was visualizing Excel when thinking about this.)

为了对表执行该操作,将其与自身进行比较并能够识别每一侧,您使用表别名.x 是表别名.它在 from 子句中分配,如下所示:from <table><别名>.x 现在可以在同一查询中的其他地方使用,以将该表作为快捷方式引用.

In order to do that operation on a table, compare it to itself and be able to identify each side, you use table aliases. x is a table alias. It is assigned in the from clause like so: from <table> <alias>. x can now be used elsewhere in the same query to refer to that table as a shortcut.

delete x 以我们的操作和目标开始查询.我们将执行查询以从多个表中选择记录,并且我们要删除出现在 x 中的记录.

delete x starts the query off with our action and target. We're going to perform a query to select records from multiple tables, and we want to delete records that appear in x.

别名用于指代表的两个实例".from myTable x join myTable z on x.subscriberEmail = z.subscriberEmail 在电子邮件匹配的地方将表格与自身碰撞.如果没有后面的 where 子句,将选择每条记录,因为它可以连接到自身.

Aliases are used to refer to both 'instances' of the table. from myTable x join myTable z on x.subscriberEmail = z.subscriberEmail bumps the table up against itself where the emails match. Without the where clause that follows, every record would be selected as it could be joined up against itself.

where 子句限制选择的记录.其中 x.id >z.id 允许 'instance' 别名 x 仅包含匹配电子邮件但具有更高 id 值的记录.您真正想要的数据,唯一的电子邮件地址(具有最低 id)不会成为 x 的一部分,也不会被删除.x 中的唯一记录将是重复记录(电子邮件地址),其 id 高于该电子邮件地址的原始记录.

The where clause limits the records that are selected. where x.id > z.id allows the 'instance' aliased x to contain only the records that match emails but have a higher id value. The data that you really want in the table, unique email addresses (with the lowest id) will not be part of x and will not be deleted. The only records in x will be duplicate records (email addresses) that have a higher id than the original record for that email address.

在这种情况下可以合并join和where子句:

The join and where clauses could be combined in this case:

delete x 
  from myTable x 
  join myTable z
    on x.subscriberEmail = z.subscriberEmail
      and x.id > z.id

为防止重复,请考虑将subscriberEmail 列设为唯一索引列.

For preventing duplicates, consider making the subscriberEmail column a UNIQUE indexed column.

相关文章