SQL 查找重复条目(组内)

2021-12-17 00:00:00 join sql oracle

SQL 查找重复条目(组内)

SQL to find duplicate entries (within a group)

我有一个小问题,我不确定解决它的最佳方法是什么,因为我对数据库 (Oracle) 本身的访问权限有限.在我们的表EVENT"中,我们有大约 160k 个条目,每个 EVENT 都有一个 GROUPID,而一个普通条目正好有 5 行具有相同的 GROUPID.由于一个错误,我们目前得到了几个重复的条目(重复,所以 10 行而不是 5 行,只是一个不同的 EVENTID.这可能会改变,所以它只是 <> 5).我们需要过滤这些组的所有条目.

I have a small problem and I'm not sure what would be the best way to fix it, as I only have limited access to the database (Oracle) itself. In our Table "EVENT" we have about 160k entries, each EVENT has a GROUPID and a normal entry has exactly 5 rows with the same GROUPID. Due to a bug we currently get a couple of duplicate entries (duplicate, so 10 rows instead of 5, just a different EVENTID. This may change, so it's just <> 5). We need to filter all the entries of these groups.

由于对数据库的访问受限,我们不能使用临时表,也不能向 GROUPID 列添加索引以使其更快.

Due to limited access to the database we can not use a temporary table, nor can we add an index to the GROUPID column to make it faster.

我们可以通过这个查询获取 GROUPID,但我们需要第二个查询来获取所需的数据

We can get the GROUPIDs with this query, but we would need a second query to get the needed data

select A."GROUPID"
from "EVENT" A
group by A."GROUPID"
having count(A."GROUPID") <> 5

一种解决方案是子选择:

One solution would be a subselect:

select *
from "EVENT" A
where A."GROUPID" IN (
  select B."GROUPID"
  from "EVENT" B
  group by B."GROUPID"
  having count(B."GROUPID") <> 5
)

如果没有 GROUPID 的索引和 160k 条目,这会花费太长时间.尝试考虑可以处理此问题的连接,但目前找不到好的解决方案.

Without an index on GROUPID and 160k entries, this takes much too long. Tried thinking about a join that can handle this, but can't find a good solution so far.

有人可以为此找到一个好的解决方案吗?

Anybody can find a good solution for this maybe?

小我们这里没有 100% 重复,因为每个条目仍然有一个唯一的 ID,而 GROUPID 也不是唯一的(这就是我们需要使用分组依据"的原因) - 或者我只是错过了一个简单的解决方案:)

Small edit: We don't have 100% duplicates here, as each entry still has a unique ID and the GROUPID is not unique either (that's why we need to use "group by") - or maybe I just miss an easy solution for it :)

关于数据的小例子(我不想删除它,只是找到它)

Small example about the data (I don't want to delete it, just find it)

<代码>事件ID |GROUPID |类型ID
123456   123     12
123457   123     145
123458   123     2612
123459   123     41
123460   123     238

234567   123     12
234568   123     145
234569   123     2612
234570   123     41
234571   123     238

它还有一些列,例如时间戳等,但正如您已经看到的,除了 EVENTID 之外,所有内容都相同.

EVENTID | GROUPID | TYPEID
123456    123       12
123457    123       145
123458    123       2612
123459    123       41
123460    123       238

234567    123       12
234568    123       145
234569    123       2612
234570    123       41
234571    123       238

It has some more columns, like timestamp etc, but as you can see already, everything is identical, besides the EVENTID.

我们将更频繁地运行它进行测试,以发现错误并检查是否再次发生.

We will run it more often for testing, to find the bug and check if it happens again.

推荐答案

您可以通过连接而不是子查询获得答案

You can get the answer with a join instead of a subquery

select
    a.*
from
    event as a
inner join
    (select groupid
     from event
     group by groupid
     having count(*) <> 5) as b
  on a.groupid = b.groupid

这是从组中的行中获取所有信息的一种相当常见的方法.

This is a fairly common way of obtaining the all the information out of the rows in a group.

就像您建议的答案和其他回复一样,使用 groupid 上的索引会运行得更快.由 DBA 来平衡使查询运行得更快的好处与维护另一个索引的成本.

Like your suggested answer and the other responses, this will run a lot faster with an index on groupid. It's up to the DBA to balance the benefit of making your query run a lot faster against the cost of maintaining yet another index.

如果 DBA 决定不使用索引,请确保相关人员了解是索引策略而不是您编写查询的方式会减慢速度.

If the DBA decides against the index, make sure the appropriate people understand that its the index strategy and not the way you wrote the query that is slowing things down.

相关文章