相同的查询给出不同的结果

2021-09-10 00:00:00 tsql sql-server

我对数据库工作还很陌生,所以请耐心等待.我已经阅读了许多类似的问题,但似乎没有一个在谈论我面临的同一问题.

I am still new to working in databases, so please have patience with me. I have read through a number of similar questions, but none of them seem to be talking about the same issue I am facing.

只是一些关于我在做什么的信息,我有一个填满联系信息的表格,一些联系人是重复的,但大多数重复的行都有一个截断的电话号码,这使得这些数据毫无用处.

Just a bit of info on what I am doing, I have a table filled with contact information, and some of the contacts are duplicated, but most of the duplicated rows have a truncated phone number, which makes that data useless.

我编写了以下查询来搜索重复项:

I wrote the following query to search for the duplicates:

WITH CTE (CID, Firstname, lastname, phone, email, length, dupcnt) AS
(
   SELECT 
       CID, Firstname, lastname, phone, email, LEN(phone) AS length,
       ROW_NUMBER() OVER (PARTITION BY Firstname, lastname, email 
                          ORDER BY Firstname) AS dupcnt
   FROM 
       [data.com_raw]
)
SELECT * 
FROM CTE
WHERE dupcnt > 1
  AND length <= 10

我假设此查询会根据我指定的三列查找所有具有重复项的记录,并选择 dupcnt 大于 1 的任何记录,以及具有长度的电话列小于或等于 10.但是当我多次运行查询时,每次执行都会得到不同的结果集.一定有一些我在这里遗漏的逻辑,但我对此完全感到困惑.所有列都是 varchar 数据类型,除了 CID,它是 int.

I assumed that this query would find all records that have duplicates based on the three columns that I have specified, and select any that have the dupcnt greater than 1, and a phone column with a length less than or equal to 10. But when I run the query more than once I get different result sets each execution. There must be some logic that I am missing here, but I am completely baffled by this. All of the columns are of varchar datatype, except for CID, which is int.

推荐答案

代替 ROW_NUMBER() 使用 COUNT(*),并删除 ORDER BY 因为那不是必须使用 COUNT(*).

Instead of ROW_NUMBER() use COUNT(*), and remove the ORDER BY since that's not necessary with COUNT(*).

按照您现在的方式,您正在通过 firstname/lastname/email 将记录分成相似的记录组/分区.然后您按 firstname 对每个组/分区进行排序.Firstname 是分区的一部分,这意味着该组/分区中的每个名字都是相同的.您将获得不同的结果,具体取决于 SQL Server 从存储中获取结果的方式(它首先找到的记录是 1,第二个找到的是 2).每次获取记录时(每次运行此 sql 时),它都可能以不同的顺序从磁盘或缓存中获取每条记录.

The way you have it now, you are chunking up records into similar groups/partitions of records by firstname/lastname/email. Then you are ORDERING each group/partition by firstname. Firstname is part of the partition, meaning every firstname in that group/partition is identical. You will get different results depending on how SQL Server fetches the results from storage (which record it found first is 1, what it found second is 2). Every time it fetches records (every time you run this sql) it may fetch each record from disk or cache at a different order.

Count(*) 将返回所有重复的行

改为:

 COUNT(*) OVER (PARTITION BY Firstname, lastname, email ) AS dupcnt

这将返回共享相同名字、姓氏和电子邮件的记录数.然后您保留任何大于 1 的记录.

Which will return the number of records that share the same firstname, lastname, and email. You then keep any record that is greater than 1.

相关文章