使用带有 CTE 的 NEWID() 生成随机的行子集会产生奇怪的结果
我正在存储过程中编写一些 SQL,以将数据集减少到我想要报告的有限随机行数.
I'm writing some SQL in a stored procedure to reduce a dataset to a limited random number of rows that I want to report on.
报告以 Users
的 Group
开头,并应用过滤器来指定所需的随机行总数 (@SampleLimit
).
The report starts with a Group
of Users
and a filter is applied to specify the total number of random rows required (@SampleLimit
).
为了达到预期的结果,我首先创建一个 CTE(临时表):
To achieve the desired result, I start by creating a CTE (temp table) with:
- 应用了
top(@SampleLimit)
按 UserId 分组
(因为 UserID 出现多次)order by NEWID()
将结果随机排列
- The
top(@SampleLimit)
applied group by UserId
(as the UserID appears multiple times)order by NEWID()
to put the results in a random order
SQL:
; with cte_temp as
(select top(@SampleLimit) UserId from QueryResults
where (GroupId = @GroupId)
group by UserId order by NEWID())
获得此结果集后,我会删除 UserId 为 NOT IN
上一步中创建的 CTE 的所有结果.
Once I have this result set, I then delete any results where the UserId is NOT IN
the CTE created in the previous step.
delete QueryResults
where (GroupId = @GroupId) and (UserId not in(select UserId from cte_temp))
我遇到的问题是,有时我得到的结果多于 @SampleLimit
中指定的结果,而其他时候它完全按预期工作.
The issue that I'm having is that from time to time, I get more results than specified in the @SampleLimit
and other times it works exactly as expected.
我已尝试分解 SQL 并在应用程序外部执行它,但无法重现该问题.
I've tried breaking up the SQL and executing it outside the application and I cannot reproduce the issue.
我所做的事情是否存在根本性的错误,可以解释为什么我偶尔会得到我要求的更多结果?
Is there anything fundamentally wrong with what I am doing that could explain why I occasionally get more results that I request?
为了完整性 - 我根据以下答案重构了解决方案:
select top(@SampleLimit) UserId into #T1
from QueryResults
where (GroupId = @GroupId)
group by UserId
order by NEWID()
delete QueryResults
where (GroupId = @GroupId) and (UserId not in(select UserId from #T1))
推荐答案
涉及NEWID()
的SELECT
语句会被执行多少次是不确定的.
It is undeterministic how many times the SELECT
statement involving NEWID()
will be executed.
如果您在 QueryResults
和 cte_temp
之间得到一个嵌套循环反半连接,并且计划中没有假脱机,它可能会被重新评估多次是 QueryResults
中的行,这意味着对于每个外部行,与 NOT IN
进行比较的集合可能完全不同.
If you get a nested loops anti semi join between QueryResults
and cte_temp
and there is no spool in the plan it will likely be re-evaluated as many times as there are rows in QueryResults
this means that for each outer row the set that is being compared against with NOT IN
may be entirely different.
您可以将结果具体化到一个临时表中来避免这种情况,而不是使用 CTE.
Instead of using a CTE you can materialize the results into a temporary table to avoid this.
INSERT INTO #T
SELECT TOP(@SampleLimit) UserId
FROM QueryResults
WHERE ( GroupId = @GroupId )
GROUP BY UserId
ORDER BY NEWID()
然后在 DELETE
相关文章