获取具有最高/最小 <whatever> 的记录每组
怎么做?
这个问题的前标题是在带有子查询的复杂查询中使用排名 (@Rank := @Rank + 1) - 会起作用吗?"因为我一直在寻找使用排名的解决方案,但是现在我看到 Bill 发布的解决方案要好得多.
Former title of this question was "using rank (@Rank := @Rank + 1) in complex query with subqueries - will it work?" because I was looking for solution using ranks, but now I see that the solution posted by Bill is much much better.
原问题:
我正在尝试编写一个查询,该查询将在给定特定顺序的情况下从每个组中获取最后一条记录:
I'm trying to compose a query that would take last record from each group given some defined order:
SET @Rank=0;
select s.*
from (select GroupId, max(Rank) AS MaxRank
from (select GroupId, @Rank := @Rank + 1 AS Rank
from Table
order by OrderField
) as t
group by GroupId) as t
join (
select *, @Rank := @Rank + 1 AS Rank
from Table
order by OrderField
) as s
on t.GroupId = s.GroupId and t.MaxRank = s.Rank
order by OrderField
表达式 @Rank := @Rank + 1
通常用于排名,但对我来说,在 2 个子查询中使用它看起来很可疑,但只初始化了一次.会这样吗?
Expression @Rank := @Rank + 1
is normally used for rank, but for me it looks suspicious when used in 2 subqueries, but initialized only once. Will it work this way?
其次,它是否适用于一个被多次评估的子查询?就像 where (or having) 子句中的子查询(上面的另一种写法):
And second, will it work with one subquery that is evaluated multiple times? Like subquery in where (or having) clause (another way how to write the above):
SET @Rank=0;
select Table.*, @Rank := @Rank + 1 AS Rank
from Table
having Rank = (select max(Rank) AS MaxRank
from (select GroupId, @Rank := @Rank + 1 AS Rank
from Table as t0
order by OrderField
) as t
where t.GroupId = table.GroupId
)
order by OrderField
提前致谢!
推荐答案
所以你想得到每组 OrderField
最高的行吗?我会这样做:
So you want to get the row with the highest OrderField
per group? I'd do it this way:
SELECT t1.*
FROM `Table` AS t1
LEFT OUTER JOIN `Table` AS t2
ON t1.GroupId = t2.GroupId AND t1.OrderField < t2.OrderField
WHERE t2.GroupId IS NULL
ORDER BY t1.OrderField; // not needed! (note by Tomas)
(Tomas 如果同一组中有更多具有相同 OrderField 的记录,而您恰好需要其中之一,则可能需要扩展条件:
(EDIT by Tomas: If there are more records with the same OrderField within the same group and you need exactly one of them, you may want to extend the condition:
SELECT t1.*
FROM `Table` AS t1
LEFT OUTER JOIN `Table` AS t2
ON t1.GroupId = t2.GroupId
AND (t1.OrderField < t2.OrderField
OR (t1.OrderField = t2.OrderField AND t1.Id < t2.Id))
WHERE t2.GroupId IS NULL
编辑结束.)
换句话说,返回没有其他行 t2
的行 t1
具有相同的 GroupId
和更大的 OrderField代码>.当
t2.*
为 NULL 时,表示左外连接没有找到这样的匹配,因此 t1
具有最大的 OrderField
值团体.
In other words, return the row t1
for which no other row t2
exists with the same GroupId
and a greater OrderField
. When t2.*
is NULL, it means the left outer join found no such match, and therefore t1
has the greatest value of OrderField
in the group.
没有等级,没有子查询.如果您在 (GroupId, OrderField)
上有一个复合索引,这应该运行得很快并使用使用索引"优化对 t2 的访问.
No ranks, no subqueries. This should run fast and optimize access to t2 with "Using index" if you have a compound index on (GroupId, OrderField)
.
关于性能,请参阅我对检索每个组中的最后一条记录的回答.我尝试了使用 Stack Overflow 数据转储的子查询方法和连接方法.差异是显着的:在我的测试中,join 方法的运行速度提高了 278 倍.
Regarding performance, see my answer to Retrieving the last record in each group. I tried a subquery method and the join method using the Stack Overflow data dump. The difference is remarkable: the join method ran 278 times faster in my test.
拥有正确的索引以获得最佳结果非常重要!
It's important that you have the right index to get the best results!
关于使用@Rank 变量的方法,它不会像您编写的那样工作,因为在查询处理完第一个表后@Rank 的值不会重置为零.我给你举个例子.
Regarding your method using the @Rank variable, it won't work as you've written it, because the values of @Rank won't reset to zero after the query has processed the first table. I'll show you an example.
我插入了一些虚拟数据,除了我们知道每组最大的行之外,还有一个为空的额外字段:
I inserted some dummy data, with an extra field that is null except on the row we know is the greatest per group:
select * from `Table`;
+---------+------------+------+
| GroupId | OrderField | foo |
+---------+------------+------+
| 10 | 10 | NULL |
| 10 | 20 | NULL |
| 10 | 30 | foo |
| 20 | 40 | NULL |
| 20 | 50 | NULL |
| 20 | 60 | foo |
+---------+------------+------+
我们可以证明第一组的排名增加到三,第二组的排名增加到六,并且内部查询正确返回了这些:
We can show that the rank increases to three for the first group and six for the second group, and the inner query returns these correctly:
select GroupId, max(Rank) AS MaxRank
from (
select GroupId, @Rank := @Rank + 1 AS Rank
from `Table`
order by OrderField) as t
group by GroupId
+---------+---------+
| GroupId | MaxRank |
+---------+---------+
| 10 | 3 |
| 20 | 6 |
+---------+---------+
现在运行没有连接条件的查询,强制所有行的笛卡尔积,我们还获取所有列:
Now run the query with no join condition, to force a Cartesian product of all rows, and we also fetch all columns:
select s.*, t.*
from (select GroupId, max(Rank) AS MaxRank
from (select GroupId, @Rank := @Rank + 1 AS Rank
from `Table`
order by OrderField
) as t
group by GroupId) as t
join (
select *, @Rank := @Rank + 1 AS Rank
from `Table`
order by OrderField
) as s
-- on t.GroupId = s.GroupId and t.MaxRank = s.Rank
order by OrderField;
+---------+---------+---------+------------+------+------+
| GroupId | MaxRank | GroupId | OrderField | foo | Rank |
+---------+---------+---------+------------+------+------+
| 10 | 3 | 10 | 10 | NULL | 7 |
| 20 | 6 | 10 | 10 | NULL | 7 |
| 10 | 3 | 10 | 20 | NULL | 8 |
| 20 | 6 | 10 | 20 | NULL | 8 |
| 20 | 6 | 10 | 30 | foo | 9 |
| 10 | 3 | 10 | 30 | foo | 9 |
| 10 | 3 | 20 | 40 | NULL | 10 |
| 20 | 6 | 20 | 40 | NULL | 10 |
| 10 | 3 | 20 | 50 | NULL | 11 |
| 20 | 6 | 20 | 50 | NULL | 11 |
| 20 | 6 | 20 | 60 | foo | 12 |
| 10 | 3 | 20 | 60 | foo | 12 |
+---------+---------+---------+------------+------+------+
从上面我们可以看出,每个组的最大排名是正确的,但是随着它处理第二个派生表,@Rank 会继续增加,达到 7 或更高.所以第二个派生表的排名永远不会与第一个派生表的排名重叠.
We can see from the above that the max rank per group is correct, but then the @Rank continues to increase as it processes the second derived table, to 7 and on higher. So the ranks from the second derived table will never overlap with the ranks from the first derived table at all.
您必须添加另一个派生表以强制 @Rank 在处理两个表之间重置为零(并希望优化器不会更改它评估表的顺序,或者使用 STRAIGHT_JOIN 来防止这种情况发生):
You'd have to add another derived table to force @Rank to reset to zero in between processing the two tables (and hope the optimizer doesn't change the order in which it evaluates tables, or else use STRAIGHT_JOIN to prevent that):
select s.*
from (select GroupId, max(Rank) AS MaxRank
from (select GroupId, @Rank := @Rank + 1 AS Rank
from `Table`
order by OrderField
) as t
group by GroupId) as t
join (select @Rank := 0) r -- RESET @Rank TO ZERO HERE
join (
select *, @Rank := @Rank + 1 AS Rank
from `Table`
order by OrderField
) as s
on t.GroupId = s.GroupId and t.MaxRank = s.Rank
order by OrderField;
+---------+------------+------+------+
| GroupId | OrderField | foo | Rank |
+---------+------------+------+------+
| 10 | 30 | foo | 3 |
| 20 | 60 | foo | 6 |
+---------+------------+------+------+
但是这个查询的优化很糟糕.它不能使用任何索引,它创建两个临时表,对它们进行硬排序,甚至使用连接缓冲区,因为它在连接临时表时也不能使用索引.这是 EXPLAIN
的示例输出:
But the optimization of this query is terrible. It can't use any indexes, it creates two temporary tables, sorts them the hard way, and even uses a join buffer because it can't use an index when joining temp tables either. This is example output from EXPLAIN
:
+----+-------------+------------+--------+---------------+------+---------+------+------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+--------+---------------+------+---------+------+------+---------------------------------+
| 1 | PRIMARY | <derived4> | system | NULL | NULL | NULL | NULL | 1 | Using temporary; Using filesort |
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 2 | |
| 1 | PRIMARY | <derived5> | ALL | NULL | NULL | NULL | NULL | 6 | Using where; Using join buffer |
| 5 | DERIVED | Table | ALL | NULL | NULL | NULL | NULL | 6 | Using filesort |
| 4 | DERIVED | NULL | NULL | NULL | NULL | NULL | NULL | NULL | No tables used |
| 2 | DERIVED | <derived3> | ALL | NULL | NULL | NULL | NULL | 6 | Using temporary; Using filesort |
| 3 | DERIVED | Table | ALL | NULL | NULL | NULL | NULL | 6 | Using filesort |
+----+-------------+------------+--------+---------------+------+---------+------+------+---------------------------------+
而我使用左外连接的解决方案优化得更好.它不使用临时表,甚至报告 "Using index"
这意味着它可以仅使用索引来解析连接,而无需接触数据.
Whereas my solution using the left outer join optimizes much better. It uses no temp table and even reports "Using index"
which means it can resolve the join using only the index, without touching the data.
+----+-------------+-------+------+---------------+---------+---------+-----------------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+---------------+---------+---------+-----------------+------+--------------------------+
| 1 | SIMPLE | t1 | ALL | NULL | NULL | NULL | NULL | 6 | Using filesort |
| 1 | SIMPLE | t2 | ref | GroupId | GroupId | 5 | test.t1.GroupId | 1 | Using where; Using index |
+----+-------------+-------+------+---------------+---------+---------+-----------------+------+--------------------------+
您可能会读到有人在他们的博客上声称联接使 SQL 变慢",但这是无稽之谈.优化不佳导致 SQL 变慢.
You'll probably read people making claims on their blogs that "joins make SQL slow," but that's nonsense. Poor optimization makes SQL slow.
相关文章