SQL 性能:使用 order by 时使用 OR 比 IN 慢
我正在使用 MariaDB 10.0.21 并针对 1200 万行运行类似于以下查询的查询:
I am using MariaDB 10.0.21 and running a query similar to the following query against 12 Million Rows:
SELECT
`primary_key`
FROM
`texas_parcels`
WHERE
`zip_code`
IN ('28461', '48227', '60411', '65802', '75215', '75440', '75773', '75783', '76501', '76502', '76504', '76511', '76513', '76518', '76519', '76520', '76522', '76525', '76527', '76528', '76530', '76537', '76539', '76541', '76542', '76543', '76548', '76549', '76550', '76556', '76567', '76571', '76574', '76577', '76578', '76642', '76704', '76853', '77418', '77434', '77474', '77833', '77835', '77836', '77845', '77853', '77879', '77964', '77975', '78002', '78003', '78006', '78013', '78028', '78056', '78064', '78070', '78114', '78123', '78130', '78132', '78133', '78154', '78155', '78359', '78382', '78602', '78605', '78606', '78607', '78608', '78609', '78610', '78611', '78612', '78613', '78614', '78615', '78616', '78617', '78619', '78620', '78621', '78623', '78624', '78626', '78628', '78629', '78632', '78633', '78634', '78636', '78638', '78639', '78640', '78641', '78642', '78643', '78644', '78645', '78648', '78650', '78652', '78653', '78654', '78655', '78656', '78657', '78659', '78660', '78662', '78663', '78664', '78665', '78666', '78669', '78672', '78676', '78681', '78699', '78701', '78702', '78703', '78704', '78705', '78717', '78719', '78721', '78722', '78723', '78724', '78725', '78726', '78727', '78728', '78729', '78730', '78731', '78732', '78733', '78734', '78735', '78736', '78737', '78738', '78739', '78741', '78744', '78745', '78746', '78747', '78748', '78749', '78750', '78751', '78752', '78753', '78754', '78756', '78757', '78758', '78759', '78828', '78934', '78940', '78941', '78942', '78945', '78946', '78947', '78948', '78953', '78954', '78956', '78957', '78963', '92536')
ORDER BY
`timestamp_updated` ASC
LIMIT 1000;
我在 (zip_code,timestamp_updated)
上有一个复合索引,我在 ~1.6 秒 内获得结果.在下一个查询中,我仍在搜索相同的邮政编码,但我使用的是 OR 而不是 IN().
I have a compound index on (zip_code,timestamp_updated)
, I obtain results in ~1.6 Seconds. In the next query I am still seraching the same zipcodes but I am using OR instead of IN().
SELECT
`primary_key`
FROM
`texas_parcels`
WHERE
(`zip_code` = '28461' OR `zip_code` = '48227' OR `zip_code` = '60411' OR `zip_code` = '65802' OR `zip_code` = '75215' OR `zip_code` = '75440' OR `zip_code` = '75773' OR `zip_code` = '75783' OR `zip_code` = '76501' OR `zip_code` = '76502' OR `zip_code` = '76504' OR `zip_code` = '76511' OR `zip_code` = '78957' OR `zip_code` = '78963' OR `zip_code` = '92536' OR `zip_code` = '76513' OR `zip_code` = '76518' OR `zip_code` = '76519' OR `zip_code` = '76520' OR `zip_code` = '76522' OR `zip_code` = '76525' OR `zip_code` = '76527' OR `zip_code` = '76528' OR `zip_code` = '76530' OR `zip_code` = '76537' OR `zip_code` = '76539' OR `zip_code` = '76541' OR `zip_code` = '76542' OR `zip_code` = '76543' OR `zip_code` = '76548' OR `zip_code` = '76549' OR `zip_code` = '76550' OR `zip_code` = '76556' OR `zip_code` = '76567' OR `zip_code` = '76571' OR `zip_code` = '76574' OR `zip_code` = '76577' OR `zip_code` = '76578' OR `zip_code` = '76642' OR `zip_code` = '76704' OR `zip_code` = '76853' OR `zip_code` = '77418' OR `zip_code` = '77434' OR `zip_code` = '77474' OR `zip_code` = '77833' OR `zip_code` = '77835' OR `zip_code` = '77836' OR `zip_code` = '77845' OR `zip_code` = '77853' OR `zip_code` = '77879' OR `zip_code` = '77964' OR `zip_code` = '77975' OR `zip_code` = '78002' OR `zip_code` = '78003' OR `zip_code` = '78006' OR `zip_code` = '78013' OR `zip_code` = '78028' OR `zip_code` = '78056' OR `zip_code` = '78064' OR `zip_code` = '78070' OR `zip_code` = '78114' OR `zip_code` = '78123' OR `zip_code` = '78130' OR `zip_code` = '78132' OR `zip_code` = '78133' OR `zip_code` = '78154' OR `zip_code` = '78155' OR `zip_code` = '78359' OR `zip_code` = '78382' OR `zip_code` = '78602' OR `zip_code` = '78605' OR `zip_code` = '78606' OR `zip_code` = '78607' OR `zip_code` = '78608' OR `zip_code` = '78609' OR `zip_code` = '78610' OR `zip_code` = '78611' OR `zip_code` = '78612' OR `zip_code` = '78613' OR `zip_code` = '78614' OR `zip_code` = '78615' OR `zip_code` = '78616' OR `zip_code` = '78617' OR `zip_code` = '78619' OR `zip_code` = '78620' OR `zip_code` = '78621' OR `zip_code` = '78623' OR `zip_code` = '78624' OR `zip_code` = '78626' OR `zip_code` = '78628' OR `zip_code` = '78629' OR `zip_code` = '78632' OR `zip_code` = '78633' OR `zip_code` = '78634' OR `zip_code` = '78636' OR `zip_code` = '78638' OR `zip_code` = '78639' OR `zip_code` = '78640' OR `zip_code` = '78641' OR `zip_code` = '78642' OR `zip_code` = '78643' OR `zip_code` = '78644' OR `zip_code` = '78645' OR `zip_code` = '78648' OR `zip_code` = '78650' OR `zip_code` = '78652' OR `zip_code` = '78653' OR `zip_code` = '78654' OR `zip_code` = '78655' OR `zip_code` = '78656' OR `zip_code` = '78657' OR `zip_code` = '78659' OR `zip_code` = '78660' OR `zip_code` = '78662' OR `zip_code` = '78663' OR `zip_code` = '78664' OR `zip_code` = '78665' OR `zip_code` = '78666' OR `zip_code` = '78669' OR `zip_code` = '78672' OR `zip_code` = '78676' OR `zip_code` = '78681' OR `zip_code` = '78699' OR `zip_code` = '78701' OR `zip_code` = '78702' OR `zip_code` = '78703' OR `zip_code` = '78704' OR `zip_code` = '78705' OR `zip_code` = '78717' OR `zip_code` = '78719' OR `zip_code` = '78721' OR `zip_code` = '78722' OR `zip_code` = '78723' OR `zip_code` = '78724' OR `zip_code` = '78725' OR `zip_code` = '78726' OR `zip_code` = '78727' OR `zip_code` = '78728' OR `zip_code` = '78729' OR `zip_code` = '78730' OR `zip_code` = '78731' OR `zip_code` = '78732' OR `zip_code` = '78733' OR `zip_code` = '78734' OR `zip_code` = '78735' OR `zip_code` = '78736' OR `zip_code` = '78737' OR `zip_code` = '78738' OR `zip_code` = '78739' OR `zip_code` = '78741' OR `zip_code` = '78744' OR `zip_code` = '78745' OR `zip_code` = '78746' OR `zip_code` = '78747' OR `zip_code` = '78748' OR `zip_code` = '78757' OR `zip_code` = '78758' OR `zip_code` = '78759' OR `zip_code` = '78828' OR `zip_code` = '78934' OR `zip_code` = '78940' OR `zip_code` = '78941' OR `zip_code` = '78942' OR `zip_code` = '78945' OR `zip_code` = '78946' OR `zip_code` = '78947' OR `zip_code` = '78948' OR `zip_code` = '78953' OR `zip_code` = '78954' OR `zip_code` = '78956')
ORDER BY
`timestamp_updated` ASC
LIMIT 1000;
第二个查询在 ~7.8 秒内以相同的顺序获得相同的结果.通过解释运行每个查询时,它们几乎相同,它们给我的 rows
数量略有不同.
This second query obtains the same results in the same order in ~7.8 Seconds. When running each query through an explain, they are nearly the same, they give me a slightly different rows
amount.
id select_type table type possible_keys key key_len ref rows filtered Extra
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 SIMPLE TX_Property range Zip Code Search Zip Code Search 15 (NULL) 2402699 99.88 Using where; Using index; Using filesort
2 SIMPLE TX_Property range Zip Code Search Zip Code Search 15 (NULL) 2321908 99.91 Using where; Using index; Using filesort
在分析这两个查询时,唯一的主要时间增加是 排序结果
时间,在第二个查询中它花费了 7.2 秒.
When Profiling the two queries the only major increase in time is the Sorting Result
time, in the second query it took up to 7.2 Seconds.
我想我不明白的是,where 中的不同操作员如何在订单方面产生如此巨大的差异,如果执行时间存在重大差异,这是否有意义?也许我只是不确定分析是如何工作的,实际上只是执行 where 部分的时间,但它只是以一种复杂的方式标记?
I guess what I don't understand is how the different operator in the where could make such a huge difference when it comes to order, it would make sense if there was a major difference in time for executing? Maybe I am just not exactly sure how profiling works, and it actually is just the time of executing the where portion, but it is just labeled in a convoluted way?
我还想指出,当我在没有 ORDER BY timestamp_updated ASC
的情况下运行查询时,顶部查询耗时 ~0.106 秒,第二个查询耗时 ~0.157 秒.
I also wanted to note that when I ran the queries without the ORDER BY timestamp_updated ASC
the top query took ~0.106 Seconds and the second query took ~0.157 Seconds.
推荐答案
删除 ORDER BY
运行得更快,因为它可以在 1000 行后停止.有多少行匹配那个 OR/IN?
Removing the ORDER BY
runs so much faster because it can stop after only 1000 rows. How many rows match that OR/IN?
请注意,EXPLAINs
表示查询是 Using index
.这意味着您有一个覆盖"索引.即 SELECT
中的所有字段都在一个索引中.
Notice that the EXPLAINs
say that the query is Using index
. That means that you have a "covering" index. That is all the fields in the SELECT
are in the one index.
在 InnoDB 中,每个辅助键都隐含包含 PK,因此 INDEX(zip_code, timestamp_updated)
实际上是 INDEX(zip_code, timestamp_updated, primaryKey)
In InnoDB, each secondary key implicitly includes the PK, so INDEX(zip_code, timestamp_updated)
is effectively INDEX(zip_code, timestamp_updated, primaryKey)
索引不是很有效,因为你有两个重要的事情发生:(1) IN 或 OR,(2) ORDER BY.索引只能处理其中一个或另一个.您的索引允许它使用 zip_code
.它
The index is not very efficient since you have two non-trivial things going on: (1) IN or OR, (2) ORDER BY. Only one or the other can be handled by the index. Your index lets it use zip_code
. It
- 在索引中找到与这些邮政编码匹配的行,
- 收集时间戳和 pk,将 3 列放入 tmp 表中
- 排序
- 提供前 1000 个.
相反,如果您说 INDEX(timestamp_updated, zip_code)
您仍然会有一个覆盖"索引,但在这种情况下,索引将(我希望)避免对 SORT 的需要.哦,考虑到这一点,它可能会在 1000 行之后停止.以下是它的工作原理:
If, instead, you said INDEX(timestamp_updated, zip_code)
you would still have a 'covering' index, but in this flavor, the index would (I hope) prevent the need for the SORT. Oh, given that, it might be able to stop after 1000 rows. Here's how it will work:
- 按时间戳顺序扫描索引.
- 检查每一行是否是其中一个拉链.(这里的测试可能在 IN 格式下会更快)
- 如果匹配,则传递行;如果是 1000,则停止.
- Scan through the index in timestamp order.
- Check each row for being one of those zips. (Here the test might be faster in IN format)
- If match, deliver row; if 1000, stop.
但是等一下……现在您将受到 12M 行的支配.如果早早出现 1000 行带有这些 zip 的行(旧时间戳),它可以快速停止.如果你需要检查所有的行来找到1000(或者甚至没有1000),那么它是一个全索引扫描,并且这种索引排列是'糟糕'的.
But wait... Now you are at the mercy of the 12M rows. If 1000 rows with those zips occur early (old timestamps), it can stop fast. If you need to check all the rows to find 1000 (or there aren't even 1000), then it is a full index scan, and this arrangement of the index is 'bad'.
如果你给优化器两个INDEXes
,它会尽职尽责地根据不充分的信息(没有值的分布)做出明智的选择,并且可能会选择更差的那个.
If you give the optimizer both INDEXes
, it will dutifully make an intelligent choice based on inadequate information (no distribution of the values), and might pick the worse one.
您实际上需要一个二维索引.这样的不存在.(好吧,也许 Spatial 可能会被混淆?)但是......
You effectively need a 2-dimensional index. Such don't exist. (Well, maybe Spatial could be kludged?) But...
PARTITION BY RANGE(timestamp)
与 INDEX
starting with zip 可能会更好.但我怀疑优化器是否足够聪明,能够意识到如果它在第一个分区中找到 1000 行,它可能会退出.如果没有 1000 个结果,它仍然会失败.
PARTITION BY RANGE(timestamp)
together with the INDEX
starting with zip might work better. But I doubt if the optimizer is smart enough to realize that if it found 1000 rows in the first partition it could quit. And it still fails badly if there aren't 1000 results.
PARTITION BY RANGE(zip)
与 INDEX
以时间戳开始 可能 无济于事,因为这么多拉链不会做太多修剪.
PARTITION BY RANGE(zip)
together with the INDEX
starting with timestamp probably will not help, since that many zips won't do much pruning.
请为您的每次尝试提供 EXPLAIN FORMAT=JSON SELECT...;
.那里可能有一些微妙的线索来解释广泛的时间变化.
Please provide EXPLAIN FORMAT=JSON SELECT...;
for each of your attempts. There may be some subtle clues there to explain the wide time variations.
您是否每次计时两次?(否则,缓存可能会使结果着色.)
Did you run each timing twice? (Otherwise, caching may have colored the results.)
另一种方法
我不知道这会表现如何,但这里是:
I do not know how well this will perform, but here goes:
SELECT primary_key
FROM (
( SELECT primary_key, timestamp_updated
FROM texas_parcels
WHERE zip_code = '28461'
ORDER BY timestamp_updated
LIMIT 1000
)
UNION ALL (
SELECT primary_key, timestamp_updated
FROM texas_parcels
WHERE zip_code = '48227'
ORDER BY timestamp_updated
LIMIT 1000
)
UNION ALL (
SELECT primary_key, timestamp_updated
FROM texas_parcels
WHERE zip_code = '60411'
ORDER BY timestamp_updated
LIMIT 1000 ) ...
) x
ORDER BY timestamp_updated
LIMIT 1000
x
似乎只有几十万行,而不是 1.3M.但是 UNION
有一些开销,等等.请注意每个子查询和外部的 LIMIT
.如果你也需要OFFSET
,那就更麻烦了.
It seems like x
will have only a few hundred thousand rows, not 1.3M. But UNION
has some overhead, etc. Note the LIMIT
in each subquery and on the outside. If you need OFFSET
, too, it gets trickier.
相关文章