为什么 MySQL 在这里并不总是使用索引合并?

2022-01-15 00:00:00 indexing mariadb mysql innodb

考虑这张表:

CREATE TABLE `Alarms` (
  `AlarmId` INT(10) UNSIGNED NOT NULL AUTO_INCREMENT,
  `DeviceId` BINARY(16) NOT NULL,
  `Code` BIGINT(20) UNSIGNED NOT NULL,
  `Ended` TINYINT(1) NOT NULL DEFAULT '0',
  `NaturalEnd` TINYINT(1) NOT NULL DEFAULT '0',
  `Pinned` TINYINT(1) NOT NULL DEFAULT '0',
  `Acknowledged` TINYINT(1) NOT NULL DEFAULT '0',
  `StartedAt` TIMESTAMP NOT NULL DEFAULT '0000-00-00 00:00:00',
  `EndedAt` TIMESTAMP NULL DEFAULT NULL,
  `MarkedForDeletion` TINYINT(1) NOT NULL DEFAULT '0',
  PRIMARY KEY (`AlarmId`),
  KEY `Key1` (`Ended`,`Acknowledged`),
  KEY `Key2` (`Pinned`),
  KEY `Key3` (`DeviceId`,`Pinned`),
  KEY `Key4` (`DeviceId`,`StartedAt`,`EndedAt`),
  KEY `Key5` (`DeviceId`,`Ended`,`EndedAt`),
  KEY `Key6` (`MarkedForDeletion`)
) ENGINE=INNODB;

并且，对于这个测试，像这样填充它:

And, for this test, populate it like so:

-- Populate some dummy data; 500 alarms for each -- of 1000 one-second periods SET @testDevice = UNHEX('00030000000000000000000000000000'); DROP PROCEDURE IF EXISTS `injectAlarms`; DELIMITER ;; CREATE PROCEDURE injectAlarms() BEGIN SET @fromdate = '2018-02-18 00:00:00'; SET @numdates = 1000; SET @todate = DATE_ADD(@fromdate, INTERVAL @numdates SECOND); -- Create table of alarm codes to join on DROP TABLE IF EXISTS `__codes`; CREATE TEMPORARY TABLE `__codes` ( `Code` BIGINT NOT NULL PRIMARY KEY ); SET @startcode = 0; SET @endcode = 499; REPEAT INSERT INTO `__codes` VALUES(@startcode); SET @startcode = @startcode + 1; UNTIL @startcode > @endcode END REPEAT; -- Add an alarm for each code, for each second in range REPEAT INSERT INTO `Alarms` (`DeviceId`, `Code`, `Ended`, `NaturalEnd`, `Pinned`, `Acknowledged`, `StartedAt`, `EndedAt`) SELECT @testDevice, `Code`, TRUE, FALSE, FALSE, FALSE, @fromdate, @fromdate FROM `__codes`; SET @fromdate = DATE_ADD(@fromdate, INTERVAL 1 SECOND); UNTIL @fromdate > @todate END REPEAT; END;; DELIMITER ; CALL injectAlarms();

现在，对于某些数据集，以下查询运行良好:

Now, for some datasets the following query works quite well:

SELECT * FROM `Alarms` WHERE ((`Alarms`.`Ended` = FALSE AND `Alarms`.`Acknowledged` = FALSE) OR `Alarms`.`Pinned` = TRUE) AND `MarkedForDeletion` = FALSE AND `DeviceId` = @testDevice ;

这是因为 MariaDB 足够聪明，可以使用索引合并，例如:

This is because MariaDB is clever enough to use index merges, e.g.:

id select_type table type possible_keys 1 SIMPLE Alarms index_merge Key1,Key2,Key3,Key4,Key5,Key6 key key_len ref rows Extra Key1,Key2,Key3 2,1,17 (NULL) 2 Using union(Key1,intersect(Key2,Key3)); Using where

但是，如果我使用上述过程填充的数据集，并稍微翻转查询(这是我需要的另一个视图，但在这种情况下会返回更多行):

However if I use the dataset as populated by the procedure above, and flip the query around a bit (which is another view I need, but in this case will return many more rows):

SELECT * FROM `Alarms` WHERE ((`Alarms`.`Ended` = TRUE OR `Alarms`.`Acknowledged` = TRUE) AND `Alarms`.`Pinned` = FALSE) AND `MarkedForDeletion` = FALSE AND `DeviceId` = @testDevice ;

…它没有:

… it doesn't:

id select_type table type possible_keys 1 SIMPLE Alarms ref Key1,Key2,Key3,Key4,Key5,Key6 key key_len ref rows Extra Key2 1 const 144706 Using where

我希望索引合并更频繁地发生.事实上，鉴于 ref=const，这个查询计划看起来并不太吓人 …但是，查询需要将近一秒钟的时间才能运行.这本身并不是世界末日，但是当我尝试一个更奇特的查询时，我的设计的可扩展性很差，这需要非常很长时间:

I would rather like the index merges to happen more often. As it is, given the ref=const, this query plan doesn't look too scary … however, the query takes almost a second to run. That in itself isn't the end of the world, but the poorly-scaling nature of my design shows when trying a more exotic query, which takes a very long time:

-- Create a temporary table that we'll join against in a mo DROP TABLE IF EXISTS `_ranges`; CREATE TEMPORARY TABLE `_ranges` ( `Start` TIMESTAMP NOT NULL DEFAULT 0, `End` TIMESTAMP NOT NULL DEFAULT 0, PRIMARY KEY(`Start`, `End`) ); -- Populate it (in reality this is performed by my application layer) SET @endtime = 1518992216; SET @starttime = @endtime - 86400; SET @inter = 900; DROP PROCEDURE IF EXISTS `populateRanges`; DELIMITER ;; CREATE PROCEDURE populateRanges() BEGIN REPEAT INSERT IGNORE INTO `_ranges` VALUES(FROM_UNIXTIME(@starttime),FROM_UNIXTIME(@starttime + @inter)); SET @starttime = @starttime + @inter; UNTIL @starttime > @endtime END REPEAT; END;; DELIMITER ; CALL populateRanges(); -- Actual query SELECT UNIX_TIMESTAMP(`_ranges`.`Start`) AS `Start_TS`, COUNT(`Alarms`.`AlarmId`) AS `n` FROM `_ranges` LEFT JOIN `Alarms` ON `Alarms`.`StartedAt` < `_ranges`.`End` AND (`Alarms`.`EndedAt` IS NULL OR `Alarms`.`EndedAt` >= `_ranges`.`Start`) AND ((`Alarms`.`EndedAt` IS NULL AND `Alarms`.`Acknowledged` = FALSE) OR `Alarms`.`Pinned` = TRUE) -- Again, the above condition is sometimes replaced by: -- AND ((`Alarms`.`EndedAt` IS NOT NULL OR `Alarms`.`Acknowledged` = TRUE) AND `Alarms`.`Pinned` = FALSE) AND `DeviceId` = @testDevice AND `MarkedForDeletion` = FALSE GROUP BY `_ranges`.`Start`

(此查询应该收集每个时间片的计数列表，每个计数表示有多少警报的 [StartedAt,EndedAt] 范围与该时间片相交.结果填充折线图.)

(This query is supposed to gather a list of counts per time slice, each count indicating how many alarms' [StartedAt,EndedAt] range intersects that time slice. The result populates a line graph.)

再一次，当我设计这些表并且其中没有多少行时，索引合并似乎使所有内容都运转起来.但现在不是这样:使用 injectAlarms() 中给出的数据集，这需要 40 秒才能完成！

Again, when I designed these tables and there weren't many rows in them, index merges seemed to make everything whiz along. But now not so: with the dataset as given in injectAlarms(), this takes 40 seconds to complete!

我在添加 MarkedForDeletion 列并执行我的第一个大型数据集规模测试时注意到了这一点.这就是为什么我对索引的选择不会因为 MarkedForDeletion 的存在而产生重大影响，尽管如果我删除 AND MarkedForDeletion = FALSE，上述结果是相同的根据我的查询；但是，我保留了条件，因为最终我需要它在那里.

I noticed this when adding the MarkedForDeletion column and performing some of my first large-dataset scale tests. This is why my choice of indexes doesn't make a big deal out of the presence of MarkedForDeletion, though the results described above are the same if I remove AND MarkedForDeletion = FALSE from my queries; however, I've kept the condition in, as ultimately I will need it to be there.

我尝试了一些 USE INDEX/FORCE INDEX 组合，但结果似乎从未使用索引合并.

I've tried a few USE INDEX/FORCE INDEX combinations, but it never seems to use index merge as a result.

我可以定义哪些索引以使该表在给定情况下快速运行?或者我该如何重组我的查询以实现相同的目标?

(以上查询计划在 MariaDB 5.5.56/CentOS 7 上获得，但解决方案也必须在 MySQL 5.1.73/CentOS 6 上运行.)

推荐答案

哇！这是我见过的最复杂的索引合并".

Wow! That's the most complicated "index merge" I have seen.

通常(也许总是)，您可以制作一个复合"索引来替换 index-merge-intersect，并且性能更好.将 key2 从仅 (pinned) 更改为 (pinned, DeviceId).这可能可以摆脱相交"并加快速度.

Usually (perhaps always), you can make a 'composite' index to replace an index-merge-intersect, and perform better. Change key2 from just (pinned) to (pinned, DeviceId). This may get rid of the 'intersect' and speed it up.

一般来说，优化器只在绝望中使用索引合并.(我认为这是标题问题的答案.)对查询或所涉及的值的任何细微更改，优化器将在不合并索引的情况下执行查询.

In general, the Optimizer uses index merge only in desperation. (I think this is the answer to the title question.) Any slight changes to the query or the values involved, and the Optimizer will perform the query without index merge.

对临时表 __codes 的改进是构建一个包含大量值的永久表，然后在 Proc 中使用该表中的一系列值.如果您使用的是 MariaDB，则使用动态构建的序列"表.例如，表"seq_1_to_100实际上是一个包含数字 1..100 的一列的表.无需声明或填充它.

An improvement on the temp table __codes is to build a permanent table with a large range of values, then use a range of values from that table inside your Proc. If you are using MariaDB, then use the dynamically built "sequence" table. For example the 'table' seq_1_to_100 is effectively a table of one column with numbers 1..100. No need to declare it or populate it.

您可以通过计算 Code 的时间来摆脱其他 REPEAT 循环.

You can get rid of the other REPEAT loop by computing the time from Code.

避免 LOOPs 将是最大的性能优势.

Avoiding LOOPs will be the biggest performance benefit.

完成所有这些，然后我可能会有其他提示.

Get all that done, then I may have other tips.

相关文章