聚集索引应该放在哪一列?

2021-12-21 00:00:00 indexing sql sql-server database-design

最近,我一直在阅读所有类型的索引,主要建议是将聚集索引放在表的主键上,但是如果查询中实际上没有使用主键(通过select 或 join)并且只是出于纯粹的关系目的而放置,因此在这种情况下不会对其进行查询.例如,假设我有一个 car_parts 表,它包含 3 列,car_part_idcar_part_nocar_part_title.car_part_id 是唯一的主键标识列.在这种情况下,car_part_no 也是唯一的,很可能是 car_part_title.car_part_no 是查询最多的,所以将聚集索引放在该列而不是 car_part_id 上是否有意义?问题的基础是哪一列实际上应该有聚集索引,因为你只允许其中之一?

Lately, I have been doing some reading on indexes of all types and the main advice is to put the clustered index on the primary key of the table, but what if the primary key actually is not used in a query (via a select or join) and is just put for purely relational purposes, so in this case it is not queried against. Example, say I have a car_parts table and it contains 3 columns, car_part_id, car_part_no, and car_part_title. car_part_id is the unique primary key identity column. In this case car_part_no is unique as well and is most likely car_part_title. car_part_no is what is most queried against, so doesn't it make sense to put the clustered index on that column instead of car_part_id? The basics of the question is what column should actually have the clustered index since you are only allowed one of them?

推荐答案

当且仅当索引中最左边的键被过滤时,查询优化器可以使用聚集或非聚集的索引.因此,如果您在列 (A, B, C) 上定义索引,则在 B=@bC=@cB 上定义 WHERE 条件=@b AND C=@c 不会充分利用索引(见注释).这也适用于加入条件.任何包含 A 的 WHERE 过滤器都会考虑索引:A=@aA=@a AND B=@bA=@a AND C=@cA=@a AND B=@b AND C=@c.

An index, clustered or non clustred, can be used by the query optimizer if and only if the leftmost key in the index is filtered on. So if you define an index on columns (A, B, C), a WHERE condition on B=@b, on C=@c or on B=@b AND C=@c will not fully leverage the index (see note). This applies also to join conditions. Any WHERE filter that includes A will consider the index: A=@a or A=@a AND B=@b or A=@a AND C=@c or A=@a AND B=@b AND C=@c.

因此,在您的示例中,如果您将 part_no 上的集群索引设为最左边的键,那么查找特定 part_id 的查询将不是 使用索引,并且 part-id 上必须存在单独的非聚集索引.

So in your example if you make the clustred index on part_no as the leftmost key, then a query looking for a specific part_id will not use the index and a separate non-clustered index must exist on part-id.

现在关于许多索引中的哪一个应该是聚集索引的问题.如果您有几个查询模式的重要性和频率大致相同,并且在所需的键方面相互矛盾(例如,通过 either part_no 的频繁查询part_id) 然后你考虑其他因素:

Now about the question which of the many indexes should be the clustered one. If you have several query patterns that are about the same importance and frequency and contradict each other on terms of the keys needed (eg. frequent queries by either part_no or part_id) then you take other factors into consideration:

  • width:聚集索引键被所有其他非聚集索引用作查找键.因此,如果您选择一个宽键(比如两个 uniquidentifier 列),那么您将使所有其他索引变宽,从而消耗更多空间,产生更多 IO 并减慢一切.因此,从读取的角度来看,在同等良好的键之间,选择最窄的作为聚集键,将较宽的键设为非聚集键.
  • 争用:如果您有特定的插入和删除模式,请尝试在物理上将它们分开,以便它们出现在聚集索引的不同部分.例如.如果表作为一个队列,所有插入在一个逻辑端,所有删除在另一个逻辑端,请尝试布局聚簇索引,以便物理顺序与此逻辑顺序匹配(例如入队顺序).
  • 分区:如果表非常大并且您计划部署分区,那么分区键必须是聚集索引.典型的例子是使用滑动窗口分区方案归档的历史数据.即使实体具有像entity_id"这样的逻辑主键,集群索引也是由日期时间列完成的,该列也用于分区功能.
  • 稳定性:经常更改的键不适合用作聚簇键,因为每个键都会更新聚簇键值并强制所有非聚簇索引更新查找他们存储的密钥.由于聚集键的更新也可能会将记录重新定位到不同的页面,因此可能会导致聚集索引出现碎片.
  • width: the clustered index key is used as the lookup key by all other non-clustered indexes. So if you choose a wide key (say two uniquidentifier columns) then you are making all the other indexes wider, thus consuming more space, generating more IO and slowing down everything. So between equaly good keys from a read point of view, choose the narrowest one as clustered and make the wider ones non-clustered.
  • contention: if you have specific patterns of insert and delete try to separate them physically so they occur on different portions of the clustered index. Eg. if the table acts as a queue with all inserts at one logical end and all deletes at the other logical end, try to layout the clustered index so that the physical order matches this logical order (eg. enqueue order).
  • partitioning: if the table is very large and you plan to deploy partioning then the partitioning key must be the clustered index. Typical example is historical data that is archived using a sliding window partitioning scheme. Even thow the entities have a logical primary key like 'entity_id', the clustred index is done by a datetime column that is also used for the partitioning function.
  • stability: a key that changes often is a poor candidate for a clustered key as each update the clustered key value and force all non-clustered indexes to update the lookup key they store. As an update of a clustered key will also likely relocate the record into a different page it can cause fragmentation on the clustered index.

注意:不是完全利用,因为有时引擎会选择非聚集索引来扫描而不是聚集索引,因为它更窄,因此需要的页面更少扫描.在我的示例中,如果您在 (A, B, C) 上有一个索引,在 B=@b 上有一个 WHERE 过滤器,并且查询项目 C,则该索引很可能是使用但不是作为搜索,作为扫描,因为仍然比完整集群扫描(更少的页面)快.

Note: not fully leverage as sometimes the engine will choose an non-clustered index to scan instead of the clustered index simply because is narrower and thus has fewer pages to scan. In my example if you have an index on (A, B, C) and a WHERE filter on B=@b and the query projects C, the index will be likely used but not as a seek, as a scan, because is still faster than a full clustered scan (fewer pages).

相关文章