表扫描和聚集索引扫描有什么区别?

2021-12-21 00:00:00 indexing sql sql-server

既然 Table ScanClustered Index Scan 本质上都是扫描表中的所有记录,那么为什么 Clustered Index Scan 应该更好?

Since both a Table Scan and a Clustered Index Scan essentially scan all records in the table, why is a Clustered Index Scan supposedly better?

举个例子 - 当有很多记录时,以下之间的性能差异是什么?:

As an example - what's the performance difference between the following when there are many records?:

declare @temp table(
    SomeColumn varchar(50)
)

insert into @temp
select 'SomeVal'

select * from @temp

-----------------------------

declare @temp table(
    RowID int not null identity(1,1) primary key,
    SomeColumn varchar(50)
)

insert into @temp
select 'SomeVal'

select * from @temp

推荐答案

在没有聚集索引的表(堆表)中,数据页没有链接在一起——所以遍历页需要一个查找索引分配图.

In a table without a clustered index (a heap table), data pages are not linked together - so traversing pages requires a lookup into the Index Allocation Map.

然而,聚簇表将数据页链接在双向链表中 - 使顺序扫描更快一点.当然,作为交换,您有处理在 INSERTUPDATEDELETE 上保持数据页顺序的开销.但是,堆表需要对 IAM 进行第二次写入.

A clustered table, however, has it's data pages linked in a doubly linked list - making sequential scans a bit faster. Of course, in exchange, you have the overhead of dealing with keeping the data pages in order on INSERT, UPDATE, and DELETE. A heap table, however, requires a second write to the IAM.

如果您的查询具有 RANGE 运算符(例如:SELECT * FROM TABLE WHERE Id BETWEEN 1 AND 100),则为聚簇表(按保证顺序)会更有效 - 因为它可以使用索引页来查找相关数据页.堆必须扫描所有行,因为它不能依赖排序.

If your query has a RANGE operator (e.g.: SELECT * FROM TABLE WHERE Id BETWEEN 1 AND 100), then a clustered table (being in a guaranteed order) would be more efficient - as it could use the index pages to find the relevant data page(s). A heap would have to scan all rows, since it cannot rely on ordering.

当然,聚集索引可以让您执行 CLUSTERED INDEX SEEK,这对于性能来说非常理想……没有索引的堆总是会导致表扫描.

And, of course, a clustered index lets you do a CLUSTERED INDEX SEEK, which is pretty much optimal for performance...a heap with no indexes would always result in a table scan.

所以:

  • 对于您选择所有行的示例查询,唯一的区别是聚集索引维护的双向链表.这应该会使您的聚簇表比具有大量行的堆快一点点.

  • For your example query where you select all rows, the only difference is the doubly linked list a clustered index maintains. This should make your clustered table just a tiny bit faster than a heap with a large number of rows.

对于带有 WHERE 子句的查询,该子句可以(至少部分地)被聚集索引满足,你会因为排序而领先 - 所以你不会必须扫描整个表.

For a query with a WHERE clause that can be (at least partially) satisfied by the clustered index, you'll come out ahead because of the ordering - so you won't have to scan the entire table.

对于聚簇索引不满足的查询,您几乎可以......同样,唯一的区别是用于顺序扫描的双向链表.无论哪种情况,你都不是最理想的.

For a query that is not satisified by the clustered index, you're pretty much even...again, the only difference being that doubly linked list for sequential scanning. In either case, you're suboptimal.

对于INSERTUPDATEDELETE,堆可能会或可能不会获胜.堆不必维护顺序,但确实需要对 IAM 进行第二次写入.我认为相对性能差异可以忽略不计,但也非常依赖于数据.

For INSERT, UPDATE, and DELETE a heap may or may not win. The heap doesn't have to maintain order, but does require a second write to the IAM. I think the relative performance difference would be negligible, but also pretty data dependent.

Microsoft 有一份白皮书,将聚集索引与堆上的等效非聚集索引(与我上面讨论的不完全相同,但很接近).他们的结论基本上是在所有表上放置聚集索引.我会尽力总结他们的结果(同样,请注意,他们在这里确实将非聚集索引与聚集索引进行了比较 - 但我认为它具有相对可比性):

Microsoft has a whitepaper which compares a clustered index to an equivalent non-clustered index on a heap (not exactly the same as I discussed above, but close). Their conclusion is basically to put a clustered index on all tables. I'll do my best to summarize their results (again, note that they're really comparing a non-clustered index to a clustered index here - but I think it's relatively comparable):

  • INSERT 性能:由于堆需要第二次写入,聚集索引的性能提高了大约 3%.
  • UPDATE 性能:由于堆需要第二次查找,聚集索引的性能提高了约 8%.
  • DELETE 性能:由于需要第二次查找以及需要从 IAM 中对堆进行第二次删除,聚集索引的性能提高了约 18%.
  • 单个 SELECT 性能:由于堆需要第二次查找,聚集索引胜出约 16%.
  • range SELECT 性能:由于堆的随机排序,聚集索引胜出约 29%.
  • 并发INSERT:由于聚集索引的页面拆分,堆表在负载下胜出 30%.
  • INSERT performance: clustered index wins by about 3% due to the second write needed for a heap.
  • UPDATE performance: clustered index wins by about 8% due to the second lookup needed for a heap.
  • DELETE performance: clustered index wins by about 18% due to the second lookup needed and the second delete needed from the IAM for a heap.
  • single SELECT performance: clustered index wins by about 16% due to the second lookup needed for a heap.
  • range SELECT performance: clustered index wins by about 29% due to the random ordering for a heap.
  • concurrent INSERT: heap table wins by 30% under load due to page splits for the clustered index.

相关文章