我应该在 SQL Server 中索引一个位字段吗?

2021-12-21 00:00:00 indexing sql-server

我记得有一次读到索引一个具有低基数(少量不同值)的字段实际上并不值得做.我承认我对索引的工作原理了解不够,无法理解为什么会这样.

I remember reading at one point that indexing a field with low cardinality (a low number of distinct values) is not really worth doing. I admit I don't know enough about how indexes work to understand why that is.

那么,如果我有一个包含 1 亿行的表,并且我正在选择位字段为 1 的记录怎么办?假设在任何时间点,只有少数位字段为 1(而不是 0)的记录.是否值得索引该位字段?为什么?

So what if I have a table with 100 million rows in it, and I am selecting records where a bit field is 1? And let's say that at any point in time, there are only a handful of records where the bit field is 1 (as opposed to 0). Is it worth indexing that bit field or not? Why?

当然我可以只测试它并检查执行计划,我会这样做,但我也对其背后的理论很好奇.基数什么时候重要,什么时候不重要?

Of course I can just test it and check the execution plan, and I will do that, but I'm also curious about the theory behind it. When does cardinality matter and when does it not?

推荐答案

考虑一下 SQL 中的索引是什么——索引实际上是一块指向其他内存块(即指向行的指针)的内存块.索引被分成页,以便索引的部分可以根据使用情况从内存中加载和卸载.

Consider what an index is in SQL - and index is really a chunk of memory pointing at other chunks of memory (i.e. pointers to rows). The index is broken into pages so that portions of the index can be loaded and unloaded from memory depending on usage.

当您请求一组行时,SQL 使用索引比表扫描(查看每一行)更快地查找行.

When you ask for a set of rows, SQL uses the index to find the rows more quickly than table scanning (looking at every row).

SQL 有聚集索引和非聚集索引.我对聚集索引的理解是它们将相似的索引值分组到同一页面中.这样,当您要求与索引值匹配的所有行时,SQL 可以从内存的聚集页面返回这些行.这就是为什么尝试对 GUID 列进行集群索引是一个坏主意 - 您不要尝试对随机值进行集群.

SQL has clustered and non-clustered indexes. My understanding of clustered indexes is that they group similar index values into the same page. This way when you ask for all the rows matching an index value, SQL can return those rows from a clustered page of memory. This is why trying to cluster index a GUID column is a bad idea - you don't try to cluster random values.

当你索引一个整数列时,SQL 的索引为每个索引值包含一组行.如果您的范围是 1 到 10,那么您将有 10 个索引指针.根据有多少行,这可以被不同地分页.如果您的查询查找与1"匹配的索引,然后其中 Name 包含Fred"(假设 Name 列未编入索引),则 SQL 会非常快速地获取与1"匹配的行集,然后进行表扫描以查找其余行.

When you index an integer column, SQL's index contains a set of rows for each index value. If you have a range of 1 to 10, then you would have 10 index pointers. Depending on how many rows there are this can be paged differently. If your query looks for the index matching "1" and then where Name contains "Fred" (assuming the Name column is not indexed), SQL gets the set of rows matching "1" very quickly, then table scans to find the rest.

所以 SQL 真正在做的是试图减少它必须迭代的工作集(行数).

So what SQL is really doing is trying to reduce the working set (number of rows) it has to iterate over.

当你索引一个位域(或一些狭窄的范围)时,你只会通过匹配该值的行数来减少工作集.如果您有少量匹配的行,它将大大减少您的工作集.对于具有 50/50 分布的大量行,与保持索引最新相比,它可能只会给您带来很小的性能提升.

When you index a bit field (or some narrow range), you only reduce the working set by the number of rows matching that value. If you have a small number of rows matching it would reduce your working set a lot. For a large number of rows with 50/50 distribution, it might buy you very little performance gain vs. keeping the index up to date.

每个人都说要测试的原因是因为 SQL 包含一个非常聪明和复杂的优化器,如果它决定表扫描速度更快,或者可能使用排序,或者可能组织内存页,但它非常喜欢,它可能会忽略索引.

The reason everyone says to test is because SQL contains a very clever and complex optimizer that may ignore an index if it decides table scanning is faster, or may use a sort, or may organize memory pages however it darn well likes.

相关文章