使用基数较低的索引有意义吗?

2021-12-21 00:00:00 indexing sql mysql cardinality

我主要是一名 Actionscript 开发人员,绝不是 SQL 专家,但有时我必须开发简单的服务器端内容.所以,我想我应该向更有经验的人询问标题中的问题.

I'm mainly an Actionscript developer and by no means an expert in SQL, but from time to time I have to develop simple server side stuff. So, I thought I'd ask more experienced people about the question in the title.

我的理解是,通过在包含几个不同值的列中设置索引,您不会获得太多收益.我有一个包含布尔值的列(实际上它是一个小整数,但我将它用作标志),并且该列用于我拥有的大多数查询的 WHERE 子句中.在理论上的平均"情况下,记录的一半值为 1,另一半为 0.因此,在这种情况下,数据库引擎可以避免全表扫描,但无论如何都必须读取大量行(总行数/2).

My understanding is that you don't gain much by setting an index in a column that will hold few distinct values. I have a column that holds a boolean value (actually it's a small int, but I'm using it as a flag), and this column is used in the WHERE clauses of most of the queries I have. In a theoretical "average" case, half of the records' values will be 1 and the other half, 0. So, in this scenario, the database engine could avoid a full table scan, but will have to read a lot of rows anyway (total rows/2).

那么,我应该将此列作为索引吗?

So, should I make this column an index?

作为记录,我使用的是 Mysql 5,但我更感兴趣的是它为什么这样做/没有意义索引我知道将具有低基数的列.

For the record, I'm using Mysql 5, but I'm more interested in a general rationale on why it does / does not make sense indexing a column that I know that will have a low cardinality.

提前致谢.

推荐答案

如果出现以下情况,索引甚至可以帮助处理低基数字段:

An index can help even on low cardinality fields if:

  1. 当其中一个可能值与其他值相比非常不常见时,您会搜索它.

  1. When one of possible values is very infrequent compared to the other values and you search for it.

例如,很少有色盲女性,所以这个查询:

For instance, there are very few color blind women, so this query:

SELECT  *
FROM    color_blind_people
WHERE   gender = 'F'

最有可能受益于 gender 的索引.

would most probably benefit from an index on gender.

当值倾向于按表格顺序分组时:

When the values tend to be grouped in the table order:

SELECT  *
FROM    records_from_2008
WHERE   year = 2010
LIMIT 1

虽然这里只有 3 个不同的年份,但较早年份的记录很可能首先添加,因此在返回第一个 2010 之前必须扫描非常多的记录如果不是索引,则记录.

Though there are only 3 distinct years here, records with earlier years are most probably added first so very many records would have to be scanned prior to returning the first 2010 record if not for the index.

当你需要ORDER BY/LIMIT时:

SELECT  *
FROM    people
ORDER BY
        gender, id
LIMIT 1

如果没有索引,则需要 filesort.虽然它对 LIMIT 做了一些优化,但它仍然需要全表扫描.

Without the index, a filesort would be required. Though it's somewhat optimized do to the LIMIT, it would still need a full table scan.

当索引覆盖查询中使用的所有字段时:

When the index covers all fields used in the query:

CREATE INDEX (low_cardinality_record, value)

SELECT  SUM(value)
FROM    mytable
WHERE   low_cardinality_record = 3

  • 当你需要DISTINCT时:

    SELECT  DISTINCT color
    FROM    tshirts
    

    MySQL 将使用INDEX FOR GROUP-BY,如果你的颜色很少,即使有数百万条记录,这个查询也是即时的.

    MySQL will use INDEX FOR GROUP-BY, and if you have few colors, this query will be instant even with millions of records.

    这是一个场景示例,当低基数字段上的索引比高基数字段上的索引效率更高.

    This is an example of a scenario when the index on a low cardinality field is more efficient than that on a high cardinality field.

    请注意,如果 DML 性能没有太大问题,那么创建索引是安全的.

    Note that if DML performance is not much on an issue, then it's safe to create the index.

    如果优化器认为索引效率低下,则不会使用该索引.

    If optimizer thinks that the index is inefficient, the index just will not be used.

  • 相关文章