数据库如何在 B-Tree/B+Tree 内部存储数据
我的问题是数据库如何存储数据以及它如何在内部执行查询.
My question is that How database stores data and how it performs query internally.
假设我们的表中有以下字段:
Suppose we have following fields in our table:
- 身份证
- 姓名
- 年龄
- 重量
- 经理
我们查询 select * from Table1 where age>50 and weight<100
我只是好奇它如何在内部执行查询.
I am just curious that how it perform query internally.
本例中B-Tre/B+Tree的节点包含什么?
What will the Node of B-Tre/B+Tree contains in this example?
推荐答案
您选择的示例是单个 Tree 无法完成工作的少数情况之一(两个独立的范围).
The example you have chosen is one of the few cases where a single Tree can't do the job (two independent ranges).
然而,我正在开发的电子书的第一章解释了 B-Tree 索引的内部工作原理:http://use-the-index-luke.com/anatomy/
However, the first chapter of my work-in-progress e-Book explains the inner workings of B-Tree indexes: http://use-the-index-luke.com/anatomy/
编辑了解为什么两个索引可能对上述示例有用.
EDIT for more details why two indexes might be useful for the above example.
上述查询可以通过三种可能的索引配置来支持:
The above query can be supported by three possible index configurations:
AGE
和WEIGHT
上的串联索引(按此顺序).
万一,查询将读取所有记录WHERE AGE >50
然后按WEIGHT
过滤.
concatenated index on
AGE
and thenWEIGHT
(in this order).
In case, the query would read all recordsWHERE AGE > 50
and then filter byWEIGHT
.
WEIGHT
和 AGE
上的串联索引(另一个顺序).
这是不同的方式:读取所有记录 WHERE WEIGHT <100
然后按 AGE
过滤.
concatenated index on WEIGHT
and then AGE
(the other order).
That goes the different way: read all records WHERE WEIGHT < 100
and then filter by AGE
.
哪个更有效取决于您拥有的数据.如果记录较少 AGE >50
比 WEIGHT
100
第一个效率更高,否则第二个.但是,如果您使用不同的值进行查询,则图片可能会发生变化.
Whatever is more efficient depends on the data you have. If there are less records AGE > 50
than WEIGHT < 100
the first will be more efficient, otherwise the second. However, if you query with different values, the picture might change.
级联索引不能很好地支持查询的原因是每个索引顺序只在一个轴上.每个索引条目都在另一个之前或之后,但从不在它旁边.所有索引条目构建一个链.
The reason that a concatenated index can't support the query well is that each index order is on one axis only. each index entry is before or after another one, but never next to it. All index entries build one chain.
具有两个独立范围查询的查询需要两个轴,不像链,但更像棋盘.AGE
的一个轴,WEIGHT
的另一个轴.如果可以的话,您的查询将只需要扫描棋盘的一个角.
A query that has two independent range queries would require two axes, not like a chain, but more like a chess board. one axis for AGE
the other for WEIGHT
. If that would be possible, your query would need to scan only one corner of the chess board.
但是,b 树只有一个轴,因此您必须先选择要使用的标准.如果您选择AGE
,则表示从AGE 50
开始,将扫描整个链直到结束.只有部分存储在链端的记录也符合 WEIGHT
WEIGHT
WEIGHT
WEIGHT <100
,其他记录必须读取但会被丢弃.
However, a b-tree has only one axis, hence you must chose which criteria to use first. If you chose AGE
it means that starting with AGE 50
, the entire chain will be scanned until the end. Only some of the records stored at the side of the chain will also qualify for WEIGHT < 100
, the other records must be read but will be discarded.
所以,长篇大论来解释为什么一棵树不能支持具有两个独立范围子句的查询.另一方面,一个连接索引可以很好地完成以下工作:
So, a long story to explain why one tree can not support a query with two independent range clauses. On the other hand, one concatenated index can do the following quite well:
WHERE age = 50 AND weight < 100
WHERE weight = 100 AND age > 50
WHERE age > 50 AND age < 70;
但是,如果在两个不同的列上使用两个不等式运算符,就会出现问题.
However, the problem arises if there are two inequality operators are used on two different columns.
那么,该怎么办?
第三种可能的方法是在两列上有两个独立的索引.这允许您拥有任意数量的轴(只需创建更多索引).但是,这存在一些巨大问题.首先,并非所有数据库产品都支持这一点.只要支持它,它就是一个相当广泛的操作.它通常以扫描每个索引的方式工作,为每个结果构建一个位图索引.然后连接这些位图索引以应用 AND
运算符.这需要大量的数据处理——只有当每个条件对它自己的选择性不是很强时才值得付出努力,但两者加起来都非常有选择性.
The third possible approach is to have two independent indexes on the two columns. That allows to have as many axes as you like (just create more indexes). However, there are a few huge problems with that. First of all, not all DB products support that. Whenever it is supported, it is a rather expansive operation. It works typically that way that each index is scanned, a bitmap index is built for each result. Those bitmap indexes are then joined to apply the AND
operator. That's a lot of data munging--it is only worth the effort if each condition is not very selective for it's own, but both together are very selective.
不想听我的建议吗?
如果您的查询在 OLTP 环境中运行:使用一个连接索引.两个独立的索引只是最后的选择.但是,如果您在 OLAP 环境中工作,您可能仍然需要位图索引.
If your query runs in an OLTP environment: use one concatenated index. Two independent indexes are an option of last resort only. However, if you are working in an OLAP environment, you might anyways need bitmap indexes.
ps:索引 AGE
是一个 exercise 在我的书中(有解决方案)——尤其是因为存储 AGE
是一种不好的做法,您应该存储出生日期.
ps.:
Indexing AGE
was an exercise in my book (with solution)--especially because storing AGE
is a bad practice, you should store the date of birth instead.
相关文章