Mysql的Clustered and Secondary Indexes非深入研究

vlambda
2022-05-01

Mysql的Clustered and Secondary Indexes非深入研究

写这篇文章的起因是阅读文章<第21期：索引设计（函数索引）>[参考1]看到这样一段文字

<localhost|mysql>explain select * from t1 where date(log_time) = '2100-02-02'\G

*************************** 1. row ***************************

id: 1

select_type: SIMPLE

table: t1

partitions: NULL

type: index

possible_keys: NULL

key: idx_log_time

key_len: 9

ref: NULL

rows: 392413

filtered: 100.00

Extra: Using where; Using index

1 row in set, 1 warning (0.00 sec)

.......虽然走了索引，但是扫描行数为总记录数，相当于全表扫，这时候比全表扫还不理想，全表扫直接走聚簇索引还快点。.......

因为不理解这里的聚簇索引的概念，所以特别深入了解了一下，从而算是比较清楚地了解Mysql的索引概念，从而整理出此笔记。

本文的Mysql版本为5.7

首先，要了解Mysql的clustered index 和 Secondary Indexe的概念，参考官方这篇文章[参考2]。这几段文字的信息量很大，所以我翻译一下

Every InnoDB table has a special index called the clustered index where the data for the rows is stored. Typically, the clustered index is synonymous with the primary key. To get the best performance from queries, inserts, and other database operations, you must understand how InnoDB uses the clustered index to optimize the most common lookup and DML operations for each table.

每一个InnoDB表都有一个特别的索引被称为聚簇索引(clustered index)，这个索引也是行记录(rows)存储数据的位置。一般来讲，聚簇索引(clustered index)基本可以等同于主键( primary key)。为了达到最好的查询性能和其它的数据库操作，你必须理解InnoDB如何使用聚簇索引(clustered index)来优化一般的表的DML操作。

When you define a PRIMARY KEY on your table, InnoDB uses it as the clustered index. Define a primary key for each table that you create. If there is no logical unique and non-null column or set of columns, add a new auto-increment column, whose values are filled in automatically.
如果你的表里定义了一个主键(PRIMARY KEY),InnoDB就使用主键来作为聚簇索引。所以你应该为每个表定义一个主键，如果你找不到逻辑上遵循唯一性约束的列或者列的集合，那么就制造一个自增长的列吧，这样就可以让这个列作为主键。

kursk : 我开始认为innodb使用PRIMARY KEY的字段作为clustered index，后来才认识到我的认知是错误的，如果存在PRIMARY KEY，那么PRIMARY KEY就是clustered index，这也是为什么下文官方文档对secondary indexes的定义是“All indexes other than the clustered index are known as secondary indexes”，我第一次看到这段文字时就觉得很奇怪，现在才体会到因为实在不好定义。如果存在PRIMARY KEY，则除PRIMARY KEY之外的索引就是secondary indexes；如果不存在PRIMARY KEY，则the first UNIQUE index是clustered index，而除了这个UNIQUE index之外的索引都是secondary indexes，所以secondary indexes的类型是变化的，官方文档这种奇怪的定义文字是因为没有其他的更好的定义方法。

另外一个有意思的问题是，如果在create table时不创建PRIMARY KEY，先创建一个UNIQUE index，然后再创建一个PRIMARY KEY，那么clustered index会变化吗？根据Stackoverflow[参考3]的答案，似乎是会变化的

If you do not define a PRIMARY KEY for your table, MySQL locates the first UNIQUE index where all the key columns are NOT NULL and InnoDB uses it as the clustered index.
如果你实在为你的表制造不了一个主键，那么Mysql就会将表中第一个唯一索引(UNIQUE index)的所有不为null的键值作为聚簇索引。
If the table has no PRIMARY KEY or suitable UNIQUE index, InnoDB internally generates a hidden clustered index named GEN_CLUST_INDEX on a synthetic column containing row ID values. The rows are ordered by the ID that InnoDB assigns to the rows in such a table. The row ID is a 6-byte field that increases monotonically as new rows are inserted. Thus, the rows ordered by the row ID are physically in insertion order.
如果表中既没有主键，也没有唯一索引，InnoDB就会在表内产生一个隐藏的聚簇索引，名称叫做GEN_CLUST_INDEX，这个索引在一个隐藏列(synthetic column)上，该隐藏列存储了行ID的值，这个隐藏列上的行按照行ID排序，行ID是6字节的单调增长并且按照此顺序插入，因此隐藏列上的物理存储就是有序的(插入顺序)。

How the Clustered Index Speeds Up Queries

Accessing a row through the clustered index is fast because the index search leads directly to the page with all the row data. If a table is large, the clustered index architecture often saves a disk I/O operation when compared to storage organizations that store row data using a different page from the index record.

聚簇索引如何提高查询速度

通过聚簇索引能提高查询速度，是因为索引直接指向记录(row data)所在的分页(the page)，如果一个表比较大，则通过聚簇索引查询会节省磁盘IO操作——因为一个同样大小的块(datablock)存储的记录的指向数量，存储索引的块要比存储表记录的块要多。kursk:这里没法文字翻译，只能偷懒根据意思翻译了。

How Secondary Indexes Relate to the Clustered Index

All indexes other than the clustered index are known as secondary indexes. In InnoDB, each record in a secondary index contains the primary key columns for the row, as well as the columns specified for the secondary index. InnoDB uses this primary key value to search for the row in the clustered index.

If the primary key is long, the secondary indexes use more space, so it is advantageous to have a short primary key.

第二索引(Secondary Indexes)如何与聚簇索引相关联

Mysql中所有不是聚簇索引的索引被称为第二索引(secondary indexes)，在InnoDB中，每个第二索引的每一条记录(each record)都包括该行的主键列(the primary key columns)，及为二级索引指定的列。InnoDB通过在聚簇索引中使用主键的值查找，从而找到对应的行。

这就是为什么主键比较短会有查询优势，因为第二索引存储空间更小——意味着更少的数据块可以存储更多的行记录。

InnoDB中的第二索引其实就是我们日常使用的索引（ PRIMARY KEY, UNIQUE, INDEX, and FULLTEXT），上述红色字体说明通过第二索引查找某条记录的过程是这样的。

根据query语句找到对应的索引(第二索引)
根据条件遍历索引，找到索引中存储的某条行记录record
获取该record记录的主键(the primary key)的值，在聚簇索引中根据主键(the primary key)的值查找找到对应的记录指向
根据记录指向找到对应的记录(record)

所以如果一次查询要获得记录(record)，则通过索引查询该记录时，实际上是需要通过聚簇索引来获得的。

回到文章<第21期：索引设计（函数索引）>中问题，

.......虽然走了索引，但是扫描行数为总记录数，相当于全表扫，这时候比全表扫还不理想，全表扫直接走聚簇索引还快点。.......

<localhost|mysql>explain select * from t1 where date(log_time) = '2100-02-02'\G

*************************** 1. row ***************************

id: 1

select_type: SIMPLE

table: t1

partitions: NULL

type: index

possible_keys: NULL

key: idx_log_time

key_len: 9

ref: NULL

rows: 392413

filtered: 100.00

Extra: Using where; Using index

1 row in set, 1 warning (0.00 sec)

这个例子中，type为index，说明扫描整个索引，而且通过索引获得行的主键值后，还要再通过聚簇索引上查找主键，找到对应的记录；所以，还不如直接走聚簇索引——少一次索引全扫描的过程，前者反而更慢。

这里其实我有一个疑问，mysql设计上为什么“而且通过索引获得行的主键值后，还要再通过聚簇索引上查找主键”，而不在索引的叶子节点直接存储该记录的指向，而不存储主键，不就可以直接通过记录的指向找到记录，而不需要通过主键这个中间环节，这样不是更快吗？

结合笔记<Mysql Explain执行计划的官方文档重点解释>，难道eq_ref，ref也要通过聚簇索引上查找主键对应的记录指向才可以吗？

为此我继续google，官方没有更多的解释，但是Medium上的一篇文章倒是证实了我的想法，

<A Brief Introduction to Cluster Index and Secondary Index in InnoDB> [参考4]这篇文章用图形对Clustered indexes和Secondary Indexes进行了对比解释，很清楚。

secondary index 一樣也是用 B-Tree 資料結構，不同的是 leaf node 只存 key 值和 clustered key 值 (通常是 primary key)，用以去 clustered index 找 row data

clustered Index

the second Index

注意：这篇文章中的“Page”与我本笔记中的“数据块(Data Block)”的概念相同。

尤其这篇文章最后的总结也有这么一段文字，红色字体正是我的猜想

The page is the minimum unit for InnoDB to access data, it’s waste and inefficiency to load a page and then access read only one record. Using the primary key to create locality of reference for improving efficiency.

If using the secondary index to read the records, MySQL needs to look up the clustered index, also as known from point 1 above, it is inefficiency to access a single record. Let the secondary index cover all required columns is great for performance.

最后说一说索引组织表( index-organized table)，Mysql的表天生就是索引组织表，clustered index就是该表上的索引结构。

参考

[1]https://mp.weixin.qq.com/s/b068cQvDW9ybz0vH2ttFNA

[2]https://dev.mysql.com/doc/refman/5.7/en/innodb-index-types.html

[3]https://stackoverflow.com/questions/51272294/what-does-changing-the-primary-key-of-a-table-incur-in-mysql

[]https://medium.com/@genchilu/a-brief-introduction-to-cluster-index-and-secondary-index-in-innodb-9b8874d4da6a#id_token=eyJhbGciOiJSUzI1NiIsImtpZCI6ImZkMjg1ZWQ0ZmViY2IxYWVhZmU3ODA0NjJiYzU2OWQyMzhjNTA2ZDkiLCJ0eXAiOiJKV1QifQ.eyJpc3MiOiJodHRwczovL2FjY291bnRzLmdvb2dsZS5jb20iLCJuYmYiOjE2MTI2ODQ1MDAsImF1ZCI6IjIxNjI5NjAzNTgzNC1rMWs2cWUwNjBzMnRwMmEyamFtNGxqZGNtczAwc3R0Zy5hcHBzLmdvb2dsZXVzZXJjb250ZW50LmNvbSIsInN1YiI6IjExNzc2NzMzNzI1OTk2MzAwNDI5MyIsImVtYWlsIjoia3Vyc2sueWVAZ21haWwuY29tIiwiZW1haWxfdmVyaWZpZWQiOnRydWUsImF6cCI6IjIxNjI5NjAzNTgzNC1rMWs2cWUwNjBzMnRwMmEyamFtNGxqZGNtczAwc3R0Zy5hcHBzLmdvb2dsZXVzZXJjb250ZW50LmNvbSIsIm5hbWUiOiJ5ZSBrdXJzayIsInBpY3R1cmUiOiJodHRwczovL2xoMy5nb29nbGV1c2VyY29udGVudC5jb20vYS0vQU9oMTRHZ0lwNU5hdFJNempiazBraGxwNjVkbkxMa1lnR1luVk1sQzdxSlJPUT1zOTYtYyIsImdpdmVuX25hbWUiOiJ5ZSIsImZhbWlseV9uYW1lIjoia3Vyc2siLCJpYXQiOjE2MTI2ODQ4MDAsImV4cCI6MTYxMjY4ODQwMCwianRpIjoiYmRiOTBiMjNjYWYxYTJhNDQ3NzYzMjQ1OGM1NDgwNTk5NTNhZTJmNyJ9.oNx6QPZug6l11FRa8ktmqFTa8yvJDTfws8W2tWnXuNPZd19hfp9h9waiaZKxlEHKZNxPXimY839PyIp2Qf7GV-aRZ67-JFKVuMq2eEqyPSl_FtLolTsAY4ILSn73tuQ1YmS3RYjqe9fLLbuL5RyQrpYdIAC_8bTeYyuHjQS1_dq3jgQTp3sB3i4YHliH3rbRmzsT0TvOxg6gigJHWiyyyTLdilefZhYo-8TfnOl8fSInN1JKOgRl8w0VsJjDhze_35jttBgbBym1QOO7g3yb4d6vriFrz38A9iaaMwpfK7tYkSz3flNHcaqNMrmqZq3WTzksYcQ2cEc9uf6492k6fw