MySQL权限表损坏导致无法启动
作者 岳雷 · 沃趣科技数据库工程师
出品 沃趣科技
CentOS 7.5 MySQL 5.7.31
mysql> create user test@'%' identified by '1234';
Query OK, 0 rows affected (0.01 sec)
然后关闭mysqld;这里需要介绍一下,我们集群的关闭方式是如下方式:
shell> kubectl delete pod yuelei-57-f3b6200- -n qfusion-admin
shell> kill -9 $(pidof mysqld)
然后重启mysqld,操作如下:
shell> mysqld_safe --defaults-file=/etc/mysql/my.cnf &
此时问题复现了,mysqld启动失败,我们查看了下error日志,信息如下:
...
2020-08-18T10:13:55.506415+08:00 0 [ERROR] /usr/local/mysql/bin/mysqld: Table './mysql/user' is marked as crashed and should be repaired
2020-08-18T10:13:55.506553+08:00 0 [ERROR] Fatal error: Can't open and lock privilege tables: Table './mysql/user' is marked as crashed and should be repaired
2020-08-18T10:13:55.506658+08:00 [ERROR] Fatal error: Failed to initialize ACL/grant/time zones structures or failed to remove temporary table files.
2020-08-18T10:13:55.506789+08:00 [ERROR] Aborting
...
根据报错信息可以看出:MySQL的权限系统表发生了损坏,导致了mysqld启动失败;由于在MySQL 5.7及其之前版本该表是MyISAM引擎,且该引擎不支持事务,所以在mysqld异常崩溃会导致该类型引擎表的损坏;但在mysqld启动时是有参数控制MyISAM引擎的恢复模式,且该参数在我们产品中也配置到了my.cnf中,如下所示:
[mysqld]
myisam_recover_options=BACKUP,FORCE
2.3 参数解析
OFF:No recovery. DEFAULT:Recovery without backup, forcing, or quick checking. BACKUP:If the data file was changed during recovery, save a backup of the tbl_name.MYD file as tbl_name-datetime.BAK. FORCE:Run recovery even if we would lose more than one row from the .MYD file. QUICK:Do not check the rows in the table if there are not any delete blocks.
[mysqld]
skip_grant_tables
(2)重启mysqld,然后修复mysql schema下的所有表。
shell> mysqlcheck -uroot -pxxx mysql --auto-repair
...
mysql.user
warning : 1 client is using or hasn't closed the table properly
status : OK
(3)在[mysqld]标签下注释或删除掉skip_grant_tables,然后重启mysqld。
shell> mysqld_safe --defaults-file=/etc/mysql/my.cnf &
通常情况下在my.cnf中设置myisam_recover_options=BACKUP,FORCE时,启动时会自动修复MyISAM损坏的表;在上述场景中为什么没有自动修复呢? 相同环境下,5.7.26版本mysqld启动是正常的,但是5.7.31版本mysqld启动出现如上述现象;难道是MySQL做了改动吗?
### 确认MySQL版本
shell> cat VERSION
MYSQL_VERSION_MAJOR=5
MYSQL_VERSION_MINOR=7
MYSQL_VERSION_PATCH=31
MYSQL_VERSION_EXTRA=
### 搜索一些关键信息
shell> grep -r "initialize ACL/grant/time" *
mysql-test/r/grant_debug.result:Pattern "Fatal error: Failed to initialize ACL/grant/time zones structures or failed to remove temporary table files." found
mysql-test/t/grant_debug.test:let SEARCH_PATTERN=Fatal error: Failed to initialize ACL/grant/time zones structures or failed to remove temporary table files.;
sql/mysqld.cc: sql_print_error("Fatal error: Failed to initialize ACL/grant/time zones "
搜索得出的结果很多,我们需要对此进行过滤。mysql-test目录是mysql的测试用例代码,我们可以直
接忽略;需要关注的是sql/mysqld.cc的文件,因为mysqld启动时调用main->mysqld_main->...,而mysqld_main函数是在sql/mysqld.cc目录下,此时我们查看具体的代码(sql/mysqld.cc:4958):
if (mysql_rm_tmp_tables() || acl_init(opt_noacl) ||
my_tz_init((THD *), default_tz_name, opt_bootstrap) ||
grant_init(opt_noacl))
{
abort_loop= true;
sql_print_error("Fatal error: Failed to initialize ACL/grant/time zones "
"structures or failed to remove temporary table files.");
delete_pid_file(MYF(MY_WME));
unireg_abort(MYSQLD_ABORT_EXIT);
}
定位到相关代码,大概是sql/mysqld.cc的4958行,且存在if条件判断,此时我们开始调试:
shell> gdb /usr/local/mysql/bin/mysqld
(gdb) b sql/mysqld.cc:4958
Breakpoint 1 at xe6f8d0: file /opt/mysql-server/sql/mysqld.cc, line 4958.
(gdb) r --defaults-file=/etc/mysql/my.cnf
(gdb) p mysql_rm_tmp_tables()
$1 = '\000'
(gdb) p acl_init(opt_noacl)
$2 = 1 '\001'
(gdb) p my_tz_init((THD *), default_tz_name, opt_bootstrap)
$3 = '\000'
(gdb) p grant_init(opt_noacl)
$4 = false
通过以上调试信息,可以判断出acl_init函数返回的值为真;此时我们查看该函数的代码 (sql/auth/sql_auth_cache.cc:1365):
/*
Initialize structures responsible for user/db-level privilege checking and
load privilege information for them from tables in the 'mysql' database.
SYNOPSIS
acl_init()
dont_read_acl_tables TRUE if we want to skip loading data from
privilege tables and disable privilege checking.
NOTES
This function is mostly responsible for preparatory steps, main work
on initialization and grants loading is done in acl_reload().
RETURN VALUES
ok
1 Could not initialize grant's
*/
my_bool acl_init(bool dont_read_acl_tables)
{
THD *thd;
my_bool return_val;
DBUG_ENTER("acl_init");
...
根据该函数的注释发现:该函数是初始化负责用户/数据库级特权检查的结构,并从mysql schema中的表中为其加载特权信息;且return值为1代表的是初始化权限失败。
此后开始逐步调试,观察return相关信息,当调试到lock_table_names函数时,我们发现在Phase 3时return值为true,且根据代码注释发现true代表是Failure;具体代码如下(sql/sql_base.cc:5549):
/**
Acquire "strong" (SRO, SNW, SNRW) metadata locks on tables used by
LOCK TABLES or by a DDL statement.
Acquire lock "S" on table being created in CREATE TABLE statement.
@note Under LOCK TABLES, we can't take new locks, so use
open_tables_check_upgradable_mdl() instead.
@param thd Thread context.
@param tables_start Start of list of tables on which locks
should be acquired.
@param tables_end End of list of tables.
@param lock_wait_timeout Seconds to wait before timeout.
@param flags Bitmap of flags to modify how the tables will be
open, see open_table() description for details.
@retval false Success.
@retval true Failure (e.g. connection was killed)
*/
bool
lock_table_names(THD *thd,
TABLE_LIST *tables_start, TABLE_LIST *tables_end,
ulong lock_wait_timeout, uint flags)
{
MDL_request_list mdl_requests;
TABLE_LIST *table;
MDL_request global_request;
Hash_set<TABLE_LIST, schema_set_get_key> schema_set(PSI_INSTRUMENT_ME);
bool need_global_read_lock_protection= false;
...
// Phase 3: Acquire the locks which have been requested so far.
if (thd->mdl_context.acquire_locks(&mdl_requests, lock_wait_timeout))
return true;
/*
Now when we have protection against concurrent change of read_only
option we can safely re-check its value.
Skip the check for FLUSH TABLES ... WITH READ LOCK and
FLUSH TABLES ... FOR EXPORT as they are not supposed to be affected
by read_only modes.
*/
if (need_global_read_lock_protection &&
!(flags & MYSQL_OPEN_SKIP_SCOPED_MDL_LOCK) &&
!(flags & MYSQL_LOCK_IGNORE_GLOBAL_READ_ONLY) &&
check_readonly(thd, true))
return true;
...
调试信息如下:
(gdb) p flags
$7 = 0
(gdb) p need_global_read_lock_protection
$8 = true
(gdb) p check_readonly(thd, true)
$9 = true
可以看到flags的值为,而MYSQL_OPEN_SKIP_SCOPED_MDL_LOCK为宏定义值0x1000,与flags的值
做按位与操作,结果自然也是,当然MYSQL_LOCK_IGNORE_GLOBAL_READ_ONLY也是如此;need_global_read_lock_protection是bool类型值,代表是否需要全局读锁的保护,这个值是在table-
>mdl_request.type不为MDL_SHARED_READ_ONLY发生改变;check_readonly函数相关信息⻅下面概述。
shell> git blame -L 5637,+10 sql/sql_base.cc
05824063 (Nisha Gopalakrishnan 2019-09-11 13:06:40 +0530 5637) Now when we have protection against concurrent change of read_only
05824063 (Nisha Gopalakrishnan 2019-09-11 13:06:40 +0530 5638) option we can safely re-check its value.
0405ebee (Nisha Gopalakrishnan 2019-09-12 19:24:45 +0530 5639) Skip the check for FLUSH TABLES ... WITH READ LOCK and
0405ebee (Nisha Gopalakrishnan 2019-09-12 19:24:45 +0530 5640) FLUSH TABLES ... FOR EXPORT as they are not supposed to be affected
0405ebee (Nisha Gopalakrishnan 2019-09-12 19:24:45 +0530 5641) by read_only modes.
05824063 (Nisha Gopalakrishnan 2019-09-11 13:06:40 +0530 5642) */
05824063 (Nisha Gopalakrishnan 2019-09-11 13:06:40 +0530 5643) if (need_global_read_lock_protection &&
0405ebee (Nisha Gopalakrishnan 2019-09-12 19:24:45 +0530 5644) !(flags & MYSQL_OPEN_SKIP_SCOPED_MDL_LOCK) &&
05824063 (Nisha Gopalakrishnan 2019-09-11 13:06:40 +0530 5645) !(flags & MYSQL_LOCK_IGNORE_GLOBAL_READ_ONLY) &&
05824063 (Nisha Gopalakrishnan 2019-09-11 13:06:40 +0530 5646) check_readonly(thd, true))
上述展示的信息中,左侧的列值为commit id为05824063和0405ebee,有兴趣的同学可以详细看下。
Under certain conditions, enabling the read_only or super_read_only system variable did not block concurrent DDL statements executed by users without the SUPER privilege. (Bug #28438114, Bug #91852)
后我们再查看下check_readonly函数,该函数是基于read_only和super_read_only状态执行标准化检查,是禁止(TRUE)还是允许(FALSE)操作。代码如下(sql/auth/sql_authorization.cc:489):
/**
@brief Performs standardized check whether to prohibit (TRUE)
or allow (FALSE) operations based on read_only and super_read_only
state.
@param thd Thread handler
@param err_if_readonly Boolean indicating whether or not
to add the error to the thread context if read-only is
violated.
@returns Status code
@retval TRUE The operation should be prohibited.
@ retval FALSE The operation should be allowed.
*/
bool check_readonly(THD *thd, bool err_if_readonly)
{
DBUG_ENTER("check_readonly");
/* read_only=OFF, do not prohibit operation: */
if (!opt_readonly)
DBUG_RETURN(FALSE);
...
此时反应就是去检查my.cnf中是否包含read_only相关参数,检查之后发现确实是使用了该参数,
如下:
[mysqld]
read_only=1
此时注释掉该参数,然后再次启动mysqld,发现MyISAM表可以自动修复,且正常启动;error log信息如下:
...
2020-08-18T11:52:26.775553+08:00 [ERROR] /usr/local/mysql/bin/mysqld: Table './mysql/user' is marked as crashed and should be repaired
2020-08-18T11:52:26.776217+08:00 [Warning] Checking table: './mysql/user'
2020-08-18T11:52:26.776273+08:00 [ERROR] 1 client is using or hasn't closed the table properly
2020-08-18T11:52:26.882537+08:00 [Note] Failed to start slave threads for channel ''
2020-08-18T11:52:26.906018+08:00 [Note] Event Scheduler: Loaded events
2020-08-18T11:52:26.906480+08:00 [Note] /usr/local/mysql/bin/mysqld: ready for connections.
Version: '5.7.31-debug-log' socket: '/var/lib/mysql/sock/mysql.sock' port: 3306 Source distribution
由于docker一些限制,我们在mysqld启动会涉及两次;所以解决该问题的方式为:次mysqld的启动时先关闭read_only参数,第二次启动时开启read_only参数。之所以选择默认开启read_only参数,
是为了避免在mysqld启动后,选主逻辑未完成时的保护措施;当然选主完成后,会自动对master执行 set global read_only=0 操作。
五、总结
MySQL小版本的升级也会有变化,一定要做好升级前的多维度测试工作。
MySQL源代码量很多,想要全部了解也很困难;此时我们通过对比不同版本之间的现象差异分别 进行调试,找出差异点后再深入探索到某些函数,效果事半功倍。
六、附录
调试的栈帧信息如下,有兴趣的小伙伴可以研究下:
(gdb) bt
# check_readonly (thd=xcde33f0, err_if_readonly=true) at /opt/mysql- server/sql/auth/sql_authorization.cc:491
#1 x0000000001486262 in lock_table_names (thd=xcde33f0, tables_start=xcdc9100, tables_end=x0, lock_wait_timeout=31536000, flags=) at /opt/mysql-server/sql/sql_base.cc:5646
#2 x0000000001484be5 in Open_table_context::recover_from_failed_open (this=x7fffffffcad0) at /opt/mysql-server/sql/sql_base.cc:4789
#3 x0000000001486758 in open_tables (thd=xcde33f0, start=x7fffffffcb90, counter=x7fffffffcbd4, flags=2048, prelocking_strategy=x7fffffffcc10) at /opt/mysql-server/sql/sql_base.cc:5891
#4 x0000000001487851 in open_and_lock_tables (thd=xcde33f0, tables=x7fffffffccd0, flags=2048, prelocking_strategy=x7fffffffcc10) at /opt/mysql-server/sql/sql_base.cc:6583
#5 x0000000000ea2611 in open_and_lock_tables (thd=xcde33f0, tables=x7fffffffccd0, flags=2048) at /opt/mysql-server/sql/sql_base.h:486
#6 x0000000000e9c2cf in acl_reload (thd=xcde33f0) at /opt/mysql- server/sql/auth/sql_auth_cache.cc:2091
#7 x0000000000e9a2fd in acl_init (dont_read_acl_tables=false) at /opt/mysql- server/sql/auth/sql_auth_cache.cc:1429
#8 x0000000000e6f8f2 in mysqld_main (argc=136, argv=x2c136a8) at /opt/mysql-server/sql/mysqld.cc:4958
#9 x0000000000e66cdd in main (argc=2, argv=x7fffffffe458) at /opt/mysql- server/sql/main.cc:32
以上文章来源于沃趣技术 ,作者岳雷
相关文章