ORA-00221 ORA-00206 ORA-00202 ORA-27063导致集群实例异常关闭

2020-08-14 00:00:00 集群文件错误异常控制

本来已经到了下班时间，由于要协助分布式数据库测试，所以在下班之前跟厂商工程师了解今天测试内容以及后续工作安排，由于我对分布式确实不了解，就多请教了几个数据分片的问题，看看时间已经到6点了，下午还有一次例行巡检，这个也是必须要完成的一项工作，经过上午和中午的例行巡检，一般下午不会出现异常行为，用户数据波动还是比较有规律，所以，下午的巡检一般相对轻松，该分析的问题上午基本都分析完了。

本来有三个巡检文件，分别对应数据库日志，数据库系统指标以及OS层面指标，一般下午我会首先看看数据库日志。突然一个扎眼的提示出现了，某个集群有异常告警提示。

ORA-00206 : error in writing (block 3, # blocks 1 ) of control file

ORA-00202: control file : ' /oradata2/orcl1/control02.ctl'

ORA-17500: ODM err:ODM ERROR V-31-3-2-321-4 Input/output error

ORA-00221： error on writing to control file

.....

USER(ospid:173351): terminating the instance due to error 221

首先，这个集群的实例挂了，从直观提示是写控制文件异常，ODM的I/O错误，导致进程173351终止了实例，而这个终止行为就是错误221

我们看看每个错误号的含义，这个实例异常关闭原因就更清楚了。

[oracle@rac1 ~]$ oerr ora 206
00206, 00000, "error in writing (block %s, # blocks %s) of control file"
// *Cause: A disk I/O failure was detected on writing the control file. <<<<<====写控制文件时磁盘I/O失败
// *Action: Check if the disk is online, if it is not, bring it online and try
// a warm start again. If it is online, then you need to
// recover the disk.

[oracle@rac1 ~]$ oerr ora 202
00202, 00000, "control file: '%s'"
// *Cause: This message reports the name file involved in other messages. <<<<<====控制文件收到影响
// *Action: See associated error messages for a description of the problem.

[oracle@rac1 ~]$ oerr ora 221
00221, 00000, "error on write to control file"
// *Cause: An error occurred when writing to one or more of the control files.<<<<<====写控制文件错误
// *Action: See accompanying messages.

[oracle@rac1 ~]$ oerr ora 17500
17500, 00000, "ODM err:%s"
// *Cause: An error returned by ODM library <<<<<====ODM异常
// *Action: Look at error message and take appropriate action or contact
// Oracle Support Services for further assistance

从每个报错信息的含义，这个问题的逻辑就比较清楚了，用户进程写控制文件失败，Oracle探测到时I/O错误，随后是更底层等待ODM报错

由于我们使用了第三方的存储，这个问题接口也就比较清楚，Oracle的写控制文件时发生I/O异常，这个异常是由于底层存储错误引起的，

这里我们只有协调系统部门，让存储工程师处理。

由于我们使用了service高可用配置，用户的业务都自动跑到备用节点，但是这个问题必须尽快解决，不然这个节点的压力就会很大。

第二天存储工程师检查后，说当时划了一块盘，具体原因也没有说明，只是说存储好了，可以重启实例。根本原因我没只有等待

第三方存储厂商的细致排查了，至于多久能出结果，这个或许涉及太多因素，这里就不讨论了。看看时间是11：56 ，今天的分享没有跨越零点！！！

相关文章