log file switch (archiving needed) 事件导致系统缓慢-归档空间正常

2022-01-06 00:00:00 等待 写入 进程 阻塞 归档


用户反映整个业务操作缓慢,影响业务运行,需要紧急排查,下面是分析步骤

1 看下数据库告警

ORACLE Instance xxxx303 - Cannot allocate log, archival required
Thread 3 cannot allocate new log, sequence 146566
All online logs need archiving
Examine archive trace files for archiving errors
Current log# 9 seq# 146565 mem# 0: //u011/xxxx30/redo17.log
Current log# 9 seq# 146565 mem# 1: //u012/xxxx30/redo18.log
Thread 3 advanced to log sequence 146566 (LGWR switch)
Current log# 11 seq# 146566 mem# 0: //u011/xxxx30/redo21.log
Current log# 11 seq# 146566 mem# 1: //u012/xxxx30/redo22.log
Archived Log entry 268556 added for T-3.S-146565 ID 0xe4617d4f LAD:1
Thread 3 advanced to log sequence 146567 (LGWR switch)
Current log# 12 seq# 146567 mem# 0: //u011/xxxx30/redo23.log
Current log# 12 seq# 146567 mem# 1: //u012/xxxx30/redo24.log
Archived Log entry 268557 added for T-3.S-146566 ID 0xe4617d4f LAD:1
ORACLE Instance xxxx303 - Cannot allocate log, archival required
Thread 3 cannot allocate new log, sequence 146568
All online logs need archiving
Examine archive trace files for archiving errors
Current log# 12 seq# 146567 mem# 0: //u011/xxxx30/redo23.log
Current log# 12 seq# 146567 mem# 1: //u012/xxxx30/redo24.log
(其实这里可以查看arc跟踪文件,因为感觉问题出在归档上,根据自己习惯做了如下操作)
在线日志需要归档 但是有问题无法完成归档,我们看看数据库等待事件

select event ,count(*) from v$active_session_history
where sample_time >to_date('202201061100','yyyymmddhh24mi') and event is not null group by event having count(*)>10 order
by count(*);
EVENT COUNT(*)
-------------------------------------------------- ----------
log file switch (archiving needed) 529848
我们看到严重的写日志有问题,也就是Oracle认为无法归档成功。
查询用户会话阻塞情况
SQL> select user_id,sql_id,event ,count(*) from v$active_session_history
where sample_time >to_date('202201061100','yyyymmddhh24mi') and event like 'log file%' and event is not null group by user_id,sql_id,event having count(*)>10 order
by count(*);

省略输出
231 2hsuv64sdjt4w log file switch (archiving needed) 5788
147 dkqkn4rbv2q44 log file switch (archiving needed) 6073
147 3qmth1t2zjw18 log file switch (archiving needed) 8816
147 48nya9ct5z7v7 log file switch (archiving needed) 13505
228 3dx8vk477jzup log file switch (archiving needed) 21302
169 1fcxttxxfmxsu log file switch (archiving needed) 141689
939 rows selected.

发现大量用户被写归档事件阻塞,这就是大量用户反映慢的根因。但是为什么写入有问题。我们从两个角度考虑
1 归档满了
2 归档进程写磁盘异常

查询空间排除归档满的问题。
检查归档进程是否异常等待

SQL> select sid,serial#,process ,program,EVENT,SECONDS_IN_WAIT from v$session where program like '%ARC%';

SID SERIAL# PROCESS PROGRAM EVENT SECONDS_IN_WAIT
---------- ---------- ---------- ---------------------------------------- ------------------------------ ---------------
13189 39093 255660 oracle@n-pc-sr850-324 (ARC0) rdbms ipc message 47
13503 53177 255664 oracle@n-pc-sr850-324 (ARC1) Disk file operations I/O 5130
13660 36034 255666 oracle@n-pc-sr850-324 (ARC2) rdbms ipc message 229
13817 30293 255668 oracle@n-pc-sr850-324 (ARC3) rdbms ipc message 107

我们看到此时process 255664等待了5130秒,等待事件为文件I/O等待,所以我们怀疑问题出在操作系统和存储层面,此时归档依然可以写
,但是写入慢,好像被什么阻塞了。

后续从操作系统messages查看有无异常,发现如下报错

Jan 6 09:07:37 n-pc-sr850-324 kernel: INFO: task ora_arc1_xxxx30:255664 blocked for more than 120 seconds.
Jan 6 09:07:37 n-pc-sr850-324 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 6 09:07:37 n-pc-sr850-324 kernel: ora_arc1_xxxx30 D 00000000005b1013 0 255664 1 0x00000084
Jan 6 09:07:37 n-pc-sr850-324 kernel: ffff88726b68b730 0000000000000086 ffff883fb879cf10 ffff88726b68bfd8
Jan 6 09:07:37 n-pc-sr850-324 kernel: ffff88726b68bfd8 ffff88726b68bfd8 ffff883fb879cf10 ffff883fb879cf10
Jan 6 09:07:37 n-pc-sr850-324 kernel: ffff8854afb0c2b0 ffff8854afb0c2a8 ffff887172b0d870 00000000005b1013
Jan 6 09:07:37 n-pc-sr850-324 kernel: Call Trace:
Jan 6 09:07:37 n-pc-sr850-324 kernel: [<ffffffff816a94c9>] schedule+0x29/0x70
Jan 6 09:07:37 n-pc-sr850-324 kernel: [<ffffffffc0ea84e0>] vx_svar_sleep_unlock+0x70/0xd0 [vxfs]
Jan 6 09:07:37 n-pc-sr850-324 kernel: [<ffffffffc0ea8420>] ? vx_basic_set_mnamfs+0xd0/0xd0 [vxfs]
Jan 6 09:07:37 n-pc-sr850-324 kernel: [<ffffffffc0ea8ccf>] vx_event_wait+0x2f/0x80 [vxfs]
Jan 6 09:07:37 n-pc-sr850-324 kernel: [<ffffffffc0d93af0>] vx_async_waitmsg+0x20/0x50 [vxfs]
Jan 6 09:07:37 n-pc-sr850-324 kernel: [<ffffffffc0d93c98>] vx_msg_send+0x178/0x190 [vxfs]
Jan 6 09:07:37 n-pc-sr850-324 kernel: [<ffffffffc09e2bb1>] ? gab_allocmsg+0x11/0x30 [gab]
Jan 6 09:07:37 n-pc-sr850-324 kernel: [<ffffffffc0e9331b>] vx_extset_msg+0x2eb/0x660 [vxfs]
Jan 6 09:07:37 n-pc-sr850-324 kernel: [<ffffffffc0d9df7e>] vx_cfs_extset+0x3e/0x50 [vxfs]
Jan 6 09:07:37 n-pc-sr850-324 kernel: [<ffffffffc0e37727>] vx_extset+0x447/0x580 [vxfs]
Jan 6 09:07:37 n-pc-sr850-324 kernel: [<ffffffffc0e37b45>] vx_setext+0x2e5/0x5c0 [vxfs]
Jan 6 09:07:37 n-pc-sr850-324 kernel: [<ffffffffc0a1c80b>] ? vxg_api_lock+0x8b/0xc0 [vxglm]
Jan 6 09:07:37 n-pc-sr850-324 kernel: [<ffffffffc0f662b3>] vx_uioctl+0x6a3/0x9e0 [vxfs]
Jan 6 09:07:37 n-pc-sr850-324 kernel: [<ffffffffc0a33b9a>] ? vxg_lock_ilock_omnibus+0x3aa/0x3c0 [vxglm]
Jan 6 09:07:37 n-pc-sr850-324 kernel: [<ffffffffc0e778ae>] ? vx_iglock2+0x8e/0x110 [vxfs]
Jan 6 09:07:37 n-pc-sr850-324 kernel: [<ffffffffc0a1acd0>] ? vxg_upgrade_fastsub+0x50/0x50 [vxglm]
Jan 6 09:07:37 n-pc-sr850-324 kernel: [<ffffffffc0e43ef4>] ? vx_glm_unlock+0x34/0x60 [vxfs]
Jan 6 09:07:37 n-pc-sr850-324 kernel: [<ffffffffc0f06d33>] vx_vop_ioctl+0x23/0x30 [vxfs]
Jan 6 09:07:37 n-pc-sr850-324 kernel: [<ffffffffc0ee23c3>] vx_odm_resize+0xc3/0x280 [vxfs]
Jan 6 09:07:37 n-pc-sr850-324 kernel: [<ffffffffc0ae9ef4>] odm_vx_resize+0x94/0x130 [vxodm]
Jan 6 09:07:37 n-pc-sr850-324 kernel: [<ffffffffc0aea0c1>] odm_vx_create+0x131/0x190 [vxodm]
Jan 6 09:07:37 n-pc-sr850-324 kernel: [<ffffffffc0af24b2>] odm_file_create.isra.12+0x173/0x2c1 [vxodm]
Jan 6 09:07:37 n-pc-sr850-324 kernel: [<ffffffffc0ae98bf>] ? odm_vx_dirlook+0x3f/0x190 [vxodm]
Jan 6 09:07:37 n-pc-sr850-324 kernel: [<ffffffffc0adf44a>] ? odm_node_rele+0x4a/0x1d0 [vxodm]
Jan 6 09:07:37 n-pc-sr850-324 kernel: [<ffffffffc0ae0358>] ? odm_clust_reserve+0xd8/0x190 [vxodm]
Jan 6 09:07:37 n-pc-sr850-324 kernel: [<ffffffffc0ae681d>] odm_create+0x65d/0x850 [vxodm]
Jan 6 09:07:37 n-pc-sr850-324 kernel: [<ffffffffc0add27f>] odm_ioctl_ctl+0x16f/0x3a0 [vxodm]
Jan 6 09:07:37 n-pc-sr850-324 kernel: [<ffffffffc0aee101>] odm_ioctl_ctl_unlocked+0x21/0x30 [vxodm]
Jan 6 09:07:37 n-pc-sr850-324 kernel: [<ffffffff812151cd>] do_vfs_ioctl+0x33d/0x540
Jan 6 09:07:37 n-pc-sr850-324 kernel: [<ffffffff816affd1>] ? __do_page_fault+0x171/0x450
Jan 6 09:07:37 n-pc-sr850-324 kernel: [<ffffffff81215471>] SyS_ioctl+0xa1/0xc0
Jan 6 09:07:37 n-pc-sr850-324 kernel: [<ffffffff816b4fc9>] system_call_fastpath+0x16/0x1b

上面提示很明显task ora_arc1_xxxx30:255664 blocked for more than 120 seconds.也就是数据库的ARC1归档进程卡在IO上了,而这个IO
明显是操作系统和存储层面的问题,后续交由系统组和存储工程师排查,我们暂时将业务迁移到其他实例。


后续:参考MOS (Doc ID 1476444.1)的说明,还是很有借鉴意义
事件log file switch (archiving needed)发生原因,lgwr进程切换日志,但是要切换到的日志目前其归档无法完成写操作。
问题症状:
1 对于集群而言本地实例有问题,前提是只是本地进程写有问题,如果是共享归档空间,比如满了,也可能影响其他实例。
2只是部分进程收到影响,比如增删改,其实对于OLTP系统,这个影响面已经很大了。
3 该等待事件在DBtime占比很高
4 归档空间不足

如何解决
1 检查告警日志是否归档被失败的归档写入操作阻塞(这个在本例中确认了,写入被存储层面问题阻塞了)
2 加速归档,增加归档进程,或者归档存储条带化处理(条带化这个生产库不现实)
3 检查归档空间问题,满了释放空间或者扩容解决




相关文章