DBtime过高告警的故障分析与解决过程
收集故障时段AWR,6号同时段AWR,分析v$active_session_history
和dba_hist_active_sess_history
1 从AWR分析一个小时总体DT time,该指标没问题,因为这是累积值,不足以提供证据,对比6号AWR发现,7号AWR的DBT ime比6号增加30%,从侧面验证DBTime确实比正常情况要高。但是总会话数量差异不大。具体数据为48cores,96CPUs,6号DBtime为2564分钟(一小时采样),6号DBtime为1962分钟一小时采样)。
2 对比6号,7号ADDMde 报告差异,7号的ADDM中软解析平均活跃会话数,共享池Latch争议能用相比6号增加28%。
3 从7号和6号的等待事件看,主要等待事件都为library cache lock
发现二者的平均等待时间差异不大,但是等待次数和总等待时间差异较大,分别提高23%,和33%,再次验证了虽然library cache lock等待事件都是6号,7号占比DBT ime高的等待事件,但是7号等待次数和等待时间明显增加,这里需要进一步细化7号01:10左右系统等待事件分布情况。
通过dba_ASH视图查询历史等待事件和SQL执行情况,下面查询时间段都为01:00-01:20 20分钟内的统计。
SYS@tjgm102>select
sql_id,event,count(*) from dba_hist_active_sess_history
where event is not null and SAMPLE_TIME
between
to_date('202004070100','yyyymmddhh24mi')
and to_date('202004070120','yyyymmddhh24mi')
group by sql_id,event having
count(*)>10 order by count(*);
SQL_ID EVENT
COUNT(*)
-------------
------------------------------------------------------------ ----------
6vg4ck9qmfar3 db
file sequential read 11
6vg4ck9qmfar3 db
file parallel read
24
4v719yvgvfnsj
cursor: pin S wait on X 62
Failed Logon Delay
78
4v719yvgvfnsj
kksfbc child completion
338
4v719yvgvfnsj
library cache: mutex X
430
4v719yvgvfnsj
library cache lock
9865
7 rows selected.
发现问题时段等待时间长的事件为library
cache lock,涉及SQL为4v719yvgvfnsj,从AWR的ADDM分析知道该SQL为insert操作。
查询等待事件涉及哪些会话
SYS@tjgm102>l
1
select session_id,sql_id,event,count(*)
2
from dba_hist_active_sess_history
3
where SAMPLE_TIME between
to_date('202004070100','yyyymmddhh24mi') and
to_date('202004070120','yyyymmddhh24mi')
4 and
event='library cache lock'
5*
group by session_id,sql_id,event having count(*)>10 order by count(*)
SYS@tjgm102>/
SESSION_ID
SQL_ID EVENT COUNT(*)
----------
------------- ------------------------- ----------
2668 4v719yvgvfnsj library cache
lock 103
2088 4v719yvgvfnsj library cache
lock 106
1963 4v719yvgvfnsj library cache
lock 106
2855 4v719yvgvfnsj library cache
lock 106
138 4v719yvgvfnsj library cache lock 106
334 4v719yvgvfnsj library cache
lock 107
331 4v719yvgvfnsj library cache
lock 107
2571 4v719yvgvfnsj library cache
lock 107
237 4v719yvgvfnsj library cache
lock 107
2189 4v719yvgvfnsj library cache
lock 107
163 4v719yvgvfnsj library cache
lock 107
2921 4v719yvgvfnsj library cache
lock 107
1352 4v719yvgvfnsj library cache
lock 107
1351 4v719yvgvfnsj library cache
lock 108
3018 4v719yvgvfnsj library cache
lock 108
40 4v719yvgvfnsj library cache
lock 108
省略部分SQL
1837 4v719yvgvfnsj library cache
lock 111
200 4v719yvgvfnsj library cache
lock 111
1418 4v719yvgvfnsj library cache
lock 111
364 4v719yvgvfnsj library cache
lock 111
2157 4v719yvgvfnsj library cache
lock 111
2281 4v719yvgvfnsj library cache
lock 111
810 4v719yvgvfnsj library cache
lock 112
424 4v719yvgvfnsj library cache
lock 112
2764 4v719yvgvfnsj library cache
lock 112
2338 4v719yvgvfnsj library cache
lock 112
1578 4v719yvgvfnsj library cache
lock 114
2696 4v719yvgvfnsj library cache
lock 114
90 rows selected.
发现问题时段大量会话等待library
cache lock,等待时间基本为1100秒,该等待十分异常,一般该等待事件不会等待这么久,很可能发生了阻塞事件,下面查询故障时间段内被阻塞的会话在等待什么,以及那些会话阻塞了他们
select
session_id,sql_id,event,blocking_session ,count(*)
from
dba_hist_active_sess_history
where
SAMPLE_TIME between to_date('202004070100','yyyymmddhh24mi')
and to_date('202004070120','yyyymmddhh24mi')
and blocking_session is not null
group by
session_id,sql_id,event,blocking_session order by count(*) ;
....
SESSION_ID
SQL_ID EVENT BLOCKING_SESSION COUNT(*)
----------
------------- ---------------------------------------- ----------------
----------
3050 4v719yvgvfnsj library cache
lock
2855 6
1837 4v719yvgvfnsj library cache
lock 2855 6
2855 4v719yvgvfnsj library cache
lock 1319 6
9 4v719yvgvfnsj library cache
lock
2855 6
5845 rows selected.
这里省略了部分输出,可以发现大量的会话在等待library cache lock,这些等待由BLOCKING_SESSION引起,下面查询BLOCKING_SESSION在等待什么。进而确定问题。
select session_id,sql_id,event ,blocking_session
,count(*)
from
dba_hist_active_sess_history
where
SAMPLE_TIME between
to_date('202004070100','yyyymmddhh24mi') and
to_date('202004070120','yyyymmddhh24mi')
and blocking_session is not null
and session_id in
(select distinct blocking_session from dba_hist_active_sess_history
where
SAMPLE_TIME between
to_date('202004070100','yyyymmddhh24mi') and to_date('202004070120','yyyymmddhh24mi')
and blocking_session is not null)
group by
session_id,sql_id,event,blocking_session order by count(*) ;
......
1162 4v719yvgvfnsj library cache
lock 458
5
1666 4v719yvgvfnsj library cache
lock 458
5
2219 4v719yvgvfnsj library cache
lock 458 5
2539 4v719yvgvfnsj library cache
lock 458
5
364 4v719yvgvfnsj library cache
lock
2855 5
424 4v719yvgvfnsj library cache
lock
2855 5
1227 4v719yvgvfnsj library cache
lock 2855 5
1484 4v719yvgvfnsj library cache
lock 2855
5
1542 4v719yvgvfnsj library cache
lock 2855
5
1933 4v719yvgvfnsj library cache
lock
2855 5
2442 4v719yvgvfnsj library cache
lock
2855 5
2566 4v719yvgvfnsj library cache
lock 2855
5
138 4v719yvgvfnsj library cache
lock
9 5
40 4v719yvgvfnsj library cache
lock
458 5
294 4v719yvgvfnsj library cache
lock 458 5
1797 4v719yvgvfnsj library cache
lock
458 5
1837 4v719yvgvfnsj library cache
lock 458
5
2314 4v719yvgvfnsj library cache
lock 458
5
2412 4v719yvgvfnsj library cache
lock 458
5
2571 4v719yvgvfnsj library cache
lock 458
5
3050 4v719yvgvfnsj library cache
lock 2855
6
1837 4v719yvgvfnsj library cache
lock
2855 6
2855 4v719yvgvfnsj library cache
lock 1319 6
9 4v719yvgvfnsj library cache
lock
2855 6
5711 rows selected.
此时发现BLOCKING_SESSION也被阻塞,等待事件相同
可以看到所有操作都涉及SQL4v719yvgvfnsj
被ASH捕获,大量的相互阻塞,造成严重的等待,整个时间段内该SQL的等待时间集中在library cache lock,等待时间为98650秒(27小时)
问题处理: 该SQL无法捕获,我们在告警日志中,发现了一条insert操作的解析错误的SQL,后与业务沟通,发现用户在故障时间点,执行了一个Job,该Job中涉及的一张表不存在,造成解析
错误,Oracle无法捕获该SQL,但是记录到了alert日志中,这个是12C具有的功能,在11G需要开特定事件才可以记录。在我们进一步追问后,业务承认是连错了数据库。在用户停止该Job后
DBIME过高告警的问题解决。
相关文章