oracle internals3-4

2023-01-31 01:01:31 oracle internals3

oracle instance 的redo thread产生的最近的日志条目可以通过RBA（redo byte address）来寻址，一个RBA包括如下的三部分：

Ø Log file sequence number 4 bytes

Ø Log file block number 4 bytes;

Ø 日志开始的时候得块内offset 2 bytes

RBA在他们的redo thread内不必不同，因为用RESETLOG方式开打数据库的时候，所有的redo thread 都可以设置log file sequence number为1。

RBA在如下的应用中比较重要：

对于Daba buffer cache中的dirty block ，低的RBA是最早改变的数据块产生日志的地址，高的RBA是最近变化的产生的日志的地址。

Dirty buffers are maintained on the buffer cache checkpoint queues in low RBA order. The checkpoint RBA is the point up to which DBWn has written buffers from the checkpoint queues if incremental checkpointing is enabled -- otherwise it is the RBA of last full thread checkpoint. The checkpoint RBA is copied into the checkpoint progress record of the controlfile by the checkpoint heartbeat once every 3 seconds. Instance recovery, when needed, begins from the checkpoint RBA recorded in the controlfile. The target RBA is the point up to which DBWn should seek to advance the checkpoint RBA to satisfy instance recovery objectives.

The on-disk RBA is the point up to which LGWR has flushed the redo thread to the online log files. DBWn may not write a block for which the high RBA is beyond the on-disk RBA. Otherwise transaction recovery (rollback) would not be possible, because the redo needed to undo a change is always in the same redo record as the redo for the change itself.

The term sync RBA is sometimes used to refer to the point up to which LGWR is required to sync the thread. However, this is not a full RBA -- only a redo block number is used at this point.

The low and high RBAs for dirty buffers can be seen in X$BH. (There is also a recovery RBA which is used to record the progress of partial block recovery by PMON.) The incremental checkpoint RBA, the target RBA and the on-disk RBA can all be seen in X$TARGETRBA. The incremental checkpoint RBA and the on-disk RBA can also be seen in X$KCCCP. The full thread checkpoint RBA can be seen in X$KCCRT.

（在buffer cache 中的dirty buffer 中要注意，low RBA 是指最近一次clean(block中已经不再有有用数据)后发生了第一次改变时的block的的redo的位置，high RBA 是最近一次发生改变时该block的redo的位置(low 和 high 是针对同一个block在不同时期发生了变化，变化之间没有被clean,可能还没有来得及被写入磁盘就又被更新了)。脏数据块在检查点写队列里面，按照low RBA 排列，checkpoint RBA 标志DBWn已经从写队列里面写到磁盘了的buffers（表示写到了哪里）如果增量检查点是激活的（前一个检查点没有结束新的检查点又发生了），否则就是标志最近一次完整的检查点发生之后,checkpoint RBA 每3秒别写入控制文件一次，实例恢复的时候如果需要的话就从控制文件中的checkpoing RBA 开始,target RBA 是该checkpoint RBA 所处的没有完成的检查点的RBA,从这里开始就可以找到checkpoint RBA

On-disk RBA 是指LGWR 已经写到log file 中的位置，DBWn不能写一个block到磁盘，假如high RBA 高于ON-DISK RBA（因为这样的话恢复的时候有问题，这个时候DBWn就通知LGWR写日志文件，也就是写进磁盘的block的redo一定要先写进log file,这个很容易理解）；否则事务恢复或者回滚将不可能，因为重做的时候需要撤消信息，而撤消信息是和重做信息放在一个重做记录里面的）

什么是checkpoint?
checkpoint是一个数据库事件，它将已修改的数据从高速缓存刷新到磁盘，并更新控制文件和数据文件。

什么时候发生checkpoint?
我们知道了checkpoint会刷新脏数据，但什么时候会发生checkpoint呢？以下几种情况会触发checkpoint。
1.当发生日志组切换的时候
2.当符合LOG_CHECKPOINT_TIMEOUT，LOG_CHECKPOINT_INTERVAL，fast_start_io_target,fast_start_mttr_target参数设置的时候
3.当运行ALTER SYSTEM SWITCH LOGFILE的时候
4.当运行ALTER SYSTEM CHECKPOINT的时候
5.当运行alter tablespace XXX begin backup，end backup的时候
6.当运行alter tablespace ,datafile offline的时候;

增量检查点（incremental checkpoint）
oracle8以后推出了incremental checkpoint的机制，在以前的版本里每次checkpoint时都会做一个full thread checkpoint,这样的话所有脏数据会被写到磁盘，巨大的i/o对系统性能带来很大影响。为了解决这个问题，oracle引入了checkpoint queue机制，每一个脏块会被移到检查点队列里面去，按照low rdb（第一次对此块修改对应的redo block address）来排列，靠近检查点队列尾端的数据块的low rba值是最小的，而且如果这些赃块被再次修改后它在检查点队列里的顺序也不会改变，这样就保证了越早修改的块越早写入磁盘。每隔3秒钟ckpt会去更新控制文件和数据文件，记录checkpoint执行的情况。

这里应该是只更新控制文件，每3秒不是更新数据文件
说记录 checkpoint 的执行情况，这个说法，没错，但不够详细，应该说，由于增量检查点和 checkpoint queue 的原理，ckpt 进程每次只是告诉 dbwr ，写dirty buffer将要一直写到最新这个位置，仅仅是告诉 dbwr 一个 checkpoint queue 中的结束点，而 ckpt 每3秒中，在控制文件中报告一下 dbwr 最新写入的位置。这样使得，比如数据库要做恢复的时候（instance recovery）可以从这个最新位置开始做恢复，而不是从数据文件中的 checkpoint scn 开始做恢复，这样将缩短恢复时间，尤其是 instance crash 的情况下启动更快

另外要注意的是，检查点发生的时候，ckpt 去更新数据文件头和控制文件，并不是把当前检查点发生时候的 scn 更新进去，而是把上一次dbwr写入已经完成的检查点发生时候的 scn 更新进去，也就是说，更新控制文件和数据文件头是滞后于检查点的发生的，这个从恢复的原理也很容易理解，因为检查点发生的时候 dirty buffer还没有写入，自然不能立即更新成当前的 scn 了。

四种情况下会引起lgwr执行写操作

1、当lgwr空闲的时候，处于rdbms ipc message 等待状态，三秒超时（dbwn也是一样）超时后，lgwr发现需要写日志，然后就写。

2、当process在log buffer 中分配block，如果log buffer占用的大于或者等于_log_io_size参数设定值，并且lgwr没有active，那么lgwr被激活进行写操作。默认值为log buffer的1/3。8i之后为1M，除非显示设定,_log_io_size这样的参数在v$ksppsv中设定为0。

3、当一个事务commit的时候，产生一个commit marker。但这种情况下，事务不能被回滚，直到log block被写进disk当中。所以在process结束事务的时候返回给用户的时候必须等待lgwr写完所有对应的log block，这时候process处于log file sync状态，超时为1秒。这时可以设定_wait_for_sync为false，这样可以避免redo sync的等待，但当instance failure的时候，不能保证事务的可恢复性。

recursive call知道返回给用户的时候才需要等待sync。

sga变量b用来连接log block number 和需要同步的redo thread。如果在lgwr超时之前好几个事务commit的话，b将记录需要同步的最高的log block number，commit marker将一次性写入disk。这就是所谓的group commit

4、当dbwn需要写一个或者更多的block，他们的redo rba超过了lgwr的on_disk rba的时候，会引发lgwr写操作。8i以后，dbwn会将blocks放入延迟写的队列当中，然后要求lgwr同步最高的rba，dbwn立即执行其他的操作。8i以前，dbwn需要进入log file sync状态等待。

相关文章