The gap
The story started with an always-unpleasant alert log message:
2019-08-08T09:46:32.915070+02:00
PR00 (PID:10646): Media Recovery Waiting for T-1.S-2306
PR00 (PID:10646): Fetching gap from T-1.S-2306 to T-1.S-2405
2019-08-08T09:46:32.925843+02:00
Errors in file /u01/app/oracle/diag/rdbms/xxxp2/XXXP2/trace/XXXP2_mz00_10650.trc:
ORA-01110: data file 1: '+ORADATA/XXXP2/DATAFILE/system.263.1009981381'
<... for all my data files ...>
2019-08-08T09:48:27.919844+02:00
PR00 (PID:10646): FAL: Failed to request gap sequence
PR00 (PID:10646): GAP - thread 1 sequence 2306-2405
PR00 (PID:10646): DBID 123456789 branch 994167271
PR00 (PID:10646): FAL: All defined FAL servers have been attempted
PR00 (PID:10646): -------------------------------------------------------------------------
PR00 (PID:10646): Check that the CONTROL_FILE_RECORD_KEEP_TIME initialization
PR00 (PID:10646): parameter is defined to a value that's sufficiently large
PR00 (PID:10646): enough to maintain adequate log switch information to resolve
PR00 (PID:10646): archived redo log gaps.
PR00 (PID:10646): -------------------------------------------------------------------------
The reason
Looking into v$archived_log: The log sequences 2306-2405 ARE applied on standby. And thus deleted by policy from the Fast Recovery Area.
Okay, this explains the effect, but not the root cause. So we had to open a SR with Oracle Support.
The real reason
Seems we have BUG 29056767 – DATAFILES CHECKPOINT NOT UPDATED AT STANDBY DATABASE WHEN MRP IS ENABLED
As a workaround to clarify if this really is the issue, MOS suggested to set an underscore parameter: _time_based_rcv_ckpt_target”=0
The fix for it is included in RU 19.4 and 20.1. FOr 18.x you have to request a backport as of today. (Update: RU 18.8 includes this fix.)
Yes, it worked as far as I can see as of today, but what does it do? It re-sets the so-called “time-based recovery checkpoint target” from 180 seconds to zero. What does THIS mean?
The change in 18c
Oracle changed the checkpointing behaviour of the Media Recovery Process (MRP) from “logfile boundary” to “time-based, every 180 seconds”. (And obviously messed up to catch all cases in conjunction with Data Guard…)
For my understanding, that’s a big change in the way an Oracle Database instance works – the check point at log switch is one of the fundamentals taught to DBAs all over the world. So is the similarity of work being done within Primary and Standby in a Data Guard environment. In this case, Oracle changed both and made the Standby side (again) a bit more specialized for its purpose. I think it’s time (again) to re-think our understanding of the Big Red DB Engine, at least for a part of it.
I can only guess the reasons for a change like this: Maybe they wanted to reduce the impact at log switch time, and reduce potential data loss to maximum of 180 seconds?
For sure we can expect much change in Oracle in the future – they went out of the monolith corner, and had a tendency into highly specialized, but smaller background processes for some time now. Next logical step is to adapt old structures and procedures to being more dynamic and specialized. I am looking forward to it!
Take care
Martin Klier
(-‸ლ)
Hallo Martin,
besten Dank für den Artikel. Mit dem entsprechenden Hinweis funktioniert nun auch unser “report need backup” auf unserer Standby-Datenbank wieder korrekt.
Liebe Grüsse
Martin
The fix is included in RU 18.8 (18.8.0.0.DBRU:191015)(https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=1653470379379428&id=2369471.1&_afrWindowMode=0&_adf.ctrl-state=5sduxsjya_148)