These days, a customer’s Oracle Clusterware (2 nodes) crashed one ASM instance at every startup.
More Facts:
- It was not possible to start it manually, too.
- The CSSD was running.
- For obvious reasons, CRSD did not start.
- The other ASM instance in the cluster recognized CLUSTER RECONFIGURATION for a short period of time.
The ASM Alert Log file looked like:
Sun Nov 13 13:44:08 2011
MMNL started with pid=21, OS id=7783
lmon registered with NM - instance number 2 (internal mem no 1)
Sun Nov 13 13:46:05 2011
System state dump requested by (instance=2, osid=7684 (PMON)),
summary=[abnormal instance termination].
System State dumped to trace file /u01/app/oracle/diag/asm/+asm/+ASM2/trace/+ASM2_diag_7706.trc
Sun Nov 13 13:46:05 2011
PMON (ospid: 7684): terminating the instance due to error 481
Dumping diagnostic data in directory=[cdmp_20111113134605], requested by (instance=2, osid=7684 (PMON)),
summary=[abnormal instance termination].
Instance terminated by PMON, pid = 7684
Strange problem. Looking up device permissions, read write tests, rebooting the cluster in a downtime window – nothing.
To make a long story short: The NTP daemon did not get his time synchronisation, but was running. Thus, CTSS was in observer mode, and server time started drifting apart. Fixing NTP, fixed the cluster.
Nota bene
Martin