Thursday, September 27, 2012

10.2.0.3 - Mounting a diskgroup fails with error ORA-600 [kfcema02]

Just yesterday I became familiar with a Bug (6163771) that can rear its ugly head when ever you perform a SHUTDOWN ABORT. This was reported fixed in 11.1.0.7 and has the following description details:

During instance recovery, mounting a diskgroup can fail with ORA-600[KFCEMA02].
There is a mismatch between the FCN recorded in the block and the FCN recorded
in the ACD.  block FCN < ACD fcn.

The top functions in the call stack are:

    kfgInitCache -> kfcMount ->kfrcrv -> kfrPass2 -> kfcema 

The trace file contains the FCN for the current block been recovered.  
eg:

 kfbh_kfcbh.fcn_kfbh = 0.5538283

 BH: (0x3807959c0)  bnum=13 type=FILEDIR state=rcv chgSt=not modifying
      flags=0x00000000  pinmode=excl  lockmode=null  bf=0x38040c000
      kfbh_kfcbh.fcn_kfbh = 0.5538283  lowAba=0.0  highAba=0.0
      last kfcbInitSlot return code=null cpkt lnk is null

The ACD fcn is the second argument on the ORA-600 [KFCEMA02]

This patch does not fix a diskgroup with the error already introduced.  
It will prevent future occurrences.

Hdr: 6163771 10.2.0.3 RDBMS 10.2.0.3 ASM PRODID-5 PORTID-23
Abstract: CANNOT MOUNT DISKGROUP DUE TO ORA-600 [KFCEMA02]

PROBLEM:
--------
The cusotmer had a maintenance window (for something else) this morning on 
this development RAC. We could not shutdown cleanly. Then after the 
maintenance window, FRA diskgroup would not mounted.

Hdr: 6163771 10.2.0.3 RDBMS 10.2.0.3 ASM PRODID-5 PORTID-23
Abstract: CANNOT MOUNT DISKGROUP DUE TO ORA-600 [KFCEMA02]

WORKAROUND:
-----------
N/A

REPRODUCIBILITY:
----------------
At will


STACK TRACE:
------------
ksedmp kgerinv kgeasnmierr kfcema kfrPass2 kfrcrv
kfcMount kfgInitCache kfgFinalizeMount 3088 kfgscFinalize kfgForEachKfgsc
kfgsoFinalize kfgFinalize kfxdrvMount kfxdrvEntry opiexe opiosq0
kpooprx kpoal8 opiodr ttcpip opitsk opiino
opiodr opidrv sou2o opimai_real...

SUPPORTING INFORMATION:
-----------------------
Alert log and trace file uploaded

PROGRAMMING DETAILS:
-----------------------
Development has found a bug in the way checkpoints are maintained and this 
bug is the probable cause of the kfcema02 assert these customers are seeing.  
 We have a high degree of confidence that the bug we found is the cause of 
the customer issues because of what we saw in the AMDU dumps.

The problem is that buffers on the ping queue are not sorted in any 
particular order.  The fix is for kfrbCkpt to scan the entire ping queue to 
find the oldest buffer when computing the new checkpoint.   kfcbDriver is 
also updated to scan the entire ping queue when computing the targetAba for 
kfcbCkpt, but that code change is not critical because the only effect of 
having the targetAba be higher than it should be was that DBWR would write 
more dirty buffer than it really needed to.

After reading this BLOG from awhile ago on ORACLE-L - I was not encouraged to say the least.

Reaching out to Oracle Support helped solved the problem with employing a 11g Tool (can also run on 10g) called: facp (and AMDU). AMDU was released with 11g, and is a tool used to get the location of the ASM metadata across the disks. As many other tools released with 11g, it can be used on 10g environments.  Note 553639.1  is the placeholder for the different platforms. The note include also instructions for the configuration. It only needs to be configured (not run) for this fix since facp calls the AMDU.

Steps taken to resolve:

Transfer amdu and facp to a working directory and include it on LD_LIBRARY_PATH, PATH and other relevant variables.

Download the script facp from SR attachment.

Then, ACD Scanning and generation of pertinent files,

$./facp '/dev/oracleasm/disks*' 'DG6' ALL

And then it will generate files named like facp* in same directory.

Then try to adjust all checkpoints  by 10 blocks:

./facp_adjust  -10

Used after adjusting the checkpoints to verify they are valid.

$./facp_check

If you adjusted too much facp_check will not print "Valid Checkpoint". Try adjusting less. 
Till get "Valid Checkpoint" for both thread.

Once facp_check reports "Valid Checkpoint" for all threads, it's the indication
to proceed with the real patching, which means, updating the ACD records

Write ASM metadata with the new data:

$./facp_patch

Then try to mount this diskgroup manually:

SQL> alter diskgroup DG6 mount; --------->> ASM sqlplus

SQL> select name,state from v$asm_diskgroup; --------->> ASM sqlplus


Everything showed MOUNTED and was able to bring up our Production DB.

If you experience this issue - log a SR with Oracle Support for these tools if not already on your system.




No comments:

Post a Comment