Recover from lost OCR disk group in Oracle Database 12c

By Ewan

September 7, 2013

Summary

How to recreate a lost OCR ASM disk group from scratch. This HowTo applies to Database 12c RAC. We will rebuild the disk group, restore the OCR, replace the voting disks, and recreate the ASM parameter and password files.

Destruction

Firstly we will destroy the OCR disk group by wiping the headers of two of the three disks. Being a normal redundancy disk group this will be sufficient to make the disk group's data unrecoverable.

[oracle@h2-o12-cl1 ~]$ dd if=/dev/zero of=/dev/oracleasm/disks/OCR02 bs=1024 count=102400
102400+0 records in
102400+0 records out
104857600 bytes (105 MB) copied, 0.335665 s, 312 MB/s
[oracle@h2-o12-cl1 ~]$ dd if=/dev/zero of=/dev/oracleasm/disks/OCR03 bs=1024 count=102400
102400+0 records in
102400+0 records out
104857600 bytes (105 MB) copied, 0.379662 s, 276 MB/s

CRS will eventually realise what has happened.

2013-09-07 13:11:07.820:
[cssd(2494)]CRS-1604:CSSD voting file is offline: ORCL:OCR02; details at (:CSSNM00069:) in /u01/app/12.1.0/grid/log/h1-o12-cl1/cssd/ocssd.log.
2013-09-07 13:11:07.820:
[cssd(2494)]CRS-1626:A Configuration change request completed successfully
2013-09-07 13:11:07.885:
[cssd(2494)]CRS-1601:CSSD Reconfiguration complete. Active nodes are h1-o12-cl1 h2-o12-cl1 .
2013-09-07 13:11:13.303:
[crsd(2666)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /u01/app/12.1.0/grid/log/h1-o12-cl1/crsd/crsd.log.
2013-09-07 13:11:13.313:
[crsd(2666)]CRS-1006:The OCR location  is inaccessible. Details in /u01/app/12.1.0/grid/log/h1-o12-cl1/crsd/crsd.log.
...
2013-09-07 13:11:26.008:
[crsd(10140)]CRS-0804:Cluster Ready Service aborted due to Oracle Cluster Registry error [PROC-26: Error while accessing the physical storage
]. Details at (:CRSD00111:) in /u01/app/12.1.0/grid/log/h1-o12-cl1/crsd/crsd.log.

CRS and storage are now shown as OFFLINE.

[oracle@h1-o12-cl1 ~]$ /u01/app/12.1.0/grid/bin/crsctl status resource -t -init
--------------------------------------------------------------------------------
Name           Target  State        Server                   State details
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
...
ora.crsd
      1        ONLINE  OFFLINE                               STABLE
ora.cssd
      1        ONLINE  ONLINE       h2-o12-cl1             STABLE
...
ora.storage
      1        ONLINE  OFFLINE      h2-o12-cl1             STARTING
--------------------------------------------------------------------------------

The cluster nodes may or may not survive this loss of disks. If they do survive, you have the opportunity to plan some cluster-wide down time for the rebuild. If they do not then you will need to start the OCR disk group rebuild immediately.

Clean up

Shut down any database(s) manually. Shut down CRS on any cluster node(s) where it is still running. Be patient as this may take a longer amount of time than usual to complete.

[root@h1-o12-cl1 ~]# time /u01/app/12.1.0/grid/bin/crsctl stop crs -f
CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'h1-o12-cl1'
...
CRS-5017: The resource action "ora.asm stop" encountered the following error:
ORA-15097: cannot SHUTDOWN ASM instance with connected client (process 2675)
. For details refer to "(:CLSN00108:)" in "/u01/app/12.1.0/grid/log/h1-o12-cl1/agent/ohasd/oraagent_oracle/oraagent_oracle.log".
CRS-2675: Stop of 'ora.asm' on 'h1-o12-cl1' failed
CRS-2679: Attempting to clean 'ora.asm' on 'h1-o12-cl1'
CRS-2681: Clean of 'ora.asm' on 'h1-o12-cl1' succeeded
...
CRS-2793: Shutdown of Oracle High Availability Services-managed resources on 'h1-o12-cl1' has completed
CRS-4133: Oracle High Availability Services has been stopped.
real    10m12.160s
user    0m0.065s
sys     0m0.097s

Terminate any remaining processes, such as TNS listeners and ONS, on all cluster nodes.

[oracle@h1-o12-cl1 ~]$ ps -ef | grep grid
oracle    2808     1  0 12:12 ?        00:00:00 /u01/app/12.1.0/grid/opmn/bin/ons -d
oracle    2809  2808  0 12:12 ?        00:00:00 /u01/app/12.1.0/grid/opmn/bin/ons -d
oracle    2836     1  0 12:12 ?        00:00:00 /u01/app/12.1.0/grid/bin/tnslsnr LISTENER_SCAN3 -no_crs_notify -inherit
oracle    2848     1  0 12:12 ?        00:00:00 /u01/app/12.1.0/grid/bin/tnslsnr LISTENER -no_crs_notify -inherit
oracle    2859     1  0 12:12 ?        00:00:00 /u01/app/12.1.0/grid/bin/tnslsnr LISTENER_SCAN2 -no_crs_notify -inherit
[oracle@h1-o12-cl1 ~]$ kill 2808 2809 2836 2848 2859

Deconfigure any IPv4 and IPv6 VIPs that remain configured. Perform these steps for all cluster nodes.

[root@h1-o12-cl1 ~]# ip addr del 10.1.2.92/24 dev eth0
...
[root@h1-o12-cl1 ~]# ip addr del 2001:123:4:5::3:5e/64 dev eth0
...

Rebuild

We are now ready to begin. Re-label the ASM disks.

[root@h1-o12-cl1 ~]# oracleasm listdisks
C1DATA101
C1DATA102
C1DATA103
C1FRA101
C1FRA102
C1FRA103
OCR01
OCR02
OCR03

[root@h1-o12-cl1 ~]# oracleasm scandisks
Reloading disk partitions:  sda: sda1
 sdb: sdb1
 sdd: sdd1
 sdi: sdi1
 sdh: sdh1
 sdf: sdf1
 sdg: sdg1
 sdc: sdc1
 sde: sde1
done
Cleaning any stale ASM disks...
Cleaning disk "OCR02"
Cleaning disk "OCR03"
Scanning system for ASM disks...

[root@h1-o12-cl1 ~]# oracleasm createdisk OCR02 /dev/sdd1
Writing disk header: done
Instantiating disk: done
[root@h1-o12-cl1 ~]# oracleasm createdisk OCR03 /dev/sdg1
Writing disk header: done
Instantiating disk: done

Re-scan the disk headers on the remaining cluster node(s).

[root@h2-o12-cl1 ~]# oracleasm scandisks
Reloading disk partitions:  sda: sda1
 sdc: sdc1
 sdf: sdf1
 sde: sde1
 sdh: sdh1
 sdb: sdb1
 sdd: sdd1
 sdg: sdg1
 sdi: sdi1
done
Cleaning any stale ASM disks...
Scanning system for ASM disks...

Start ASM on one node in exclusive mode.

[root@h1-o12-cl1 ~]# /u01/app/12.1.0/grid/bin/crsctl start crs -excl -nocrs
CRS-4123: Oracle High Availability Services has been started.
...
CRS-2672: Attempting to start 'ora.asm' on 'h1-o12-cl1'
CRS-2676: Start of 'ora.asm' on 'h1-o12-cl1' succeeded

Check CRSD is down and stop it if it is not.

[root@h1-o12-cl1 ~]# /u01/app/12.1.0/grid/bin/crsctl status resource ora.crsd -init
NAME=ora.crsd
TYPE=ora.crs.type
TARGET=OFFLINE
STATE=OFFLINE

If we attempt to drop the corrupted disk group now we will receive the following error as ASM is still aware of the voting file present on the remaining disk.

ORA-15039: diskgroup not dropped
ORA-15276: ASM diskgroup C1OCR has cluster voting files

So we will firstly drop the voting files and then drop the corrupted disk group and recreate it.

[oracle@h1-o12-cl1 ~]$ /u01/app/12.1.0/grid/bin/crsctl query css votedisk
##  STATE    File Universal Id                File Name Disk group
--  -----    -----------------                --------- ---------
 1. ONLINE   dcc6f60295074f73bf58d22bee1c0da1 (ORCL:OCR01) [C1OCR]
 2. OFFLINE  39f8dd658ead4f1fbf3c67c6d1765970 () []
Located 2 voting disk(s).
[oracle@h1-o12-cl1 ~]$ /u01/app/12.1.0/grid/bin/crsctl delete css votedisk +c1ocr
CRS-4611: Successful deletion of voting disk +c1ocr.
[oracle@h1-o12-cl1 ~]$ sqlplus / as sysasm
SQL> drop diskgroup c1ocr force including contents;
Diskgroup dropped.
SQL> create diskgroup c1ocr normal redundancy disk 'ORCL:OCR01', 'ORCL:OCR02', 'ORCL:OCR03' attribute 'compatible.asm' = '12.1';
Diskgroup created.

The OCR must be restored before the voting files can be recreated. Adding voting files to an ASM disk group is not possible and the following error would be returned.

[root@h1-o12-cl1 ~]# /u01/app/12.1.0/grid/bin/crsctl add css votedisk +c1ocr
CRS-4671: This command is not supported for ASM diskgroups.
CRS-4000: Command Add failed, or completed with errors.

Locate the most recent OCR backup and restore it.

[root@h1-o12-cl1 ~]# /u01/app/12.1.0/grid/bin/ocrconfig -restore /u01/app/12.1.0/grid/cdata/o12-cl1/backup00.ocr

It is now possible to replace the voting disks.

[oracle@h1-o12-cl1 ~]$ /u01/app/12.1.0/grid/bin/crsctl replace votedisk +c1ocr
Successful addition of voting disk b29fb1e424634f56bf073e20b1ee1849.
Successful addition of voting disk 1d95a0212c144f8bbf998587f5b54159.
Successful addition of voting disk 3bf4b50448f84fc5bf9eedd00fcf12cf.
Successfully replaced voting disk group with +c1ocr.
CRS-4266: Voting file(s) successfully replaced
[oracle@h1-o12-cl1 ~]$ /u01/app/12.1.0/grid/bin/crsctl query css votedisk
##  STATE    File Universal Id                File Name Disk group
--  -----    -----------------                --------- ---------
 1. ONLINE   b29fb1e424634f56bf073e20b1ee1849 (ORCL:OCR01) [C1OCR]
 2. ONLINE   1d95a0212c144f8bbf998587f5b54159 (ORCL:OCR02) [C1OCR]
 3. ONLINE   3bf4b50448f84fc5bf9eedd00fcf12cf (ORCL:OCR03) [C1OCR]
Located 3 voting disk(s).

Perform a quick check of the OCR. Root privileges are required for the logical corruption check.

[root@h1-o12-cl1 o12-cl1]# /u01/app/12.1.0/grid/bin/ocrcheck
...
         Device/File Name         :    +C1OCR
                                    Device/File integrity check succeeded
...
         Cluster registry integrity check succeeded
         Logical corruption check succeeded

Recreate the ASM spfile in the disk group. The spfile that has been recorded no longer exists.

[oracle@h1-o12-cl1 ~]$ asmcmd -p
ASMCMD [+] > spget
+C1OCR/o12-cl1/ASMPARAMETERFILE/registry.253.820928619
ASMCMD [+] > ls -l +C1OCR/o12-cl1/ASMPARAMETERFILE/registry.253.820928619
ASMCMD-8002: entry 'ASMPARAMETERFILE' does not exist in directory '+C1OCR/o12-cl1/'

Search the ASM alert log for non-default parameter values and copy them into a temporary pfile. Note that we have a non-default asm_diskstring.

[oracle@h1-o12-cl1 ~]$ view /u01/app/oracle/diag/asm/+asm/+ASM1/trace/alert_+ASM1.log
...
System parameters with non-default values:
  large_pool_size          = 12M
  remote_login_passwordfile= "EXCLUSIVE"
  asm_diskstring           = "ORCL:*"
  asm_diskgroups           = "C1DATA1"
  asm_diskgroups           = "C1FRA1"
  asm_power_limit          = 1
...
[oracle@h1-o12-cl1 ~]$ vi /tmp/asmpfiletemp.ora
...

Convert the pfile into an spfile.

[oracle@h1-o12-cl1 ~]$ sqlplus / as sysasm
SQL> create spfile='+c1ocr' from pfile='/tmp/asmpfiletemp.ora';
File created.

Update the registry with the new ASM file name of the replacement spfile.

[oracle@h1-o12-cl1 ~]$ asmcmd -p
ASMCMD [+] > ls -l +c1ocr/o12-cl1/ASMPARAMETERFILE
Type              Redund  Striped  Time             Sys  Name
ASMPARAMETERFILE  MIRROR  COARSE   SEP 07 15:00:00  Y    REGISTRY.253.825521871
ASMCMD [+] > spset +c1ocr/o12-cl1/ASMPARAMETERFILE/REGISTRY.253.825521871
ASMCMD [+] > spget
+c1ocr/o12-cl1/ASMPARAMETERFILE/REGISTRY.253.825521871

If we have a non-default asm_diskstring, check that it has been set in the profile and spfile. Use dsset to fix this if necessary.

ASMCMD [+] > dsget
parameter:ORCL:*
profile:ORCL:*

CRS is required to be running in non-exclusive mode in order to recreate the ASM password file in the disk group. This means some 'application' cluster resources will be started and available now.

[root@h1-o12-cl1 ~]# /u01/app/12.1.0/grid/bin/crsctl stop crs -f
...
CRS-4133: Oracle High Availability Services has been stopped.
[root@h1-o12-cl1 ~]# /u01/app/12.1.0/grid/bin/crsctl start crs -wait
...
CRS-6024: Completed start of Oracle Cluster Ready Services-managed resources
CRS-4123: Oracle High Availability Services has been started.

Recreate the ASM password file in the disk group.

[oracle@h1-o12-cl1 ~]$ asmcmd -p
ASMCMD [+] > pwget --asm
+C1OCR/orapwASM
ASMCMD [+] > pwdelete --asm
OPW-00022: The password file does not exist.
ASMCMD-9462: could not delete password file
ASMCMD [+] > pwcreate --asm +C1OCR/orapwASM topsecret
ASMCMD [+] > pwget --asm
+C1OCR/orapwasm
ASMCMD [+] > ls -l +C1OCR/orapwasm
Type      Redund  Striped  Time             Sys  Name
PASSWORD  HIGH    COARSE   SEP 07 16:00:00  N    orapwasm => +C1OCR/ASM/PASSWORD/pwdasm.256.825523875
ASMCMD [+] > ls -l +C1OCR/ASM/PASSWORD/pwdasm.256.825523875
Type      Redund  Striped  Time             Sys  Name
PASSWORD  HIGH    COARSE   SEP 07 16:00:00  Y    pwdasm.256.825523875

Start CRS on the remaining cluster node(s).

[root@h2-o12-cl1 ~]# /u01/app/12.1.0/grid/bin/crsctl start crs -wait
CRS-4123: Starting Oracle High Availability Services-managed resources
...
CRS-6016: Resource auto-start has completed for server h2-o12-cl1
CRS-6024: Completed start of Oracle Cluster Ready Services-managed resources
CRS-4123: Oracle High Availability Services has been started.

Run a cluster verification of all cluster nodes.

[oracle@h1-o12-cl1 ~]$ /u01/app/12.1.0/grid/bin/cluvfy comp ocr -n all -verbose
Verifying OCR integrity
Checking OCR integrity...
Checking the absence of a non-clustered configuration...
All nodes free of non-clustered, local-only configurations
Checking OCR config file "/etc/oracle/ocr.loc"...
OCR config file "/etc/oracle/ocr.loc" check successful
Disk group for ocr location "+C1OCR" is available on all the nodes
Checking OCR dump functionality
OCR dump check passed
NOTE:
This check does not verify the integrity of the OCR contents. Execute 'ocrcheck' as a privileged user to verify the contents of OCR.
OCR integrity check passed
Verification of OCR integrity was successful.

Perform a manual backup of the OLR on each cluster node

[root@h1-o12-cl1 ~]# /u01/app/12.1.0/grid/bin/ocrconfig -local -manualbackup
h1-o12-cl1     2013/09/07 16:38:49     /u01/app/12.1.0/grid/cdata/h1-o12-cl1/backup_20130907_163849.olr
h1-o12-cl1     2013/07/16 11:49:13     /u01/app/12.1.0/grid/cdata/h1-o12-cl1/backup_20130716_114913.olr
[root@h2-o12-cl1 ~]# /u01/app/12.1.0/grid/bin/ocrconfig -local -manualbackup
h2-o12-cl1     2013/09/07 16:39:18     /u01/app/12.1.0/grid/cdata/h2-o12-cl1/backup_20130907_163918.olr
h2-o12-cl1     2013/07/16 11:56:16     /u01/app/12.1.0/grid/cdata/h2-o12-cl1/backup_20130716_115616.olr

Re-add any CRS services that had been added since the OCR backup we restored. Backup the OCR.

[root@h2-o12-cl1 ~]# /u01/app/12.1.0/grid/bin/ocrconfig -manualbackup
h1-o12-cl1     2013/09/07 16:41:05     /u01/app/12.1.0/grid/cdata/o12-cl1/backup_20130907_164105.ocr
h1-o12-cl1     2013/09/07 04:13:05     /u01/app/12.1.0/grid/cdata/o12-cl1/backup00.ocr
h1-o12-cl1     2013/07/16 11:56:16     /u01/app/12.1.0/grid/cdata/o12-cl1/backup_20130716_115616.ocr
h1-o12-cl1     2013/07/16 11:49:14     /u01/app/12.1.0/grid/cdata/o12-cl1/backup_20130716_114914.ocr

References

Classifications

ASM

CRS

Thank you for this

Thomas (not verified) Thu, 23/06/2016 - 07:06

Thank you for this comprehensive explanation!
It is great.
Thomas

Add new comment