Summary
How to recreate a lost OCR ASM disk group from scratch. This HowTo applies to Database 12c RAC. We will rebuild the disk group, restore the OCR, replace the voting disks, and recreate the ASM parameter and password files.
Destruction
Firstly we will destroy the OCR disk group by wiping the headers of two of the three disks. Being a normal redundancy disk group this will be sufficient to make the disk group's data unrecoverable.
[oracle@h2-o12-cl1 ~]$ dd if=/dev/zero of=/dev/oracleasm/disks/OCR02 bs=1024 count=102400 102400+0 records in 102400+0 records out 104857600 bytes (105 MB) copied, 0.335665 s, 312 MB/s [oracle@h2-o12-cl1 ~]$ dd if=/dev/zero of=/dev/oracleasm/disks/OCR03 bs=1024 count=102400 102400+0 records in 102400+0 records out 104857600 bytes (105 MB) copied, 0.379662 s, 276 MB/s
CRS will eventually realise what has happened.
2013-09-07 13:11:07.820: [cssd(2494)]CRS-1604:CSSD voting file is offline: ORCL:OCR02; details at (:CSSNM00069:) in /u01/app/12.1.0/grid/log/h1-o12-cl1/cssd/ocssd.log. 2013-09-07 13:11:07.820: [cssd(2494)]CRS-1626:A Configuration change request completed successfully 2013-09-07 13:11:07.885: [cssd(2494)]CRS-1601:CSSD Reconfiguration complete. Active nodes are h1-o12-cl1 h2-o12-cl1 . 2013-09-07 13:11:13.303: [crsd(2666)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /u01/app/12.1.0/grid/log/h1-o12-cl1/crsd/crsd.log. 2013-09-07 13:11:13.313: [crsd(2666)]CRS-1006:The OCR location is inaccessible. Details in /u01/app/12.1.0/grid/log/h1-o12-cl1/crsd/crsd.log. ... 2013-09-07 13:11:26.008: [crsd(10140)]CRS-0804:Cluster Ready Service aborted due to Oracle Cluster Registry error [PROC-26: Error while accessing the physical storage ]. Details at (:CRSD00111:) in /u01/app/12.1.0/grid/log/h1-o12-cl1/crsd/crsd.log.
CRS and storage are now shown as OFFLINE.
[oracle@h1-o12-cl1 ~]$ /u01/app/12.1.0/grid/bin/crsctl status resource -t -init -------------------------------------------------------------------------------- Name Target State Server State details -------------------------------------------------------------------------------- Cluster Resources -------------------------------------------------------------------------------- ... ora.crsd 1 ONLINE OFFLINE STABLE ora.cssd 1 ONLINE ONLINE h2-o12-cl1 STABLE ... ora.storage 1 ONLINE OFFLINE h2-o12-cl1 STARTING --------------------------------------------------------------------------------
The cluster nodes may or may not survive this loss of disks. If they do survive, you have the opportunity to plan some cluster-wide down time for the rebuild. If they do not then you will need to start the OCR disk group rebuild immediately.
Clean up
Shut down any database(s) manually. Shut down CRS on any cluster node(s) where it is still running. Be patient as this may take a longer amount of time than usual to complete.
[root@h1-o12-cl1 ~]# time /u01/app/12.1.0/grid/bin/crsctl stop crs -f CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'h1-o12-cl1' ... CRS-5017: The resource action "ora.asm stop" encountered the following error: ORA-15097: cannot SHUTDOWN ASM instance with connected client (process 2675) . For details refer to "(:CLSN00108:)" in "/u01/app/12.1.0/grid/log/h1-o12-cl1/agent/ohasd/oraagent_oracle/oraagent_oracle.log". CRS-2675: Stop of 'ora.asm' on 'h1-o12-cl1' failed CRS-2679: Attempting to clean 'ora.asm' on 'h1-o12-cl1' CRS-2681: Clean of 'ora.asm' on 'h1-o12-cl1' succeeded ... CRS-2793: Shutdown of Oracle High Availability Services-managed resources on 'h1-o12-cl1' has completed CRS-4133: Oracle High Availability Services has been stopped. real 10m12.160s user 0m0.065s sys 0m0.097s
Terminate any remaining processes, such as TNS listeners and ONS, on all cluster nodes.
[oracle@h1-o12-cl1 ~]$ ps -ef | grep grid oracle 2808 1 0 12:12 ? 00:00:00 /u01/app/12.1.0/grid/opmn/bin/ons -d oracle 2809 2808 0 12:12 ? 00:00:00 /u01/app/12.1.0/grid/opmn/bin/ons -d oracle 2836 1 0 12:12 ? 00:00:00 /u01/app/12.1.0/grid/bin/tnslsnr LISTENER_SCAN3 -no_crs_notify -inherit oracle 2848 1 0 12:12 ? 00:00:00 /u01/app/12.1.0/grid/bin/tnslsnr LISTENER -no_crs_notify -inherit oracle 2859 1 0 12:12 ? 00:00:00 /u01/app/12.1.0/grid/bin/tnslsnr LISTENER_SCAN2 -no_crs_notify -inherit [oracle@h1-o12-cl1 ~]$ kill 2808 2809 2836 2848 2859
Deconfigure any IPv4 and IPv6 VIPs that remain configured. Perform these steps for all cluster nodes.
[root@h1-o12-cl1 ~]# ip addr del 10.1.2.92/24 dev eth0 ... [root@h1-o12-cl1 ~]# ip addr del 2001:123:4:5::3:5e/64 dev eth0 ...
Rebuild
We are now ready to begin. Re-label the ASM disks.
[root@h1-o12-cl1 ~]# oracleasm listdisks C1DATA101 C1DATA102 C1DATA103 C1FRA101 C1FRA102 C1FRA103 OCR01 OCR02 OCR03
[root@h1-o12-cl1 ~]# oracleasm scandisks Reloading disk partitions: sda: sda1 sdb: sdb1 sdd: sdd1 sdi: sdi1 sdh: sdh1 sdf: sdf1 sdg: sdg1 sdc: sdc1 sde: sde1 done Cleaning any stale ASM disks... Cleaning disk "OCR02" Cleaning disk "OCR03" Scanning system for ASM disks...
[root@h1-o12-cl1 ~]# oracleasm createdisk OCR02 /dev/sdd1 Writing disk header: done Instantiating disk: done [root@h1-o12-cl1 ~]# oracleasm createdisk OCR03 /dev/sdg1 Writing disk header: done Instantiating disk: done
Re-scan the disk headers on the remaining cluster node(s).
[root@h2-o12-cl1 ~]# oracleasm scandisks Reloading disk partitions: sda: sda1 sdc: sdc1 sdf: sdf1 sde: sde1 sdh: sdh1 sdb: sdb1 sdd: sdd1 sdg: sdg1 sdi: sdi1 done Cleaning any stale ASM disks... Scanning system for ASM disks...
Start ASM on one node in exclusive mode.
[root@h1-o12-cl1 ~]# /u01/app/12.1.0/grid/bin/crsctl start crs -excl -nocrs CRS-4123: Oracle High Availability Services has been started. ... CRS-2672: Attempting to start 'ora.asm' on 'h1-o12-cl1' CRS-2676: Start of 'ora.asm' on 'h1-o12-cl1' succeeded
Check CRSD is down and stop it if it is not.
[root@h1-o12-cl1 ~]# /u01/app/12.1.0/grid/bin/crsctl status resource ora.crsd -init NAME=ora.crsd TYPE=ora.crs.type TARGET=OFFLINE STATE=OFFLINE
If we attempt to drop the corrupted disk group now we will receive the following error as ASM is still aware of the voting file present on the remaining disk.
ORA-15039: diskgroup not dropped ORA-15276: ASM diskgroup C1OCR has cluster voting files
So we will firstly drop the voting files and then drop the corrupted disk group and recreate it.
[oracle@h1-o12-cl1 ~]$ /u01/app/12.1.0/grid/bin/crsctl query css votedisk ## STATE File Universal Id File Name Disk group -- ----- ----------------- --------- --------- 1. ONLINE dcc6f60295074f73bf58d22bee1c0da1 (ORCL:OCR01) [C1OCR] 2. OFFLINE 39f8dd658ead4f1fbf3c67c6d1765970 () [] Located 2 voting disk(s). [oracle@h1-o12-cl1 ~]$ /u01/app/12.1.0/grid/bin/crsctl delete css votedisk +c1ocr CRS-4611: Successful deletion of voting disk +c1ocr. [oracle@h1-o12-cl1 ~]$ sqlplus / as sysasm SQL> drop diskgroup c1ocr force including contents; Diskgroup dropped. SQL> create diskgroup c1ocr normal redundancy disk 'ORCL:OCR01', 'ORCL:OCR02', 'ORCL:OCR03' attribute 'compatible.asm' = '12.1'; Diskgroup created.
The OCR must be restored before the voting files can be recreated. Adding voting files to an ASM disk group is not possible and the following error would be returned.
[root@h1-o12-cl1 ~]# /u01/app/12.1.0/grid/bin/crsctl add css votedisk +c1ocr CRS-4671: This command is not supported for ASM diskgroups. CRS-4000: Command Add failed, or completed with errors.
Locate the most recent OCR backup and restore it.
[root@h1-o12-cl1 ~]# /u01/app/12.1.0/grid/bin/ocrconfig -restore /u01/app/12.1.0/grid/cdata/o12-cl1/backup00.ocr
It is now possible to replace the voting disks.
[oracle@h1-o12-cl1 ~]$ /u01/app/12.1.0/grid/bin/crsctl replace votedisk +c1ocr Successful addition of voting disk b29fb1e424634f56bf073e20b1ee1849. Successful addition of voting disk 1d95a0212c144f8bbf998587f5b54159. Successful addition of voting disk 3bf4b50448f84fc5bf9eedd00fcf12cf. Successfully replaced voting disk group with +c1ocr. CRS-4266: Voting file(s) successfully replaced [oracle@h1-o12-cl1 ~]$ /u01/app/12.1.0/grid/bin/crsctl query css votedisk ## STATE File Universal Id File Name Disk group -- ----- ----------------- --------- --------- 1. ONLINE b29fb1e424634f56bf073e20b1ee1849 (ORCL:OCR01) [C1OCR] 2. ONLINE 1d95a0212c144f8bbf998587f5b54159 (ORCL:OCR02) [C1OCR] 3. ONLINE 3bf4b50448f84fc5bf9eedd00fcf12cf (ORCL:OCR03) [C1OCR] Located 3 voting disk(s).
Perform a quick check of the OCR. Root privileges are required for the logical corruption check.
[root@h1-o12-cl1 o12-cl1]# /u01/app/12.1.0/grid/bin/ocrcheck ... Device/File Name : +C1OCR Device/File integrity check succeeded ... Cluster registry integrity check succeeded Logical corruption check succeeded
Recreate the ASM spfile in the disk group. The spfile that has been recorded no longer exists.
[oracle@h1-o12-cl1 ~]$ asmcmd -p ASMCMD [+] > spget +C1OCR/o12-cl1/ASMPARAMETERFILE/registry.253.820928619 ASMCMD [+] > ls -l +C1OCR/o12-cl1/ASMPARAMETERFILE/registry.253.820928619 ASMCMD-8002: entry 'ASMPARAMETERFILE' does not exist in directory '+C1OCR/o12-cl1/'
Search the ASM alert log for non-default parameter values and copy them into a temporary pfile. Note that we have a non-default asm_diskstring.
[oracle@h1-o12-cl1 ~]$ view /u01/app/oracle/diag/asm/+asm/+ASM1/trace/alert_+ASM1.log ... System parameters with non-default values: large_pool_size = 12M remote_login_passwordfile= "EXCLUSIVE" asm_diskstring = "ORCL:*" asm_diskgroups = "C1DATA1" asm_diskgroups = "C1FRA1" asm_power_limit = 1 ... [oracle@h1-o12-cl1 ~]$ vi /tmp/asmpfiletemp.ora ...
Convert the pfile into an spfile.
[oracle@h1-o12-cl1 ~]$ sqlplus / as sysasm SQL> create spfile='+c1ocr' from pfile='/tmp/asmpfiletemp.ora'; File created.
Update the registry with the new ASM file name of the replacement spfile.
[oracle@h1-o12-cl1 ~]$ asmcmd -p ASMCMD [+] > ls -l +c1ocr/o12-cl1/ASMPARAMETERFILE Type Redund Striped Time Sys Name ASMPARAMETERFILE MIRROR COARSE SEP 07 15:00:00 Y REGISTRY.253.825521871 ASMCMD [+] > spset +c1ocr/o12-cl1/ASMPARAMETERFILE/REGISTRY.253.825521871 ASMCMD [+] > spget +c1ocr/o12-cl1/ASMPARAMETERFILE/REGISTRY.253.825521871
If we have a non-default asm_diskstring, check that it has been set in the profile and spfile. Use dsset to fix this if necessary.
ASMCMD [+] > dsget parameter:ORCL:* profile:ORCL:*
CRS is required to be running in non-exclusive mode in order to recreate the ASM password file in the disk group. This means some 'application' cluster resources will be started and available now.
[root@h1-o12-cl1 ~]# /u01/app/12.1.0/grid/bin/crsctl stop crs -f ... CRS-4133: Oracle High Availability Services has been stopped. [root@h1-o12-cl1 ~]# /u01/app/12.1.0/grid/bin/crsctl start crs -wait ... CRS-6024: Completed start of Oracle Cluster Ready Services-managed resources CRS-4123: Oracle High Availability Services has been started.
Recreate the ASM password file in the disk group.
[oracle@h1-o12-cl1 ~]$ asmcmd -p ASMCMD [+] > pwget --asm +C1OCR/orapwASM ASMCMD [+] > pwdelete --asm OPW-00022: The password file does not exist. ASMCMD-9462: could not delete password file ASMCMD [+] > pwcreate --asm +C1OCR/orapwASM topsecret ASMCMD [+] > pwget --asm +C1OCR/orapwasm ASMCMD [+] > ls -l +C1OCR/orapwasm Type Redund Striped Time Sys Name PASSWORD HIGH COARSE SEP 07 16:00:00 N orapwasm => +C1OCR/ASM/PASSWORD/pwdasm.256.825523875 ASMCMD [+] > ls -l +C1OCR/ASM/PASSWORD/pwdasm.256.825523875 Type Redund Striped Time Sys Name PASSWORD HIGH COARSE SEP 07 16:00:00 Y pwdasm.256.825523875
Start CRS on the remaining cluster node(s).
[root@h2-o12-cl1 ~]# /u01/app/12.1.0/grid/bin/crsctl start crs -wait CRS-4123: Starting Oracle High Availability Services-managed resources ... CRS-6016: Resource auto-start has completed for server h2-o12-cl1 CRS-6024: Completed start of Oracle Cluster Ready Services-managed resources CRS-4123: Oracle High Availability Services has been started.
Run a cluster verification of all cluster nodes.
[oracle@h1-o12-cl1 ~]$ /u01/app/12.1.0/grid/bin/cluvfy comp ocr -n all -verbose Verifying OCR integrity Checking OCR integrity... Checking the absence of a non-clustered configuration... All nodes free of non-clustered, local-only configurations Checking OCR config file "/etc/oracle/ocr.loc"... OCR config file "/etc/oracle/ocr.loc" check successful Disk group for ocr location "+C1OCR" is available on all the nodes Checking OCR dump functionality OCR dump check passed NOTE: This check does not verify the integrity of the OCR contents. Execute 'ocrcheck' as a privileged user to verify the contents of OCR. OCR integrity check passed Verification of OCR integrity was successful.
Perform a manual backup of the OLR on each cluster node
[root@h1-o12-cl1 ~]# /u01/app/12.1.0/grid/bin/ocrconfig -local -manualbackup h1-o12-cl1 2013/09/07 16:38:49 /u01/app/12.1.0/grid/cdata/h1-o12-cl1/backup_20130907_163849.olr h1-o12-cl1 2013/07/16 11:49:13 /u01/app/12.1.0/grid/cdata/h1-o12-cl1/backup_20130716_114913.olr [root@h2-o12-cl1 ~]# /u01/app/12.1.0/grid/bin/ocrconfig -local -manualbackup h2-o12-cl1 2013/09/07 16:39:18 /u01/app/12.1.0/grid/cdata/h2-o12-cl1/backup_20130907_163918.olr h2-o12-cl1 2013/07/16 11:56:16 /u01/app/12.1.0/grid/cdata/h2-o12-cl1/backup_20130716_115616.olr
Re-add any CRS services that had been added since the OCR backup we restored. Backup the OCR.
[root@h2-o12-cl1 ~]# /u01/app/12.1.0/grid/bin/ocrconfig -manualbackup h1-o12-cl1 2013/09/07 16:41:05 /u01/app/12.1.0/grid/cdata/o12-cl1/backup_20130907_164105.ocr h1-o12-cl1 2013/09/07 04:13:05 /u01/app/12.1.0/grid/cdata/o12-cl1/backup00.ocr h1-o12-cl1 2013/07/16 11:56:16 /u01/app/12.1.0/grid/cdata/o12-cl1/backup_20130716_115616.ocr h1-o12-cl1 2013/07/16 11:49:14 /u01/app/12.1.0/grid/cdata/o12-cl1/backup_20130716_114914.ocr
Thank you for this
Thank you for this comprehensive explanation!
It is great.
Thomas