Views expressed here are solely that of my own. Please make sure that you test the code/queries that appear on my blog before applying them to the production environment.

Sunday, June 19, 2011

Problem starting the second node of RAC after OS upgrade from IBM AIX v5.3 to v6.1

We decided to test the IBM AIX OS upgrade operation in a test Oracle RAC environment. We wanted to do this test in a rolling fashion not to cause any downtime in production environment which is two node Oracle 10gR2 RAC database.

What happened was that, after Unix team upgraded the OS on second node of RAC from IBM AIX v5.3 to v6.1, we restarted the second node and we saw that Oracle crs processes did not start successfully on the second node.

We checked the CRS services
[root@srvdb01]:/home/root > crsstat
HA Resource                       Target  State
-----------                       ------  -----
ora.ORCL.ORCL1.inst               ONLINE  ONLINE on srvdb01
ora.ORCL.db                       ONLINE  ONLINE on srvdb01
ora.srvdb01.ASM1.asm              ONLINE  ONLINE on srvdb01
ora.srvdb01.LISTENER_SRVDB01.lsnr ONLINE  ONLINE on srvdb01
ora.srvdb01.gsd                   ONLINE  ONLINE on srvdb01
ora.srvdb01.ons                   ONLINE  ONLINE on srvdb01
ora.srvdb01.vip                   ONLINE  ONLINE on srvdb01
ora.ORCL.ORCL2.inst               OFFLINE OFFLINE
ora.srvdb02.ASM2.asm              OFFLINE OFFLINE
ora.srvdb02.LISTENER_SRVDB02.lsnr OFFLINE OFFLINE
ora.srvdb02.gsd                   OFFLINE OFFLINE
ora.srvdb02.ons                   OFFLINE OFFLINE
ora.srvdb02.vip                   ONLINE  ONLINE on srvdb02

As you can see from the above output, "ora.srvdb02.vip" did not automatically moved to node1, so first we relocated that vip service to node1. Because client sessions trying to connect to the database by using this vip address was getting connection errors.
[root@srvdb01]:/home/root > $CRS_HOME/bin/crs_relocate ora.srvdb02.vip

Status after relocate,
[root@srvdb01]:/home/root > crsstat
HA Resource                       Target State
-----------                       ------  -----
ora.ORCL.ORCL1.inst               ONLINE  ONLINE on srvdb01
ora.ORCL.db                       ONLINE  ONLINE on srvdb01
ora.srvdb01.ASM1.asm              ONLINE  ONLINE on srvdb01
ora.srvdb01.LISTENER_SRVDB01.lsnr ONLINE  ONLINE on srvdb01
ora.srvdb01.gsd                   ONLINE  ONLINE on srvdb01
ora.srvdb01.ons                   ONLINE  ONLINE on srvdb01
ora.srvdb01.vip                   ONLINE  ONLINE on srvdb01
ora.ORCL.ORCL2.inst               OFFLINE OFFLINE
ora.srvdb02.ASM2.asm              OFFLINE OFFLINE
ora.srvdb02.LISTENER_SRVDB02.lsnr OFFLINE OFFLINE
ora.srvdb02.gsd                   OFFLINE OFFLINE
ora.srvdb02.ons                   OFFLINE OFFLINE
ora.srvdb02.vip                   ONLINE  ONLINE on srvdb01

By the way if you do not have "crsstat" command on your server, you can use "crs_stat -t" command instead or you can put the following script in a folder and add its path to the "PATH" environment variable in "oracle" user profile on your server. "crs_stat -t" command does not display the full name of "HA Resource" but "crsstat" command below will display full names of "HA Resource" column.
[oracle@srvdb01]:/oracle/uural > more crsstat
#!/usr/bin/ksh
#
# Sample 10g CRS resource status query script
#
# Description:
#    - Returns formatted version of crs_stat -t, in tabular
#      format, with the complete rsc names and filtering keywords
#   - The argument, $RSC_KEY, is optional and if passed to the script, will
#     limit the output to HA resources whose names match $RSC_KEY.
# Requirements:
#   - $ORA_CRS_HOME should be set in your environment

RSC_KEY=$1
QSTAT=-u
#AWK=/usr/xpg4/bin/awk    # if not available use /usr/bin/awk
AWK=/usr/bin/awk    # if not available use /usr/bin/awk

# Table header:echo ""
$AWK \
'BEGIN {printf "%-45s %-10s %-18s\n", "HA Resource", "Target", "State";
printf "%-45s %-10s %-18s\n", "-----------", "------", "-----";}'

# Table body:
$ORA_CRS_HOME/bin/crs_stat $QSTAT | $AWK \
'BEGIN { FS="="; state = 0; }
$1~/NAME/ && $2~/'$RSC_KEY'/ {appname = $2; state=1};
state == 0 {next;}
$1~/TARGET/ && state == 1 {apptarget = $2; state=2;}
$1~/STATE/ && state == 2 {appstate = $2; state=3;}
state == 3 {printf "%-45s %-10s %-18s\n", appname, apptarget, appstate; state=0;}'

[oracle@srvdb01]:/oracle/uural >

Anyway we checked the reason why CRS services was down on the second node after OS upgrade. We checked the CRS related logfiles on node2.
We saw the following error.
[root@srvdb02]:/oracle/crshome1/log/srvdb02/client > vi ocrcheck_327920.log
Oracle Database 10g CRS Release 10.2.0.4.0 Production Copyright 1996, 2008 Oracle.  All rights reserved.
2011-06-14 16:13:48.455: [OCRCHECK][1]ocrcheck starts...
2011-06-14 16:13:48.457: [  OCROSD][1]utopen:7:failed to open OCR file/disk /dev/ocr_disk1 /dev/ocr_disk2, errno=19, os err string=No such device
2011-06-14 16:13:48.457: [  OCRRAW][1]proprinit: Could not open raw device
2011-06-14 16:13:48.457: [ default][1]a_init:7!: Backend init unsuccessful : [26]
2011-06-14 16:13:48.457: [OCRCHECK][1]Failed to access OCR repository: [PROC-26: Error while accessing the physical storage Operating System error [No such device] [19]]
2011-06-14 16:13:48.457: [OCRCHECK][1]Failed to initialize ocrchek2
2011-06-14 16:13:48.457: [OCRCHECK][1]Exiting [status=failed]...

We thought that node2 could not access the OCR disk anyway. We checked the OCR disk configuration on node2.
[root@srvdb02]:/home/root > ls -l /dev/ocr*
crw-r-----    1 root     dba          23, 27 Jun 08 15:14 /dev/ocr_disk1
crw-r-----    1 root     dba          23,  5 Jun 08 15:16 /dev/ocr_disk2

When we first installed Oracle RAC, we created the OCR disks as below
mknod /dev/ocr_disk1 c 23 27
mknod /dev/ocr_disk2 c 23 5

Querying the corresponding hdisks by using the disk access path "major, minor" numbers gave no results
[root@srvdb02]:/home/root > ls -l /dev/hdisk* | grep "23, 27"
[root@srvdb02]:/home/root > ls -l /dev/hdisk* | grep "23,  5"

We realized that major numbers of hdisks have changed after OS upgrade.
[root@srvdb02]:/home/root > ls -l /dev/hdisk* | grep "21, 27"
brw-------    1 root     system       21, 27 Jun 14 15:34 /dev/hdisk25
[root@srvdb02]:/home/root > ls -l /dev/hdisk* | grep "23,  5"
brw-------    1 root     system       21,  5 Jun 14 15:34 /dev/hdisk3

But how could we be sure that the minor numbers did not change after OS upgrade. To check this we need to get one level deeper and compare the LUN ids of those hdisks with the ones on the node1 which has stil IBM AIX v5.3 OS.

We checked the ocr disk configuration on node1
[root@srvdb01]:/home/root > ls -l /dev/ocr*
crw-r-----    1 root     dba          23,  4 Jun 19 15:52 /dev/ocr_disk1
crw-r-----    1 root     dba          23,  5 Jun 19 15:52 /dev/ocr_disk2
[root@srvdb01]:/home/root > ls -l /dev/hdisk* | grep "23,  4"
brw-------    1 root     system       23,  4 May 16 2008  /dev/hdisk2
[root@srvdb01]:/home/root > ls -l /dev/hdisk* | grep "23,  5"
brw-------    1 root     system       23,  5 May 16 2008  /dev/hdisk3

Find the LUN ids of those hdisks on node1
[root@srvdb01]:/home/root > for i in 2 3
> do
> lsattr -El hdisk$i | grep reserve_policy | awk '{print $1,$2 }'| read rp1 rp2
> lsattr -El hdisk$i | grep pvid | awk '{print $1,$2 }'| read pv1 pv2
> lsattr -El hdisk$i | grep lun_id | awk '{print $1,$2 }'| read li1 li2
> if [ "$li1" != "" ]
> then
> echo hdisk$i' -> '$li1' = '$li2' / '$rp1' = '$rp2' / '$pv1' = '$pv2
> fi
> done
hdisk2 -> lun_id = 0x0001000000000000 / reserve_policy = no_reserve / pvid = none
hdisk3 -> lun_id = 0x0002000000000000 / reserve_policy = no_reserve / pvid = none

Find the LUN ids of hdisks on node2 which were configured as ocr disks
[root@srvdb02]:/home/root > for i in 25 3
> do
> lsattr -El hdisk$i | grep reserve_policy | awk '{print $1,$2 }'| read rp1 rp2
> lsattr -El hdisk$i | grep pvid | awk '{print $1,$2 }'| read pv1 pv2
> lsattr -El hdisk$i | grep lun_id | awk '{print $1,$2 }'| read li1 li2
> if [ "$li1" != "" ]
> then
> echo hdisk$i' -> '$li1' = '$li2' / '$rp1' = '$rp2' / '$pv1' = '$pv2
> fi
> done
hdisk25 -> lun_id = 0x0018000000000000 / reserve_policy = single_path / pvid = none
hdisk3 -> lun_id = 0x0002000000000000 / reserve_policy = single_path / pvid = none

As we compare the LUN ids of those disks on both nodes, we saw that they did not match, which means not only the major numbers but also the minor numbers of hdisks had been changed on node2 after the OS upgrade on this node.
Since we could not find so much information about this problem on the Internet, we discussed with the Unix guys about the problem and we decided that the problem of changing disk access path major minor number change could be the result of IBM AIX v6.1 started to use MPIO to manage storage although IBM AIX v5.3 was using RDAC. Although the Unix guys said they tried to changed the MPIO to RDAC after the upgrade on node2, they said this corrected only the major number change and set the major numbers to the old original values but still the minor numbers were scrambled causing the Oracle RAC node2 to fail to start CRS services.

In any case like that if the Unix administrators are not able to bring back the original major,minor numbers of the hdisks, what you can do is find the corresponding hdisks numbers by matching the LUN ids on node2, remove ocr disks and create them againg with the correct hdisk major, minor numbers and try to start the OCR on node2.

I will demonstrate how to find the matching hdisks on node2 by using LUN ids on node1.
List the LUN ids on node2
[root@srvdb02]:/home/root > for i in 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41
> do
> lsattr -El hdisk$i | grep reserve_policy | awk '{print $1,$2 }'| read rp1 rp2
> lsattr -El hdisk$i | grep pvid | awk '{print $1,$2 }'| read pv1 pv2
> lsattr -El hdisk$i | grep lun_id | awk '{print $1,$2 }'| read li1 li2
> if [ "$li1" != "" ]
> then
> echo hdisk$i' -> '$li1' = '$li2' / '$rp1' = '$rp2' / '$pv1' = '$pv2
> fi
> done
hdisk1 -> lun_id = 0x0000000000000000 / reserve_policy = single_path / pvid = none
hdisk2 -> lun_id = 0x0001000000000000 / reserve_policy = single_path / pvid = none
hdisk3 -> lun_id = 0x0002000000000000 / reserve_policy = single_path / pvid = none
hdisk4 -> lun_id = 0x0003000000000000 / reserve_policy = single_path / pvid = none
hdisk5 -> lun_id = 0x0004000000000000 / reserve_policy = single_path / pvid = none
hdisk6 -> lun_id = 0x0005000000000000 / reserve_policy = single_path / pvid = none
hdisk7 -> lun_id = 0x0006000000000000 / reserve_policy = single_path / pvid = none
hdisk8 -> lun_id = 0x0007000000000000 / reserve_policy = single_path / pvid = none
hdisk9 -> lun_id = 0x0008000000000000 / reserve_policy = single_path / pvid = none
hdisk10 -> lun_id = 0x0009000000000000 / reserve_policy = single_path / pvid = none
hdisk11 -> lun_id = 0x000a000000000000 / reserve_policy = single_path / pvid = none
hdisk12 -> lun_id = 0x000b000000000000 / reserve_policy = single_path / pvid = none
hdisk13 -> lun_id = 0x000c000000000000 / reserve_policy = single_path / pvid = none
hdisk14 -> lun_id = 0x000d000000000000 / reserve_policy = single_path / pvid = none
hdisk15 -> lun_id = 0x000e000000000000 / reserve_policy = single_path / pvid = none
hdisk16 -> lun_id = 0x000f000000000000 / reserve_policy = single_path / pvid = none
hdisk17 -> lun_id = 0x0010000000000000 / reserve_policy = single_path / pvid = none
hdisk18 -> lun_id = 0x0011000000000000 / reserve_policy = single_path / pvid = none
hdisk19 -> lun_id = 0x0012000000000000 / reserve_policy = single_path / pvid = none
hdisk20 -> lun_id = 0x0013000000000000 / reserve_policy = single_path / pvid = none
hdisk21 -> lun_id = 0x0014000000000000 / reserve_policy = single_path / pvid = none
hdisk22 -> lun_id = 0x0015000000000000 / reserve_policy = single_path / pvid = none
hdisk23 -> lun_id = 0x0016000000000000 / reserve_policy = single_path / pvid = none
hdisk24 -> lun_id = 0x0017000000000000 / reserve_policy = single_path / pvid = none
hdisk25 -> lun_id = 0x0018000000000000 / reserve_policy = single_path / pvid = none
hdisk26 -> lun_id = 0x0019000000000000 / reserve_policy = single_path / pvid = none
hdisk27 -> lun_id = 0x001a000000000000 / reserve_policy = single_path / pvid = none
hdisk28 -> lun_id = 0x001b000000000000 / reserve_policy = single_path / pvid = none
hdisk29 -> lun_id = 0x001c000000000000 / reserve_policy = single_path / pvid = none
hdisk30 -> lun_id = 0x001d000000000000 / reserve_policy = single_path / pvid = none
hdisk31 -> lun_id = 0x001e000000000000 / reserve_policy = single_path / pvid = none
hdisk32 -> lun_id = 0x001f000000000000 / reserve_policy = single_path / pvid = none
hdisk33 -> lun_id = 0x0028000000000000 / reserve_policy = single_path / pvid = 000c84c103c1d4480000000000000000
hdisk34 -> lun_id = 0x0020000000000000 / reserve_policy = single_path / pvid = none
hdisk35 -> lun_id = 0x0021000000000000 / reserve_policy = single_path / pvid = none
hdisk36 -> lun_id = 0x0022000000000000 / reserve_policy = single_path / pvid = none
hdisk37 -> lun_id = 0x0023000000000000 / reserve_policy = single_path / pvid = none
hdisk38 -> lun_id = 0x0024000000000000 / reserve_policy = single_path / pvid = none
hdisk39 -> lun_id = 0x0025000000000000 / reserve_policy = single_path / pvid = none
hdisk40 -> lun_id = 0x0026000000000000 / reserve_policy = single_path / pvid = none
hdisk41 -> lun_id = 0x0027000000000000 / reserve_policy = single_path / pvid = none

As you can see from the following output, if we match hdisk2 and hdisk3 LUN ids on node1 to LUN ids on node2 we found that they correspond again to the hdisk2 and hdisk3 on node2. This time LUN ids and hdisk names matches on both nodes but this is not always like that, hdisk names does not necessarily have to match on both servers everytime, the important thing is LUN ids should match, hdisk2 on node1 could match to hdisk18 on node2 in any other occasion.

In our case, hdisk2 and hdisk3 on node1 are the same disks as hdisk2 and hdisk3 on node2.
We need to find major, minor numbers of those disks on node2.
[root@srvdb02]:/home/root > ls -l /dev/hdisk[2,3]
brw-------    1 root     system       21,  4 Jun 14 15:34 /dev/hdisk2
brw-------    1 root     system       21,  5 Jun 14 15:34 /dev/hdisk3

Delete old ocr disk device definitions
[root@srvdb02]:/home/root > rm /dev/ocr_disk1
[root@srvdb02]:/home/root > rm /dev/ocr_disk2

Recreate ocr disks with the correct disk major,minor numbers
[root@srvdb02]:/home/root > mknod /dev/ocr_disk1 c 21 4
[root@srvdb02]:/home/root > mknod /dev/ocr_disk2 c 21 5

After completing this configuration you can first check the status and then try to restart CRS services
[root@srvdb02]:/home/root > crsctl check crs
Failure 1 contacting CSS daemon
Cannot communicate with CRS
Cannot communicate with EVM
[root@srvdb02]:/home/root > crsctl start crs

4 comments:

sbahi said...

We have the same problem that you mentioned in your blog


Please can you say, if the problem was solved ?
can we delete the old ocr disk device definitions and recreate them without loosing Data.

sbahi said...

Thank you for the Solution,
It Worked for us.

You have to change the reserve_policy of the disks on each node, from "single_path" to "no_reserve".

example:

On IBM Storage:
chdev -l hdisk3 - a reserve_policy=no_reserve

on EMC:
chdev -l hdisk3 - a reserve_lock=no

Mike said...

Thank you for writing the solution! I tried it and it worked great for me. You have great experience. Kudos to you!
sap migration

Ural Ural said...

Thank you all guys.
Knowledge spreads by sharing.
Cheers,
Ural