Recently I ran into an problem with 126.96.36.199 RAC. I observed this on a system patched to PSU6 and it looks like a bug to me. But the interesting part isn’t the problem – it’s an impressive and creative workaround that my colleague found over the weekend. I should add that this teammate doesn’t have much background with Oracle RAC though he does have lots of experience with other technologies. His email this weekend surprised me and also gave me a good laugh – hope you find it equally useful and enjoyable!
The problem originated with a requirement I was given when designing this particular cluster system: I was asked to run Data Guard traffic over the backup network instead of the public network. This sounds simple enough if you haven’t worked with RAC. But if you’ve worked with Oracle clusters you realize that nothing is simple anymore. (A big reason I often encourage people to wait on moving to RAC, especially if the main driver is high availability…)
In an Oracle cluster, networks aren’t just networks. Each component (listeners & ports, IPs, subnets) is a “resource” that must be defined and managed by the cluster management software and must not be tinkered with outside of the cluster management software. I’ve seen many occasions where sysadmins new to RAC were surprised when their server would suddenly reboot itself after they stopped and restarted the network. Welcome to the newly complicated world of clusters!
Data Guard, of course, needs to use SQLNet connections in both directions between the two mirrored databases: from primary to standby to ship changes and from standby to primary to retrieve any gaps of missing changes. SQLNet connections require listeners. On a RAC cluster with IP networking you must always connect to the listener by using a special virtual IP. And the official Oracle Documentation only seems to support having a single Virtual IP for each node on a single public network. They don’t give any clues about having listeners (and VIPs) on a second network.
However with a little searching I found notes 1063571.1 and 1349977.1 on Oracle’s support site which has instructions to add a second network. FYI, there are also a few blogs (like Linda Smith’s blog) which have published the general process outlined by this note. This is the process I followed to setup listeners on a second network for our cluster. But this is where things got interesting. After adding the second network I proceeded to add a new node to the cluster… FAIL! And the root cause seems to be that adding a node simply breaks if there’s a listener on a second network! More specifically, root.sh – which actually joins the new node into the cluster – fails.
(root)# sh /u01/grid/oracle/product/11.2.0/grid_1/root.sh Performing root user operation for Oracle 11g The following environment variables are set as: ORACLE_OWNER= grid ORACLE_HOME= /u01/grid/oracle/product/11.2.0/grid_1 Entries will be added to the /etc/oratab file as needed by Database Configuration Assistant when a database is created Finished running generic part of root script. Now product-specific root actions will be performed. Using configuration parameter file: /u01/grid/oracle/product/11.2.0/grid_1/crs/install/crsconfig_params User ignored Prerequisites during installation OLR initialization - successful Adding Clusterware entries to upstart CRS-4402: The CSS daemon was started in exclusive mode but found an active CSS daemon on node collabn1, number 1, and is terminating An active cluster was found during exclusive startup, restarting to join the cluster clscfg: EXISTING configuration version 5 detected. clscfg: version 5 is 11g Release 2. Successfully accumulated necessary OCR keys. Creating OCR keys for user 'root', privgrp 'root'.. Operation successful. /u01/grid/oracle/product/11.2.0/grid_1/bin/srvctl start listener -n collabn3 ... failed /u01/grid/oracle/product/11.2.0/grid_1/perl/bin/perl -I/u01/grid/oracle/product/11.2.0/grid_1/perl/lib -I/u01/grid/oracle/product/11.2.0/grid_1/crs/install /u01/grid/oracle/product/11.2.0/grid_1/crs/install/rootcrs.pl execution failed
The clusterware root.sh creates a log file under the cfgtoollogs/crsconfig directory in the grid home. Looking a little deeper, we can see that this was a fatal error and that the cluster setup process actually died:
(root)# less /u01/grid/oracle/product/11.2.0/grid_1/cfgtoollogs/crsconfig/rootcrs_collabn3.log 2013-11-01 15:00:33: Running as user grid: /u01/grid/oracle/product/11.2.0/grid_1/bin/srvctl start listener -n collabn3 2013-11-01 15:00:33: s_run_as_user2: Running /bin/su grid -c ' /u01/grid/oracle/product/11.2.0/grid_1/bin/srvctl start listener -n collabn3 ' 2013-11-01 15:00:37: Removing file /tmp/filed7f9vT 2013-11-01 15:00:37: Successfully removed file: /tmp/filed7f9vT 2013-11-01 15:00:37: /bin/su exited with rc=1 2013-11-01 15:00:37: /u01/grid/oracle/product/11.2.0/grid_1/bin/srvctl start listener -n collabn3 ... failed 2013-11-01 15:00:37: Running as user grid: /u01/grid/oracle/product/11.2.0/grid_1/bin/cluutil -ckpt -oraclebase /oracle/grid -writeckpt -name ROOTCRS_NODECONFIG -state FAIL 2013-11-01 15:00:37: s_run_as_user2: Running /bin/su grid -c ' /u01/grid/oracle/product/11.2.0/grid_1/bin/cluutil -ckpt -oraclebase /oracle/grid -writeckpt -name ROOTCRS_NODECONFIG -state FAIL ' 2013-11-01 15:00:37: Removing file /tmp/fileRpYFXs 2013-11-01 15:00:37: Successfully removed file: /tmp/fileRpYFXs 2013-11-01 15:00:37: /bin/su successfully executed 2013-11-01 15:00:37: Succeeded in writing the checkpoint:'ROOTCRS_NODECONFIG' with status:FAIL 2013-11-01 15:00:37: CkptFile: /u01/grid/oracle/Clusterware/ckptGridHA_collabn3.xml 2013-11-01 15:00:37: Sync the checkpoint file '/u01/grid/oracle/Clusterware/ckptGridHA_collabn3.xml' 2013-11-01 15:00:37: Sync '/u01/grid/oracle/Clusterware/ckptGridHA_collabn3.xml' to the physical disk 2013-11-01 15:00:37: ###### Begin DIE Stack Trace ###### 2013-11-01 15:00:37: Package File Line Calling 2013-11-01 15:00:37: --------------- -------------------- ---- ---------- 2013-11-01 15:00:37: 1: main rootcrs.pl 387 crsconfig_lib::dietrap 2013-11-01 15:00:37: 2: crsconfig_lib crsconfig_lib.pm 9124 main::__ANON__ 2013-11-01 15:00:37: 3: crsconfig_lib crsconfig_lib.pm 9082 crsconfig_lib::configNode 2013-11-01 15:00:37: 4: main rootcrs.pl 902 crsconfig_lib::perform_configNode 2013-11-01 15:00:37: ####### End DIE Stack Trace #######
And finally – if we just manually try to run the listener startup command, it becomes obvious what the problem is.
(root)# /u01/grid/oracle/product/11.2.0/grid_1/bin/srvctl start listener -n collabn3 PRCC-1015 : LISTENER was already running on collabn3 PRCR-1004 : Resource ora.LISTENER.lsnr is already running PRCR-1013 : Failed to start resource ora.LISTENER_BKP.lsnr PRCR-1064 : Failed to start resource ora.LISTENER_BKP.lsnr on node collabn3 CRS-2805: Unable to start 'ora.LISTENER_BKP.lsnr' because it has a 'hard' dependency on resource type 'ora.cluster_vip_net2.type' and no resource of that type can satisfy the dependency CRS-2525: All instances of the resource 'ora.collabn1-bvip.vip' are already running; relocate is not allowed because the force option was not specified CRS-2525: All instances of the resource 'ora.collabn2-bvip.vip' are already running; relocate is not allowed because the force option was not specified CRS-2525: All instances of the resource 'ora.collabn3-bvip.vip' are already running; relocate is not allowed because the force option was not specified [root@collabn3 ~]# /u01/grid/oracle/product/11.2.0/grid_1/bin/srvctl config listener Name: LISTENER Network: 1, Owner: grid Home: End points: TCP:1521 Name: LISTENER_BKP Network: 2, Owner: grid Home: End points: TCP:1522 [root@collabn3 ~]# /u01/grid/oracle/product/11.2.0/grid_1/bin/srvctl config network Network exists: 1/192.168.20.0/255.255.254.0/bond0, type static Network exists: 2/192.168.220.0/255.255.254.0/bond1, type static [root@collabn3 ~]# /u01/grid/oracle/product/11.2.0/grid_1/bin/srvctl config vip -n collabn1 VIP exists: /192.168.221.130/192.168.221.130/192.168.220.0/255.255.254.0/bond1, hosting node collabn1 VIP exists: /collabn1-vip/192.168.21.130/192.168.20.0/255.255.254.0/bond0, hosting node collabn1 [root@collabn3 ~]# /u01/grid/oracle/product/11.2.0/grid_1/bin/srvctl config vip -n collabn3 VIP exists: /collabn3-vip/192.168.21.171/192.168.20.0/255.255.254.0/bond0, hosting node collabn3
You can see from the excerpts above that the Oracle CRS root.sh script has failed to create the VIP for the second network. Looks like a simple bug to me, should be pretty easy to reproduce too. This was where I left the case last Friday afternoon.
This weekend I received the following email from my coworker:
I spent a little bit of time digging into what this Oracle VIP is and if there was a way to fool it. Since this is not a standard OS definable thing, I checked to see if the srvctl command already existed on the machine. Since it was there, I gave it a try and it wouldn’t work because you had removed the clusterware. So I tried this:
– Opened 2 windows
– In Window 1 ran the command: /u01/grid/oracle/product/11.2.0/grid_1/root.sh
– In window 2 I kept trying the command: /u01/grid/oracle/product/11.2.0/grid_1/bin/srvctl add vip -n collabn3 -A 192.168.221.170/255.255.254.0 -k 2
– Eventually the clusterware came up and the node was defined in the cluster and services were fine and the command finally succeeded.
I assume this means everything went fine. Check it out and let me know if it really is in the cluster.
And there you have it. I had a look monday morning and sure enough it looked like everything succeeded. Creative, brilliant solution from one of my coworkers! Of course we will still file a bug report with Oracle and work to get it resolved, but this seemed worth sharing in the meantime.