Listener Error from addNode.sh with Second Network

Posted by Jeremy ⋅ November 4, 2013

Recently I ran into an problem with 11.2.0.3 RAC. I observed this on a system patched to PSU6 and it looks like a bug to me. But the interesting part isn’t the problem – it’s an impressive and creative workaround that my colleague found over the weekend. I should add that this teammate doesn’t have much background with Oracle RAC though he does have lots of experience with other technologies. His email this weekend surprised me and also gave me a good laugh – hope you find it equally useful and enjoyable!

The problem originated with a requirement I was given when designing this particular cluster system: I was asked to run Data Guard traffic over the backup network instead of the public network. This sounds simple enough if you haven’t worked with RAC. But if you’ve worked with Oracle clusters you realize that nothing is simple anymore. (A big reason I often encourage people to wait on moving to RAC, especially if the main driver is high availability…)

In an Oracle cluster, networks aren’t just networks. Each component (listeners & ports, IPs, subnets) is a “resource” that must be defined and managed by the cluster management software and must not be tinkered with outside of the cluster management software. I’ve seen many occasions where sysadmins new to RAC were surprised when their server would suddenly reboot itself after they stopped and restarted the network. Welcome to the newly complicated world of clusters!

Data Guard, of course, needs to use SQLNet connections in both directions between the two mirrored databases: from primary to standby to ship changes and from standby to primary to retrieve any gaps of missing changes. SQLNet connections require listeners. On a RAC cluster with IP networking you must always connect to the listener by using a special virtual IP. And the official Oracle Documentation only seems to support having a single Virtual IP for each node on a single public network. They don’t give any clues about having listeners (and VIPs) on a second network.

However with a little searching I found notes 1063571.1 and 1349977.1 on Oracle’s support site which has instructions to add a second network. FYI, there are also a few blogs (like Linda Smith’s blog) which have published the general process outlined by this note. This is the process I followed to setup listeners on a second network for our cluster. But this is where things got interesting. After adding the second network I proceeded to add a new node to the cluster… FAIL! And the root cause seems to be that adding a node simply breaks if there’s a listener on a second network! More specifically, root.sh – which actually joins the new node into the cluster – fails.

(root)# sh /u01/grid/oracle/product/11.2.0/grid_1/root.sh

Performing root user operation for Oracle 11g

The following environment variables are set as:
    ORACLE_OWNER= grid
    ORACLE_HOME=  /u01/grid/oracle/product/11.2.0/grid_1
Entries will be added to the /etc/oratab file as needed by
Database Configuration Assistant when a database is created
Finished running generic part of root script.
Now product-specific root actions will be performed.
Using configuration parameter file: /u01/grid/oracle/product/11.2.0/grid_1/crs/install/crsconfig_params
User ignored Prerequisites during installation
OLR initialization - successful
Adding Clusterware entries to upstart
CRS-4402: The CSS daemon was started in exclusive mode but found an active CSS daemon on node collabn1, number 1, and is terminating
An active cluster was found during exclusive startup, restarting to join the cluster
clscfg: EXISTING configuration version 5 detected.
clscfg: version 5 is 11g Release 2.
Successfully accumulated necessary OCR keys.
Creating OCR keys for user 'root', privgrp 'root'..
Operation successful.
/u01/grid/oracle/product/11.2.0/grid_1/bin/srvctl start listener -n collabn3 ... failed
/u01/grid/oracle/product/11.2.0/grid_1/perl/bin/perl -I/u01/grid/oracle/product/11.2.0/grid_1/perl/lib -I/u01/grid/oracle/product/11.2.0/grid_1/crs/install /u01/grid/oracle/product/11.2.0/grid_1/crs/install/rootcrs.pl execution failed

The clusterware root.sh creates a log file under the cfgtoollogs/crsconfig directory in the grid home. Looking a little deeper, we can see that this was a fatal error and that the cluster setup process actually died:

(root)# less /u01/grid/oracle/product/11.2.0/grid_1/cfgtoollogs/crsconfig/rootcrs_collabn3.log

2013-11-01 15:00:33: Running as user grid: /u01/grid/oracle/product/11.2.0/grid_1/bin/srvctl start listener -n collabn3
2013-11-01 15:00:33: s_run_as_user2: Running /bin/su grid -c ' /u01/grid/oracle/product/11.2.0/grid_1/bin/srvctl start listener -n collabn3 '
2013-11-01 15:00:37: Removing file /tmp/filed7f9vT
2013-11-01 15:00:37: Successfully removed file: /tmp/filed7f9vT
2013-11-01 15:00:37: /bin/su exited with rc=1

2013-11-01 15:00:37: /u01/grid/oracle/product/11.2.0/grid_1/bin/srvctl start listener -n collabn3 ... failed
2013-11-01 15:00:37: Running as user grid: /u01/grid/oracle/product/11.2.0/grid_1/bin/cluutil -ckpt -oraclebase /oracle/grid -writeckpt -name ROOTCRS_NODECONFIG -state FAIL
2013-11-01 15:00:37: s_run_as_user2: Running /bin/su grid -c ' /u01/grid/oracle/product/11.2.0/grid_1/bin/cluutil -ckpt -oraclebase /oracle/grid -writeckpt -name ROOTCRS_NODECONFIG -state FAIL '
2013-11-01 15:00:37: Removing file /tmp/fileRpYFXs
2013-11-01 15:00:37: Successfully removed file: /tmp/fileRpYFXs
2013-11-01 15:00:37: /bin/su successfully executed

2013-11-01 15:00:37: Succeeded in writing the checkpoint:'ROOTCRS_NODECONFIG' with status:FAIL
2013-11-01 15:00:37: CkptFile: /u01/grid/oracle/Clusterware/ckptGridHA_collabn3.xml
2013-11-01 15:00:37: Sync the checkpoint file '/u01/grid/oracle/Clusterware/ckptGridHA_collabn3.xml'
2013-11-01 15:00:37: Sync '/u01/grid/oracle/Clusterware/ckptGridHA_collabn3.xml' to the physical disk
2013-11-01 15:00:37: ###### Begin DIE Stack Trace ######
2013-11-01 15:00:37:     Package         File                 Line Calling
2013-11-01 15:00:37:     --------------- -------------------- ---- ----------
2013-11-01 15:00:37:  1: main            rootcrs.pl            387 crsconfig_lib::dietrap
2013-11-01 15:00:37:  2: crsconfig_lib   crsconfig_lib.pm     9124 main::__ANON__
2013-11-01 15:00:37:  3: crsconfig_lib   crsconfig_lib.pm     9082 crsconfig_lib::configNode
2013-11-01 15:00:37:  4: main            rootcrs.pl            902 crsconfig_lib::perform_configNode
2013-11-01 15:00:37: ####### End DIE Stack Trace #######

And finally – if we just manually try to run the listener startup command, it becomes obvious what the problem is.

(root)# /u01/grid/oracle/product/11.2.0/grid_1/bin/srvctl start listener -n collabn3
PRCC-1015 : LISTENER was already running on collabn3
PRCR-1004 : Resource ora.LISTENER.lsnr is already running
PRCR-1013 : Failed to start resource ora.LISTENER_BKP.lsnr
PRCR-1064 : Failed to start resource ora.LISTENER_BKP.lsnr on node collabn3
CRS-2805: Unable to start 'ora.LISTENER_BKP.lsnr' because it has a 'hard' dependency on resource type 'ora.cluster_vip_net2.type' and no resource of that type can satisfy the dependency
CRS-2525: All instances of the resource 'ora.collabn1-bvip.vip' are already running; relocate is not allowed because the force option was not specified
CRS-2525: All instances of the resource 'ora.collabn2-bvip.vip' are already running; relocate is not allowed because the force option was not specified
CRS-2525: All instances of the resource 'ora.collabn3-bvip.vip' are already running; relocate is not allowed because the force option was not specified

[root@collabn3 ~]# /u01/grid/oracle/product/11.2.0/grid_1/bin/srvctl config listener
Name: LISTENER
Network: 1, Owner: grid
Home: 
End points: TCP:1521
Name: LISTENER_BKP
Network: 2, Owner: grid
Home: 
End points: TCP:1522

[root@collabn3 ~]# /u01/grid/oracle/product/11.2.0/grid_1/bin/srvctl config network
Network exists: 1/192.168.20.0/255.255.254.0/bond0, type static
Network exists: 2/192.168.220.0/255.255.254.0/bond1, type static

[root@collabn3 ~]# /u01/grid/oracle/product/11.2.0/grid_1/bin/srvctl config vip -n collabn1
VIP exists: /192.168.221.130/192.168.221.130/192.168.220.0/255.255.254.0/bond1, hosting node collabn1
VIP exists: /collabn1-vip/192.168.21.130/192.168.20.0/255.255.254.0/bond0, hosting node collabn1

[root@collabn3 ~]# /u01/grid/oracle/product/11.2.0/grid_1/bin/srvctl config vip -n collabn3
VIP exists: /collabn3-vip/192.168.21.171/192.168.20.0/255.255.254.0/bond0, hosting node collabn3

You can see from the excerpts above that the Oracle CRS root.sh script has failed to create the VIP for the second network. Looks like a simple bug to me, should be pretty easy to reproduce too. This was where I left the case last Friday afternoon.

This weekend I received the following email from my coworker:

Jeremy –

I spent a little bit of time digging into what this Oracle VIP is and if there was a way to fool it. Since this is not a standard OS definable thing, I checked to see if the srvctl command already existed on the machine. Since it was there, I gave it a try and it wouldn’t work because you had removed the clusterware. So I tried this:

– Opened 2 windows
– In Window 1 ran the command: /u01/grid/oracle/product/11.2.0/grid_1/root.sh
– In window 2 I kept trying the command: /u01/grid/oracle/product/11.2.0/grid_1/bin/srvctl add vip -n collabn3 -A 192.168.221.170/255.255.254.0 -k 2
– Eventually the clusterware came up and the node was defined in the cluster and services were fine and the command finally succeeded.

…

I assume this means everything went fine. Check it out and let me know if it really is in the cluster.

And there you have it. I had a look monday morning and sure enough it looked like everything succeeded. Creative, brilliant solution from one of my coworkers! Of course we will still file a bug report with Oracle and work to get it resolved, but this seemed worth sharing in the meantime.