And the saga continues… Tuesday I mentioned a PRKP-1001 from rconfig – however I think that the root cause might have been something quite different from my original suspicion. Yesterday we discovered a rather major issue on the server… when trying to configure EM dbcontrol I started receiving errors indicating that the ports weren’t free. This didn’t make sense because nothing else was using those ports.
A quick glance at netstat revealed that the ports had been used by outbound connections and were sitting in a TIME_WAIT state (indicating that the connection had closed already; the entry sits in the kernel table for a preconfigured time that defaults to about 2 minutes in linux but configurable through /proc/sys/net/ipv4/tcp_fin_timeout). Some process was opening about 60,000 connections in a matter of seconds — effectively blocking every available port and launching a DOS attack on itself. So it was time to do a little investigation. We ended up finally cracking the case this morning and I think that the troubleshooting process we went through might be informative since it could help troubleshoot other similar problems.
So where did we start? I’m not going to be terribly verbose but just show you the commands I used to track it down and show you the output. You can adapt these methods to your problem. :)
Getting Started: Gathering Information
One of the first things I noticed was an odd problem trying to start the instace on the local node. srvctl was returning a few messages that I don’t normally see.
[oracle@lab01 dbs]$ srvctl start instance -d dbtest -i dbtest1
PRKP-1001 : Error starting instance dbtest1 on node lab01
CRS-0215: Could not start resource 'ora.dbtest.dbtest1.inst'.
[oracle@lab01 dbs]$ srvctl status database -d dbtest
PRKO-2015 : Error in checking condition of instance on node: lab01
Instance dbtest2 is running on node lab02
As I mentioned before, I eventually discovered a VERY large number of TIME_WAIT connections on the system. Using the telnet program I could prove that this problem was actually preventing programs from opening tcp connections. The telnet error was intermittent; sometimes it worked and sometimes I got the error here.
[oracle@lab01 logs]$ netstat -tapn|grep TIME_WAIT|wc -l
(Not all processes could be identified, non-owned process info
will not be shown, you would have to be root to see it all.)
63897
[oracle@lab01 ~]$ telnet localhost 6101
Trying 127.0.0.1...
telnet: connect to address 127.0.0.1: Cannot assign requested address
telnet: Unable to connect to remote host: Cannot assign requested address
The next step was to try and find out exactly what was going on with these ports. First I had a look at the information available from netstat.
[oracle@lab01 conf]$ netstat -tapn|grep TIME_WAIT|head
(Not all processes could be identified, non-owned process info
will not be shown, you would have to be root to see it all.)
tcp 0 0 127.0.0.1:54551 127.0.0.1:6101 TIME_WAIT -
tcp 0 0 127.0.0.1:54550 127.0.0.1:6101 TIME_WAIT -
tcp 0 0 127.0.0.1:54549 127.0.0.1:6101 TIME_WAIT -
tcp 0 0 127.0.0.1:54548 127.0.0.1:6101 TIME_WAIT -
tcp 0 0 127.0.0.1:54547 127.0.0.1:6101 TIME_WAIT -
tcp 0 0 127.0.0.1:54546 127.0.0.1:6101 TIME_WAIT -
tcp 0 0 127.0.0.1:54545 127.0.0.1:6101 TIME_WAIT -
tcp 0 0 127.0.0.1:54544 127.0.0.1:6101 TIME_WAIT -
tcp 0 0 127.0.0.1:54559 127.0.0.1:6101 TIME_WAIT -
tcp 0 0 127.0.0.1:54558 127.0.0.1:6101 TIME_WAIT -
Investigating ONS
The connections were all being made to port 6101 – so what’s listening on that port? Turns out that it’s ONS – Oracle Notification Service! (This service publishes FAN events to subscribing processes such as Fast Connect Failover clients.)
[oracle@lab01 opmn]$ netstat -tapn|grep LISTEN|egrep '6[12]0'
(Not all processes could be identified, non-owned process info
will not be shown, you would have to be root to see it all.)
tcp 0 0 127.0.0.1:6101 0.0.0.0:* LISTEN 27417/ons
tcp 0 0 0.0.0.0:6200 0.0.0.0:* LISTEN 27417/ons
So I could shutdown ONS and the problem would go away – since no connections could be made on this port. But of course that’s not a feasible solution since we need ONS. So the next step is to try to learn a bit more about – and eventually find – the process that’s making all the connections. The first thing I noticed was that upon starting ONS the connections on the system maxed out in like 5 seconds – whatever client process is opening the connections, it’s really going crazy!!
[oracle@lab01 opmn]$ crs_start ora.lab01.ons
Attempting to start `ora.lab01.ons` on member `lab01`
Start of `ora.lab01.ons` on member `lab01` succeeded.
[oracle@lab01 opmn]$ netstat -tapn|grep TIME_WAIT|wc -l
(Not all processes could be identified, non-owned process info
will not be shown, you would have to be root to see it all.)
63925
I also thought it’d be informative to quickly review the ONS config. In fact in this process I stumbled on a misconfiguration – somehow the wrong values for REMOTE_PORT were stored in the OCR – but fixing this did not impact the root problem. Later I did go back and fix it but for now I’ll leave it alone since we’re focusing on something else.
[oracle@lab01 ~]$ ocrdump -stdout -keyname DATABASE.ONS_HOSTS
04/05/2007 08:56:42
ocrdump -stdout -keyname DATABASE.ONS_HOSTS
[DATABASE.ONS_HOSTS]
UNDEF :
SECURITY : {USER_PERMISSION : PROCR_ALL_ACCESS, GROUP_PERMISSION : PROCR_READ,
OTHER_PERMISSION : PROCR_READ, USER_NAME : oracle, GROUP_NAME : dba}
[DATABASE.ONS_HOSTS.lab01!lab!ardentperf!com]
ORATEXT : lab01.lab.ardentperf.com
SECURITY : {USER_PERMISSION : PROCR_ALL_ACCESS, GROUP_PERMISSION : PROCR_READ,
OTHER_PERMISSION : PROCR_READ, USER_NAME : oracle, GROUP_NAME : dba}
[DATABASE.ONS_HOSTS.lab01!lab!ardentperf!com.PORT]
ORATEXT : 6201
SECURITY : {USER_PERMISSION : PROCR_ALL_ACCESS, GROUP_PERMISSION : PROCR_READ,
OTHER_PERMISSION : PROCR_READ, USER_NAME : oracle, GROUP_NAME : dba}
[DATABASE.ONS_HOSTS.lab02!lab!ardentperf!com]
ORATEXT : lab02.lab.ardentperf.com
SECURITY : {USER_PERMISSION : PROCR_ALL_ACCESS, GROUP_PERMISSION : PROCR_READ,
OTHER_PERMISSION : PROCR_READ, USER_NAME : oracle, GROUP_NAME : dba}
[DATABASE.ONS_HOSTS.lab02!lab!ardentperf!com.PORT]
ORATEXT : 6201
SECURITY : {USER_PERMISSION : PROCR_ALL_ACCESS, GROUP_PERMISSION : PROCR_READ,
OTHER_PERMISSION : PROCR_READ, USER_NAME : oracle, GROUP_NAME : dba}
[oracle@lab01 ~]$ cd $ORACLE_HOME/opmn/conf
[oracle@lab01 conf]$ cat ons.config
localport=6101
remoteport=6200
loglevel=3
useocr=on
Deeper into Linux
Next let’s see if we can catch any processes connected to the port. Unfortunately I wasn’t able to spot the culprits despite many, many attempts. Two processes did come back – a listener (tnslsnr) and racgimon – but as you can see when I issues the queries a few minutes apart these processes still had the same connections (same source ports) so they don’t appear to have reconnected at all.
[root@lab01 log]# netstat -tapn|grep '127.0.0.1:6101'|grep -v TIME_WAIT
tcp 0 0 127.0.0.1:6101 0.0.0.0:* LISTEN 16359/ons
tcp 0 0 127.0.0.1:6101 127.0.0.1:2884 ESTABLISHED 16359/ons
tcp 0 0 127.0.0.1:6101 127.0.0.1:2883 ESTABLISHED 16359/ons
tcp 0 0 127.0.0.1:2883 127.0.0.1:6101 ESTABLISHED 8636/tnslsnr
tcp 0 0 127.0.0.1:2884 127.0.0.1:6101 ESTABLISHED 13514/racgimon
[root@lab01 log]# netstat -tapn|grep '127.0.0.1:6101'|grep -v TIME_WAIT
tcp 0 0 127.0.0.1:6101 0.0.0.0:* LISTEN 16359/ons
tcp 0 0 127.0.0.1:6101 127.0.0.1:2884 ESTABLISHED 16359/ons
tcp 0 0 127.0.0.1:6101 127.0.0.1:2883 ESTABLISHED 16359/ons
tcp 0 0 127.0.0.1:2883 127.0.0.1:6101 ESTABLISHED 8636/tnslsnr
tcp 0 0 127.0.0.1:2884 127.0.0.1:6101 ESTABLISHED 13514/racgimon
Next I decided to actually observe what was going over the wire in hopes of identifying what kind of process this might be. The tcpdump program could take care of this quite nicely on linux.
[root@lab01 ~]# tcpdump tcp port 6101 -i lo -X -s 0 > jeremy.dmp
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on lo, link-type EN10MB (Ethernet), capture size 65535 bytes
10235 packets captured
20470 packets received by filter
0 packets dropped by kernel
----
---- EXCERPT FROM JEREMY.DMP
----
11:07:41.775368 IP localhost.localdomain.5130 > localhost.localdomain.6101:
. ack 2 win 1024
0x0000: 4500 0034 313b 4000 4006 0b87 7f00 0001 E..41;@.@.......
0x0010: 7f00 0001 140a 17d5 8c3d 1fbf 8bf6 fe79 .........=.....y
0x0020: 8010 0400 af5c 0000 0101 080a 02cb 2ebe ...............
0x0030: 02cb 2ebe ....
11:07:41.775396 IP localhost.localdomain.5131 > localhost.localdomain.6101:
S 2358367471:2358367471(0) win 32767 <mss 16396,sackOK,timestamp 46870206 0,nop,wscale 5>
0x0000: 4500 003c fa51 4000 4006 4268 7f00 0001 E..<.Q@.@.Bh....
0x0010: 7f00 0001 140b 17d5 8c91 d4ef 0000 0000 ................
0x0020: a002 7fff d0bd 0000 0204 400c 0402 080a ..........@.....
0x0030: 02cb 2ebe 0000 0000 0103 0305 ............
11:07:41.775414 IP localhost.localdomain.6101 > localhost.localdomain.5131:
S 2349149105:2349149105(0) ack 2358367472 win 32767 <mss 16396,sackOK,timestamp 46870206 46870206,nop,wscale 5>
0x0000: 4500 003c 0000 4000 4006 3cba 7f00 0001 E..<..@.@.<.....
0x0010: 7f00 0001 17d5 140b 8c05 2bb1 8c91 d4f0 ..........+.....
0x0020: a012 7fff e76c 0000 0204 400c 0402 080a .....l....@.....
0x0030: 02cb 2ebe 02cb 2ebe 0103 0305 ............
11:07:41.775426 IP localhost.localdomain.5131 > localhost.localdomain.6101:
. ack 1 win 1024 <nop,nop,timestamp 46870206 46870206>
0x0000: 4500 0034 fa53 4000 4006 426e 7f00 0001 E..4.S@.@.Bn....
0x0010: 7f00 0001 140b 17d5 8c91 d4f0 8c05 2bb2 ..............+.
0x0020: 8010 0400 cc8e 0000 0101 080a 02cb 2ebe ................
0x0030: 02cb 2ebe ....
11:07:41.775462 IP localhost.localdomain.5131 > localhost.localdomain.6101:
P 1:49(48) ack 1 win 1024 <nop,nop,timestamp 46870206 46870206>
0x0000: 4500 0064 fa55 4000 4006 423c 7f00 0001 E..d.U@.@.B<....
0x0010: 7f00 0001 140b 17d5 8c91 d4f0 8c05 2bb2 ..............+.
0x0020: 8018 0400 fe58 0000 0101 080a 02cb 2ebe .....X..........
0x0030: 02cb 2ebe 504f 5354 202f 636f 6e6e 6563 ....POST./connec
0x0040: 7420 4854 5450 2f31 2e31 0d0a 5665 7273 t.HTTP/1.1..Vers
0x0050: 696f 6e3a 2033 0d0a 466f 726d 4661 6374 ion:.3..FormFact
0x0060: 6f72 3a20 or:.
11:07:41.775471 IP localhost.localdomain.6101 > localhost.localdomain.5131:
. ack 49 win 1024 <nop,nop,timestamp 46870206 46870206>
0x0000: 4500 0034 050c 4000 4006 37b6 7f00 0001 E..4..@.@.7.....
0x0010: 7f00 0001 17d5 140b 8c05 2bb2 8c91 d520 ..........+.....
0x0020: 8010 0400 cc5e 0000 0101 080a 02cb 2ebe .....^..........
0x0030: 02cb 2ebe ....
11:07:41.775486 IP localhost.localdomain.5131 > localhost.localdomain.6101:
F 49:49(0) ack 1 win 1024 <nop,nop,timestamp 46870206 46870206>
0x0000: 4500 0034 fa57 4000 4006 426a 7f00 0001 E..4.W@.@.Bj....
0x0010: 7f00 0001 140b 17d5 8c91 d520 8c05 2bb2 ..............+.
0x0020: 8011 0400 cc5d 0000 0101 080a 02cb 2ebe .....]..........
0x0030: 02cb 2ebe ....
11:07:41.775504 IP localhost.localdomain.6101 > localhost.localdomain.5131:
F 1:1(0) ack 50 win 1024 <nop,nop,timestamp 46870206 46870206>
0x0000: 4500 0034 050e 4000 4006 37b4 7f00 0001 E..4..@.@.7.....
0x0010: 7f00 0001 17d5 140b 8c05 2bb2 8c91 d521 ..........+....!
0x0020: 8011 0400 cc5c 0000 0101 080a 02cb 2ebe ...............
0x0030: 02cb 2ebe ....
11:07:41.775514 IP localhost.localdomain.5131 > localhost.localdomain.6101:
. ack 2 win 1024 <nop,nop,timestamp 46870206 46870206>
0x0000: 4500 0034 fa59 4000 4006 4268 7f00 0001 E..4.Y@.@.Bh....
0x0010: 7f00 0001 140b 17d5 8c91 d521 8c05 2bb3 ...........!..+.
0x0020: 8010 0400 cc5c 0000 0101 080a 02cb 2ebe ...............
0x0030: 02cb 2ebe ....
11:07:41.775544 IP localhost.localdomain.5132 > localhost.localdomain.6101:
S 2353466507:2353466507(0) win 32767 <mss 16396,sackOK,timestamp 46870206 0,nop,wscale 5>
0x0000: 4500 003c 2235 4000 4006 1a85 7f00 0001 E..<"5@.@.......
0x0010: 7f00 0001 140c 17d5 8c47 0c8b 0000 0000 .........G......
0x0020: a002 7fff 996b 0000 0204 400c 0402 080a .....k....@.....
0x0030: 02cb 2ebe 0000 0000 0103 0305 ............
Ah-ha… not sure if you saw it in there but this is the exact text that is being sent:
POST /connect HTTP/1.1
Version: 3
FormFactor:
This is in fact exactly what ONS expects from a client (it uses an HTTP-style wire protocol). That means that the client is definitely some sort of oracle process. In fact it smells a lot like a listener even though I thought I’d eliminated that. Right then was when we made the other breakthrough… Rob, one of the guys I was working with, suddenly remembered a problem he’d encountered with listeners and ONS. He jumped on metalink and quickly came up with notes 372959.1 and 284602.1 which we hadn’t turned up before because we didn’t know to look for listener and ONS related problems.
This turned out to be the breaking point… it was all downhill from here. Next we started looking at all the listeners on the box (there were several).
[root@lab01 ~]# ps axf|grep lsnr
14274 pts/2 S+ 0:00 _ grep lsnr
8636 ? Ssl 0:00 /opt/oracle/app/asm/10.2.0/bin/tnslsnr LISTENER_LAB01 -inherit
10500 ? Ss 0:00 /db5/oracle/db2test1db/10.2.0/bin/tnslsnr db2test1 -inherit
10501 ? S 0:00 _ /db5/oracle/db2test1db/10.2.0/bin/tnslsnr db2test1 -inherit
10503 ? S 0:00 _ /db5/oracle/db2test1db/10.2.0/bin/tnslsnr db2test1 -inherit
10502 ? S 8:14 _ /db5/oracle/db2test1db/10.2.0/bin/tnslsnr db2test1 -inherit
6328 ? Ss 0:00 /db5/oracle/db2test1ora/8.0.6/bin/tnslsnr APPS_db2test1 -inherit
[root@lab01 ~]# cd /opt/oracle/app/asm/10.2.0/opmn/conf/
[root@lab01 conf]# cat ons.config
localport=6102
remoteport=6201
loglevel=3
useocr=on
[root@lab01 conf]# cd /db5/oracle/db2test1db/10.2.0/opmn/conf/
[root@lab01 conf]# cat ons.config
localport=6101
remoteport=6200
loglevel=3
Now that we’d identified which listener knew ONS was on port 6101 we ran a trace on that listener to see what it was doing.
[root@lab01 conf]# strace -p 10502 2>&1|grep 127.0.0.1
connect(15, {sa_family=AF_INET, sin_port=htons(6101), sin_addr=inet_addr("127.0.0.1")}, 16) = 0
connect(15, {sa_family=AF_INET, sin_port=htons(6101), sin_addr=inet_addr("127.0.0.1")}, 16) = 0
connect(15, {sa_family=AF_INET, sin_port=htons(6101), sin_addr=inet_addr("127.0.0.1")}, 16) = 0
connect(15, {sa_family=AF_INET, sin_port=htons(6101), sin_addr=inet_addr("127.0.0.1")}, 16) = 0
connect(15, {sa_family=AF_INET, sin_port=htons(6101), sin_addr=inet_addr("127.0.0.1")}, 16) = 0
connect(15, {sa_family=AF_INET, sin_port=htons(6101), sin_addr=inet_addr("127.0.0.1")}, 16) = 0
Wow – he was connecting on the port with the exact frequency we were looking for. A more detailed trace even revealed the exact problem text.
[root@lab01 conf]# strace -p 10502
connect(15, {sa_family=AF_INET, sin_port=htons(6101), sin_addr=inet_addr("127.0.0.1")}, 16) = 0
setsockopt(15, SOL_SOCKET, SO_LINGER, {onoff=1, linger=900}, 8) = 0
setsockopt(15, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0
open("/db5/oracle/db2test1db/10.2.0/opmn/conf/.formfactor", O_RDONLY) = -1 ENOENT (No such file
or directory)
send(15, "POST /connect HTTP/1.1rnVersion:"..., 48, 0) = 48
send(15, "", 3069884474, 0) = 0
shutdown(15, 2 /* send and receive */) = 0
close(15) = 0
socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 15
The Solution
We chose the first solution from note 284602.1 – renaming the ons config so that the listener didn’t know about ONS.
[oracle@lab01 bin]$ cd /db5/oracle/db2test1db/10.2.0/opmn/conf/
[oracle@lab01 conf]$ mv ons.config
ons.config.renamed_itc_05-apr-2007_too_many_open_connections_problem
[oracle@lab01 ~]$ . db-db2test1.env
[oracle@lab01 ~]$ lsnrctl stop db2test1
[oracle@lab01 ~]$ lsnrctl start db2test1
We waited a minute or two for the current TIME_WAIT entries to age out and then checked – and there were no more. Problem solved! And we got to do a little kernel snorkeling in the process. :)
Hey Jeremy. Reads like a good internal support ticket :)
excellent.
I will pass onto the other blokes when I am back on deck next week.
LikeLike
You wrote “So I could shutdown ONS and the problem would go away – since no connections could be made on this port. But of course that’s not a feasible solution since we need ONS.”
And then you switch off ONS by renaming ons.config?
It is much more likely that you have an Application Server instance on the same box running, which causes the ONS issue because it uses the same ports as your database listener. This is also described in Note 297411.1.
LikeLike
I renamed the ons.config used by the listener, not ONS itself. Renaming this file just made the listener stop trying to connect to ONS.
There was definitely not an Application Server instance on the same box; it was a RAC node that was only running DB software. It was a clean, new OS installation and we had just finished installing the Oracle software ourselves.
Definitely a bug in Oracle 10.2.0.2; Doug Burns recently wrote about the same problem on an AIX single-instance deployment.
LikeLike
Was fighting the exact same issue last night – 52k+ connections from/to :6101
Love the www – every thing you need is out there somewhere – you just need to know how to find it.
Great clean site btw.
LikeLike
Perfect match to our problems here. GREAT summary and references.
LikeLike
thanks for putting up such a beautiful note. this solves the exact problem i was facing. thanks again..
LikeLike
Thank you for the help. I enjoyed the professionalism, because you have not merely provide the solution, but you’ve explained step by step strategy that has led you to trace the origin of the problem, and then proceed to its resolution.
LikeLike