Yesterday, I had a fight to the death with a nasty Oracle Net problem. The battle consumed a little more of the day than I intended… but it was worthwhile for the sweet taste of victory.
Everything started with this short, innocent-seeming instant message:
there is something wrong with a 10g server config. connecting to it from windows works, the linux instant client returns ORA-28547. the same linux client can connect to 9g installation though. google is not helpful, does anyone have an idea?
Seems like a simple question! But we ended up chatting on-and-off throughout the whole day as I helped troubleshoot the problem. We finally found the problem at the end of the day – and I promise you that it would have taken Oracle Support weeks (or even months) to find this problem. Yikes!
The client was php on a linux box. This client was connecting to the database server over a network connection which includes a VPN tunnel. The database is being upgraded from 9i to 10g – and the new database is already running on a new server. The client can connect to the 9i database but it can’t connect to the 10g database. The developer also has a windows box with an oracle client; this windows box can connect to the 10g database over the network link just fine. TNSNAMES is not used, and all connect strings are identical except for the IP addresses. The 11g instant client for Linux is being used.
Our investigation process involved a lot of trial-and-error. We repeated this cycle: gather data, postulate a theory, suggest an action. We knew that we were close because often the error messages would change – but they didn’t go away! For posterity (and google), here are the errors we encountered:
- ORA-28547: connection to server failed, probable Oracle Net admin error
- ORA-12170: TNS:Connect timeout occurred
- ORA-03113: end-of-file on communication channel
- WARNING: inbound connection timed out (ORA-3136)
- Fatal NI connect error 12170.
- TNS-12535: TNS:operation timed out / ns secondary err code: 12606
- ntt2err:Read unexpected EOF ERROR on 7
- ORA-12537 or TNS-12537: TNS: connection closed / ns2: 12560
We started with simple changes before looking into deep logs or tracefiles. I didn’t have any access to the actual system, just a chat connection to the developer. And of course I was hoping for a quick fix… though it didn’t end up happening. But FWIW, here’s everything we tried:
- Double-check that TNSNAMES or connection string is correct
- If RAC, then check REMOTE SERVER connect string in LSNRCTL output (this comes from local_listener/remote_listener settings)
- Add server hostname to local client DNS. (Connect string used IP, but server might pass back hostname for re-connect during connection process.)
- Check if DB server is behind a firewall, client seeing IP address different than server – see metalink note 270096.1 (can possibly address on 10g+ with DIRECT_HANDOFF_TTC_ in listener.ora or USE_SHARED_SOCKET in windows registry)
- Attempt to disable shared-socket/direct-handoff. (Trace revealed that the working 9i connection didn’t bequeath its connection but broken 10g connection did.)
- Downgraded instant client from 11g to 10g (to match target database)
- Check if there’s a software firewall active on the database server
To gather data, of course we first just captured error messages from the client. Later, we checked the server sqlnet.log, alert log, bdump/udump, and windows OS event log for messages. The developer I was working with was very savvy, so I actually had him generate some client trace files fairly early on – from both working and non-working connections.
We heavily leveraged the two working connections (linux->9i and windows->10g) by comparing tracefiles. In the end, these tracefiles tipped us off to the real problem.
I expected a quick, easy answer – but it wasn’t. It drove me crazy all day.
From the beginning it was obviously some kind of networking problem. The database was running on a windows server, so my first thought was that it was related to the re-connect that happens when the listener hands off the connection to the database process. But the trace file revealed that 10g was actually bequeathing (or rather, sharing) the connection – shame on me for not keeping up to speed with Oracle/Windows developments!! (I usually work with unix environments.)
Generating a client trace is easy – you just edit (or create) the text file sqlnet.ora and add two lines TRACE_DIRECTORY_CLIENT and TRACE_LEVEL_CLIENT (detailed description). If you’re using the instant client then you can set the environment variable TNS_ADMIN to the directory containing this file.
Here are a few lines from the client trace we generated:
2010-09-07 20:41:00.451918 : nscall:connecting... 2010-09-07 20:41:00.452854 : nsopen:opening transport... 2010-09-07 20:41:00.453086 : nttcni:Tcp conn timeout = 60000 (ms) 2010-09-07 20:41:00.504377 : nttcni:connected on ipaddr 10.10.0.97 2010-09-07 20:41:00.506948 : nscon:doing connect handshake... 2010-09-07 20:41:00.506979 : nscon:sending NSPTCN packet 2010-09-07 20:41:00.507113 : nttwr:socket 7 had bytes written=233 2010-09-07 20:41:00.567169 : nttrd:socket 7 had bytes read=8 2010-09-07 20:41:00.567434 : nscon:got NSPTRS packet 2010-09-07 20:41:00.567692 : nscon:sending NSPTCN packet 2010-09-07 20:41:00.567817 : nttwr:socket 7 had bytes written=233 2010-09-07 20:41:00.617139 : nttrd:socket 7 had bytes read=32 2010-09-07 20:41:00.617362 : nscon:got NSPTAC packet 2010-09-07 20:41:00.617453 : nscon:connect handshake is complete 2010-09-07 20:41:00.620093 : nsdofls:sending NSPTDA packet 2010-09-07 20:41:00.620253 : nttwr:socket 7 had bytes written=156 2010-09-07 20:41:00.667129 : nttrd:socket 7 had bytes read=127 2010-09-07 20:41:00.667359 : nsrdr:got NSPTDA packet 2010-09-07 20:41:00.670112 : nttfpwr:socket 7 had bytes written=47 2010-09-07 20:41:00.717287 : nttfprd:socket 7 had bytes read=181 2010-09-07 20:41:00.719612 : nttfpwr:socket 7 had bytes written=2011 2010-09-07 20:41:00.720401 : nttfpwr:socket 7 had bytes written=666 2010-09-07 20:42:01.577217 : ntt2err:Read unexpected EOF ERROR on 7
The presence of NSPTRS indicates that the connection is being bequeathed/shared on the server. Metalink notes 270096.1 and 1007807.1 are very helpful for understanding this handshake process. Here’s a brief outline of what’s happening:
One other important piece of information we eventually unearthed was this error in sqlnet.log on the server:
Fatal NI connect error 12170. ... Tns error struct: ns main err code: 12535 TNS-12535: TNS:operation timed out ns secondary err code: 12606
Both the server and client were seeing the connection spontaneously die! Yet the client connects fine to a 9i database, and a different client connects fine to this database!! Right about now, I was thinking of two possible root causes:
- Rogue firewall somewhere that’s killing the connection for some unknown reason. But the developer is pretty sure that there’s no firewall killing connections.
At this point I asked the developer to run a level 16 client trace (which includes packet dumps) and I started reading through the dump. Then I noticed two important things.
- There is a 60 second pause before the client detects the dead connection. This is equal to the TCP connection timeout. I suddenly realized that the client “EOF” error was not from the server terminating the connection – the client was timing out while waiting for the server. And then I realized that the server was doing the same thing – waiting for the client!
- I looked a lot closer at a certain section of the tracefile that immediately precedes this timeout. In particular, I looked at the packet contents and sizes – and I compared with the working 9i connection.
Here’s a snip from the diff between the working 9i connection and the broken 10g connection:
nttfpwr:socket 7 had bytes written=47 -nttfprd:socket 7 had bytes read=164 -nscontrol:Vect I/O support: 02010-09-07 20:48:06.447554 : nscontrol:cmd=44, lcl=0x0 -nttfpwr:socket 7 had bytes written=1080 -nttfprd:socket 7 had bytes read=949 -nttfpwr:socket 7 had bytes written=213 -nttfprd:socket 7 had bytes read=22 +nttfprd:socket 7 had bytes read=181 +nscontrol:Vect I/O support: 02010-09-07 20:52:12.267498 : nscontrol:cmd=44, lcl=0x0 +nttfpwr:socket 7 had bytes written=2011 +nttfpwr:socket 7 had bytes written=666 +ntt2err:Read unexpected EOF ERROR on 7
When I looked at the packet contents for the broken connection, it was immediately evident that those TWO packets (2011 bytes and 666 bytes) were ONE message. The message was too large for one packet so it had been fragmented.
However the first big message sent to the 9i database is 1080 bytes. It gets sent in a single packet.
I asked the developer, what’s the max packet size on the Linux box? He replied that the MTU was 1456.
WILD guess… I wondered, was the big packet getting dropped somewhere? (Windows client trace shows the large packet getting through without any problem, but still…)
Oracle is Better than MySQL
…ok, maybe the standard MySQL client has this functionality too – I actually don’t know. I just wrote that title because it’s provocative!! But when it comes to tweaks, Oracle really does shine. It has more switches and knobs than you’d believe.
It’s very simple to reduce the max packet size in the Oracle client: just add “(SDU=####)” to the connect string. The default is 2048 – which explains the fragment sizes we observed.
Here’s what transpired next in my chat (please excuse the language):
just tried with 1024 and it worked
Seems that we identified the problem! And most importantly, we have a working connection so that this developer can get back to what’s important – coding.
Seriously, there’s an awful lot of understanding that went into determining this solution. Network and Oracle knowledge combined in a way that can only come from experience.
I chatted with the developer again yesterday; the root problem seems to be one of these things:
Linux environment is Debian running in a VM (bridged), and the VPN tunnel terminates in the host OS… so maybe there’s a bug in the virt network drivers that’s brought out by bridging into a pptp tunnel? (i doubt that the bridging or pptp code itself would have a bug.)
developer saw indications that packets bigger than 1400 were getting an ICMP error from a few hops down the path… so maybe there’s a networking misconfig or bug somewhere (pptp/vpn tunnel)? and maybe the windows client has a smaller MTU so it was further fragmenting the big packet in the OS stack and thus avoiding this error?
These are guesses, and I have no evidence to prove either. Also, I’m not expecting to hear anything else – this guy has more important things to do. But you never know, occasionally if something really bothers me then I waste a lot of time figuring it out. (This post is a case in point!) I’ll leave an update if I hear anything else.
Jeremy, this very interesting network troubleshooting! thanks for sharing the references as well :)
hello! i encountered a problem that you will really find very interesting (and maybe you haven’t encountered never before).
we have 2 domain controllers and 50 sites! each one is a global catalog and also a dns server (win2k3) because we want them “play” even if we have not network (which is often)
oracle: 9i (9.0.2)
after 3 months down (meaning no network with us but the users there had no problem)a server comes to the network! happy? no…
we connect with desktop authority but using sqlplus we cannot connect to the database (ora-03113)
but!!! i can connect to the db from my computer with command prompt.. sqlplus/nolog .. connect sys/sys@…. as sysdba
did u undestand? i coonect to tha data remotely but not locally!
(that’s why the users had no problem since they connected to server from their client pc –win xp pro with developper 6i– and they worked perfectly!!)
reload dns zone (it was empty)!
next problem? another server came up ofter one month and unfortunately cannot connect nor locally nor remotely!
the problem is only to connect to the db (ora -03113) after connecting startup and mount encounters no problems…
I had same problem but following your advice I was able to fix it.
Thank you so much Jeremy.