Case Study: Statspack/AWR Latch Waits (Part 1)

Posted by Jeremy ⋅ June 27, 2007 ⋅ 5 Comments

I thought it might be interesting to write about a situation I ran into last week and a bit of the methodology I used to tackle it.

The whole thing started as a health check for a PeopleSoft database. To give a little background, the whole environment lives on a 32-processor IBM p690 partitioned into 10 LPARs. There are local mirrored disks for operating systems but an EMC Symmetrix does the heavy lifting for PeopleSoft and Oracle. The sym has a uniform configuration that mainly consists of 9G hypers and 100G metas.

I started by logging into the development system. It has 16G of memory, three CPUs and two dedicated fiber paths to the sym. It uses 9 metas but mangles them through AIX’s LVM. Two are in the first volume group and provide 15 mount points for various software binaries. The other seven are in a second volume group and provide 87 mount points for datafiles. JFS2 is used on all of these logical volumes. There are 16 non-production databases on this server however only 9 or 10 were running when I logged in.

I opened up SQLPlus and I remember running some really basic query (probably against something like v$database or dual) – and waiting for about 5 seconds for it to return. So then I tried an explain plan on the previously attempted simple SQL – and waited another 5-10 seconds for this to return.

That doesn’t seem like normal performance to me. :)

Starting a Performance Investigation

Despite a bit of bad press that statspack has received lately, I still find it (and AWR reports – the optional and expensive 10g counterpart) to be a good starting point for these investigations. And anyhow, many shortcomings – like the fact that prior to 10gR2 it doesn’t include stats for in-progress long-running SQL – generally have a bigger impact on DW environments.

However before collecting any data there is always a critical first step: understanding your system workload. For consultants this is especially important (albeit challenging); we start with relatively little knowledge of the system and hope to offer quality assistance in a short timeframe.

Like most transactional systems, this one is used mostly during the day and has more of a batch workload at night. It is a PeopleSoft Development database – virtually no concurrent users; just a handful of developers working on new features and testing 10g. Not a high amount of activity. The SGA is pretty small: 400M buffer cache, 100M sql area, 25M library cache.

After looking a little more closely we decided that the user workload was best defined as 6am to 6pm; so as a starting point we looked at the AWR report for that time period. The first thing that jumped out was at the very top of the report: very poor ratios for Execute to Parse % and Parse CPU to Parse Elapsd %.

Now before I go any further, it’s important to point out that much has been written about the use of ratios with database tuning… and I certainly don’t want to perpetuate any myths that every database with low ratios is unhealthy. However I think that ratios can be helpful – particularly when I’m trying to understand the workload on a database. It’s really a starting point that leads to the question: “why?”

In this case it appears that while parsing SQL, Oracle is spending a lot of time waiting for something. Now PeopleSoft does have a bit of a reputation for writing literal SQL… but 20% for the parse CPU/elapsed? That seems a bit excessive. So we looked at the top timed events in the report… and observed some very high wait times for both latch: library cache and latch: shared pool.

Latch Waits

The first thing that jumped out at me on that report was the average wait times on the latches – 200ms! On the NetApp based system in Phoenix these latches were typically acquired in well under 10ms. Also it’s very important to keep in mind the law of averages, which Jonathan Lewis colorfully articulated: “if you put your head in an ice bucket and your feet in a fire, on average you should be perfectly comfortable.” We don’t know yet whether every attempt to get this latch takes 200ms or if one attempt took 10 minutes.

Ok… so any guesses about what is happening? In fact if you’re curious you can even have a look at the complete AWR report (I’ve only changed the DB and host names). Personally I’m not sure you can identify it from the report – I had to dig deeper; and tomorrow I’ll post about that. But I’m curious if anyone who’s a bit better at this than I am can spot something here that I didn’t.

Interesting stuff! And more to come tomorrow…

About Jeremy

Building and running reliable data platforms that scale and perform. about.me/jeremy_schneider

View all posts by Jeremy »

Discussion

5 thoughts on “Case Study: Statspack/AWR Latch Waits (Part 1)”

I’ll have to tune in for the next installment on this. I don’t like the 12hour range – too much opportunity for flattened averages, especially on a system that doesn’t seem to be doing much.

I think there’s could be a lot of table truncating going on that is causing invalidation of SQL statements and pl/sql procedures. I think the problem may be made worse by the number of histograms in existence.

The various statistics about constraints don’t really seem to be consistent. You don’t seem to query them much, don’t have many, and yet work very hard when you need to see them. Maybe the root cause is in the history, with the con$ table or c_obj# cluster being smashed by lots of create/drop activity on objects and constraints.

LikeLike

Posted by Jonathan Lewis | June 27, 2007, 1:21 pm
“insert into argument$” over 10,000 executions.
“delete from source$” over 1,000 executions.
Looks like they are creating and dropping a lot of PL/SQL.
If it’s development, I wonder if they’ve got something that is dropping/restoring database objects (maybe flashback database ?) to a ‘production/testing-like’ state.

LikeLike

Posted by Gary | June 27, 2007, 5:05 pm
I would not expect that the latch waits have anything to do with “NetApp”. Latch waits could be also be high if the server was starved of CPU and/or a very high rate of context-switching is occurring. As Jonathan says “I’ll have to tune in for the next installment on this”.

LikeLike

Posted by Hemant | June 29, 2007, 9:26 am