Last Friday I wrote about a client query that could be rewritten in a much more optimized manner. However rewriting it was a little tricky, involving a hierarchical query and a column that needed to calculate a value based on it’s parent, grand-parent, great-grand-parent, etc. Raj posted a very smart solution that used sys_connect_by_path to access this information – which is exactly how I tried to do it at first. However after rethinking the query I realized that there was a better way to do it with analytics.
So here’s the approach that I took and how I arrived at the solution.
The Problem
The root problem was that I wanted a simple method to access any column value from an arbitrary level in the hierarchy above my current level. For example if you look at the TURNER record in the result set, his UPLINE_SAL column needs to be the sum of the SAL column from three different records:
ENAME EMPNO VP_NAME SAL UPLINE_SAL
--------------- ---------- ---------- ---------- ----------
KING 7839 5000 5000
JONES 7566 JONES 2975 7975
SCOTT 7788 JONES 3000 10975
ADAMS 7876 JONES 1100 12075
FORD 7902 JONES 3000 10975
SMITH 7369 JONES 800 11775
BLAKE 7698 BLAKE 2850 7850
ALLEN 7499 BLAKE 1600 9450
WARD 7521 BLAKE 1250 9100
MARTIN 7654 BLAKE 1250 9100
TURNER 7844 BLAKE 1500 9350
JAMES 7900 BLAKE 950 8800
CLARK 7782 CLARK 2450 7450
MILLER 7934 CLARK 1300 8750
The Solution: Step 1
First what I’m going to do is create a few new fields. The purpose of these fields is going to be to identify an ancestor record. I will create one field for each level of the hierarchy where I might need to grab an ancestor. And to start with, I’m going to populate that field for the rows that actually belong to the specified level of the hierarchy. Might be easier to just show it, rather than trying to explain. :)
SQL> set linesize 120
SQL> set pagesize 40
SQL> col N format 99
SQL> col L format 99
SQL> col L1 format 99
SQL> col L2 format 99
SQL> col L3 format 99
SQL> col L4 format 99
select
rownum N,
level L,
empno,
mgr,
ename,
decode(level,1,rownum,null) L1,
decode(level,2,rownum,null) L2,
decode(level,3,rownum,null) L3,
decode(level,4,rownum,null) L4
from emp
connect by mgr = prior empno start with mgr is null;
N L EMPNO MGR ENAME L1 L2 L3 L4
--- --- ---------- ---------- ---------- --- --- --- ---
1 1 7839 KING 1
2 2 7566 7839 JONES 2
3 3 7788 7566 SCOTT 3
4 4 7876 7788 ADAMS 4
5 3 7902 7566 FORD 5
6 4 7369 7902 SMITH 6
7 2 7698 7839 BLAKE 7
8 3 7499 7698 ALLEN 8
9 3 7521 7698 WARD 9
10 3 7654 7698 MARTIN 10
11 3 7844 7698 TURNER 11
12 3 7900 7698 JAMES 12
13 2 7782 7839 CLARK 13
14 3 7934 7782 MILLER 14
14 rows selected.
Now you might ask, “what good does that do?” Good question… check this out: if I can “fill down” those values then I can use analytics to partition on them and grab the first_value of any field that I want! Don’t worry, I’ll take it one step at a time and it’ll make sense in a minute.
The Solution: Step 2
Next lets “fill down” the values. I’m also going to use subquery factoring to keep the code a little cleaner and more understandable:
with q1 as (
select
rownum N,
level L,
ename,
decode(level,1,rownum,null) L1,
decode(level,2,rownum,null) L2,
decode(level,3,rownum,null) L3,
decode(level,4,rownum,null) L4
from emp
connect by mgr = prior empno start with mgr is null
)
select N, L, ename,
max(L1) over (order by N) L1,
max(L2) over (order by N) L2,
max(L3) over (order by N) L3,
max(L4) over (order by N) L4
from q1
order by N;
N L ENAME L1 L2 L3 L4
--- --- ---------- --- --- --- ---
1 1 KING 1
2 2 JONES 1 2
3 3 SCOTT 1 2 3
4 4 ADAMS 1 2 3 4
5 3 FORD 1 2 5 4
6 4 SMITH 1 2 5 6
7 2 BLAKE 1 7 5 6
8 3 ALLEN 1 7 8 6
9 3 WARD 1 7 9 6
10 3 MARTIN 1 7 10 6
11 3 TURNER 1 7 11 6
12 3 JAMES 1 7 12 6
13 2 CLARK 1 13 12 6
14 3 MILLER 1 13 14 6
14 rows selected.
Looks good except for one problem: we actually only want to “fill down” values as long as we’re at the same level or lower in the hierarchy. For example, BLAKE is a “VP” (second level in the hierarchy) – he should not show FORD (N=5) at level 3 and SMITH (N=6) at level 4. He doesn’t have level 3 and 4 ancestors since he himself is at level 2. So what we’ll do is add a CASE statement so that records show themselves for all levels greater than or equal to their own.
with q1 as (
select
rownum N,
level L,
ename,
decode(level,1,rownum,null) L1,
decode(level,2,rownum,null) L2,
decode(level,3,rownum,null) L3,
decode(level,4,rownum,null) L4
from emp
connect by mgr = prior empno start with mgr is null
)
select N, L, ename,
case when (L>1) then (max(L1) over (order by N)) else (N) end L1,
case when (L>2) then (max(L2) over (order by N)) else (N) end L2,
case when (L>3) then (max(L3) over (order by N)) else (N) end L3,
case when (L>4) then (max(L4) over (order by N)) else (N) end L4
from q1
order by N;
N L ENAME L1 L2 L3 L4
--- --- ---------- --- --- --- ---
1 1 KING 1 1 1 1
2 2 JONES 1 2 2 2
3 3 SCOTT 1 2 3 3
4 4 ADAMS 1 2 3 4
5 3 FORD 1 2 5 5
6 4 SMITH 1 2 5 6
7 2 BLAKE 1 7 7 7
8 3 ALLEN 1 7 8 8
9 3 WARD 1 7 9 9
10 3 MARTIN 1 7 10 10
11 3 TURNER 1 7 11 11
12 3 JAMES 1 7 12 12
13 2 CLARK 1 13 13 13
14 3 MILLER 1 13 14 14
14 rows selected.
As you can see, BLAKE now shows himself at levels 3 and 4 (since technically he doesn’t have level 3 or 4 ancestors).
The Solution: Step 3
At this point getting the answer is easy. We can now use analytics (again) to simply partition on the field for the level we want and grab the first value. This allows us to easily grab any ancestor from the hierarchy without any additional access of the datasource! For example, to get the “VP” (ename from the second level) we simply use this expression:
first_value(ename) over (partition by L2 order by N)
There is one thing to be careful of: remember a level 1 record asking for it’s level 2 ancestor will return its own value – even though technically it doesn’t have a level two ancestor. Might be what you want… or maybe it could mess you up. For example if you’re mathematically adding the values of each level in the hierarchy then you don’t want to add the CEO’s salary four times!! In that case you’d need to use something like a CASE statement to check the level and only add the salary in if it’s appropriate.
This formula would give us the UPLINE_SAL value that we’re looking for:
case when (L>1) then (first_value(sal) over (partition by L1 order by N))
else (0) end +
case when (L>2) then (first_value(sal) over (partition by L2 order by N))
else (0) end +
case when (L>3) then (first_value(sal) over (partition by L3 order by N))
else (0) end +
sal
The Solution: Final Answer
Putting all of that together and extending it to work with eight levels, this query will give the output I was looking for last Friday. It combines hierarchical query, analytics, and subquery factoring:
SQL> col ename format a15
with q1 as (
select
rownum N,
level L,
empno,
mgr,
ename,
sal,
decode(level,1,rownum,null) L1,
decode(level,2,rownum,null) L2,
decode(level,3,rownum,null) L3,
decode(level,4,rownum,null) L4,
decode(level,5,rownum,null) L5,
decode(level,6,rownum,null) L6,
decode(level,7,rownum,null) L7,
decode(level,8,rownum,null) L8
from emp where level1) then (max(L1) over (order by N)) else (N) end L1,
case when (L>2) then (max(L2) over (order by N)) else (N) end L2,
case when (L>3) then (max(L3) over (order by N)) else (N) end L3,
case when (L>4) then (max(L4) over (order by N)) else (N) end L4,
case when (L>5) then (max(L5) over (order by N)) else (N) end L5,
case when (L>6) then (max(L6) over (order by N)) else (N) end L6,
case when (L>7) then (max(L7) over (order by N)) else (N) end L7,
case when (L>8) then (max(L8) over (order by N)) else (N) end L8
from q1
)
select lpad(' ',L-1)||ename ename, empno,
case when (L>=2)
then (first_value(ename) over (partition by L2 order by N))
else (null) end
vp_name,
sal,
case when (L>1)
then (first_value(sal) over (partition by L1 order by N))
else (0) end +
case when (L>2)
then (first_value(sal) over (partition by L2 order by N))
else (0) end +
case when (L>3)
then (first_value(sal) over (partition by L3 order by N))
else (0) end +
case when (L>4)
then (first_value(sal) over (partition by L4 order by N))
else (0) end +
case when (L>5)
then (first_value(sal) over (partition by L5 order by N))
else (0) end +
case when (L>6)
then (first_value(sal) over (partition by L6 order by N))
else (0) end +
case when (L>7)
then (first_value(sal) over (partition by L7 order by N))
else (0) end +
sal
upline_sal
from q2
order by N;
ENAME EMPNO VP_NAME SAL UPLINE_SAL
--------------- ---------- ---------- ---------- ----------
KING 7839 5000 5000
JONES 7566 JONES 2975 7975
SCOTT 7788 JONES 3000 10975
ADAMS 7876 JONES 1100 12075
FORD 7902 JONES 3000 10975
SMITH 7369 JONES 800 11775
BLAKE 7698 BLAKE 2850 7850
ALLEN 7499 BLAKE 1600 9450
WARD 7521 BLAKE 1250 9100
MARTIN 7654 BLAKE 1250 9100
TURNER 7844 BLAKE 1500 9350
JAMES 7900 BLAKE 950 8800
CLARK 7782 CLARK 2450 7450
MILLER 7934 CLARK 1300 8750
14 rows selected.
Pretty simple and yet very powerful and flexible! But… how does it perform?
Performance Analysis
Let’s take a look at the plan. Granted, this is on a tiny dataset – but it’s a starting point.
SQL> select * from plan;
PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------------------------
Plan hash value: 833053088
------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 14 | 2100 | 13 (77)| 00:00:01 |
| 1 | SORT ORDER BY | | 14 | 2100 | 13 (77)| 00:00:01 |
| 2 | WINDOW SORT | | 14 | 2100 | 13 (77)| 00:00:01 |
| 3 | WINDOW SORT | | 14 | 2100 | 13 (77)| 00:00:01 |
| 4 | WINDOW SORT | | 14 | 2100 | 13 (77)| 00:00:01 |
| 5 | WINDOW SORT | | 14 | 2100 | 13 (77)| 00:00:01 |
| 6 | WINDOW SORT | | 14 | 2100 | 13 (77)| 00:00:01 |
| 7 | WINDOW SORT | | 14 | 2100 | 13 (77)| 00:00:01 |
| 8 | WINDOW SORT | | 14 | 2100 | 13 (77)| 00:00:01 |
| 9 | WINDOW SORT | | 14 | 2100 | 13 (77)| 00:00:01 |
| 10 | VIEW | | 14 | 2100 | 4 (25)| 00:00:01 |
| 11 | WINDOW SORT | | 14 | 2100 | 4 (25)| 00:00:01 |
| 12 | VIEW | | 14 | 2100 | 3 (0)| 00:00:01 |
| 13 | COUNT | | | | | |
|* 14 | FILTER | | | | | |
|* 15 | CONNECT BY WITH FILTERING| | | | | |
|* 16 | FILTER | | | | | |
| 17 | TABLE ACCESS FULL | EMP | 14 | 252 | 3 (0)| 00:00:01 |
|* 18 | HASH JOIN | | | | | |
| 19 | CONNECT BY PUMP | | | | | |
| 20 | TABLE ACCESS FULL | EMP | 14 | 252 | 3 (0)| 00:00:01 |
| 21 | TABLE ACCESS FULL | EMP | 14 | 252 | 3 (0)| 00:00:01 |
------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
14 - filter(LEVEL<=8)
15 - filter("MGR" IS NULL)
16 - filter("MGR" IS NULL)
18 - access("MGR"=NULL)
36 rows selected.
There are two points worth mentioning:
-
Minimal table access. Line 12 in the plan represents the first subquery, q1. This is where we do the hierarchical join. After that, all of the ancestor values are pulled without any additional table access; the values are just “remembered” since the rows are coming from the subquery in order (ancestors always come before children). Also note that the reason for full table scans here is because this is the SCOTT.EMP table, it’s tiny, and we’re reading all of the data – it’s the most efficient access path in this case.
-
The sorts are actually redundant. We know something the optimizer doesn’t: the data will never actually need to be resorted. Oracle doesn’t know it, but the fields L1-L8 (which are being sorted here) are actually already in order because they’re coming from the rowsource ordered by N. If you look at the sample data above you’ll see that L1-L8 are always in ascending order.
So not only do we have a simple and powerful solution, but we also have a fast one. Who knows… there might be an even better way to do this but I thought that this one seemed pretty good and worth documenting!
Very cool — nice explanation too.
LikeLike
In “The Solution: Step 2”
you have done
max(L1) over (order by N) L1,
max(L2) over (order by N) L2,
max(L3) over (order by N) L3,
max(L4) over (order by N) L4
I am little confused with “order by N” how it works.If I am doing “order by L” I am geting totaly different result.Since wrt to 1 column there is only 1 value for max analytical function if partition clause is not mentioned then how this is working pls explain .Thanks
LikeLike