Ardent Performance Computing

Jeremy Schneider

Search

>

Accessing Arbitrary Ancestors in Hierarchical Queries

Posted by Jeremy ⋅ May 23, 2007 ⋅ 2 Comments

Last Friday I wrote about a client query that could be rewritten in a much more optimized manner. However rewriting it was a little tricky, involving a hierarchical query and a column that needed to calculate a value based on it’s parent, grand-parent, great-grand-parent, etc. Raj posted a very smart solution that used sys_connect_by_path to access this information – which is exactly how I tried to do it at first. However after rethinking the query I realized that there was a better way to do it with analytics.

So here’s the approach that I took and how I arrived at the solution.

The Problem

The root problem was that I wanted a simple method to access any column value from an arbitrary level in the hierarchy above my current level. For example if you look at the TURNER record in the result set, his UPLINE_SAL column needs to be the sum of the SAL column from three different records:

ENAME                EMPNO VP_NAME           SAL UPLINE_SAL
--------------- ---------- ---------- ---------- ----------
KING                  7839                  5000       5000
 JONES                7566 JONES            2975       7975
  SCOTT               7788 JONES            3000      10975
   ADAMS              7876 JONES            1100      12075
  FORD                7902 JONES            3000      10975
   SMITH              7369 JONES             800      11775
 BLAKE                7698 BLAKE            2850       7850
  ALLEN               7499 BLAKE            1600       9450
  WARD                7521 BLAKE            1250       9100
  MARTIN              7654 BLAKE            1250       9100
  TURNER              7844 BLAKE            1500       9350
  JAMES               7900 BLAKE             950       8800
 CLARK                7782 CLARK            2450       7450
  MILLER              7934 CLARK            1300       8750

The Solution: Step 1

First what I’m going to do is create a few new fields. The purpose of these fields is going to be to identify an ancestor record. I will create one field for each level of the hierarchy where I might need to grab an ancestor. And to start with, I’m going to populate that field for the rows that actually belong to the specified level of the hierarchy. Might be easier to just show it, rather than trying to explain. :)

SQL> set linesize 120
SQL> set pagesize 40
SQL> col N format 99
SQL> col L format 99
SQL> col L1 format 99
SQL> col L2 format 99
SQL> col L3 format 99
SQL> col L4 format 99

select
  rownum N, 
  level L, 
  empno, 
  mgr, 
  ename, 
  decode(level,1,rownum,null) L1, 
  decode(level,2,rownum,null) L2, 
  decode(level,3,rownum,null) L3, 
  decode(level,4,rownum,null) L4
from emp
connect by mgr = prior empno start with mgr is null;

  N   L      EMPNO        MGR ENAME       L1  L2  L3  L4
--- --- ---------- ---------- ---------- --- --- --- ---
  1   1       7839            KING         1
  2   2       7566       7839 JONES            2
  3   3       7788       7566 SCOTT                3
  4   4       7876       7788 ADAMS                    4
  5   3       7902       7566 FORD                 5
  6   4       7369       7902 SMITH                    6
  7   2       7698       7839 BLAKE            7
  8   3       7499       7698 ALLEN                8
  9   3       7521       7698 WARD                 9
 10   3       7654       7698 MARTIN              10
 11   3       7844       7698 TURNER              11
 12   3       7900       7698 JAMES               12
 13   2       7782       7839 CLARK           13
 14   3       7934       7782 MILLER              14

14 rows selected.

Now you might ask, “what good does that do?” Good question… check this out: if I can “fill down” those values then I can use analytics to partition on them and grab the first_value of any field that I want! Don’t worry, I’ll take it one step at a time and it’ll make sense in a minute.

The Solution: Step 2

Next lets “fill down” the values. I’m also going to use subquery factoring to keep the code a little cleaner and more understandable:

with q1 as (
  select 
    rownum N, 
    level L, 
    ename, 
    decode(level,1,rownum,null) L1, 
    decode(level,2,rownum,null) L2, 
    decode(level,3,rownum,null) L3, 
    decode(level,4,rownum,null) L4
  from emp
  connect by mgr = prior empno start with mgr is null
)
select N, L, ename,
  max(L1) over (order by N) L1,
  max(L2) over (order by N) L2,
  max(L3) over (order by N) L3,
  max(L4) over (order by N) L4
from q1
order by N;

  N   L ENAME       L1  L2  L3  L4
--- --- ---------- --- --- --- ---
  1   1 KING         1
  2   2 JONES        1   2
  3   3 SCOTT        1   2   3
  4   4 ADAMS        1   2   3   4
  5   3 FORD         1   2   5   4
  6   4 SMITH        1   2   5   6
  7   2 BLAKE        1   7   5   6
  8   3 ALLEN        1   7   8   6
  9   3 WARD         1   7   9   6
 10   3 MARTIN       1   7  10   6
 11   3 TURNER       1   7  11   6
 12   3 JAMES        1   7  12   6
 13   2 CLARK        1  13  12   6
 14   3 MILLER       1  13  14   6

14 rows selected.

Looks good except for one problem: we actually only want to “fill down” values as long as we’re at the same level or lower in the hierarchy. For example, BLAKE is a “VP” (second level in the hierarchy) – he should not show FORD (N=5) at level 3 and SMITH (N=6) at level 4. He doesn’t have level 3 and 4 ancestors since he himself is at level 2. So what we’ll do is add a CASE statement so that records show themselves for all levels greater than or equal to their own.

with q1 as (
  select 
    rownum N, 
    level L, 
    ename, 
    decode(level,1,rownum,null) L1, 
    decode(level,2,rownum,null) L2, 
    decode(level,3,rownum,null) L3, 
    decode(level,4,rownum,null) L4
  from emp
  connect by mgr = prior empno start with mgr is null
)
select N, L, ename,
  case when (L>1) then (max(L1) over (order by N)) else (N) end L1,
  case when (L>2) then (max(L2) over (order by N)) else (N) end L2,
  case when (L>3) then (max(L3) over (order by N)) else (N) end L3,
  case when (L>4) then (max(L4) over (order by N)) else (N) end L4
from q1
order by N;

  N   L ENAME       L1  L2  L3  L4
--- --- ---------- --- --- --- ---
  1   1 KING         1   1   1   1
  2   2 JONES        1   2   2   2
  3   3 SCOTT        1   2   3   3
  4   4 ADAMS        1   2   3   4
  5   3 FORD         1   2   5   5
  6   4 SMITH        1   2   5   6
  7   2 BLAKE        1   7   7   7
  8   3 ALLEN        1   7   8   8
  9   3 WARD         1   7   9   9
 10   3 MARTIN       1   7  10  10
 11   3 TURNER       1   7  11  11
 12   3 JAMES        1   7  12  12
 13   2 CLARK        1  13  13  13
 14   3 MILLER       1  13  14  14

14 rows selected.

As you can see, BLAKE now shows himself at levels 3 and 4 (since technically he doesn’t have level 3 or 4 ancestors).

The Solution: Step 3

At this point getting the answer is easy. We can now use analytics (again) to simply partition on the field for the level we want and grab the first value. This allows us to easily grab any ancestor from the hierarchy without any additional access of the datasource! For example, to get the “VP” (ename from the second level) we simply use this expression:

first_value(ename) over (partition by L2 order by N)

There is one thing to be careful of: remember a level 1 record asking for it’s level 2 ancestor will return its own value – even though technically it doesn’t have a level two ancestor. Might be what you want… or maybe it could mess you up. For example if you’re mathematically adding the values of each level in the hierarchy then you don’t want to add the CEO’s salary four times!! In that case you’d need to use something like a CASE statement to check the level and only add the salary in if it’s appropriate.

This formula would give us the UPLINE_SAL value that we’re looking for:

case when (L>1) then (first_value(sal) over (partition by L1 order by N))
  else (0) end +
case when (L>2) then (first_value(sal) over (partition by L2 order by N))
  else (0) end +
case when (L>3) then (first_value(sal) over (partition by L3 order by N))
  else (0) end +
sal

The Solution: Final Answer

Putting all of that together and extending it to work with eight levels, this query will give the output I was looking for last Friday. It combines hierarchical query, analytics, and subquery factoring:

SQL> col ename format a15

with q1 as (
  select 
    rownum N, 
    level L, 
    empno, 
    mgr, 
    ename, 
    sal,
    decode(level,1,rownum,null) L1, 
    decode(level,2,rownum,null) L2, 
    decode(level,3,rownum,null) L3, 
    decode(level,4,rownum,null) L4,
    decode(level,5,rownum,null) L5, 
    decode(level,6,rownum,null) L6, 
    decode(level,7,rownum,null) L7, 
    decode(level,8,rownum,null) L8
  from emp where level1) then (max(L1) over (order by N)) else (N) end L1,
    case when (L>2) then (max(L2) over (order by N)) else (N) end L2,
    case when (L>3) then (max(L3) over (order by N)) else (N) end L3,
    case when (L>4) then (max(L4) over (order by N)) else (N) end L4,
    case when (L>5) then (max(L5) over (order by N)) else (N) end L5,
    case when (L>6) then (max(L6) over (order by N)) else (N) end L6,
    case when (L>7) then (max(L7) over (order by N)) else (N) end L7,
    case when (L>8) then (max(L8) over (order by N)) else (N) end L8
  from q1
) 
select lpad(' ',L-1)||ename ename, empno, 
  case when (L>=2) 
    then (first_value(ename) over (partition by L2 order by N))
    else (null) end
  vp_name,
  sal,
    case when (L>1) 
      then (first_value(sal) over (partition by L1 order by N)) 
      else (0) end +
    case when (L>2) 
      then (first_value(sal) over (partition by L2 order by N)) 
      else (0) end +
    case when (L>3) 
      then (first_value(sal) over (partition by L3 order by N)) 
      else (0) end +
    case when (L>4) 
      then (first_value(sal) over (partition by L4 order by N)) 
      else (0) end +
    case when (L>5) 
      then (first_value(sal) over (partition by L5 order by N)) 
      else (0) end +
    case when (L>6) 
      then (first_value(sal) over (partition by L6 order by N)) 
      else (0) end +
    case when (L>7) 
      then (first_value(sal) over (partition by L7 order by N)) 
      else (0) end +
    sal
  upline_sal
from q2
order by N;

ENAME                EMPNO VP_NAME           SAL UPLINE_SAL
--------------- ---------- ---------- ---------- ----------
KING                  7839                  5000       5000
 JONES                7566 JONES            2975       7975
  SCOTT               7788 JONES            3000      10975
   ADAMS              7876 JONES            1100      12075
  FORD                7902 JONES            3000      10975
   SMITH              7369 JONES             800      11775
 BLAKE                7698 BLAKE            2850       7850
  ALLEN               7499 BLAKE            1600       9450
  WARD                7521 BLAKE            1250       9100
  MARTIN              7654 BLAKE            1250       9100
  TURNER              7844 BLAKE            1500       9350
  JAMES               7900 BLAKE             950       8800
 CLARK                7782 CLARK            2450       7450
  MILLER              7934 CLARK            1300       8750

14 rows selected.

Pretty simple and yet very powerful and flexible! But… how does it perform?

Performance Analysis

Let’s take a look at the plan. Granted, this is on a tiny dataset – but it’s a starting point.

SQL> select * from plan;

PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------------------------
Plan hash value: 833053088

------------------------------------------------------------------------------------------------
| Id  | Operation                               | Name | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                        |      |    14 |  2100 |    13  (77)| 00:00:01 |
|   1 |  SORT ORDER BY                          |      |    14 |  2100 |    13  (77)| 00:00:01 |
|   2 |   WINDOW SORT                           |      |    14 |  2100 |    13  (77)| 00:00:01 |
|   3 |    WINDOW SORT                          |      |    14 |  2100 |    13  (77)| 00:00:01 |
|   4 |     WINDOW SORT                         |      |    14 |  2100 |    13  (77)| 00:00:01 |
|   5 |      WINDOW SORT                        |      |    14 |  2100 |    13  (77)| 00:00:01 |
|   6 |       WINDOW SORT                       |      |    14 |  2100 |    13  (77)| 00:00:01 |
|   7 |        WINDOW SORT                      |      |    14 |  2100 |    13  (77)| 00:00:01 |
|   8 |         WINDOW SORT                     |      |    14 |  2100 |    13  (77)| 00:00:01 |
|   9 |          WINDOW SORT                    |      |    14 |  2100 |    13  (77)| 00:00:01 |
|  10 |           VIEW                          |      |    14 |  2100 |     4  (25)| 00:00:01 |
|  11 |            WINDOW SORT                  |      |    14 |  2100 |     4  (25)| 00:00:01 |
|  12 |             VIEW                        |      |    14 |  2100 |     3   (0)| 00:00:01 |
|  13 |              COUNT                      |      |       |       |            |          |
|* 14 |               FILTER                    |      |       |       |            |          |
|* 15 |                CONNECT BY WITH FILTERING|      |       |       |            |          |
|* 16 |                 FILTER                  |      |       |       |            |          |
|  17 |                  TABLE ACCESS FULL      | EMP  |    14 |   252 |     3   (0)| 00:00:01 |
|* 18 |                 HASH JOIN               |      |       |       |            |          |
|  19 |                  CONNECT BY PUMP        |      |       |       |            |          |
|  20 |                  TABLE ACCESS FULL      | EMP  |    14 |   252 |     3   (0)| 00:00:01 |
|  21 |                 TABLE ACCESS FULL       | EMP  |    14 |   252 |     3   (0)| 00:00:01 |
------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

  14 - filter(LEVEL<=8)
  15 - filter("MGR" IS NULL)
  16 - filter("MGR" IS NULL)
  18 - access("MGR"=NULL)

36 rows selected.

There are two points worth mentioning:

Minimal table access. Line 12 in the plan represents the first subquery, q1. This is where we do the hierarchical join. After that, all of the ancestor values are pulled without any additional table access; the values are just “remembered” since the rows are coming from the subquery in order (ancestors always come before children). Also note that the reason for full table scans here is because this is the SCOTT.EMP table, it’s tiny, and we’re reading all of the data – it’s the most efficient access path in this case.
The sorts are actually redundant. We know something the optimizer doesn’t: the data will never actually need to be resorted. Oracle doesn’t know it, but the fields L1-L8 (which are being sorted here) are actually already in order because they’re coming from the rowsource ordered by N. If you look at the sample data above you’ll see that L1-L8 are always in ascending order.

So not only do we have a simple and powerful solution, but we also have a fast one. Who knows… there might be an even better way to do this but I thought that this one seemed pretty good and worth documenting!

About Jeremy

Building and running reliable data platforms that scale and perform. about.me/jeremy_schneider

View all posts by Jeremy »

Discussion

2 thoughts on “Accessing Arbitrary Ancestors in Hierarchical Queries”

Very cool — nice explanation too.

LikeLike

Posted by Dominic Delmolino | May 23, 2007, 1:25 pm
In “The Solution: Step 2”
you have done
max(L1) over (order by N) L1,
max(L2) over (order by N) L2,
max(L3) over (order by N) L3,
max(L4) over (order by N) L4
I am little confused with “order by N” how it works.If I am doing “order by L” I am geting totaly different result.Since wrt to 1 column there is only 1 value for max analytical function if partition clause is not mentioned then how this is working pls explain .Thanks

LikeLike

Posted by Ravi | July 31, 2007, 12:33 am

Ardent Performance Computing

Ardent Performance Computing

Search

Accessing Arbitrary Ancestors in Hierarchical Queries

The Problem

The Solution: Step 1

The Solution: Step 2

The Solution: Step 3

The Solution: Final Answer

Performance Analysis

About Jeremy

Discussion

2 thoughts on “Accessing Arbitrary Ancestors in Hierarchical Queries”

Disclaimer

Email Updates

Recent Posts

Recent Comments

Ardent Performance Computing

Search

Accessing Arbitrary Ancestors in Hierarchical Queries

The Problem

The Solution: Step 1

The Solution: Step 2

The Solution: Step 3

The Solution: Final Answer

Performance Analysis

Share this:

Related

About Jeremy

Discussion

2 thoughts on “Accessing Arbitrary Ancestors in Hierarchical Queries”

Disclaimer

Email Updates

Recent Posts

Recent Comments