PostgreSQL Performance Puzzle

Posted by Jeremy ⋅ July 16, 2022 ⋅ 15 Comments

Filed Under database, performance, PostgreSQL, puzzle

Update 7/18: y’all guessed the answer to my puzzle very fast! At the bottom of this blog post, I’ve added the expressions that I used (with credit to those who guessed it), and also other expressions guessed by readers which exhibited similar behavior. I’ll do another follow-up blog post touching on several layers of this interesting test, hopefully in the next few days.

This week I stumbled across a really big surprise in Open Source PostgreSQL. I thought it would make a fun puzzle to challenge the skills of all you on the internet who consider yourselves as PostgreSQL experts.

It’s very simple. I spun up a fresh copy of Centos 7 on an EC2 instance and loaded the latest Open Source PostgreSQL production version (14.4) from the official PG yum repositories. Then I created a simple table with a text column and two not-null bigint columns. I populated the table with 100 million rows of dummy data using generate_series() and then I indexed both bigint columns in the same way. I did a vacuum analyze to get stats and make sure the table is ready for a simple test. And finally, I did a quick bounce of PG and cleared the OS page cache so we could start the test cold.

Now I ran a very simple query:

select count(mydata) from test where mynumber1 < 500000;

Now here’s what shocked me. Starting with a cold cache every time, that query right there consistently completes in less than 500 milliseconds. But when I run the same query on the the other number column, it consistently takes over 100 seconds! Yes, seconds – not milliseconds. With exactly the same execution plan, processing the same number of rows! (500 thousand rows or 0.5% of the table.)

Now I’m not surprised to see specific data in a table impacting performance – that’s 101 for anyone who’s been around RDBMS performance. But the magnitude of this difference was way higher than I expected. I can repeat the two queries over and over with the same difference in executions times.

Again: it’s a simple table with two not-null bigint columns. The columns are defined identically – same data type, same index type, same execution plan, same number of rows processed. The only difference is the data in the table. There are no null values involved. And the difference in execution time is astronomical – as much as 600 X slower for some of my iterations!!

Can you guess what the data in the table is?

Included below is a complete and detailed transcript of the reproduction – I’ve just excluded the expressions that I used to populate the test table. I’m curious if anyone will be able to guess what those expressions were!!

If anyone tells me a guess on blog comments or twitter, I’ll test out their expression using the transcript below. Then I’ll credit anyone who sends a guess that exhibits the same shocking difference in performance between the two column. Fame and glory, my friends. In a week or so (maybe sooner if someone guesses it really fast), I’ll publish the two expressions that I myself redacted from my session transcript below.

Major, big-time bonus fame and glory if anyone can explain the reason. 😀

PS. Mohamed, if you’re reading this, you’re not allowed to guess! (He found it first but we worked together to figure out what was going on.)

instance_type: r5.large
volume_type: gp2
volume_size: 100

# CentOS Linux 7 x86_64 HVM EBS ENA 2002_01
us-west-2: ami-0bc06212a56393ee1

[root@ip-172-31-36-129 ~]# rpm -q postgresql14-server
postgresql14-server-14.4-1PGDG.rhel7.x86_64
[root@ip-172-31-36-129 ~]#

pg-14.4 rw root@db1=# create table test ( mydata text, mynumber1 bigint not null, mynumber2 bigint not null);
CREATE TABLE
Time: 6.617 ms

pg-14.4 rw root@db1=# insert into test
[more] - >   select 'data-XXXXXXXXXXXX'
[more] - >     , <GUESS THIS FIRST EXPRESSION>
[more] - >     , <GUESS THIS SECOND EXPRESSION>
[more] - >   from generate_series(1,100000000) as n
[more] - > ;
INSERT 0 100000000
Time: 224101.677 ms (03:44.102)

pg-14.4 rw root@db1=# create index test_mynumber1 on test(mynumber1);
CREATE INDEX
Time: 229785.314 ms (03:49.785)

pg-14.4 rw root@db1=# create index test_mynumber2 on test(mynumber2);
CREATE INDEX
Time: 108900.146 ms (01:48.900)

pg-14.4 rw root@db1=# vacuum verbose analyze test;
INFO:  00000: vacuuming "public.test"
LOCATION:  lazy_scan_heap, vacuumlazy.c:940
INFO:  00000: launched 1 parallel vacuum worker for index cleanup (planned: 1)
LOCATION:  do_parallel_vacuum_or_cleanup, vacuumlazy.c:2766
INFO:  00000: table "test": found 0 removable, 79913800 nonremovable row versions in 665949 out of 833334 pages
DETAIL:  0 dead row versions cannot be removed yet, oldest xmin: 757
Skipped 0 pages due to buffer pins, 0 frozen pages.
CPU: user: 8.44 s, system: 3.97 s, elapsed: 96.48 s.
LOCATION:  lazy_scan_heap, vacuumlazy.c:1674
INFO:  00000: vacuuming "pg_toast.pg_toast_16454"
LOCATION:  lazy_scan_heap, vacuumlazy.c:940
INFO:  00000: table "pg_toast_16454": found 0 removable, 0 nonremovable row versions in 0 out of 0 pages
DETAIL:  0 dead row versions cannot be removed yet, oldest xmin: 758
Skipped 0 pages due to buffer pins, 0 frozen pages.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
LOCATION:  lazy_scan_heap, vacuumlazy.c:1674
INFO:  00000: analyzing "public.test"
LOCATION:  do_analyze_rel, analyze.c:331
INFO:  00000: "test": scanned 30000 of 833334 pages, containing 3600000 live rows and 0 dead rows; 30000 rows in sample, 100000080 estimated total rows
LOCATION:  acquire_sample_rows, analyze.c:1357
VACUUM
Time: 100010.196 ms (01:40.010)



[root@ip-172-31-36-129 ~]# service postgresql-14 restart;
Redirecting to /bin/systemctl restart postgresql-14.service
[root@ip-172-31-36-129 ~]# sync; echo 1 > /proc/sys/vm/drop_caches
[root@ip-172-31-36-129 ~]#



pg-14.4 rw root@db1=# select count(mydata) from test where mynumber1<500000;
 count
--------
 499999
(1 row)

Time: 155.314 ms



[root@ip-172-31-36-129 ~]# service postgresql-14 restart;
Redirecting to /bin/systemctl restart postgresql-14.service
[root@ip-172-31-36-129 ~]# sync; echo 1 > /proc/sys/vm/drop_caches
[root@ip-172-31-36-129 ~]#



pg-14.4 rw root@db1=# select count(mydata) from test where mynumber2<500000;
 count
--------
 499180
(1 row)

Time: 105202.086 ms (01:45.202)



[root@ip-172-31-36-129 ~]# service postgresql-14 restart;
Redirecting to /bin/systemctl restart postgresql-14.service
[root@ip-172-31-36-129 ~]# sync; echo 1 > /proc/sys/vm/drop_caches
[root@ip-172-31-36-129 ~]#


pg-14.4 rw root@db1=# select count(mydata) from test where mynumber1<500000;
 count
--------
 499999
(1 row)

Time: 494.063 ms



[root@ip-172-31-36-129 ~]# service postgresql-14 restart;
Redirecting to /bin/systemctl restart postgresql-14.service
[root@ip-172-31-36-129 ~]# sync; echo 1 > /proc/sys/vm/drop_caches
[root@ip-172-31-36-129 ~]#



pg-14.4 rw root@db1=# select count(mydata) from test where mynumber2<500000;
 count
--------
 499180
(1 row)

Time: 104084.311 ms (01:44.084)

Update 7/18: Here’s the full INSERT statement from my original test:

pg-14.4 rw root@db1=# insert into test
[more] - >   select 'data-XXXXXXXXXXXX'
[more] - >     ,n
[more] - >     ,random()*100000000
[more] - >   from generate_series(1,100000000) as n
[more] - > ;
INSERT 0 100000000
Time: 224101.677 ms (03:44.102)

The very first person to correctly guess was Franck Pachot via a twitter DM at 06:05 UTC on the 17th. The first person to correctly guess with a public post was donghual@ in a blog comment at 14:33 UTC on the 17th.

At 14:48 UTC on the 17th, RustProof Labs on twitter guessed the idea behind what I did, although the explanation didn’t have an expression for me to test/measure. At 00:14 UTC on the 18th, Denish Patel made a guess on twitter that technically didn’t match but I would give credit, because his explanation suggests that it was just a typo. I didn’t run the test but I would expect similar behavior.

Two people also guessed other expressions that showed similar behavior, though less than 100 seconds per execution. At 06:38 UTC on the 17th, Andy Sayer on twitter guessed mod(n, 200 )*500000 which gave about 50 seconds per execution. At 16:26 UTC on the 18th, Joe Wildish in a blog comment guessed case n % 120 when 1 then n / 120 else 500000 + n end which gave about 30 seconds per execution. Joe was building on a blog comment from Matthis at 09:15 UTC on the 17th which also deserves credit, because his explanation suggests that his guess was correct and his expression just needed a little tweaking to avoid the integer out of range error.

Obviously I made this puzzle way too easy! 😀

About Jeremy

Building and running reliable data platforms that scale and perform. about.me/jeremy_schneider

View all posts by Jeremy »

Discussion

15 thoughts on “PostgreSQL Performance Puzzle”

I think the problem is the cardinality of the values. One possible solution could be:
```
insert into test
  select 'data-XXXXXXXXXXXX'
    ,n
    ,1
  from generate_series(1,100000000) as n;
```
LikeLiked by 1 person

Posted by Stefan Pastrilov | July 17, 2022, 12:16 am
Reply to this comment
- Good guess! However it wouldn’t process the same number of rows, so with that data the second query (where mynumber2<500000) would return 100 million rows instead of returning around 500 thousand.
  
  LikeLike
  
  Posted by Jeremy | July 18, 2022, 9:34 am
  
  Reply to this comment
First expression:

n

Second expression:

10000000 * (n % 10000000) + n / 10000000

With this, the numbers in the first column are 0, 1, 2, 3, …
In the second column we have 0, 10000000, 20000000, …, 1, 10000001, 20000001, …, 3, 10000003, …

The Indices will be of the default type btree. When querying with the simple query, the btree index is used.

Since all our numbers in the first column are in order, we have a naturally clustered index. This means that using (scanning) the index, we first read all entries from the first data page, then all entries from the second data page etc. All in all we read 0.5% of all data pages.

But all numbers in the second column are semi-shuffled, we have to read all data pages. Just like with the first column, we use the index to find the data page for the first number (0). Its on the first data page, we load that page. Then we continue to the second number (1) on the index. 1 is sadly 10000000 records after 0, quite surely on another data page. So we now have to load again. And again when reading 3, and 4, …

This did not happen when querying the first column. There we looked up 0 on the index, loaded the data page containing 0, continued to 1 on the index, saw that it’s on the same page as 0, so no loading necessary, continued to 2 on the index, again no loading necessary, …

Here is also a visual explanation: https://imgur.com/a/W0IFJqZ

So during the fast first query, we read 0.5% of all data pages and during the second, slow query of the distributed/shuffled data we read all pages.

LikeLiked by 1 person

Posted by Matthis | July 17, 2022, 2:15 am

Reply to this comment
- I think a few people mentioned it on twitter before the timestamp of this message, but I’m pretty sure you have the distinction of being the first person to mention clustering factor in a blog comment – nice! I reached the same conclusion that this is part of the story (but not the whole story).
  
  So I think the idea behind your second expression is right, however there was one small problem when I ran the test:
```
pg-14.4 rw root@db1=# insert into test
[more] - >    select 'data-XXXXXXXXXXXX'
[more] - >      ,n
[more] - >      ,10000000 * (n % 10000000) + n / 10000000
[more] - >    from generate_series(1,500000) as n
[more] - > ;
ERROR:  22003: integer out of range
LOCATION:  int4mul, int.c:807
Time: 45.037 ms
```
  LikeLike
  
  Posted by Jeremy | July 18, 2022, 9:44 am
  Reply to this comment
Interesting question. My guess is your first expression is “n”, as there are exactly 499999 rows returned. and your second expression is based on randomisation. might be “(100000000*random())::bigint, as “499180” rows returned. With randomisation, the index might need to scan more pages in 2nd case, and likely use parallel table scan instead of index scan.

LikeLiked by 1 person

Posted by donghual@ | July 17, 2022, 5:14 am

Reply to this comment
- Yeah! You guessed it! Those are exactly the expressions I used. 😀
  
  LikeLike
  
  Posted by Jeremy | July 18, 2022, 9:49 am
  
  Reply to this comment
A quiz is always welcome.

I guess between the first (fast) field and the second (slow) field there are one or more fields like text or bytea that make the second column be stored outside the table row. Because “count(*)” doesn’t use an index if has to read all the table rows and thus all the data that is stored outside the table row.

Nailed?

LikeLiked by 1 person

Posted by Rama Lang | July 17, 2022, 7:33 am

Reply to this comment
- Both columns are BIGINT. We can see the command user ran to create table.
  
  LikeLiked by 1 person
  
  Posted by Anup | July 17, 2022, 11:26 pm
  
  Reply to this comment
Or the second slow column is a text column that is toasted and has no index on it. A column with implicit cast.

LikeLiked by 1 person

Posted by Rama Lang | July 17, 2022, 7:36 am

Reply to this comment
- Index on column is useless when there is pretty much no/zero cardinality. Other situation is when sql has to skip many rows to find matching criteria.
  
  LikeLiked by 1 person
  
  Posted by Anup | July 17, 2022, 11:24 pm
  
  Reply to this comment
So, Jeremy, what’s the quiz’ answer?

LikeLiked by 1 person

Posted by Rama Lang | July 17, 2022, 2:46 pm

Reply to this comment
First expression: n
Second expression: case n % 120 when 1 then n / 120 else 500000 + n end
This is, I think, similar to Matthis answer. Essentially, with the second expression, each value between 0 to 500,000 is placed onto their own page in the heap, whereas with the first expression it places the values contiguously into the heap, so they occupy far fewer pages.
I chose a value of 120 by seeing how many rows “fit” into a page via inspecting CTID.
I see c.5000 pages being required for the faster read, and c.500000 for the slower read.
My tests locally show an order of magnitude (ish) difference in timings too, which I guess is what you would expect given that is the case with number of pages.
40ms for the closely correlated one, 642ms for the other.
I’ve got a fairly beefy system for testing. But still, maybe you are doing something else to see an even larger difference on EC2?

LikeLiked by 1 person

Posted by Joe Wildish | July 18, 2022, 9:26 am

Reply to this comment
- Also in my test I wasn’t clearing the filesystem cache so that may account for the difference.
  
  LikeLiked by 1 person
  
  Posted by Joe Wildish | July 18, 2022, 9:29 am
  
  Reply to this comment
  - I ran the test using your expressions on my test system where I generated the original transcript, including the step of clearing DB and filesystem caches. I got about 30 seconds per execution on the second numeric column. I didn’t give a specific criteria in my challenge, but I think this counts. Nice!!
```
pg-14.4 rw root@db1=# insert into test
[more] - >    select 'data-XXXXXXXXXXXX'
[more] - >      ,n
[more] - >      ,case n % 120 when 1 then n / 120 else 500000 + n end
[more] - >    from generate_series(1,100000000) as n
[more] - > ;
INSERT 0 100000000
Time: 208370.887 ms (03:28.371)

pg-14.4 rw root@db1=# select count(mydata) from test where mynumber1<500000;
 count
--------
 499999
(1 row)
Time: 327.972 ms

pg-14.4 rw root@db1=# select count(mydata) from test where mynumber2<500000;
 count
--------
 500000
(1 row)
Time: 30008.500 ms (00:30.008)

pg-14.4 rw root@db1=# select count(mydata) from test where mynumber1<500000;
 count
--------
 499999
(1 row)
Time: 524.307 ms

pg-14.4 rw root@db1=# select count(mydata) from test where mynumber2<500000;
 count
--------
 500000
(1 row)
Time: 30146.506 ms (00:30.147)
```
    LikeLike
    
    Posted by Jeremy | July 18, 2022, 10:13 am