>
Planet, PostgreSQL, Technical

Postgres Indexes, Partitioning and LWLock:LockManager Scalability

I have decided that – in Postgres circles – I shall henceforth refer to 2023 as THE YEAR OF THE LOCK MANAGER’S REVENGE.

Let me explain.

Lets start with Bruce Momjian. He has an in-depth presentation about locking in general with PostgreSQL called “Unlocking the Postgres Lock Manager“. I see online that he’s been giving this talk at least as far back as the Postgres Open 2011 conference and the slide deck says it was last updated in Feb 2023. There is a video recording available (Bruce’s site above has the YouTube link).

Bruce’s talk is not about the wait event. The wait event represents an in-memory 16-partition (aka tranche) “lightweight lock” that protects the Lock Manager during concurrent access. Bruce’s talk is important because it describes what the Lock Manager does… and that’s the thing which this lightweight tranche lock is protecting. Otherwise, concurrent memory access would cause corruptions. So it gives a beginning sense what factors might lead to contention on this in-memory lightweight tranche lock. The Lock Manager system is in charge of managing heavyweight or sql/application explicit locks like table locks, row locks, etc. The word “lock” means a bunch of things which is really confusing in this context!

In 2017, just before the Postgres version of Aurora was launched, I joined RDS and plunged myself into the PostgreSQL world. Having been interested in Oracle performance for many years (but not an expert by any means), one of the first areas I dove into was PostgreSQL Wait Events. This was brand new and hot off the press at that time.

Meanwhile, Kyle Hailey had been working with the RDS Performance Insights team and brought years of industry experience building DB performance tools. As a result, RDS PostgreSQL and Aurora PostgreSQL provided exceptional visibility into database performance very early on. Aurora even backported the v10 Postgres wait code into it’s v9.6-compatible launch product, so that it could have the Average Active Sessions dashboard with query drilldown available on launch day. As a result, many engineers at AWS started early with using wait events for Postgres performance issues.

The year 2017 was – I think – perhaps one of the most important for PostgreSQL because of the v10 release. Of course this was when they changed from 3-digit versioning to 2-digit versioning and broke every DBA script in existence. 😂 (But seriously, I’m happy with that change overall.) I already mentioned that v10 was the first version which fully implemented wait events. And another major headline feature in v10 was declarative partitioning. While there was previously a way to do partitioning in Postgres with something called table inheritance, I personally believe that adding support for “CREATE TABLE ... PARTITIONED BY ...” likely opened the floodgates and – combined with increasing general adoption – it meant that Postgres was about to get a lot more partitioning usage very quickly.

In May 2020 there was a twitter thread with Fujii Masao and Amit Langote about the LWLock:LockManager wait event. Both of them had seen and knew about lightweight lock contention that can happen at very high scale, especially in the context of partitioning. Amit mentioned that David Rowley had started a thread on hackers in Dec 2018 related to this issue, so it seems David had also seen this. David was looking at enhancing the query executor to reduce locking in some situations. By this time I had also seen the problem before; in fact I included a slide in my 2019 wait events talk with a screenshot of a system encountering this problem. My first reflex had been to look at NUM_LOCK_PARTITIONS as a possible mitigation.

RDS Documentation

Now for a short digression.

In 2018, a bunch of amazing and tenured Postgres people who made up “OpenSCG” became part of AWS. In January 2020, David Rader (one of those amazing people) became the manager of the team I was on. Among other things, that team was involved in some support-related investigations. As a result, engineers (who already knew a lot) learned even more about using wait events to diagnose performance problems. Dave noticed this. On January 6, 2021 he added a task for the team: Create detailed write ups on common PostgreSQL wait events that can be used in the documentation. Explain what the wait event means, known causes, troubleshooting steps.

I’m pretty sure everybody on the team agreed this was a great idea. Everybody also had a hundred demands on their time, and we all had to prioritize carefully. In my opinion, writing a documentation page about a wait event sounds a bit like writing and publishing a great technical book with an animal on the cover… I have the utmost respect (and I haven’t ruled out authoring a book someday)… but I’m pretty sure nobody writes technical books “for the money” or “because it’s easy and doesn’t take much time”.

I think Dave deserves so much credit for persistence on this. He knew it could benefit Postgres users. Dave spent basically two years chasing engineers about this doc idea, amongst everything else he was doing. Gently, strongly, passively, actively, jokingly, seriously… always friendly and respectfully… we made a spreadsheet, distributed specific wait events amongst different people on the team, he remembered to ask at every weekly team meeting if anyone made progress… those pages are hard to write!

According to GitHub, the new section of the RDS PostgreSQL documentation was published on Dec 2, 2022. I contributed some. A lot of people contributed in various ways. This included engineers (like Vinny Schmidt, Sean Massey, Sami Imseih, Mohamed Ali and Jim Nasby) and doc writers and – importantly – managers like Dave.

This brings us to the documentation page for the LWLock:LockManager wait event. Some time in 2021 Luiz Verona took on the task of assembling the bulk of the draft. Luiz dug through Postgres docs and code and comments, read about fast path locking, created a repro based on a single empty table with hundreds or thousands of indexes (no partitioning). And as always, we build on the work of others: he also spent time reading a lot of prior work. PostgreSQL mailing list threads and public blogs… old customer support cases and internal documents that other engineers had cobbled together over time, in various states of disorganization.

One example was Kyle Hailey’s doc simply titled “LWLock:lock_manager” which he had started in January 2020. He had collected and dumped together all sorts of references, quotes, and bits of information from other engineers over time. Some people I saw cited include Robins Tharakan, Emmanuel Devillard, Jignesh Shah, Scott Mead, Jim Mlodgenski and Dave Rader. In fact Kyle’s doc linked another doc that Robins wrote in February 2020 with a LWLock:LockManager contention reproduction based on joining multiple empty tables. Luiz reviewed all of this – and then after drafting a doc page he got further review from Bertrand Drouvot and Shawn McCoy (I think Shawn added the query to check for fast path locks).

The doc page that has been published is relatively short and it’s packed with valuable insights drawn from many people.

GitLab

Meanwhile, in a galaxy far far away…

It was early 2023 and GitLab needed to upgrade the hardware under their databases. They run on GCP compute instances and they wanted to upgrade from n1-generation hardware to n2-generation hardware. In order to perform this upgrade, they added new read replicas to their Patroni cluster using n2-generation hardware and began to shift production traffic while carefully monitoring. Once confirming that everything looked good, they gave the green light to shift the remaining traffic and decommission the old n1-generation read replicas.

On April 5 they executed the change to remove the n1-generation hardware – and immediately experienced a severe production performance degradation. The change was rolled back; the total impact lasted 2 hours and 15 minutes. The GitLab team opened an investigation to study what led to this performance degradation and after quite a bit of work they determined it to be centered around PostgreSQL’s “LockManager” lightweight lock. I’m sure there were more people involved, but I saw contributions from Rafael Henchen, Alexander Sosna, Andrew Newdigate, and Matt Smiley. Matt used bpf-based tooling to begin gathering extensive data during a number of experiments. (And I think he may have even done a little careful data collection on their production systems?) Matt identified the factor of slow vs fastpath locking, which was also discussed on the RDS doc page. This column has been in the pg_locks table all along, even though Bruce’s old presentation didn’t mention it! As a result of the GitLab investigation, Nikolay Samokhvalov (who has consistently continued to work on this with the GitLab team) started a discussion about FP_LOCK_SLOTS_PER_BACKEND on the hackers list.

BTW, Lukas Fittl also covered the GitLab incident in depth on the Nov 16 episode of 5mins of Postgres – worth a listen if you’re interested in an actually-7-minute-overview. 😉

Image: https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301

GitLab has seen a few more incidents, none as severe as the first one. Rafael created an epic in October to collect information about the multiple incidents and about the ongoing work toward improved reliability. From my read, the incidents on July 5 and September 20 breached alarm/notification thresholds but likely weren’t severe enough to be noticed by many users.

One important note: I might be mistaken, but from my read, GitLab’s contention might be driven by index count more than partitioning. For example, their projects table contains 57 indexes, so a transaction needs 58 locks to query it. And that’s without any joins, and assuming a toast table and index aren’t adding more locking requirements. From a workload perspective, the query rate per active replica peaked at 60K queries/second on April 5th.

Matt Smiley’s investigation summary is gold. The most important take-aways in my opinion:

  • Adding capacity is the simplest remedy (ie. read replicas)
  • Cache at app tier when practical to reduce query rate
  • Search for indexes to drop from frequently queried tables
  • Ensure partitioned tables are queried in a way that is compatible with partition pruning

Update 4-Mar-2024: I mistakenly forgot to mention Nikolay Samokhvalov’s very good November 11 Postgres Marathon writeup about learnings from the GitLab incident.

Importantly, Nikolay pointed out that in the case of contention driven by over-indexing, lwlock pressure might also be lessened by using prepared statements to reduce planning. However if pressure is driven by partitioning then it’s more complicated. Custom plans need to be re-planned on each execution but can eliminate partitions. Generic plans may avoid re-planning and see some benefit, but they will need to scan all partitions.

Also, it’s worth mentioning a thread in the GitLab discussion which developed this year after Adrien Nayrat pointed out Andres Freund’s 2022 lwlock optimization patch (finally committed on Jan 18 this year by Michael Paquier with Bharath Rupireddy listed as the reviewer… and so many heavy-hitting PG contributors on the hackers thread I can’t begin to list them). Vitaliy Kukharik shared a number of test results to see how this patch impacted the over-indexes case of LWLock:LockManager contention, with Dmitry Fomin confirming that he’d seen similar results in his own tests. So many different people working in this area! I missed these important contributions when I first published, and I hope you’ll leave a comment if you know more people I’ve missed.

Image: https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2353

Midjourney

But GitLab was not the only company who talked publicly about running into LWLock:LockManager in 2023.

By this time Kyle Hailey (yes the same one) was working at Midjourney and the load on their Postgres database was growing while interest in AI boomed. Yes – I’m talking to all you geeks who fiddle with Midjourney and make cool art on the weekend – someone has to be hired to operate and manage a Postgres database so you can play! Specifically, they had a database with outbound logical replication and an 8TB table handling 8 to 10 thousand queries per second which was quickly becoming unmanageable. Adding a single index (concurrently) took 12 hours. Dropping an unused index takes a lock that essentially stops all queries. Kyle decided to implement PostgreSQL partitioning, configured with time-based ranges and one partition per day.

Like the team at GitLab, Kyle carefully planned this maintenance operation and did significant testing. Unsurprisingly, with all the attention on AI these days, if Midjourney is down for more than 15 minutes then they could get a phone call from Forbes asking what’s happening. Migrating to partitioning (especially online) is not a simple one-command operation in PostgreSQL. (Chelsea Dole did a great talk on this topic recently at SEAPUG!) Kyle had to do quite a bit of research and work to plan this.

Saturday April 1, 2023 was the day Kyle ran the migration.

The migration was flawless. Seconds: that’s all it took to run the conversion script. A minute or two to bounce the database. Starting and stopping the application was what took the most time – 20 minutes down, 20 minutes up – end to end the process took less than an hour.

Everything indicated a smooth, successful migration.

The first problem didn’t come until a few days later. A minor mistake in the script had left a constraint in place which triggered errors when a future date range was no longer in the future… easy quick fix, but 30 minutes of downtime. At this point they got the call from Forbes.

📞 “All good! Don’t worry!” 📞 … Kyle’s blog post walking through that migration is worth reading.

But the BIG problem didn’t come until four weeks later. This single Postgres database is handling 8 to 10 thousand queries per second, 10 million new records daily, each receiving an average of three updates. The main orders table, serving as a queue for new work requests and retaining historical records, is now partitioned by day.

Saturday May 6, 2023 was the fateful day of great suffering and tragedy. The primary function of this database – the work queue operation – started to stall. The performance view (Average Active Sessions) immediately suggested that the culprit was contention on LWLock:LockManager.

Kyle’s blog says the issue started Saturday morning. He included two pictures, and from those pictures I’m guessing that some level of errors started around midnight and proceeded for many hours – but the most severe incident seemed to last only about 5 minutes, with hundreds of database connections stuck on LWLock:LockManager and very severe upstream application impact. In Kyle’s blog he mentions later that in this first incident the database lightweight lock contention subsided on its own (similar to the first LWLock:LockManager incident I ever saw).


Images: https://www.kylehailey.com/post/postgres-partition-pains-lockmanager-waits

In Kyle’s blog post, he first mentions finding the RDS Documentation which had just been published in December. (That is some good Karma, seeing how Luiz had referenced Kyle’s internal doc while drafting it!) Secondly, Kyle found the GitLab issue which had been published literally weeks before. Between reading these, some hackers mailing list threads, and a few other things on the internet Kyle was able to determine that one significant contributor was the number of partitions, and he identified a mitigation strategy of doing an expedited migration from daily to weekly partitions. He started work on migrating the first 7 daily partitions to a single weekly partition. It took 2 hours to copy the first daily partition. After 14 hours to migrate 7 days into the first weekly partition he attaches it to the main table, makes sure things look good, and finally drives home from the office.

The frantic phone call came before he even got home. When PostgreSQL logical replication and partitioning are used together, you have to be careful and often things require more manual steps. Kyle had created the new partition using a SQL command that picked up properties from the parent table. It turns out this didn’t include the replication settings. After the new weekly partition was attached to the main table, the database log soon started to emit replication errors while application orders started to flat-out fail. Kyle got home as quickly as he could and detached the new partition so that replication could resume. Now a week of data was missing from the main table, but the the application errors subsided. Second try: re-create the new weekly partition with slightly different SQL syntax to pick up all properties from a child partition. Another 14 hours building. I’ve been in situations like this before. It’s unclear to me whether Kyle slept at this point.

Finally, after what sounds like 30-40 hours straight work (and weekend plans ruined), they had performed an initial partition count reduction. It’s determined that the system has been pulled back from the edge and Kyle finally takes a break.

Wednesday May 10, 2023 – everyone is taken by surprise when they hit the LWLock:LockManager wall again – they had thought there was more headroom than this. This time, Kyle was armed with more information and was able to mitigate the problem by detaching partitions. But detaching partitions could have turned a brown-out into a black-out because the detach operation gets in line for a lock on the entire table, so Kyle wrote a script (available on his blog) to proactively kill any sessions holding locks which the detach operation is waiting for. This strategy successfully and quickly mitigated the May 10 incident after only one hour of critically elevated error rates.

Image: https://www.kylehailey.com/post/postgres-partition-pains-lockmanager-waits

After this, Kyle finished migrating all remaining daily partitions to weekly partitions, significantly decreasing the locking overhead.

If you’re a fan of AI and Midjourney, make sure you buy Kyle a beverage of his choice next time you see him!

And – with both Midjourney and GitLab – I know there were many other talented engineers involved in these incidents too. Thank you: to all of you who responded and helped get these services back to normal operation quickly and safely.

Reproductions

This past January, Fujii Masao posted a great repro on Twitter for LWLock:LockManager contention. It’s very easy if you want to test it and play around with this wait event:

1. pgbench -i -s 100 --partitions=100
2. psql -c "alter user $PGUSER set plan_cache_mode=force_generic_plan"
3. pgbench -c 80 -S -M prepared -T 60

Fujii’s reproduction relies on forcing generic plans with prepared statements. Note that you have to override the plan_cache_mode to a non-default setting, because otherwise the executor is smart enough to stick with custom plans for good performance. I ran this test on an r6i.16xlarge (Xeon Ice Lake, 1 socket/processor, 32 cores, 64 threads) and I did see high LWLock:LockManager contention (70% of 64 AAS) while processing 75,000 tps and using 16% total cpu. Increasing client count did not increase tps; this was around the peak tps.

Running the same test without forcing generic plans, I peaked at 245,000 tps and 37% total cpu, with 512 clients and 8 jobs (-c 512 -j 8). Increasing the client count resulted in cpu continuing to elevate and tps on a slight downward slope.

Running the same test without forcing generic plans at a fixed rate of 75,000 tps (-j 8 -c 512 -R 75000) resulted in 10% total cpu and 0.30ms client latency. The original test with generic plans was processing the same number of transactions per second with 1.05ms client latency and 16% total cpu. As you can see from the screenshot, I used RDS Postgres for this test. Since pgbench was connecting from a separate EC2 instance, the client latency includes network time.


We can also tweak Fujii’s reproduction to use default settings, if we use a query that doesn’t include the partition key:

1. pgbench -i -s 100 --partitions=100
2. echo "select abalance from pgbench_accounts where bid=1 limit 1;" >bquery.sql
3. pgbench -n -c 320 -T 60 -f bquery.sql

And of course it’s also possible to reproduce this with indexes and without partitions.

Takeaways

First of all, lets look at the numbers. People are successfully operating PostgreSQL on data sets of terabytes and transaction/query rates of tens of thousands per second. If your database is gigabytes and your transaction rate is hundreds per second, you probably don’t need to worry about LWLock:LockManager contention. Postgres adoption is still surging and it’s now had declarative partitioning for over 5 years. Don’t let a few incidents fool you: this lightweight lock is working very well on the whole – a large number of people are using and scaling Postgres with success.

Second, of course, using the cloud doesn’t make every problem go away. Once you begin to operate with a data set and workload at large scale, you’re eventually going to need to learn enough about your platform to understand how your application and schema and workload interact with it. Regardless of whether their title is Software Engineer, DevOps Engineer, Site Reliability Engineer, Database Engineer, Database Administrator, Programmer, SysAdmin or Architect: someone in your employment – not the vendor’s employment – still needs to do this work.

Third, production outages – especially performance induced brownouts – are very stressful for everyone involved. I’ve been doing this long enough that I’ve had my share of long, grueling days. (Including in my Oracle days.) Practice Blameless Post-Mortems and get some sleep and focus on moving forward to become more robust and resilient, both technologically and organizationally.

Fourth, is “NoSQL” the answer any time you hit a scaleability problem in Postgres? Certainly not at first – most likely you need to first read the happiness hints, grab the low hanging fruit, and just clean up your app to follow basic best practices. But if you’re one of the few organizations that truly runs Postgres at very high scale? Then maybe. Or maybe not. Maybe you have 200 unrelated schemas in a single database and you can move them to multiple databases without too much effort. If you have a single schema outgrow partitioning and need to shard across multiple Postgres databases, that will require some application changes. It’s a reasonable time to consider whether it makes sense to pull a table or a module out for a simpler overall architecture, and whether a different kind of data store makes sense for those modules. It seems to me like a tradeoff with many NoSQL data stores is you get a little less flexibility on things like level of SQL syntax support, joins, schema, transactions, isolation, integrity constraints, client driver flexibility & choice, 3rd-party management tool compatibility & availability, and COTS support – in exchange for more promises about behavior? Each new additional system means spending an innovation token to learn all the quirks, like how it differently handles upgrades and security patching and backups and HA and pricing and performance and paid support. The complexity of your architecture will grow as your business grows… the art is in making good choices, at good times.


As I wrap this up, I would be remiss if I didn’t mention the many specific happiness hints that are directly related to aspects of this article:

  • Use Connection Pooling. (Centralized pooling eases use of read replicas, and all pooling helps reduce overall load on the database server.)
  • Limit or avoid dynamic pool growth when possible. (Unbounded max pool size would cause a connection storm in all of these scenarios, making the problem worse.)
  • Measure relation size in GB, not TB. (Nobody wants those 12 hour index builds.)
  • Measure indexes per table in single digits, not double digits. (If possible… often you need the indexes but each index does mean a little more locking is needed.)
  • To scale workloads, shard across instances or carefully partition tables. (Note my VERY INTENTIONAL use of the word “carefully” 😀)
  • Use Active Session Monitoring. (In a crisis, it gets you to the problem faster than anything else!)
  • Save key statistics including unused indexes. (Best way to find them!)
  • Alarm on Average Active Sessions. (In a crisis, you’ll be the first to know something is wrong.)

PostgreSQL is a great database, and thanks to contributors of many kinds – from large companies to individuals in countries all around the world – it keeps getting better. Ideas for improving LockManager scalability are being discussed: for example reduced partition locking by the executor, increasing tranche partitions, or increasing fastpath slots. I think there’s plenty of room for improvement before we’re limited by laws of physics, and I’ll be continuing to watch this area.

About Jeremy

Building and running reliable data platforms that scale and perform. about.me/jeremy_schneider

Discussion

3 thoughts on “Postgres Indexes, Partitioning and LWLock:LockManager Scalability

  1. There’s more to it:

    Liked by 1 person

    Posted by Jakub Wartak | March 5, 2024, 12:03 am
  2. I think the following line should refer to Gitlab, not GitHub:

    One important note: I might be mistaken, but from my read, GitHub’s contention

    Liked by 1 person

    Posted by Tim Sandquist | April 18, 2024, 5:32 am

Leave a New Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Disclaimer

This is my personal website. The views expressed here are mine alone and may not reflect the views of my employer.

contact: 312-725-9249 or schneider @ ardentperf.com


https://about.me/jeremy_schneider

oaktableocmaceracattack

(a)

Enter your email address to receive notifications of new posts by email.

Join 68 other subscribers