Ardent Performance Computing

Jeremy Schneider

Search

>

kubernetes, Planet, PostgreSQL, Technical

Run Jepsen against CloudNativePG to see sync replication prevent data loss

Posted by Jeremy ⋅ September 1, 2025 ⋅ 1 Comment

Filed Under cloudnativepg, cnpg, crash, data loss, database, durability, isolation, jepsen, lab, network partition, replication, split brain, synchronous, testing

Are you in the Pacific Northwest?

This Thursday Sep 4 at 6pm we have special guest Nikolay Samokhvalov visiting Seattle and speaking at the Seattle Postgres User Group! Nik is the founder of PostgresAI and co-host of the long-running weekly podcast postgres.fm – we are very lucky to have him as a guest and I hope you can come!
Seattle Systems meetup needs meetup locations to host us. If you are a company in the Seattle area with space for an in-depth technical talk and 50-100 people, then please reach out to Sitesh (I can help get you in touch). Huge thank you to Arzhang and Momento for helping with the first two meetups!

Want to learn more about topics related to this blog? At 3:15p on Thu Nov 13 in KubeCon Atlanta, I’ll be speaking with Leonardo Cecchi about distributed systems theory applied to standard open source postgres cluster reconfigurations.

Jepsen is a testing framework for distributed systems that verifies safety guarantees by subjecting clusters to faults (e.g., network partitions, crashes, and failovers) and checking for consistency violations. In this lab exercise, we will use Jepsen to illustrate data loss when synchronous replication is disabled and cluster reconfigurations occur involving postgres failover.

Postgres synchronous replication ensures that transactions are only committed once changes are written to both the primary and a synchronous standby, which is crucial for preventing data loss during automated failovers by guaranteeing no acknowledged transaction is lost if the primary crashes.

In addition to crashes, some flavor of synchronous replication is (and always has been) the only protection against data loss during network partitions or split brains with Postgres – across all Postgres HA frameworks. There was a little confusion recently around Patroni’s “failsafe” feature. Failsafe is a useful feature but it does not prevent data loss during network partitions – Patroni relies on postgres sync replication to prevent data loss during network partitions, similar to CloudNativePG.

This lab exercise will focus on crashes. The test doesn’t detect data loss on every run. In a 12-hour run with a 2-instance cluster on September 9 2025, there were 22 failures out of 93 iterations total. You may not see failure on the first attempt; be sure to try several times. (This paragraph was updated Sep 9.)

I think the easiest way to go through this lab exercise is with the CloudNativePG LAB: a ready-to-use, batteries-included, runs-anywhere Virtual Machine with the CNPG Playground and a few Lab Exercises. It can run directly on your laptop or it can run in the cloud. Under the hood, the CNPG LAB is just a robust post-install bootstrap script that transforms a clean Ubuntu 25.04 server into a fully functional virtual desktop lab environment. Using this lab, many people should be able to start from scratch and complete the exercises to see data loss with Jepsen in a few hours. You’re welcome to learn clojure, but it’s not required for these exercises!

https://github.com/ardentperf/cnpg-playground/tree/main/lab

Note: Exercise 1 is actually the walkthrough of creating a LAB virtual machine. Start there. 🙂

Exercise 3 runs the Jepsen “append” workload against the pg-eu CloudNativePG cluster and induces rapid primary failures to stress the system.

Reading and Understanding Jepsen Test Results

A good in-depth explanation of the Jepsen workload is available from CMU’s quarentine tech talk series, starting with the discussion of Atul Adya’s work to define isolation levels in terms of dependency cycles.

For completing this lab exercise, we only need a simplified description of how the test works.

Jepsen has a table and it updates rows in that table. Basically, the only thing it does is a string append on individual rows in its table.

The Jepsen test client keeps a log of all the historical values it has seen and it will analyze the history after the test completes. For our test, as we walk through time, if a chunk of a string suddenly disappears then that tells us we lost some data.

When the Jepsen test finishes, you will see something like this:

0 successes
0 unknown
0 crashed
1 failures

1 failures means a failure (such as data loss) was detected.
1 successes means no failures were observed.
crashed or unknown means the run was inconclusive; re-run the test.

In this CNPG LAB exercise, full detailed Jepsen results are automatically uploaded to an object store bucket when the test completes. There’s a bookmark in Firefox, which automatically opens when you first connect to the desktop.

First open the file latency-raw.png – this has your overall timeline. You can get a general sense of how long failovers took from the client perspective. Here’s an example of what this file looks like. The vertical blue bars are periods of successful transactions and the blank spaces with sparse red squares are period of unavailability:

For this lab exercise, the folder elle/incompatible-order has the data loss details. Each file represents the timeline/history for one specific row in the table (by primary key).

When you open one of these files, you’re going to see the timeline. On the left side is a list of timestamps. At each timestamp, you see the value in that row get longer one committed transaction at a time. What you’re going to see is a timestamp where suddenly the row gets inexplicably shorter – and a bunch of missing values in the middle of the string, and then it continues appending but never gets the missing data back.

Below, i’ve opened 15.html which shows all reads of the row with primary key 15. Remember that in this test, jepsen only ever appends values to the string. After scrolling to the second page, I can see that we lost about 3 seconds worth of committed durable data. The read at time 40.66 has lost all the updates between time 24.57 and time 27.42.

Postgres Synchronous Replication

Next, the lab exercise walks through enabling Postgres synchronous replication and repeating the test. Once synchronous replication is enabled, you should no longer see data loss when killing the primary pod.

Here’s an example screenshot after a test with synchronous replication enabled:

Drilling down into the test results, we can see from latency-raw.png that our periods of unavailability were slightly longer. This makes sense because we don’t want postgres to accept writes if there are no healthy replicas.

Under the directory elle/sccs (“Strongly Connected Components” or SCCs), we can see a few graphs Jepsen has generated for dependency graph cycles that it detected during this test. The screenshot below shows two transactions:

A transaction APPENDS the value 9 to record 71, before it READS record 185 (value 1)
A transaction APPENDS the value 2 to record 185, before it READS record 71 (values 1-8)

Logically, this does not make sense. Which transaction happens before the other? This kind of cycle anomaly is one of the differences between read-committed isolation and serializable isolation. Right now we have configured postgres in read-committed mode so Jepsen simply documents the cycles while considering the test successful.

Takeaways

Most importantly, this lab illustrates the role that synchronous replication plays with postgres (and CloudNativePG).

But I hope that providing this lab also encourages more people to start playing with Jepsen and CNPG! Core open source postgres is decades old and yet there are plenty of interesting places where improvements can still be made (like the Canceled Transaction Problem that Alexander Kukushkin illustrated in his POSETTE talk this year… where even WITH synchronous replication we can still lose client-committed data after a failover). There are also lots of opportunities to contribute directly to CloudNativePG for engineers who are interested in databases and distributed systems. Start digging into the source code and come chat on Slack (join the CNCF Slack Workspace)!

Frequently Asked Questions about the CloudNativePG LAB

https://github.com/ardentperf/cnpg-playground/tree/main/lab

What’s the reasoning behind the hardware specs? After some experimentation, it seemed that running with 2 CPUs and 8 GBs of memory could result in a system that was running well over 50% utilized even before starting a monitoring stack or workload. At present it seems like 4 CPUs and 16 GBs of memory should be able to support a full CloudNativePG distributed topology for learning including two full kubernetes clusters with twelve nodes total, data replication between them, monitoring stacks on both, and a demo workload – all running on just your single personal machine or a single cloud instance.

If you are running this in a Virtual Machine on your Windows or Mac laptop, the Virtual Machine you create needs to match the recommended specs, and you will need to leave enough resources for everything else running on your laptop. When configuring the VM, if you are asked to set a CPU count, assign at least 4 CPUs to the VM. As with cloud environments, you are assigning virtual CPUs—not physical cores—to your virtual machine. VirtualBox, UTM (on mac) and Hyper-V (on windows) should all work for installing and running Ubuntu in a VM.

What’s the vCPU/core rule of thumb? What’s the reasoning behind it? Rule of thumb for “what is a CPU”: In cloud environments, count vCPUs. On your own hardware, count physical cores – not smt threads or operating system CPUs. Those of us who choose cloud environments will be using smaller instances rather than whole servers. While noisy neighbors do sometimes happen, my theory is that cloud providers generally don’t run their physical hardware at high enough cpu utilization where SMT would have a noticable adverse impact on individual small tenants. (Wild guess with zero inside info… someone should really test this and publish their findings.) I might be wrong about this – but at only 4 vCPUs, I’m hoping they will generally behave like full cores even if the underlying hardware has SMT or hyperthreading enabled?

Why a virtual desktop instead of just a server? Monitoring and dashboarding systems like Grafana are essential day-two operations for any database. While it’s possible to forward ports and use a browser elsewhere, having a desktop environment simplifies things and provides a more consistent experience. We can more easily share demos and screenshots and experiments when we minimize the differences in how we’re doing things. It minimizes variation and makes the lab more accessible to beginners. It also makes it easier to build training curriculums on this foundation, which can be used in formal classes.

Why a virtual desktop instead of Ubuntu’s official Desktop Edition? Ubuntu’s desktop edition is geared toward specific hardware. There’s no easy way to convert a server installation into a desktop installation via package managers because much of the desktop setup code lives only in Ubuntu’s installer. By standardizing on a virtual desktop via RDP (even when running in a local VM), we can provide a single consistent and universal experience.

Why the Cinnamon desktop environment instead of something like GNOME or KDE? Through a lot of trial and error, we learned that with multiple installation methods on both 24.04 and 25.04, GNOME has problems interoperating with xRDP. We experienced crashes and unresponsiveness at startup. XFCE and KDE are stable however both of them do not support color emojis in the terminal, which have become somewhat common in command line tooling recently. Cinnamon was the only environment that both supported color emojis and also seemed to work reliably with xRDP.

Why Ubuntu version 25.04 rather than an LTS release? Because this has a new enough version of nix package manager in the distro repositories to work with the CNPG playground’s nix devshell. This could have gone either way – docker seemed to work on 24.04 and we could have installed bleeding edge versions of nix directly from upstream – but for now we decided to stick with ubuntu packaged versions of nix in favor of more stability in the lab environment. We will likely refresh the lab environment for Ubuntu 26.04 after it is released.

Why would someone need proxies and custom CAs? There are a wide variety of ways internet connectivity and traffic is managed in different places. For example, the official Docker documentation includes a guide for using docker in corporate environments where network traffic is intercepted and monitored with HTTPS proxies like Zscaler. It’s certainly possible to run cnpg-playground in these environments too, and these ubuntu automation scripts will handle it if needed.

About Jeremy

Building and running reliable data platforms that scale and perform. about.me/jeremy_schneider

View all posts by Jeremy »

Discussion

One thought on “Run Jepsen against CloudNativePG to see sync replication prevent data loss”

Greate article Jeremy! I learned few new things from your post.

LikeLiked by 1 person

Posted by Luiz Verona | September 2, 2025, 1:38 am

Reply to this comment

Leave a New Comment Cancel reply

Ardent Performance Computing

Ardent Performance Computing

Search

Run Jepsen against CloudNativePG to see sync replication prevent data loss

Reading and Understanding Jepsen Test Results

Postgres Synchronous Replication

Takeaways

Frequently Asked Questions about the CloudNativePG LAB

About Jeremy

Discussion

One thought on “Run Jepsen against CloudNativePG to see sync replication prevent data loss”

Leave a New Comment Cancel reply

Disclaimer

Email Updates

Recent Posts

Recent Comments

Ardent Performance Computing

Search

Run Jepsen against CloudNativePG to see sync replication prevent data loss

Reading and Understanding Jepsen Test Results

Postgres Synchronous Replication

Takeaways

Frequently Asked Questions about the CloudNativePG LAB

Share this:

Related

About Jeremy

Discussion

One thought on “Run Jepsen against CloudNativePG to see sync replication prevent data loss”

Leave a New Comment Cancel reply

Disclaimer

Email Updates

Recent Posts

Recent Comments