Ardent Performance Computing

Jeremy Schneider

Search

>

kubernetes, Linux, Planet, PostgreSQL, Technical

KubeCon 2025: Bookmarks on Memory and Postgres

Posted by Jeremy ⋅ November 16, 2025 ⋅ Leave a comment

Filed Under cgroups, cost, cpu, database, kubelet, linux, memory, PostgreSQL, sig-node

Just got home from KubeCon.

One of my big goals for the trip was to make some progress in a few areas of postgres and kubernetes – primarily around allowing more flexible use of the linux page cache and avoiding OOM kills with less hardware overprovisioning. When I look at Postgres on Kubernetes, I think there are idle resources (both memory and CPU) on the table with the current Postgres deployment models that generally use guaranteed QoS.

Ultimately this is about cost savings. I think we can still run more databases on less hardware without compromising the availability and reliability of our database services.

The trip was a success, because I came home with lots of reading material and homework!

Putting a few bookmarks here, mostly for myself to come back to later:

key place for discussion is sig-node
documentation on node-pressure eviction https://kubernetes.io/docs/concepts/scheduling-eviction/node-pressure-eviction/
- eviction signal thresholds can be customized
- it looks like priority classes give a lot of control over the order in which pods are evicted
documentation on priority classes https://kubernetes.io/docs/concepts/scheduling-eviction/pod-priority-preemption/
cgroups v2 memory controller documentation https://docs.kernel.org/admin-guide/cgroup-v2.html#memory
long running github issue about pod evictions due to kubernetes (incorrectly?) interpreting active page cache as working memory that won’t be reclaimed https://github.com/kubernetes/kubernetes/issues/43916
new feature MemoryQOS – still alpha (feature gate off-by-default)
- KEP https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2570-memory-qos/
  - currently stalled – related message from Linux Kernel Mailing Lists https://lkml.org/lkml/2023/6/1/1300
  - “Future: memory.high can be used to implement kill policies in for userspace OOMs, together with Pressure Stall Information (PSI). When the workloads are in stuck after their memory usage levels reach memory.high, high PSI can be used by userspace OOM policy to kill such workload(s).”
- Nov 2021 blog https://kubernetes.io/blog/2021/11/26/qos-memory-resources/
- May 2023 blog https://kubernetes.io/blog/2023/05/05/qos-memory-resources/
- Brief mention in docs https://kubernetes.io/docs/concepts/workloads/pods/pod-qos/#memory-qos-with-cgroup-v2
metrics added to CAdvisor for both active and inactive page cache https://github.com/google/cadvisor/pull/3445
metric added for PSI https://kubernetes.io/blog/2025/09/04/kubernetes-v1-34-introducing-psi-metrics-beta/
homework – taking a closer look at anonymous memory and page cache metrics (both active and inactive) for real postgres databases on kubernetes
homework – set up tests that emulate the diagram below and demonstrate the eviction behavior that i think will happen

I still have a lot of catching up to do. I sketched out the diagram below, but please take this with a large grain of salt – this aspect of kubernetes is complex and linux memory management is complex:

I tried to summarize some thoughts in a comment on the long-running github issue, but this might be wrong – it’s just what I’ve managed to piece together so far.

.

My “user story” is that (1) I’d like higher limit and more memory over-commit for page cache specifically – letting linux use available/unused memory as needed for page cache and (2) I’d like lower request to get scheduling closer to actual anonymous memory needs. I’m running Postgres. In the current state, I have to simultaneously set an artificially low limit on per-pod page cache (to avoid eviction) and artificially high request on per-pod anonymous memory (to avoid OOM by getting oom_score_adj). I’d like individual pods able to burst anonymous memory usage (eg. an unexpected SQL query that hogs memory), if we can steal from page cache of other pods beyond their request – avoiding OOM. The linux kernel can do this; I think it should be possible with the right cgroup settings?

It seems like the new Memory QOS feature might be assigning a static calculated value to memory.high – but for page cache usage, I wonder if we actually want kubernetes to dynamically adjust memory.high eventually as low as request in an attempt to reclaim node-level resources – before evicting end-user pods – when the memory.available eviction signal has exceeded the threshold?

Anyway it’s also worth pointing out that the postgres problems are likely accentuated by higher concentrations of postgres on nodes; if databases are spread across large multi-tenant clusters that likely mitigates things a bit.

Edit 11/29: Alexey Demidov replied on the github issue and pointed out the problem; the linux kernel throttles CPU of processes when we use memory.high so this probably makes my idea above ineffective.

About Jeremy

Building and running reliable data platforms that scale and perform. about.me/jeremy_schneider

Ardent Performance Computing

Search

KubeCon 2025: Bookmarks on Memory and Postgres

About Jeremy

Discussion

No comments yet.

Leave a New Comment Cancel reply

Disclaimer

Email Updates

Recent Posts

Recent Comments

Ardent Performance Computing

Search

KubeCon 2025: Bookmarks on Memory and Postgres

Share this:

Related

About Jeremy

Discussion

No comments yet.

Leave a New Comment Cancel reply

Disclaimer

Email Updates

Recent Posts

Recent Comments