a curated list of database news from authoritative sources

December 30, 2025

Performance for RocksDB 9.8 through 10.10 on 8-core and 48-core servers

This post has results for RocksDB performance using db_bench on 8-core and 48-core servers. I previously shared results for RocksDB performance using gcc and clang and then for RocksDB on a small Arm server

tl;dr

  • RocksDB is boring, there are few performance regressions. 
  • There was a regression in write-heavy workloads with RocksDB 10.6.2. See bug 13996 for details. That has been fixed.
  • I will repeat tests in a few weeks

Software

I used RocksDB versions 9.8 through 10.0.

I compiled each version clang version 18.3.1 with link-time optimization enabled (LTO). The build command line was:

flags=( DISABLE_WARNING_AS_ERROR=1 DEBUG_LEVEL=0 V=1 VERBOSE=1 )

# for clang+LTO
AR=llvm-ar-18 RANLIB=llvm-ranlib-18 CC=clang CXX=clang++ \
    make "${flags[@]}" static_lib db_bench

Hardware

I used servers with 8 and 48 cores, both run Ubuntu 22.04:

  • 8-core
    • Ryzen 7 (AMD) CPU with 8 cores and 32G of RAM.
    • storage is one NVMe SSD with discard enabled and ext-4
    • benchmarks are run with 1 client, 20M KV pairs for byrx and 400M KV pairs for iobuf and iodir
  • 48-core
    • an ax162s from Hetzner with an AMD EPYC 9454P 48-Core Processor with SMT disabled, 128G of RAM
    • storage is 2 SSDs with RAID 1 (3.8T each) and ext-4.
    • benchmarks are run with 36 clients, 200M KV pairs for byrx and 2B KV pairs for iobuf and iodir

Benchmark

Overviews on how I use db_bench are here and here.

Most benchmark steps were run for 1800 seconds and all used the LRU block cache. I try to use Hyperclock on large servers but forgot that this time.

Tests were run for three workloads:

  • byrx - database cached by RocksDB
  • iobuf - database is larger than RAM and RocksDB used buffered IO
  • iodir - database is larger than RAM and RocksDB used O_DIRECT

The benchmark steps that I focus on are:
  • fillseq
    • load RocksDB in key order with 1 thread
  • revrangeww, fwdrangeww
    • do reverse or forward range queries with a rate-limited writer. Report performance for the range queries
  • readww
    • do point queries with a rate-limited writer. Report performance for the point queries.
  • overwrite
    • overwrite (via Put) random keys and wait for compaction to stop at test end

Relative QPS

Many of the tables below (inlined and via URL) show the relative QPS which is:
    (QPS for my version / QPS for RocksDB 9.8)

The base version varies and is listed below. When the relative QPS is > 1.0 then my version is faster than RocksDB 9.8. When it is < 1.0 then there might be a performance regression or there might just be noise.

The spreadsheet with numbers and charts is here. Performance summaries are here.

Results: cached database (byrx)

From 1 client on the 8-core server

  • Results are stable except for the overwrite test where there might be a regression, but I think that is noise after repeating this test 2 more times and the cause is that the base case (result from 9.8) was an outlier. I will revisit this.

From 36 clients on the 48-core server

  • Results are stable

Results: IO-bound with buffered IO (iobuf)

From 1 client on the 8-core server

  • Results are stable except for the overwrite test where there might be a large improvement. But I wonder if this is from noise in the result for the base case from RocksDB 9.8, just as there might be noice in the cached (byrx) results.
  • The regression in fillseq with 10.6.2 is from bug 13996

From 36 clients on the 48-core server
  • Results are stable except for the overwrite test where there might be a large improvement. But I wonder if this is from noise in the result for the base case from RocksDB 9.8, just as there might be noice in the cached (byrx) results.
  • The regression in fillseq with 10.6.2 is from bug 13996
Results: IO-bound with O_DIRECT (iodir)

From 1 client on the 8-core server

  • Results are stable
  • The regression in fillseq with 10.6.2 is from bug 13996

From 36 clients on the 48-core server

  • Results are stable
  • The regression in fillseq with 10.6.2 is from bug 13996


Migrate to Freedom: Choosing a Truly Open Source PostgreSQL Operator

Open Source Isn’t What It Used to Be The landscape of open source has undergone significant changes in recent years, and selecting the right operator and tooling for PostgreSQL clusters in Kubernetes has never been more crucial. MinIO, for example, was a widely used open source S3-compatible storage backend. Over the past few years, it has: […]

IO-bound sysbench vs Postgres on a 48-core server

This has results for an IO-bound sysbench benchmark on a 48-core server for Postgres versions 12 through 18. Results from a CPU-bound sysbench benchmark on the 48-core server are here.

tl;dr - for Postgres 18.1 relative to 12.22

  • QPS for IO-bound point-query tests is similar while there is a large improvement for the one CPU-bound test (hot-points)
  • QPS for range queries without aggregation is similar
  • QPS for range queries with aggregation is between 1.05X and 1.25X larger in 18.1
  • QPS for writes show there might be a few large regressions in 18.1
tl;dr - for Postgres 18.1 using different values for the io_method option
  • for tests that do long range queries without aggregation
    • the best QPS is from io_method=io_uring
    • the second best QPS is from io_method=worker with a large value for io_workers
  • for tests that do long range queries with aggregation
    • when using io_method=worker a larger value for io_workers hurt QPS in contrast to the result for range queries without aggregation
    • for most tests the best QPS is from io_method=io_uring

Builds, configuration and hardware

I compiled Postgres from source for versions 12.22, 13.23, 14.20, 15.15, 16.10, 16.11, 17.6, 17.7, 18.0 and 18.1.

I used a 48-core server from Hetzner
  • an ax162s with an AMD EPYC 9454P 48-Core Processor with SMT disabled
  • 2 Intel D7-P5520 NVMe storage devices with RAID 1 (3.8T each) using ext4
  • 128G RAM
  • Ubuntu 22.04 running the non-HWE kernel (5.5.0-118-generic)
Configuration files for the big server
Benchmark

I used sysbench and my usage is explained here. I now run 32 of the 42 microbenchmarks listed in that blog post. Most test only one type of SQL statement. Benchmarks are run with the database cached by Postgres.

The read-heavy microbenchmarks are run for 600 seconds and the write-heavy for 900 seconds. The benchmark is run with 40 clients and 8 tables with 250M rows per table. With 250M rows per table this is IO-bound. I normally use 10M rows per table for CPU-bound workloads.

The purpose is to search for regressions from new CPU overhead and mutex contention. I use the small server with low concurrency to find regressions from new CPU overheads and then larger servers with high concurrency to find regressions from new CPU overheads and mutex contention.

Results

The microbenchmarks are split into 4 groups -- 1 for point queries, 2 for range queries, 1 for writes. For the range query microbenchmarks, part 1 has queries without aggregation while part 2 has queries with aggregation. 

I provide charts below with relative QPS. The relative QPS is the following:
(QPS for some version) / (QPS for base version)
When the relative QPS is > 1 then some version is faster than base version.  When it is < 1 then there might be a regression. When the relative QPS is 1.2 then some version is about 20% faster than base version.

I provide two comparisons and each uses a different base version. They are:
  • base version is Postgres 12.22
    • compare 12.22, 13.23, 14.20, 15.15, 16.11, 17.7 and 18.1
    • the goal for this is to see how performance changes over time
    • per-test results from vmstat and iostat are here
  • base version is Postgres 18.1
    • compare 18.1 using the x10b_c32r128, x10c_c32r128, x10cw8_c32r128, x10cw16_c32r128, x10cw32_c32r128 and x10d_c32r128 configs
    • the goal for this is to understand the impact of the io_method option
    • per-test results from vmstat and iostat are here
The per-test results from vmstat and iostat can help to explain why something is faster or slower because it shows how much HW is used per request, including CPU overhead per operation (cpu/o) and context switches per operation (cs/o) which are often a proxy for mutex contention.

The spreadsheet and charts are here and in some cases are easier to read than the charts below. Converting the Google Sheets charts to PNG files does the wrong thing for some of the test names listed at the bottom of the charts below.

Results: Postgres 12.22 through 18.1

All charts except the first have the y-axis start at 0.7 rather than 0.0 to improve readability.

There are two charts for point queries. The second truncates the y-axis to improve readability.
  • a large improvement for the hot-points test arrives in 17.x. While most tests are IO-bound, this test is CPU-bound because all queries fetch the same N rows.
  • for other tests there are small changes, both improvements and regressions, and the regressions are too small to investigate
For range queries without aggregation:
  • QPS for Postgres 18.1 is within 5% of 12.22, sometimes better and sometimes worse
  • for Postgres 17.7 there might be a large regression on the scan test and that also occurs with 17.6 (not shown). But the scan test can be prone to variance, especially with Postgres and I don't expect to spend time debugging this. Note that the config I use for 18.1 here uses io_method=sync which is similar to what Postgres uses in releases prior to 18.x. From the vmstat and iostat metrics what I see is:
    • a small reduction in CPU overhead (cpu/o) in 18.1
    • a large reduction in the context switch rate (cs/o) in 18.1
    • small reductions in read IO (r/o and rKB/o) in 18.1
For range queries with aggregation:
  • QPS for 18.1 is between 1.05X and 1.25X better than for 12.22
For write-heavy tests
  • there might be large regressions for several tests: read-write, update-zipf and write-only, The read-write tests do all of the writes done by write-only and then add read-only statements. 
  • from the vmstat and iostat results for the read-write tests I see
    • CPU (cpu/o) is up by ~1.2X in PG 16.x through 18.x
    • storage reads per query (r/o) have been increasing from PG 16.x through 18.x and are up by ~1.1X in PG 18.1
    • storage KB read per query (rKB/o) increased started in PG 16.1 and are 1.44X and 1.16X larger in PG 18.x
  • from the vmstat and iostat results for the update-zipf test
    • results are similar to the read-write tests above
  • from the vmstat and iostat results for the write-only test
    • results are similar to the read-write tests above
Results: Postgres 18.1 and io_method

For point queries
  • results are similar for all configurations and this is expected
For range queries without aggregation
  • there are two charts, the y-axis is truncated in the second to improve readability
  • all configs get similar QPS for all tests except scan
  • for the scan test
    • the x10c_c32r128 config has the worst result. This is expected given there are 40 concurrent connections and it used the default for io_workers (=3)
    • QPS improves for io_method=worker with larger values for io_workers
    • io_method=io_uring has the best QPS (the x10d_c32r128 config)
For range queries with aggregation
  • when using io_method=worker a larger value for io_workers hurt QPS in contrast to the result for range queries without aggregation
  • io_method=io_uring gets the best QPS on all tests except for the read-only tests with range=10 and 10,000. There isn't an obvious problem based on the vmstat and iostat results. From the r_await column in iostat output (not shown) the differences are mostly explained by a change in IO latency. Perhaps variance in storage latency is the issue.
For writes
  • the best QPS occurs with the x10b_c32r128 config (io_method=sync). I am not sure if that option matters here and perhaps there is too much noise in the results.

How to Fix Kafka to ClickHouse® Performance Bottlenecks

Learn how to optimize your Kafka to ClickHouse pipeline performance with schema optimization, Materialized View tuning, partition distribution strategies and throughput best practices.

December 29, 2025

Rethinking the Cost of Distributed Caches for Datacenter Services

This paper (HOTNETS'25) re-teaches a familiar systems lesson: caching is not just about reducing latency, it is also about saving CPU! The paper makes this point concrete by focusing on the second-order effect that often dominates in practice: the monetary cost of computation. The paper shows that caching --even after accounting for the cost of DRAM you use for caching-- still yields 3–4x better cost efficiency thanks to the reduction in CPU usage. In today's cloud pricing model, that CPU cost dominates. DRAM is cheap. Well, was cheap... I guess the joke is on them now, since right after this paper got presented, the DRAM prices jumped by 3-4x! Damn Machine Learning ruining everything since 2018!

Anyways, let's ignore that point conveniently to get back to the paper. Ok, so caches do help, but when do they help the most? Many database-centric or storage-side cache designs miss this point. Even when data is cached at the storage/database cache, an application read still needs to travel there, pay for RPCs, query planning, serialization, and coordination checks. 

The paper advocates for moving the caches as close to the application as possible to cut costs for CPU. The key argument is that application-level linked caches deliver far better cost savings than storage-layer caches. By caching fully materialized application objects and bypassing the storage/database read path entirely, linked caches eliminate query amplification and coordination overhead. Across production workloads, this yields 3–4x better cost efficiency than storage-layer caching, easily offsetting the additional DRAM cost. Remote caches help, but still burn CPU on RPCs and serialization. Storage-layer caches save disk I/O but leave most of the query and coordination path intact, delivering the weakest cost savings. The results are consistent across different access skews and read intensities, reinforcing that cache placement dominates cache size.

So that is the gist of the paper. The paper makes two adjacent points. Special cases of this observation, if you will. And let's cover them for completeness.

The first point is rich-object workloads, which is where the most striking evaluation results come from. For services where a single logical read expands into many database queries (e.g., metadata services and control planes), caching fully materialized objects at the application level avoids query amplification entirely. And this yields up to an order-of-magnitude cost reduction versus uncached reads and roughly 2x improvement over caching denormalized key-value representations.

The second result, a negative result, is also important. Adding even lightweight freshness or version checks largely erases these gains, because the check itself traverses most of the database stack. The experiments make clear that strong consistency remains fundamentally at odds with the cost benefits of application-level caching. The paper leaves this as an open challenge, saying that we still lack a clean, low-cost way to combine strong consistency with the economic benefits of application-level caching. I think it is possible to employ leases to trade off an increase in update latency with cost efficiency, and alleviate this problem. Or we could just say: Cache coherence is hard, let's go shopping for CXL!


Discussion

Overall, the paper quantifies something many practitioners intuit but rarely measure. If you care about cost (also monetary cost), move caching up the stack, cache rich objects, and trade memory against CPU burn.  

As usual Aleksey and I did a live-reading of the paper. And as usual we had a lot to argue and gripe about. Above is a recording of our discussion, and this links to my annotated paper.

Of course, Aleksey zeroed in on the metastability implications right from the abstract. And yes the metastability implications remained unaddressed in the paper. If you cut costs and operate at lower CPU provisioning (thanks to this cache assist), you are making yourself prone to failure by operating at maximum utilization, without any slack. That means, the moment the cache fails or becomes inaccessible, your application will also get overwhelmed by 2-3x more traffic than it can handle and suffer unavailability or metastability.

I had some reservations about application-level caches. They are undeniably effective, but they lack the reusability and black-box nature of storage-layer caching. Storage-side caching is largely free, transparent, and naturally shared across nodes and applications. Application-level caching, by contrast, requires careful design and nontrivial development effort. It also sacrifices reuse and sharing, since each application must manage its own cache semantics and lifecycle. I wish the paper could discuss these costs and tradeoffs.

Writing, after the introduction section, was repetitive and sub par. Sections 2 and 3 largely repeated the Introduction and wasted space. Then we only had 2 paragraphs of the Theoretical Analysis section, which we actually looked forward to reading. That section is effectively cropped out of the paper, when it makes the core of the arguments for the paper. 

The paper's subtitle (see headers on Page 3, 5, 7) is a copy-paste error from the authors' HotNets 2024 paper. There did not seem to be any camera-ready time checks on the paper. To motivate strong consistency, the paper drops several citations in Section 2.3, calling them as recent work. Only 2 out of 6 of these are after 2014. The figures were sloppy as well. Did you notice Figure 6 above? The y-axis are not covering the same ranges, which makes it very hard to compare about the subfigures. The y-axis in Figure 5 uses relative costs, which is also of not much use. It may be that in 2025 most people use LLMs to read papers, but one should still write papers as if humans will read them, past the introduction section, and line by line to understand and check the work.


Finally, here is an interesting question to ponder on. Does this paper conflict with storage disaggregation trend?

At first glance, the paper appears to push against the storage disaggregation trend by arguing for tighter coupling between computation and cached data to meet real-time freshness constraints. In reality, it does not reject disaggregation but warns that disaggregated designs require additional caching above the storage layer.  Just storage side caching would not be able to suffice from a latency as well as cost perspective! The paper also points to a hidden cost: freshness guarantees degrade when cache coherence is treated as a best-effort side effect of an eventually consistent pipeline. The paper's message is that disaggregation needs explicit freshness semantics and coordination mechanisms. So maybe a corollary here is that, we should expect disaggregated systems to inevitably grow "stateful edges" over time in order to recover performance and control.

Update Request! New PostgreSQL RPMs Released to Disable Debug Assertions

We recently identified that a batch of our Percona Server for PostgreSQL RPM packages were inadvertently compiled with the debug assertion flag (–enable-cassert) enabled. While these assertions are invaluable for our developers during the testing phase, they are not intended for production use. We have since rebuilt the packages and strongly recommend that all users […]

Percona Operator for MongoDB in 2025: Making Distributed MongoDB More Predictable on Kubernetes

In 2025, the Percona Operator for MongoDB focused on the hardest parts of running MongoDB in Kubernetes: reliable backups and restores, clearer behavior during elections and restores, better observability at scale, and safer defaults as MongoDB 8.0 became mainstream. The year included real course corrections, such as addressing PBM connection leaks and being explicit about […]

December 28, 2025

Randomer Things

I aspire to get bored in the new year

I've realized that chess has been eating my downtime. Because it lives on my phone (Lichess), it is frictionless to start a bullet game, and get a quick dopamine hit.

The problem is that I no longer get bored. That is bad. I need to get bored so I can start to imagine, daydream, think, self-reflect, plan, or even get mentally prepared for things (like the Stoics talked about). I badly need that empty space back.

So bye chess. Nothing personal. I will play only when teaching/playing with my daughters. I may occasionally cheat and play a bullet game on my wife's phone. But no more chess apps on my phone.

While I was at it, I installed the  Website Blocker extension for Chrome. I noticed my hands typing reddit or twitter at the first hint of boredom. The blocker is easy to disable, but that is fine. I only need that slight friction to catch myself before opening the site on autopilot.


I am disappointed by online discourse

In 2008, Reddit had a popular thread on Attacking Generals problem with about 300 confident and largely misguided comments. I shared it with two senior distributed researchers, with the subject line "Proof that distributed reasoning is really tricky".

One asked: "Who are these people and where do they come from?" The other mentioned this book: "The Dumbest Generation: How the Digital Age Stupefies Young Americans and Jeopardizes Our Future"

Today, Hacker News is at the same threshold. People with little understanding discuss opinions with other uninformed people back and forth.

I am aware this comes across cranky. I am not angry, and I am not trying to gatekeep or anything. I am glad to see interest in these topics. I just want to see a higher signal to noise ratio in these discussions. Right now, maybe less than 10% of the comments say something useful or substantial. 

Marc Brooker has a quadrant model to explain why this discourse happens. As he also admits, we can do better, and the forums can design incentives that reward insight instead of noise.


I like Trader Joe’s

A Trader Joe's opened nearby. Now we eat simit for breakfast. Trader Joe's simit is spot-on authentic. As a Turk, this is an amazing treat. Simit is deeply nostalgic. Every kid loves it. Adults too. It's simple, inexpensive, and perfect. 

They also had frozen baklava. It was very fresh, authentic, delicious, and inexpensive as well. Past tense because, it turns out this was seasonal. 

Acquired did an episode on Trader Joe's. Worth a listen. I am always impressed by how helpful the Trader's Joe associates are. Shopping should feel like this. Hey Trader Joe's, after this shout-out, you should sponsor me with unlimited simit. Or at least make baklava year-round. Please.


I also like Qamaria

A Yemeni coffee chain opened nearby. This appears to be the year of pistachio. First Dubai chocolate, now this. We tried pistachio lattes, hot and cold. The cold one was better, and looked better too. Both were very sweet. I had to return the hot one, and ask for a replacement with less Pistaschio paste, so I don't die of sweet Pistaschio overdose on the spot. The caramel tres leches was excellent. Really light, soft, and moist. 

Yemeni coffee tastes better. I am done with Starbucks. The Starbucks branch near us is consistently awful. Some locations are better, but overall, I am finished with it. I hear good things about Luckin Coffee. That is next on my list.


I have been watching things

Zootopia 1 was better.

What happened to scriptwriting? The industry no longer seems to try. No creativity. Not even basic logic. In most movies, there are plot holes large enough to drive a garbage truck through.

Wake Up Dead Man: A Knives Out Mystery was good, but not as engaging as the first.

I did like The Great Flood, a Korean movie on Netflix. At least it is an engaging machine learning movie.

No good Christmas movie this year? Time to watch Die Hard again.

Baby vs Man: Rowan Atkinson is brilliant. Unfortunately, I have a strong stress reaction to watching stupidity unfold slowly and inevitably. Atkinson is exceptionally good at this, which is exactly why I cannot bring myself to watch it.

MongoDB Read/Write vs. PostgreSQL Synchronous Replication

For users used to SQL databases, MongoDB’s consistency model can be confusing. Typical SQL systems layer read replicas on top of data recovery, not into the query layer. MongoDB instead includes replication into its read and write paths, extending ACID guarantees across a horizontally scalable cluster.

The SQL world has similar misconceptions, especially the belief that read replicas can safely scale horizontally—until data is lost. This article follows the six myths from Alexander Kukushkin’s PGConf.DE talk, Myths and Truths about Synchronous Replication in PostgreSQL (slides). For each myth, I relate the idea to MongoDB’s write concern, read concern, and majority commit point.

MongoDB Write Concern Behaviour

In MongoDB, writeConcern: { w: 1 } acknowledges a write as soon as the primary accepts it, before replication. With writeConcern: { w: "majority" } — the default for most replica set configurations — the write is acknowledged only after it has been replicated to a majority of voting nodes.

PostgreSQL’s synchronous commit is similar to MongoDB’s majority write concern, except that in PostgreSQL, replicas push changes directly from the primary and the primary waits for synchronous network confirmation from each selected standby. MongoDB replicas pull changes from any peer, and send a commit progress report back to the primary.

Myth №1 – “PostgreSQL transaction is committed after confirmation from synchronous standby”

🐘 In PostgreSQL, commits are always local first: XLogFlush(XactLastRecEnd) runs before waiting for synchronous replication via SyncRepWaitForLSN(XactLastRecEnd, true).

The transaction is not visible until the synchronous standby has received, flushed, or applied the WAL because locks remain held. If that wait is interrupted — by cancellation, connection loss, or restart — locks may be released early. The transaction can then be visible but not yet durable and may be lost on failover.

These are “dirty reads” in terms of durability, not isolation — they read only committed local data, but that data can still be rolled back in case of failure.

🍃 MongoDB behaves similarly with { w: "majority" } (default in most configurations): it waits for acknowledgment after writing locally. But MongoDB does not hold locks. Visibility is controlled entirely by the read concern. With the default readConcern: "local", you see the change before it is committed to the quorum, and it is therefore subject to rollback on failover. 🌱 With readConcern: "majority", you only read from a point in time where all changes are quorum‑durable.

Myth №2 – “PostgreSQL synchronous replication guarantees Zero RPO / no data loss”

🐘 In PostgreSQL, if synchronous commit wait is cancelled — by query cancel, TCP connection reset, or server restart — the transaction becomes visible immediately. The application receives a warning:

The transaction has already committed locally, but might not have been replicated to the standby.

If the primary fails before standby confirmation, the promoted standby may miss the transaction, resulting in data loss if the application ignored the warning.

If the TCP connection between client and server is interrupted at the wrong moment, the application may not know whether the transaction committed after disconnect — and it will not have received a warning. To verify, you must query data back, or get the transaction ID (txid_current()) before committing, and check it after reconnect using txid_status().

🍃 MongoDB is similar: if the client loses its connection after sending the write but before receiving the acknowledgment for w: "majority", the commit status is unclear. 🌱 However, the driver can handle this with retryable writes — specifying retryWrites: true — for idempotent operations, and for writes with deterministic keys such as a driver‑generated ObjectId. Retrying with the same _id will either match the existing document or insert it once.

MongoDB Read Concern Behaviour

MongoDB uses MVCC and reads from a recent state of the database that excludes uncommitted changes. By default, readConcern: "local" returns the most recent node‑local commit, durable to that node’s own disk (fdatasync). A write with { w: "majority"} may be visible before it is acknowledged by the quorum, and can be rolled back in a failover.

To extend ACID beyond a single node, readConcern: "majority" guarantees that what you read has been acknowledged by a majority of voting members and is durable cluster‑wide. It does so by reading from the commit snapshot corresponding to the majority commit point.

Myth №3 – “Reading from PostgreSQL synchronous standby nodes is like reading from the primary”

🐘 In PostgreSQL, a standby can show a transaction before the primary finishes waiting for other standbys. The ACID properties in PostgreSQL are not automatically extended to read replicas.

🍃 In MongoDB you can read from any replica with readConcern: "majority" and guarantee that it reads the same durable state as the primary. Drivers can automatically distribute reads with readPreference: "secondary" or "secondaryPreferred". 🌱 Adding readConcern: "majority" ensures that all those reads see the majority‑committed snapshot. Replicas can lag a little, but that works in an MVCC database where reads do not acquire locks.

MongoDB Majority Commit Point

In MongoDB replica sets, the majority commit point is a logical timestamp indicating that all operations up to that point have been replicated and made durable on a majority of members. The primary advances this point when it observes a higher point on a majority of nodes. With majority write concern, MongoDB acknowledges a write only when it is at or before this point, guaranteeing survival across elections and failovers.

Myth №4 – “We just need to promote PostgreSQL synchronous replica to avoid data loss”

🐘 In PostgreSQL, setting a node to synchronous doesn’t synchronise it instantly. The standby must first reach the "streaming" state, with zero lag, before it can serve as a synchronous standby. Until then, there is a catch‑up window during which promotion can result in missing transactions.

🍃 MongoDB uses the majority commit point in elections. 🌱 If writes use w: "majority", the elected node always has those writes acknowledged, avoiding the catch‑up hazard.

Myth №5 – “With PostgreSQL synchronous replication we don’t need pg_rewind”

🐘 Even in synchronous mode, a PostgreSQL standby can miss certain WAL changes not generated by client transactions — for example, VACUUM — leading to divergence after failover. This cannot be fixed by simple roll‑forward. pg_rewind must identify and copy the differing blocks from the new primary to reinstate the old primary as a standby.

🍃 In MongoDB, rollback when a node rejoins can happen if the oplog has diverged, but normally only with w: 1. 🌱 With writeConcern: "majority" — default in most deployments — writes wait for the commit point to advance and are protected from rollback in the most common cases.

MongoDB Replication Performance Implications

With w: 1, MongoDB only performs a local disk write, so latency is minimal. With w: "majority", it waits for replication to enough nodes and their disk commits (j: true is the default), so each write can incur multiple intra‑cluster and cross‑region RTTs.

For reads, readConcern: "local" avoids extra RTTs by reading the latest local commit. readConcern: "majority" also does not require extra network hops. It reads from the local snapshot corresponding to the majority commit point.

Linearizable reads in MongoDB have the highest cost: they require a no-op write and wait for majority replication before returning, adding a full RTT to the read.

Myth №6 – “PostgreSQL Synchronous replication adds no noticeable latency”

🐘 In PostgreSQL, synchronous replication increases latency proportional to RTT between nodes. The push‑based approach also increases primary CPU and network overhead to transmit WAL to all standbys.

🍃 MongoDB behaves similarly with w: "majority", but the pull‑based replication puts less pressure on the primary. 🌱 Secondaries can fetch changes from other secondaries, reducing primary load.

Conclusion

Synchronous replication in PostgreSQL and majority write concern in MongoDB are aimed at the same goal: protecting committed data against failover. Both commit locally first, then wait for quorum confirmation, but PostgreSQL’s locking model delays visibility, whereas MongoDB lets you choose visibility via read concern.

Warnings in PostgreSQL about cancelled sync waits are critical to avoid silent data loss. In MongoDB, retryable writes (retryWrites: true) and idempotent operations solve similar problems of uncertainty after a disconnect.

Read replicas in PostgreSQL do not automatically carry the same durability as the primary. In MongoDB, majority read concern enforces that. PostgreSQL requires manual lag checks before safe promotion. MongoDB elections ensure the majority commit point is present. PostgreSQL can still diverge on non‑transactional WAL and require pg_rewind after a failover, while MongoDB avoids rollback for majority writes.

Performance costs rise in both systems when ensuring strongest guarantees, with PostgreSQL’s push model loading the primary and MongoDB’s pull model distributing replication work.

Knowing exactly what these guarantees mean—and when they don’t apply—is essential to using them safely. In short, MongoDB by default offers similar protection, along with additional operational benefits from its built-in orchestration and cluster-aware drivers. 🌱 By using a {w: "majority"} write concern, MongoDB can extend ACID properties to a horizontally scalable cluster.

December 25, 2025

No Foreign Keys in MongoDB: Rethinking Referential Integrity

In SQL databases, foreign keys act as immediate constraints that verify the correctness of relationships between tables before accepting a write. This was designed for scenarios in which end users can submit random queries directly to the database. As a result, the database is responsible for protecting the data model using normalization, integrity constraints, stored procedures, and triggers, rather than relying on validation performed before the application interacts with the database. When relational integrity is violated, an error occurs, preventing the user from making the changes, and the application rolls back and raises an exception.

MongoDB’s NoSQL approach differs from relational databases as it was designed for application developers. It relies on application code to enforce these rules. Use cases are clearly defined, validation occurs at the application level, and business logic takes precedence over foreign key verification. Eliminating the need for additional serializable reads associated with foreign keys can significantly boost write performance and scalability.

Referential integrity can be verified asynchronously. Instead of raising an exception—an unexpected event the application might not be ready for—MongoDB allows the write to proceed and offers tools like the aggregation framework and change streams to detect and log errors. This approach enables error analysis, data correction, and application fixes without affecting the application's availability and includes the business logic.

Let's go through a traditional example of departments and employees, where all employees must belong to a department.

Two collections with reference

Strong relationships, including one-to-many, don't necessarily require multiple collections with references, especially if they share the exact lifecycle. Depending on the domain's context, we can embed a list of employees within each department document to ensure referential integrity and prevent orphans. Alternatively, we might embed department information into each employee's document, particularly when department updates are infrequent—such as a simple multi-document change to a department description—or when department changes are usually part of larger enterprise reorganizations.

When both entities are not always accessed together, have unbounded cardinality, or are updated independently, you can choose to reference another document instead of embedding all details. For example, store a "deptno" for each employee and maintain a separate collection of departments, each with a unique "deptno". I insert such data:

// Reset
db.departments.drop();
db.employees.drop();

// Departments
db.departments.createIndex(  
  { deptno: 1 },   // deptno will be used as the referenced key 
  { unique: true } // it must be unique for many-to-one relationships
;
db.departments.insertMany([
  { deptno: 10, dname: "ACCOUNTING", loc: "NEW YORK" },
  { deptno: 20, dname: "RESEARCH", loc: "DALLAS" },
  { deptno: 30, dname: "SALES", loc: "CHICAGO" },
  { deptno: 40, dname: "OPERATIONS", loc: "BOSTON" }
]);

// Employees in departments 10, 20, and 30
db.departments.createIndex(  
  { deptno: 1 },   // reference to departments 
;
db.employees.insertMany([
  { empno: 7839, ename: "KING",    job: "PRESIDENT", deptno: 10 },
  { empno: 7698, ename: "BLAKE",   job: "MANAGER",   deptno: 30 },
  { empno: 7782, ename: "CLARK",   job: "MANAGER",   deptno: 10 },
  { empno: 7566, ename: "JONES",   job: "MANAGER",   deptno: 20 },
  { empno: 7788, ename: "SCOTT",   job: "ANALYST",   deptno: 20 },
  { empno: 7902, ename: "FORD",    job: "ANALYST",   deptno: 20 },
  { empno: 7844, ename: "TURNER",  job: "SALESMAN",  deptno: 30 },
  { empno: 7900, ename: "JAMES",   job: "CLERK",     deptno: 30 },
  { empno: 7654, ename: "MARTIN",  job: "SALESMAN",  deptno: 30 },
  { empno: 7499, ename: "ALLEN",   job: "SALESMAN",  deptno: 30 },
  { empno: 7521, ename: "WARD",    job: "SALESMAN",  deptno: 30 },
  { empno: 7934, ename: "MILLER",  job: "CLERK",     deptno: 10 },
  { empno: 7369, ename: "SMITH",   job: "CLERK",     deptno: 20 },
  { empno: 7876, ename: "ADAMS",   job: "CLERK",     deptno: 20 }
]);

I didn't declare the schema upfront, as it will come as-is from the application. For performance reasons, I declare indexes on both sides to enable fast navigation between employees and departments, and between departments and employees.

Query examples

This schema supports all cardinalities, including millions of employees per department—something you wouldn't embed—and is normalized to ensure that updates affect only a single document. It also allows for bidirectional querying.

Here's an example of a query that joins all department information to employees as if it were embedded, but evaluated at read time:

db.employees.aggregate([  
  {  
    $lookup: {                 // add all department info in an array  
      from: "departments",  
      localField: "deptno",  
      foreignField: "deptno",  // fast access by index on departments 
      as: "department"  
    }  
  },  
  {  
    $set: {  // get first (and only) match (guaranteed by unique index)  
      department: { $arrayElemAt: ["$department", 0] }  
    }  
  }  
]);

Here's an example of a query that joins all employee information to departments as if it were duplicated and embedded, but evaluated at read time:

db.departments.aggregate([  
  {  
    $lookup: {                 // add all employee info in an array
      from: "employees",          
      localField: "deptno",      
      foreignField: "deptno",  // fast access by index on employees
      as: "employees"             
    }  
  }  
]);

From a performance standpoint, performing a lookup is more costly than reading from a single embedded collection. However, this overhead isn't significant when browsing through tens or hundreds of documents. When choosing this model, because a department might have a million employees, you don't retrieve all the data at once. Instead, a $match will filter documents before the $lookup in the first query, or a filter will be applied within the $lookup pipeline in the second query.

I have covered those variations in a previous post:

What about referential integrity for these queries? If an employee is inserted with a deptno that does not exist in departments, the lookup finds no match. The first query omits the department information, and the second query doesn't show the new employee because it lists only the known department. This is expected behaviour for an application that didn't insert the referenced department.

Relational database administrators often overstate how serious this is, and even call it data corruption. Because SQL defaults to inner joins, that employee would be missing from the result of the first query, but with outer joins like $lookup in MongoDB, this does not happen. It’s more like a NULL in SQL: the information is not yet known, so it isn’t shown. You can add the department later, and the queries will reflect the information as it becomes available.

You may still want to detect when referenced items are not inserted after some time, for example, due to an application bug.

Foreign key definition as a $lookup stage

I define referential integrity using two stages: a lookup stage and a match stage that verify whether the referenced document exists:

const lookupStage = {
  $lookup: {
    from: "departments",
    localField: "deptno",
    foreignField: "deptno",
    as: "dept"
  }
};

const matchStage = { $match: { dept: { $size: 0 } } }
;

The definition is simple and similar to an SQL foreign key. In practice, it can be more complex and precise. Document databases are well-suited to well-understood applications where business logic extends beyond what can be defined by a foreign key. For example, some employees may temporarily have no department—such as new hires—or may belong to two departments during a transition. MongoDB’s flexible schema supports these cases, and you define referential integrity rules accordingly. You aren’t constrained to an application-unaware model as with SQL schemas. I'll keep it simple for this example.

One-time validation with an aggregation pipeline

I insert a new employee, Eliot, into dept 42, which doesn’t exist yet:

db.employees.insertOne({
  empno: 9002,
  ename: "Eliot",
  job: "CTO",
  deptno: 42       // Missing department
});

This doesn’t raise any errors. In all queries, the employee is visible only by department number, with no other information about that department.

If you decide that such a situation should not stay and must be detected, you can use an aggregation pipeline to list the violations, with the lookup and match stage defined earlier:

db.employees.aggregate([ lookupStage, matchStage ])
;

This shows the employees referencing a department that doesn't exist:

[
  {
    _id: ObjectId('694d8b6cd0e5c67212d4b14f'),
    empno: 9002,
    ename: 'Eliot',
    job: 'CTO',
    deptno: 42,
    dept: []
  }
]

We’ve caught the violation asynchronously and can decide what to do. Maybe the "deptno" was wrong, maybe we failed to insert the department, or someone deleted it, or we missed a business scenario where employees can be assigned to a department number without more information.

In SQL databases, the rules are basic and not driven by use cases. They check only for anomalies arising from normalization, and any violation is treated as an error without further consideration. However, in MongoDB, where you build a database for a known application, the integrity is part of the business logic.

Whether you should run this validation depends on your database's size and the risk of integrity issues. After major data refactoring, run it as an extra check. To avoid production impact, run it on a read replica—an advantage of asynchronous verification. You don't need a high isolation level, as, at worst, concurrent transactions may trigger a false warning, which can be checked later. If you restore backups for disaster recovery testing, it’s wise to run the validation on that copy to verify both the restore process and data integrity of the primary database.

Real-time watcher with change streams

You may also decide to perform validations in near real time, checking the changes shortly after they occur.

I start a change stream for employees, looking for inserts/updates,

and apply the same $lookup + $match to just the changed doc:

const cs = db.employees.watch([
  { $match: { operationType: { $in: ["insert", "update", "replace"] } } }
]);
print("👀 Watching employees for referential integrity violations...");

while (cs.hasNext()) {  
  const change = cs.next(); // Get the next change event  
  if (["insert", "update", "replace"].includes(change.operationType)) {  
    const result = db.employees.aggregate([  
      { $match: { _id: change.documentKey._id } }, // check the new document
      lookupStage, // lookup dept info by deptno  
      matchStage   // keep only docs with NO matching dept  
    ]).toArray();  

    if (result.length > 0) {  
      print("\n⚠ Real-time Referential Integrity Violation Detected:");  
      printjson(result[0]);  
    }  
  }  
}  

In another session, I insert another employee, Dwight, in department 42, which is still missing:

db.employees.insertOne({
  empno: 9001,
  ename: "Dwight",
  job: "CEO",
  deptno: 42 // 🚨 Still missing
});

The loop on the change stream get notified of the insert, applies the aggregation pipeline verification and returns the following output:

 Real-time Referential Integrity Violation Detected:
{
  _id: ObjectId('694da3aa8cd2fa3fe4d4b0c2'),
  empno: 9001,
  ename: 'Dwight',
  job: 'CEO',
  deptno: 42,
  dept: []
}

Rather than an error that blocks the application, the application handles this as any application alert. It can either create a department automatically or have a user analyze the situation.

Fixing the Violations

I add the missing department:

db.departments.insertOne({
  deptno: 42,
  dname: "DEVELOPER EXPERIENCE",
  loc: "REMOTE"
});

I re-run the batch check:

db.employees.aggregate([ lookupStage, matchStage ]);

Conclusion

In SQL databases, foreign key constraints require reading the parent record before inserting or updating a child and applying locks if necessary. When deleting or updating referenced keys, the database checks for existing children and waits if a child is being inserted. If users manually perform these operations on production or if the database administrator doubts the development team, using a SQL database with declared foreign keys is advisable.

However, these operations generally originate from an application that has already conducted the necessary checks: it reads the referenced table to get the key, verifies no children exist before deleting a parent, and often prefers logical over physical deletions. Additional validations can identify bugs, but they can run asynchronously and integrate with business logic and alert systems.

In MongoDB, data integrity is implemented by developers across various development stages in a DevOps manner, rather than solely during write operations. This strategy relies on not altering the production database beyond structured development practices like peer reviews and testing. However, if the database administrator (DBA) doesn't control who accesses the database or doesn't trust the developers, they believe that all verifications should be carried out within the database before each commit.

December 24, 2025

Kubernetes Operators Compared: The Key to Scalable, Cost-Efficient Databases

Choosing the right Kubernetes operator is one of those quiet decisions that ultimately defines your database strategy, affecting everything from how easily you automate backups and scaling to how much control you maintain over long-term costs and architecture. But while most operators look similar at first glance, their underlying models yield vastly different outcomes. Some […]

Adaptive Join in Amazon Aurora PostgreSQL

Slow joins in PostgreSQL often result from a nested loop join chosen by the query planner, which estimates a few rows but processes hundreds of thousands. System metrics like buffer cache hit ratio are all green, but it reads more data than necessary and takes longer. This can happen suddenly because the join method decision is sensitive: a single row difference can trigger a shift from a hash join to a nested loop.
Although cost-based optimizers are common, alternative strategies exist. For example, MongoDB's multi-planner postpones index selection until execution, testing all options and switching to the best after a short trial. Likewise, Oracle Database can delay selecting the join method or parallel query distribution by buffering rows before determining the plan for the remaining data. Amazon Aurora implements a similar approach called adaptive join, which defers the decision between nested loop and hash join until execution.

Here is an example of Amazon Aurora adaptive plans. I used the same tables as in the previous post with two additional indexes:

CREATE INDEX ON outer_table (a,b,id);
CREATE INDEX ON inner_table (id,b);

I executed the following query:

explain (analyze)
SELECT o.b,i.b 
 FROM outer_table o 
 JOIN inner_table i USING(id)
 WHERE o.a<10 AND o.b<10
;

                                                                     QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.57..124.76 rows=4 width=8) (actual time=0.020..0.119 rows=9 loops=1)
   ->  Index Only Scan using outer_table_a_b_id_idx on outer_table o  (cost=0.29..58.71 rows=20 width=8) (actual time=0.011..0.061 rows=28 loops=1)
         Index Cond: ((a < 10) AND (b < 10))
         Heap Fetches: 0
   ->  Index Only Scan using inner_table_id_b_idx on inner_table i  (cost=0.29..3.30 rows=1 width=8) (actual time=0.002..0.002 rows=0 loops=28)
         Index Cond: (id = o.id)
         Heap Fetches: 0
 Planning Time: 0.321 ms
 Execution Time: 0.152 ms
(9 rows)

Because I had optimal indexes and not too many rows in the outer table (estimated rows=20), the query planner chose a nested loop join. During execution, there were more rows than estimated (actual rows=28). It might still be an effective join method with not too many inner loops (loops=28). However, what happens if the actual number of rows is much higher?

For instance, increasing the range on "o.b" causes the query planner to swap the join inputs and select a hash join.

explain (analyze)
SELECT o.b,i.b 
 FROM outer_table o 
 JOIN inner_table i USING(id)
 WHERE o.a<10 AND o.b<1000
;
                                                                          QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=108.48..289.73 rows=444 width=8) (actual time=0.656..2.426 rows=900 loops=1)
   Hash Cond: (i.id = o.id)
   ->  Seq Scan on inner_table i  (cost=0.00..155.00 rows=10000 width=8) (actual time=0.006..0.697 rows=10000 loops=1)
   ->  Hash  (cost=80.72..80.72 rows=2221 width=8) (actual time=0.624..0.624 rows=2250 loops=1)
         Buckets: 4096  Batches: 1  Memory Usage: 120kB
         ->  Index Only Scan using outer_table_a_b_id_idx on outer_table o  (cost=0.29..80.72 rows=2221 width=8) (actual time=0.019..0.315 rows=2250 loops=1)
               Index Cond: ((a < 10) AND (b < 1000))
               Heap Fetches: 0
 Planning Time: 0.901 ms
 Execution Time: 2.522 ms
(10 rows)

Instead of starting a nested loop from "outer_table," this approach loads the entire "outer_table" into a build table using hashing and starts the probe from "inner_table." Although this initial step takes longer to build the hash table, it prevents running 2,000 inner loops, as I can verify by disabling all other methods.

set enable_hashjoin  to off;
set enable_mergejoin to off;

explain (analyze)
SELECT o.b,i.b 
 FROM outer_table o 
 JOIN inner_table i USING(id)
 WHERE o.a<10 AND o.b<1000
;

                                                                       QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.57..872.57 rows=444 width=8) (actual time=0.610..4.871 rows=900 loops=1)
   ->  Index Only Scan using outer_table_a_b_id_idx on outer_table o  (cost=0.29..80.72 rows=2221 width=8) (actual time=0.039..0.334 rows=2250 loops=1)
         Index Cond: ((a < 10) AND (b < 1000))
         Heap Fetches: 0
   ->  Index Only Scan using inner_table_id_b_idx on inner_table i  (cost=0.29..0.36 rows=1 width=8) (actual time=0.002..0.002 rows=0 loops=2250)
         Index Cond: (id = o.id)
         Heap Fetches: 0
 Planning Time: 4.991 ms
 Execution Time: 5.670 ms
(9 rows)

With accurate cardinality estimates, the query planner can select the optimal join method — but during execution, these estimates can sometimes be significantly inaccurate. This is where an adaptive plan can assist—not necessarily to find the perfect plan, but to prevent the worst-case scenarios.

I re-enable all join methods, activate the adaptive plan, and rerun my initial query, which retrieves 28 rows from the outer table.

set enable_hashjoin  to on;
set enable_mergejoin to on;

set apg_adaptive_join_crossover_multiplier to 1;
set apg_enable_parameterized_adaptive_join to on;

explain (analyze)
SELECT o.b,i.b 
 FROM outer_table o 
 JOIN inner_table i USING(id)
 WHERE o.a<10 AND o.b<10
;
                                                                     QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop (Adaptive)  (cost=0.57..124.76 rows=4 width=8) (actual time=4.559..4.609 rows=9 loops=1)
   Adaptive Crossover: rows=74
   ->  Index Only Scan using outer_table_a_b_id_idx on outer_table o  (cost=0.29..58.71 rows=20 width=8) (actual time=1.999..3.261 rows=28 loops=1)
         Index Cond: ((a < 10) AND (b < 10))
         Heap Fetches: 0
   ->  Index Only Scan using inner_table_id_b_idx on inner_table i  (cost=0.29..3.30 rows=1 width=8) (actual time=0.047..0.047 rows=0 loops=28)
         Index Cond: (id = o.id)
         Heap Fetches: 0
 Planning Time: 2.107 ms
 Execution Time: 4.648 ms
(10 rows)

It's still a nested loop with the same cost estimate as before, but it is now flagged as (Adaptive) and shows an extra detail: Adaptive Crossover: rows=74.

This indicates that the query planner found a nested loop to be cheaper than a hash join for the originally estimated number of iterations (rows=20). At planning time, it also computed the cost for higher row counts and identified a crossover point at rows=74, beyond which a hash join would have been cheaper and therefore selected. In other words, the planner pre-calculated an inflection point at which it would prefer a hash join and deferred the final choice to execution time.

At runtime, the rows read from outer_table are counted and buffered. Because the row count never reached the crossover/inflection point, the plan continued using the nested loop.

To see how the plan changes with more qualifying rows, I updated my data so that more rows satisfy the predicate a < 10 AND b < 10:

UPDATE outer_table SET b=0 WHERE a<10 AND b BETWEEN 10 AND 40;

UPDATE 47

I ran my query again. It’s still (Adaptive) with Adaptive Crossover: rows=74, but now it shows a Hash Join:

explain (analyze)
SELECT o.b,i.b 
 FROM outer_table o 
 JOIN inner_table i USING(id)
 WHERE o.a<10 AND o.b<10
;

                                                                           QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
 Hash Join (Adaptive)  (cost=58.96..240.21 rows=4 width=8) (actual time=2.531..3.801 rows=28 loops=1)
   Output: o.b, i.b
   Inner Unique: true
   Adaptive Crossover: rows=74
   Hash Cond: (i.id = o.id)
   ->  Seq Scan on public.inner_table i  (cost=0.00..155.00 rows=10000 width=8) (actual time=0.007..0.628 rows=10000 loops=1)
         Output: i.id, i.a, i.b
   ->  Hash  (cost=58.71..58.71 rows=20 width=8) (actual time=2.470..2.470 rows=75 loops=1)
         Output: o.b, o.id
         ->  Index Only Scan using outer_table_a_b_id_idx on public.outer_table o  (cost=0.29..58.71 rows=20 width=8) (actual time=1.103..1.280 rows=75 loops=1)
               Output: o.b, o.id
               Index Cond: ((o.a < 10) AND (o.b < 10))
               Heap Fetches: 57
 Query Identifier: 8990309245261094611
 Planning Time: 1.674 ms
 Execution Time: 3.861 ms
(16 rows)

At planning time, the decision remained the same as before because the statistics had not changed (the estimate was still cost=0.29..58.71 rows=20). In reality, though, more than 74 rows were read from outer_table (actual rows=75), and instead of being used for a nested loop, the buffered rows were used as the build table of a hash join.

I then analyzed the table to see what would happen with fresh statistics, and was surprised to find the plan reverted to a nested loop:

analyze outer_table ;

explain (analyze)
SELECT o.b,i.b
 FROM outer_table o
 JOIN inner_table i USING(id)
 WHERE o.a<10 AND o.b<10
;
                                                                     QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop (Adaptive)  (cost=0.57..145.25 rows=5 width=8) (actual time=0.124..0.248 rows=28 loops=1)
   Adaptive Crossover: rows=78
   ->  Index Only Scan using outer_table_a_b_id_idx on outer_table o  (cost=0.29..70.29 rows=23 width=8) (actual time=0.026..0.104 rows=75 loops=1)
         Index Cond: ((a < 10) AND (b < 10))
         Heap Fetches: 57
   ->  Index Only Scan using inner_table_id_b_idx on inner_table i  (cost=0.29..3.26 rows=1 width=8) (actual time=0.001..0.001 rows=0 loops=75)
         Index Cond: (id = o.id)
         Heap Fetches: 0
 Planning Time: 1.025 ms
 Execution Time: 0.287 ms
(10 rows)

The reason is that even with a freshly analyzed table, the optimizer’s estimate is worse than before: it predicts fewer rows (rows=23) when there are actually more (rows=75). This can happen because a predicate such as a < 10 AND b < 10 is already complex for the cost-based optimizer. Due to these misestimates, the inflection point was estimated higher (rows=78), so the optimizer still chose a nested loop plan.

What I like about this feature—which I’ve been familiar with since it was implemented in Oracle Database years ago—is that it doesn’t try to find the absolute best plan. Instead, it focuses on avoiding the worst plans (for example, nested loops on tens or hundreds of thousands of rows) and switching to a plan that is simply good enough. Amazon Aurora is a black box with limited tracing, so it’s difficult to know exactly how it works, but it probably behaves similarly to Oracle adaptive plans. I wrote an older blog post about how Oracle determines the inflection point:

dbi BlogOracle 12c Adaptive Plan & inflection point - dbi Blog

Oracle 12c Adaptive Plan description

dbi-services.com

As this requires more work at execution time, Aurora triggers it only when the cost of the first join method is higher than a threshold, by default 100:

postgres=> \dconfig apg*adaptive_join*

        List of configuration parameters
               Parameter                | Value
----------------------------------------+-------
 apg_adaptive_join_cost_threshold       | 100
 apg_adaptive_join_crossover_multiplier | 1
 apg_enable_parameterized_adaptive_join | on
(3 rows)

In my examples, the costs were higher than apg_adaptive_join_cost_threshold (Nested Loop (cost=0.57..124.76). I had to enable apg_enable_parameterized_adaptive_join because the join predicate is pushed down as a parameter (Index Cond: (id = o.id)), which is the main advantage of a nested loop join as it enables index access to the inner table. I've set the apg_adaptive_join_crossover_multiplier to enable the feature. Setting a higher value simply raises the inflection point by multiplying the crossover value, which reduces the likelihood of an adaptive plan being triggered.

To test it further, I modified my data so that the outer table returns 50,000 rows and run the query again:

postgres=> UPDATE outer_table SET a=0 , b=0;

UPDATE 50000

postgres=> explain (analyze)
SELECT o.b,i.b
 FROM outer_table o
 JOIN inner_table i USING(id)
 WHERE o.a<10 AND o.b<10
;
                                                                          QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------
 Hash Join (Adaptive)  (cost=180.21..361.46 rows=5 width=8) (actual time=231.245..234.250 rows=10000 loops=1)
   Adaptive Crossover: rows=139
   Hash Cond: (i.id = o.id)
   ->  Seq Scan on inner_table i  (cost=0.00..155.00 rows=10000 width=8) (actual time=0.014..0.633 rows=10000 loops=1)
   ->  Hash  (cost=179.65..179.65 rows=45 width=8) (actual time=231.240..231.240 rows=50000 loops=1)
         ->  Index Only Scan using outer_table_a_b_id_idx on outer_table o  (cost=0.42..179.65 rows=45 width=8) (actual time=1.641..90.167 rows=50000 loops=1)
               Index Cond: ((a < 10) AND (b < 10))
               Heap Fetches: 50075
 Planning Time: 1.041 ms
 Execution Time: 298.893 ms
(10 rows)

The adaptive plan avoided a nested loop join that would have required 50,000 iterations. With accurate statistics, the optimizer probably would have selected a merge join instead, since it needs to read all tables and I have indexes on the join keys. In that case, a merge join would have been faster than a hash join. That said, even with stale statistics, the hash join was still much better — or at least less risky — than using a nested loop join.

Without this feature in PostgreSQL, you still have options: ensure statistics are accurate (this remains true even with adaptive plans, which rely on estimates) and use extended statistics where they help. Make sure you have the right indexes so that cost differences are clear and the planner does not hesitate between two bad plans. You can use pg_hint_plan to force a specific join method, though it often needs more hints than expected (see Predictable plans with pg_hint_plan full hinting). Some people tweak random_page_cost, which affects index costs and thus join choices, but I have my own ideas about that. Because joins are central in SQL databases due to relational normalization, a poor join method can make them seem slow and unpredictable, so it is crucial to understand join strategies and review execution plans carefully. This feature in Aurora helps prevent some runaway queries, so I think it is a good idea to enable it by default, especially given that you can set a crossover multiplier to have it kick in only to avoid the worst cases.

December 23, 2025

Unnesting Scalar Subqueries into Left Outer Joins in SQL

Relational databases treat your query as a declarative description of the desired result and select the most efficient execution plan. They may rewrite the query—for example, by transforming subqueries into joins and vice versa—so the database, not the developer, manages this complexity.

Historically, PostgreSQL has had fewer planner transformations than many commercial databases. As an open-source project, it favors simpler code that promotes contributions and good SQL design, while commercial vendors can justify more complex planning logic when it helps revenue-generating customers in reducing their optimization efforts. PostgreSQL does not maintain a global shared plan cache, so most queries are planned per execution. This encourages keeping planning overhead low. The only exceptions are when using prepared statements, functions, or other situations where a plan can be generic, retained and reused. Transformations open more access paths, potentially improving execution plans, but at the cost of higher planning time.

AWS recently added these kinds of transformations to Amazon Aurora PostgreSQL (APG) to optimize queries with a correlated subquery in the projection:

Without this transformation, projection expressions are evaluated per row, which at best yields nested-loop–like performance. By pushing the subquery down so it runs before the projection, more efficient join algorithms can be used. Amazon Aurora is a proprietary fork of PostgreSQL, and its improvements are not contributed upstream, so this feature is not available in PostgreSQL. Nonetheless, examining the transformation Aurora implements can inspire similar manual rewrites of queries. I’ll use the example from the AWS blog post to show how to rewrite the query more efficiently. It is important to understand the transformation carefully to ensure it does not change the results.

Here is the DDL and DML to create the same example as AWS's article:

-- Clean up from previous runs

DROP TABLE IF EXISTS outer_table;
DROP TABLE IF EXISTS inner_table;

-- Outer table (like customers)

CREATE TABLE outer_table (
    id  SERIAL PRIMARY KEY,
    a   INT,
    b   INT
);

-- Inner table (like orders)

CREATE TABLE inner_table (
    id  SERIAL PRIMARY KEY,
    a   INT,
    b   INT
);

-- Insert data into inner_table:
--   10,000 rows, 'a' cycles from 1..100, 'b' random 0–999

INSERT INTO inner_table (a, b)
SELECT
    1 + mod(gs - 1, 100),
    floor(random() * 1000)::int
FROM generate_series(1, 10000) AS gs;

-- Insert outer_table:
--   First 25K rows: a = 1..100 repeated

INSERT INTO outer_table (a, b)
SELECT
    1 + mod(gs - 1, 100),
    floor(random() * 1000)::int
FROM generate_series(1, 25000) AS gs;
--   Next 25K rows: unique a = 101..35100
INSERT INTO outer_table (a, b)
SELECT
    gs + 100,
    floor(random() * 1000)::int
FROM generate_series(1, 25000) AS gs;
-- Gather statistics
ANALYZE outer_table;
ANALYZE inner_table;

I've set up an Aurora Serverless database to test it by enabling the transformation parameter and running the query, displaying the execution plan:

SET apg_enable_correlated_scalar_transform = on;

explain (analyze , verbose, costs off)
SELECT outer_table.a, outer_table.b
       , (SELECT AVG(inner_table.b)
          FROM inner_table
          WHERE inner_table.a = outer_table.a
) FROM outer_table
;
                                           QUERY PLAN
------------------------------------------------------------------------------------------------
 Hash Left Join (actual time=4.904..15.740 rows=50000 loops=1)
   Output: outer_table.a, outer_table.b, (avg(inner_table.b))
   Inner Unique: true
   Hash Cond: (outer_table.a = inner_table.a)
   ->  Seq Scan on public.outer_table (actual time=0.016..2.968 rows=50000 loops=1)
         Output: outer_table.id, outer_table.a, outer_table.b
   ->  Hash (actual time=2.985..2.986 rows=100 loops=1)
         Output: (avg(inner_table.b)), inner_table.a
         Buckets: 1024  Batches: 1  Memory Usage: 13kB
         ->  HashAggregate (actual time=2.930..2.960 rows=100 loops=1)
               Output: avg(inner_table.b), inner_table.a
               Group Key: inner_table.a
               Batches: 1  Memory Usage: 32kB
               ->  Seq Scan on public.inner_table (actual time=0.016..0.637 rows=10000 loops=1)
                     Output: inner_table.id, inner_table.a, inner_table.b
 Query Identifier: -2382945993278526738
 Planning Time: 2.439 ms
 Execution Time: 23.322 ms
(18 rows)

I've added the verbose option to explain because it is important to see all elements of the rewritten query. Here, with the transformation, (SELECT AVG(inner_table.b) FROM ... WHERE ... ) has been transformed to (avg(inner_table.b)) over a Hash Left Join to the deduplicated (HashAggregate) inner table.

It is the equivalent of:

explain (analyze , verbose, costs off )
SELECT outer_table.a,  outer_table.b
       , agg.avg_b
FROM outer_table
LEFT JOIN (
    SELECT a, AVG(b) AS avg_b
    FROM inner_table
    GROUP BY a
) AS agg
ON outer_table.a = agg.a
;
                                              QUERY PLAN
------------------------------------------------------------------------------------------------------
 Hash Left Join (actual time=4.469..16.534 rows=50000 loops=1)
   Output: outer_table.a, outer_table.b, agg.avg_b
   Inner Unique: true
   Hash Cond: (outer_table.a = agg.a)
   ->  Seq Scan on public.outer_table (actual time=0.011..3.124 rows=50000 loops=1)
         Output: outer_table.id, outer_table.a, outer_table.b
   ->  Hash (actual time=3.804..3.806 rows=100 loops=1)
         Output: agg.avg_b, agg.a
         Buckets: 1024  Batches: 1  Memory Usage: 13kB
         ->  Subquery Scan on agg (actual time=3.733..3.778 rows=100 loops=1)
               Output: agg.avg_b, agg.a
               ->  HashAggregate (actual time=3.732..3.765 rows=100 loops=1)
                     Output: inner_table.a, avg(inner_table.b)
                     Group Key: inner_table.a
                     Batches: 1  Memory Usage: 32kB
                     ->  Seq Scan on public.inner_table (actual time=0.004..0.668 rows=10000 loops=1)
                           Output: inner_table.id, inner_table.a, inner_table.b
 Query Identifier: -3523129028670016640
 Planning Time: 1.361 ms
 Execution Time: 19.674 ms
(20 rows)

This looks simple, but SQL can be weird and not all aggregation functions have the same semantic with the absence of values. If there are no rows from the inner table, AVG() returns a NULL, like the correlated subquery would return in the absence of rows. However, if a COUNT() was used instead of AVG() it would be different, as a subquery still returns no rows but a COUNT() would return 0.

I test the automatic transformation on Aurora with a COUNT():

explain (analyze , verbose, costs off)
SELECT outer_table.a, outer_table.b
       , (SELECT COUNT(inner_table.b)
          FROM inner_table
          WHERE inner_table.a = outer_table.a
) FROM outer_table
;
                                           QUERY PLAN
------------------------------------------------------------------------------------------------
 Hash Left Join (actual time=2.319..13.332 rows=50000 loops=1)
   Output: outer_table.a, outer_table.b, COALESCE((count(inner_table.b)), 0)
   Inner Unique: true
   Hash Cond: (outer_table.a = inner_table.a)
   ->  Seq Scan on public.outer_table (actual time=0.012..3.003 rows=50000 loops=1)
         Output: outer_table.id, outer_table.a, outer_table.b
   ->  Hash (actual time=2.302..2.304 rows=100 loops=1)
         Output: (count(inner_table.b)), inner_table.a
         Buckets: 1024  Batches: 1  Memory Usage: 13kB
         ->  HashAggregate (actual time=2.255..2.268 rows=100 loops=1)
               Output: count(inner_table.b), inner_table.a
               Group Key: inner_table.a
               Batches: 1  Memory Usage: 24kB
               ->  Seq Scan on public.inner_table (actual time=0.003..0.640 rows=10000 loops=1)
                     Output: inner_table.id, inner_table.a, inner_table.b
 Query Identifier: 6903753335662751945
 Planning Time: 1.267 ms
 Execution Time: 15.219 ms
(18 rows)

Now the VERBOSE option shows COALESCE((count(inner_table.b)), 0) to transform a NULL into a zero, in order to match the COUNT semantic.

Here is the equivalent query if you want to do the transformation manually:

explain (analyze , verbose )
SELECT outer_table.a,  outer_table.b
       , COALESCE(agg.cnt_b, 0) AS cnt_b
FROM outer_table
LEFT JOIN (
    SELECT a, COUNT(b) AS cnt_b
    FROM inner_table
    GROUP BY a
) AS agg
ON outer_table.a = agg.a;

                                                                  QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------
 Hash Left Join  (cost=208.25..1110.53 rows=50000 width=16) (actual time=2.556..77.785 rows=50000 loops=1)
   Output: outer_table.a, outer_table.b, COALESCE(agg.cnt_b, '0'::bigint)
   Inner Unique: true
   Hash Cond: (outer_table.a = agg.a)
   ->  Seq Scan on public.outer_table  (cost=0.00..771.00 rows=50000 width=8) (actual time=0.012..3.183 rows=50000 loops=1)
         Output: outer_table.id, outer_table.a, outer_table.b
   ->  Hash  (cost=207.00..207.00 rows=100 width=12) (actual time=2.532..2.534 rows=100 loops=1)
         Output: agg.cnt_b, agg.a
         Buckets: 1024  Batches: 1  Memory Usage: 13kB
         ->  Subquery Scan on agg  (cost=205.00..207.00 rows=100 width=12) (actual time=2.485..2.509 rows=100 loops=1)
               Output: agg.cnt_b, agg.a
               ->  HashAggregate  (cost=205.00..206.00 rows=100 width=12) (actual time=2.484..2.497 rows=100 loops=1)
                     Output: inner_table.a, count(inner_table.b)
                     Group Key: inner_table.a
                     Batches: 1  Memory Usage: 24kB
                     ->  Seq Scan on public.inner_table  (cost=0.00..155.00 rows=10000 width=8) (actual time=0.006..0.679 rows=10000 loops=1)
                           Output: inner_table.id, inner_table.a, inner_table.b
 Query Identifier: 4982770911819576582
 Planning Time: 0.151 ms
 Execution Time: 80.622 ms
(20 rows)

You need COALESCE with COUNT in the manual join rewrite because of how SQL aggregates behave: unlike other aggregates, COUNT returns 0—not NULL—when there are no rows.

Aurora PostgreSQL’s apg_enable_correlated_scalar_transform shows how the planner can rewrite correlated subqueries in the SELECT list into join-based aggregates, replacing per-row subqueries with set-based aggregation and hash joins for much better performance.

Even if you don’t use Aurora in production, you can spin up a small Aurora Serverless instance to validate your manual rewrites. Enable the parameter and compare the execution plan to your PostgreSQL version. The plans should match, with one structural difference: the manual version’s grouped subquery is wrapped in a Subquery Scan, which is expected in hand-written SQL.

For aggregates like AVG(), the rewrite preserves semantics with no extra changes. For COUNT(), wrap the join output in COALESCE(..., 0) so it behaves like the original correlated subquery when no rows are found.

By understanding these transformations and their semantics, you can reproduce Aurora’s optimization in upstream PostgreSQL and gain similar performance benefits while keeping full control over correctness.

Kubernetes Multi-Cloud Architecture: Building Portable Databases Without Lock-In

Most organizations now run across multiple clouds, pursuing flexibility, better pricing, or regional availability. But while stateless applications move freely, databases often remain stuck. Each cloud provider offers its own managed database service (e.g., RDS, Cloud SQL, Azure Database) with distinct APIs, automation tools, and monitoring layers. Once you commit to one, moving becomes complicated […]

December 22, 2025

Memory Management in MongoDB 8.0: Testing the New TCMalloc

With MongoDB 8.0, the database engine takes another step forward in performance optimization, particularly in how it manages memory. One of the most impactful changes under the hood is the updated version of TCMalloc (Thread-Caching Malloc), which affects how the server allocates, caches, and reuses memory blocks. For workloads with high concurrency, long-running queries, or […]