a curated list of database news from authoritative sources

September 02, 2025

Postgres 18 beta3, large server, sysbench

This has performance results for Postgres 18 beta3, beta2, beta1, 17.5 and 17.4 using the sysbench benchmark and a large server. The working set is cached and the benchmark is run with high concurrency (40 connections). The goal is to search for CPU and mutex regressions. This work was done by Small Datum LLC and not sponsored

tl;dr

  • There might be small regressions (~2%) for several range queries that don't do aggregation. This is similar to what I reported for 18 beta3 on a small server, but here it only occurs for 3 of the 4 microbenchmarks and on the small server it occurs on all 4. I am still uncertain about whether this really is a regression.
Builds, configuration and hardware

I compiled Postgres versions 17.4, 17.5, 18 beta1, 18 beta2 and 18 beta3 from source.

The server is an ax162-s from Hetzner with an AMD EPYC 9454P processor, 48 cores, AMD SMT disabled and 128G RAM. The OS is Ubuntu 22.04. Storage is 2 NVMe devices with SW RAID 1 and 
ext4. More details on it are here.

The config file for Postgres 17.4 and 17.5 is x10a_c32r128.

The config files for Postgres 18 are:
  • x10b_c32r128 is functionally the same as x10a_c32r128 but adds io_method=sync
  • x10d_c32r128 starts with x10a_c2r128 and adds io_method=io_uring

Benchmark

I used sysbench and my usage is explained here. To save time I only run 32 of the 42 microbenchmarks and most test only 1 type of SQL statement. Benchmarks are run with the database cached by Postgres.

The tests are run using 8 tables with 10M rows per table. The read-heavy microbenchmarks run for 600 seconds and the write-heavy for 900 seconds.

Results

The microbenchmarks are split into 4 groups -- 1 for point queries, 2 for range queries, 1 for writes. For the range query microbenchmarks, part 1 has queries that don't do aggregation while part 2 has queries that do aggregation. 

I provide charts below with relative QPS. The relative QPS is the following:
(QPS for some version) / (QPS for Postgres 17.5)
When the relative QPS is > 1 then some version is faster than PG 17.5.  When it is < 1 then there might be a regression. Values from iostat and vmstat divided by QPS are also provided here. These can help to explain why something is faster or slower because it shows how much HW is used per request.

Relative to: pg174_o2nofp.x10a_c32r128
col-1 : pg175_o2nofp.x10a_c32r128
col-2 : pg18beta1_o2nofp.x10b_c32r128
col-3 : pg18beta1_o2nofp.x10d_c32r128
col-4 : pg18beta2_o2nofp.x10d_c32r128
col-5 : pg18beta3_o2nofp.x10d_c32r128

col-1   col-2   col-3   col-4   col-5
0.98    0.99    0.99    1.00    0.99    hot-points_range=100
1.01    1.01    1.00    1.01    1.01    point-query_range=100
1.00    1.00    0.99    1.00    1.00    points-covered-pk
1.00    1.01    1.00    1.02    1.00    points-covered-si
1.00    1.01    1.00    1.00    1.00    points-notcovered-pk
1.00    1.00    1.01    1.02    1.00    points-notcovered-si
1.00    1.00    1.00    1.00    1.00    random-points_range=1000
1.00    1.01    1.00    1.00    1.00    random-points_range=100
1.00    1.00    1.00    1.00    1.00    random-points_range=10
1.00    0.97    0.96    0.98    0.97    range-covered-pk
1.00    0.97    0.97    0.98    0.97    range-covered-si
0.99    0.99    0.99    0.99    0.98    range-notcovered-pk
1.00    1.01    1.01    1.00    1.01    range-notcovered-si
1.00    1.02    1.03    1.03    1.02    read-only-count
1.00    1.00    1.00    1.01    1.01    read-only-distinct
1.00    1.00    1.00    1.00    1.00    read-only-order
1.01    1.01    1.02    1.02    1.01    read-only_range=10000
1.00    0.99    0.99    0.99    1.00    read-only_range=100
1.01    0.99    0.99    1.00    0.99    read-only_range=10
1.00    1.01    1.01    1.01    1.01    read-only-simple
1.00    1.02    1.03    1.03    1.02    read-only-sum
1.00    1.13    1.14    1.02    0.91    scan_range=100
1.00    1.13    1.13    1.02    0.90    scan.warm_range=100
1.00    0.99    0.99    0.99    0.99    delete_range=100
0.99    1.00    1.02    0.99    1.00    insert_range=100
1.01    1.00    1.00    1.00    0.99    read-write_range=100
1.00    0.98    1.00    1.01    0.99    read-write_range=10
0.99    0.99    1.02    0.98    0.96    update-index
1.00    1.01    1.00    1.00    1.01    update-inlist
0.98    0.98    0.99    0.98    0.97    update-nonindex
0.95    0.95    0.94    0.93    0.95    update-one_range=100
0.97    0.98    0.98    0.97    0.95    update-zipf_range=100
0.98    0.99    0.99    0.98    0.98    write-only_range=10000

Asymmetric Linearizable Local Reads

People want data fast. They also want it consistent. Those two wants pull in opposite directions. This VLDB'25 paper does another take on this conundrum. Rather than assuming a symmetric network environment where all replicas face similar latencies, the paper emphasizes that in practice, some replicas are closer to the leader, where others are stranded halfway across the globe. By embracing this asymmetry, the authors propose two new algorithms: Pairwise-Leader (PL) and Pairwise-All (PA). Both cut read latency compared to the prior approaches. PL could even achieve 50x latency improvements in some cases.

Aleksey and I did our usual thing. We recorded our first blind read of the paper. You can watch it here (link to come soon), if you like seeing two people puzzle through a paper in real time. I also annotated a copy while reading which you can access here.

We liked the ideas, even though the protocols themselves didn't thrill us particularly. I particularly liked finding another good example of the use of synchronized time in distributed database systems. Another study to add to my survey.


Background

At the heart of the problem is linearizability (see my explanation for a primer), a strong consistency condition that ensures operations on a distributed object appear to occur atomically in a single total order consistent with real time. If one operation finishes before another begins, then all replicas must reflect this order. This strong model spares developers from reasoning about concurrency anomalies: if you read after a write, you are guaranteed to observe it. That means there are no stale reads.

While linearizability makes life simple for the developers, it makes it harder for the system, which has to make sure every replica behaves like one machine. This is typically enforced by state machine replication (SMR) protocols such as Paxos or Raft, which order all operations through a leader. This works fine, except for reads. Reads dominate workloads in practice, and forcing every read to consult the leader or coordinate with a quorum introduces unnecessary WAN round trips. To improve efficiency, many systems have proposed linearizable local read algorithms, which allow followers to serve reads locally under certain conditions. However, this is tricky ground. Local reads introduce the risk of staleness, because that replica/follower has not yet applied the latest writes.

And here's the rub: WANs exacerbate the problem. Some replicas are close. Others are hopelessly far. The close ones see fresh data. The far ones lag. Prior work noticed this and tried various tricks to smooth out the unfairness. This paper doesn’t smooth it out. It embraces it as we will see in the coming sections.

The key problem addressed by the paper is: How can linearizable local reads be achieved in a system where replicas are asymmetric in their ability to keep up with the updates?


The Blocking Problem

Over the years, many protocols have been proposed for enabling linearizable local reads. The paper reviews these under three broad categories.

Invalidation-based algorithms (e.g., Megastore, PQL, Hermes): Replicas mark themselves invalid when they receive prepares and regain validity only after commits. But this can lead to unbounded blocking: If the leader issues prepares faster than it gathers acknowledgments in a write-heavy workload, the replicas stay perpetually invalid. 

Eager Stamping algorithms (e.g., CHT): Reads are stamped with the latest prepare index, then block until the corresponding commits arrive. This avoids perpetual invalidation thanks to instance-based tracking of prepares, but it still results in blocking proportional to twice the leader's eccentricity.

Delayed Stamping algorithms (e.g., CockroachDB Global Tables): These use synchronized clocks to assign visibility windows. Here, blocking time depends on the clock skew bound Δ, which theoretically in the worst case would be bound to the relative network diameter. This theoretical worst case bound does not apply if GPS clock synchronization (e.g., Google's Truetime or AWS Timesync) is available, and so the paper, in order to make its case, assumes GPS based synchronization is not available (which is actually pretty available). https://muratbuffalo.blogspot.com/2024/12/utilizing-highly-synchronized-clocks-in.html

Ok, I owe you an explation of delayed stamping approach. But first let me set this up using the paper's unifying stop/go events framework. I like this framing: when you name something, you own it. In this model, each index i on each replica has two events:

  • A stop event at which the replica stops assigning reads to indices less than i.
  • A go event at which the replica can safely serve reads at i.

This abstraction ultimately guarantees linearizability by ensuring that, across all replicas, all go events for i occur at or after all stop events for i. The framework is clean and clarifies why blocking occurs.

Now let's walk through Fig. 2 and explain the two approaches using the stop/go framework.

Eager Stamping (Fig. 2a). A follower stamps a read with the highest index i it has seen a prepare for (the stop event). It can only apply the read after it has received all commits up to i (the go event). The leader is special here: since it is always up to date, its stop and go events collapse into one.

Delayed Stamping (Fig. 2b). This approach uses synchronized clocks to decouple stop and go events from message arrival. When the leader accepts an update, it assigns it a future visibility time t+α, where the visibility delay α is derived from estimated commit time of the update. Followers stop stamping reads with indices less than i once that visibility time has passed on their clocks. They then apply the read after an additional Δ (the clock skew/uncertainty) has elapsed. Unlike Eager Stamping, the leader does not have special stop and go events; it also follows this visibility rule. I had talked about this earlier when explaining CockroachDB's global transactions and the Aurora Limitless and DSQL future timestamping.

Table 2 summarizes the worst-case blockage time at followers for the three category of algorithms mentioned above and the two new algorithms introduced in this work (which we explain next). 

I would be amiss (like the paper), if I do not emphasize a common flaw shared across all these algorithms: the leader in these algorithms requires acknowledgments from all nodes (rather than just a quorum) before it can commit a write! If you want a local linearizable read from a single node, the write protocol is forced into a write-all model to ensure that the one-node read quorum intersects with the write quorum. This design hurts both availability and tail-latency for the algorithms in Table 2. In contrast, in our Paxos Quorum Reads (PQR 2019) work we avoided the write-all model: PQR used only LSN tracking and quorum acknowledgments, and no clocks are needed. I discuss PQR at the end of this post.


Pairwise-Leader (PL)

The central idea of PL is to tailor blocking time to each replica's distance from the leader. Nearby replicas get near-zero latency, while distant ones may fare worse than before. Figure 3 provides the stepping stone for explaining the PL algorithm in Figure 4.

The central idea of PL is that blocking time should depend on how far a replica is from the leader. Replicas close to the leader see near-zero latency, while distant replicas may wait longer. To get there, the paper first introduces a stepping-stone algorithm (as shown in Figure 3). The trick is to deliberately time/delay prepare messages so that acknowledgments from all replicas reach the leader at the same time and hence blockage time is reduced for followers closer to the leader. Specifically, this ensures that for any replica, the gap between its prepare and commit messages is just the round-trip distance to the leader. That alone already improves over older algorithms that tied blocking to the full network diameter.

PL then builds on this stepping-stone by further decoupling stop and go events, borrowing the spirit of Delayed Stamping but applying it pairwise. Instead of relying on synchronized clocks, PL introduces a new event scheduling primitive that ensures a replica's stop event happens just before a visibility time, and its go event just after. Figure 4 illustrates this: each replica's worst-case blocking time becomes exactly 2 relative message delay between the leader and itself (see Table 1 for notation). In other words, nearby replicas get fast, almost instant reads, as the cost for distant ones reflects only their distance from the leader.

PL introduces a new pairwise event scheduling/synchronization primitive: Instead of requiring global clock synchronization, the leader coordinates stop and go events directly with each follower. This scheduling/synchronization ensures stop/go events happen at predictable real-time offsets relative to the leader's visibility time, while exploiting known lower bounds on delays to followers to maintain correctness. Yes, unfortunately the drawback is that the delays to followers need to be reliable/predictable for the correctness to work. I discuss this problem at the end of the post.


Pairwise-All (PA)

PL optimizes aggressively for leader-adjacent replicas, but it penalizes distant ones. PA extends the pairwise trick to all replicas using all-to-all communication as shown in Figure 6.

PA's stepping-stone in Figure 6 works like PL's but shifts the synchronization target: instead of aligning acknowledgments at the leader, it delays prepare messages so they all arrive every replica at the same time. Each process also sends acknowledgments to all others, not just the leader. The effect is that every replica can commit once its own eccentricity time has passed since receiving a prepare. As a result, the worst-case read blocking time for each replica is its relative eccentricity.

To further reduce blocking, PA applies the same decoupling (delayed stamping) idea as PL but with all replicas aligned at the visibility time (as shown in Figure 7). The leader schedules stop events near this visibility point, and each process waits for stopped events from others before issuing its own go. Since a go event is the maximum of all stopped events, correctness holds regardless of scheduling accuracy. This ensures that every replica's worst-case blocking time is bounded by its eccentricity rather than its leader-distance. In practice, that means nearby replicas don't get PL's extreme gains, but distant ones aren't punished.

That's it, that's the story. Asymmetry is real. PL exploits it ruthlessly. PA makes it fair. If you're close to the leader, PL is your friend. If you're far, PA keeps you from suffering.


Discussion

Both PL and PA assume stable latencies and non-faulty processes. The paper sketches how to tolerate failures, but variance and reconfiguration remain open issues. The funny thing is to justify predictable network latencies, the authors cite Aleksey's paper: "Cloudy Forecast: How Predictable is Communication Latency in the Cloud?" Ironically, that paper shows the opposite, the variance can be massive: up to 10x of median in WAN setup, and 3000x in same AZ setup! So either they didn't read it carefully, or they cited it for sport. Cloud networks aren't that tame and predictable for tenants.

The treatment of clock skew Δ is also odd. The paper insists Δ must be proportional to the network diameter, but that's a theoretical result, and I don't know how much it would apply to even NTP based synchronization. Moreover, in practice, GPS clocks exist, and AWS Timesync provides 50 microsecond clock uncertainty. Why not use these? The paper explicitly disallows GPS clocks to make the results from PL and PA look more favorable. A comparison against synchronized clocks would have been valuable. With clocks, blocking could be in less than millisecond (as we designed in AWS DSQL) with just delayed timestamping and that would not only be a lot more simple, but also beat anything from PL and PA significantly. 

Our PQR work (2019) also tackled linearizable non-leader reads. You can also frame it as local reads, though PQR used multiple nodes. The key idea in PQR is to involve the client: The client contacts a quorum of nodes, usually gets a linearizable read in one shot, and in the rare case of an ongoing update, waits briefly and completes with a callback. PQR required no synchronized clocks and worked in a fully asynchronous model using only LSNs. It fits naturally in this space.

Postgres 18 beta3, small server, sysbench

This has performance results for Postgres 18 beta3, beta2, beta1 and 17.6 using the sysbench benchmark and a small server. The working set is cached and the benchmark is run with low concurrency (1 connection). The goal is to search for CPU regressions. This work was done by Small Datum LLC and not sponsored

tl;dr

  • There might be small regressions (~2%) for several range queries that don't do aggregation. This is similar to what I reported for 18 beta1.
  • Vacuum continues to be a problem for me and I had to repeat the benchmark a few times to get a stable result. It appears to be a big source of non-deterministic behavior leading to false alarms for CPU regressions in read-heavy tests that run after vacuum. In some ways, RocksDB compaction causes similar problems. Fortunately, InnoDB MVCC GC (purge) does not cause such problems.
Builds, configuration and hardware

I compiled Postgres versions 17.6, 18 beta1, 18 beta2 and 18 beta3 from source.

The server is a Beelink SER7 with a Ryzen 7 7840HS CPU, 32G of RAM, 8 cores with AMD SMT disabled, Ubuntu 24.04 and an NVMe devices with discard enabled and ext4 for the database.

The config file for Postgres 17.6 is x10a_c8r32.

The config files for Postgres 18 are:
  • x10b_c8r32 is functionally the same as x10a_c8r32 but adds io_method=sync
  • x10b1_c8r32 starts with x10b_c8r32 and adds vacuum_max_eager_freeze_failure_rate =0
  • x10b2_c8r32 starts with x10b_c8r32 and adds vacuum_max_eager_freeze_failure_rate =0.99

Benchmark

I used sysbench and my usage is explained here. To save time I only run 32 of the 42 microbenchmarks and most test only 1 type of SQL statement. Benchmarks are run with the database cached by Postgres.

The tests are run using 1 table with 50M rows. The read-heavy microbenchmarks run for 600 seconds and the write-heavy for 900 seconds.

Results

The microbenchmarks are split into 4 groups -- 1 for point queries, 2 for range queries, 1 for writes. For the range query microbenchmarks, part 1 has queries that don't do aggregation while part 2 has queries that do aggregation. 

I provide charts below with relative QPS. The relative QPS is the following:
(QPS for some version) / (QPS for Postgres 17.6)
When the relative QPS is > 1 then some version is faster than PG 17.6.  When it is < 1 then there might be a regression. Values from iostat and vmstat divided by QPS are also provided here. These can help to explain why something is faster or slower because it shows how much HW is used per request.

The numbers highlighted in yellow below might be from a small regression for range queries that don't do aggregation. But note that this does reproduce for the full table scan microbenchmark (scan). I am not certain it is a regression as this might be from non-deterministic CPU overheads for read-heavy workloads that are run after vacuum. I hope to look at CPU flamegraphs soon.
  • the mapping from microbenchmark name to Lua script is here
  • the range query without aggregation microbenchmarks use oltp_range_covered.lua with various flags set and the SQL statements it uses are here. All of these return 100 rows.
  • the scan microbenchmark uses oltp_scan.lua which is a SELECT with a WHERE clause that filters all rows (empty result set)
Relative to: x.pg176_o2nofp.x10a_c8r32.pk1
col-1 : x.pg18beta1_o2nofp.x10b_c8r32.pk1
col-2 : x.pg18beta2_o2nofp.x10b_c8r32.pk1
col-3 : x.pg18beta3_o2nofp.x10b_c8r32.pk1
col-4 : x.pg18beta3_o2nofp.x10b1_c8r32.pk1
col-5 : x.pg18beta3_o2nofp.x10b2_c8r32.pk1

col-1   col-2   col-3   col-4   col-5 -> point queries
1.00    1.00    0.98    0.99    0.99    hot-points_range=100
1.00    1.01    1.00    1.00    0.99    point-query_range=100
1.00    1.02    1.01    1.01    1.01    points-covered-pk
1.00    1.00    1.00    1.00    1.00    points-covered-si
1.01    1.01    1.00    1.00    1.00    points-notcovered-pk
1.01    1.00    1.00    1.00    1.00    points-notcovered-si
0.99    1.00    0.99    1.00    1.00    random-points_range=1000
1.01    1.00    1.00    1.00    1.00    random-points_range=100
1.01    1.01    1.00    1.00    0.99    random-points_range=10

col-1   col-2   col-3   col-4   col-5 -> range queries w/o agg
0.98    0.99    0.97    0.98    0.96    range-covered-pk_range=100
0.98    0.99    0.96    0.98    0.97    range-covered-si_range=100
0.98    0.98    0.98    0.97    0.98    range-notcovered-pk
0.99    0.99    0.98    0.98    0.98    range-notcovered-si
1.01    1.02    1.00    1.00    1.00    scan

col-1   col-2   col-3   col-4   col-5 -> range queries with agg
1.02    1.01    1.02    1.01    0.98    read-only-count_range=1000
0.98    1.01    1.01    1.00    1.03    read-only-distinct
0.99    0.99    0.99    0.99    0.99    read-only-order_range=1000
1.00    1.00    1.01    1.00    1.01    read-only_range=10000
0.99    0.99    0.99    0.99    0.99    read-only_range=100
0.99    0.99    0.99    0.98    0.99    read-only_range=10
1.01    1.00    1.00    1.00    1.01    read-only-simple
1.01    1.00    1.01    1.00    1.00    read-only-sum_range=1000

col-1   col-2   col-3   col-4   col-5 -> writes
0.99    1.00    0.98    0.98    0.98    delete_range=100
0.99    0.98    0.98    1.00    0.98    insert_range=100
0.99    0.99    0.99    0.98    0.99    read-write_range=100
0.98    0.99    0.99    0.98    0.99    read-write_range=10
1.00    0.99    0.98    0.97    0.99    update-index_range=100
1.01    1.00    0.99    1.01    1.00    update-inlist_range=100
1.00    1.00    0.99    0.96    0.99    update-nonindex_range=100
1.01    1.01    0.99    0.97    0.99    update-one_range=100
1.00    1.00    0.99    0.98    0.99    update-zipf_range=100
1.00    0.99    0.98    0.98    1.00    write-only_range=10000

September 01, 2025

DocumentDB: Comparing Emulation Internals with MongoDB

MongoDB is the leading database for document data modeling, with its Atlas service available on AWS, Azure, and Google Cloud. Its popularity has led to the development of compatible APIs by other vendors, like Amazon DocumentDB (with MongoDB compatibility), highlighting MongoDB's importance in modern applications. Microsoft did the same for CosmosDB and developed a MongoDB emulation on PostgreSQL called DocumentDB, now part of the Linux Foundation.
AWS has joined the project. While today Amazon DocumentDB uses its own Aurora‑based proprietary engine, AWS’s participation opens the possibility that, in the future, the managed service could leverage this PostgreSQL‑based extension.

An emulation cannot truly replace MongoDB, which was designed to store, index, and process documents with flexible schema natively instead of using fixed-size blocks and relational tables, but may help in their transition. This article tests a simple query across three options: native MongoDB, PostgreSQL with the DocumentDB extension, and Oracle Database’s emulation - another emulation on top another RDBMS. They encounter similar challenges: implementing document semantics on top of a row-based engine. The aim is to demonstrate an evaluation method, including execution plans, to assess the pros and cons of each platform in relation to relevant application patterns, rather than specific use cases.

MongoDB

I create a simple collection with one field "nnn", indexed, and insert random values between 0 and 100:

db.franck.drop();

db.franck.createIndex({nnn:1});

void db.franck.insertMany(
 Array.from({ length: 100000 }, () => (
  { nnn: (Math.random() * 100)}
 ))
);

I query values between 20 and 80, displaying the first five for pagination:

db.franck.find(
  { nnn: { $gte: 20, $lt: 80 } }
).sort({ nnn: 1 }).limit(5)

[
  { _id: ObjectId('68b37f883c2e2550c0d51c0c'), nnn: 20.00087217392812 },
  { _id: ObjectId('68b37f883c2e2550c0d5dd3c'), nnn: 20.000927538131542 },
  { _id: ObjectId('68b37f883c2e2550c0d5f1e7'), nnn: 20.000979995906974 },
  { _id: ObjectId('68b37f883c2e2550c0d59dc4'), nnn: 20.001754428025208 },
  { _id: ObjectId('68b37f883c2e2550c0d66c4f'), nnn: 20.002357317589414 }
]

Here is the execution plan with execution statistics:

db.franck.find(
  { nnn: { $gte: 20, $lt: 80 } }
).sort({ nnn: 1 }).limit(5).explain("executionStats").executionStats

{
  executionSuccess: true,
  nReturned: 5,
  executionTimeMillis: 0,
  totalKeysExamined: 5,
  totalDocsExamined: 5,
  executionStages: {
    isCached: false,
    stage: 'LIMIT',
    nReturned: 5,
    executionTimeMillisEstimate: 0,
    works: 6,
    advanced: 5,
    needTime: 0,
    needYield: 0,
    saveState: 0,
    restoreState: 0,
    isEOF: 1,
    limitAmount: 5,
    inputStage: {
      stage: 'FETCH',
      nReturned: 5,
      executionTimeMillisEstimate: 0,
      works: 5,
      advanced: 5,
      needTime: 0,
      needYield: 0,
      saveState: 0,
      restoreState: 0,
      isEOF: 0,
      docsExamined: 5,
      alreadyHasObj: 0,
      inputStage: {
        stage: 'IXSCAN',
        nReturned: 5,
        executionTimeMillisEstimate: 0,
        works: 5,
        advanced: 5,
        needTime: 0,
        needYield: 0,
        saveState: 0,
        restoreState: 0,
        isEOF: 0,
        keyPattern: { nnn: 1 },
        indexName: 'nnn_1',
        isMultiKey: false,
        multiKeyPaths: { nnn: [] },
        isUnique: false,
        isSparse: false,
        isPartial: false,
        indexVersion: 2,
        direction: 'forward',
        indexBounds: { nnn: [ '[20, 80)' ] },
        keysExamined: 5,
        seeks: 1,
        dupsTested: 0,
        dupsDropped: 0
      }
    }
  }
}

MongoDB scanned the index (stage: 'IXSCAN') on "nnn" (keyPattern: { nnn: 1 }) for values between 20 and 80 (indexBounds: { nnn: [ '[20, 80)' ]). It examined 5 index entries (keysExamined: 5) and fetched the corresponding documents, resulting in 5 documents read (docsExamined: 5). It stopped (LIMIT) after returning the documents for the result (nReturned: 5).

We achieved exactly what we needed without any unnecessary work, so no further tuning is required. We could go further, like with a covering index to avoid the FETCH stage, but it's not needed as the number of documents fetched is low and bounded.

PostgreSQL with DocumentDB

To gain a comprehensive understanding of the emulation, I examine both the execution plan from the emulation and the execution plan in the underlying database. I begin by starting a container with DocumentDB.

I start a container for my lab using the DocumentDB image from the Microsoft repo, which will later move to the Linux Foundation, and I use the default ports.

docker run -d -p 10260:10260 -p 9712:9712 --name pgddb \
ghcr.io/microsoft/documentdb/documentdb-local:latest  \
--username ddb --password ddb

I add auto-explain to show all execution plans for my lab:

## add auto-explain extension to be loaded
docker exec -it pgddb sed -e '/shared_preload_libraries/s/,/, auto_explain,/' -i /home/documentdb/postgresql/data/postgresql.conf

## bounce the instance
docker restart -t 5 pgddb

# set auto-explain by default for the emulation gateway
psql -e 'postgres://documentdb@localhost:9712/postgres' <<'SQL'
\dconfig shared_preload_libraries
alter user ddb set auto_explain.log_analyze=on;
alter user ddb set auto_explain.log_buffers=on;
alter user ddb set auto_explain.log_format=text;
alter user ddb set auto_explain.log_min_duration=0;
alter user ddb set auto_explain.log_nested_statements=on;
alter user ddb set auto_explain.log_settings=on;
alter user ddb set auto_explain.log_timing=on;
alter user ddb set auto_explain.log_triggers=on;
alter user ddb set auto_explain.log_verbose=on;
alter user ddb set auto_explain.log_wal=on;
SQL

# tail the PostgreSQL log in the background to see the execution plan
docker exec -it pgddb tail -f /home/documentdb/postgresql/data/pglog.log | grep -v " LOG: cron job" &

# connect to the MogoDB emulation gateway
mongosh 'mongodb://ddb:ddb@localhost:10260/?tls=true&tlsAllowInvalidCertificates=true'

I create the same collection and index as in my MongoDB test, run the same query, and check the execution plan:

db.franck.find(
  { nnn: { $gte: 20, $lt: 80 } }
).sort({ nnn: 1 }).limit(5).explain("executionStats").executionStats

{
  nReturned: Long('5'),
  executionTimeMillis: Long('154'),
  totalDocsExamined: Long('5'),
  totalKeysExamined: Long('5'),
  executionStages: {
    stage: 'LIMIT',
    nReturned: Long('5'),
    executionTimeMillis: Long('154'),
    totalKeysExamined: Long('5'),
    totalDocsExamined: 5,
    numBlocksFromCache: 4415,
    numBlocksFromDisk: 0,
    inputStage: {
      stage: 'SORT',
      nReturned: Long('5'),
      executionTimeMillis: Long('154'),
      totalKeysExamined: Long('5'),
      totalDocsExamined: 5,
      sortMethod: 'top-N heapsort',
      totalDataSizeSortedBytesEstimate: 26,
      numBlocksFromCache: 4415,
      numBlocksFromDisk: 0,
      inputStage: {
        stage: 'FETCH',
        nReturned: Long('59935'),
        executionTimeMillis: Long('133'),
        totalKeysExamined: Long('59935'),
        indexName: 'nnn_1',
        totalDocsRemovedByIndexRechecks: 0,
        numBlocksFromCache: 4415,
        numBlocksFromDisk: 0,
        inputStage: {
          stage: 'IXSCAN',
          nReturned: Long('59935'),
          executionTimeMillis: Long('133'),
          totalKeysExamined: Long('59935'),
          indexName: 'nnn_1',
          totalDocsRemovedByIndexRechecks: 0,
          numBlocksFromCache: 4415,
          numBlocksFromDisk: 0
        }
      }
    }
  }
}

DocumentDB scanned the index (stage: 'IXSCAN'), identifying the index name but lacking details on key patterns or index bounds. It appears to have read the correct range (from 20 to 80) but did not apply pagination efficiently, as evidenced by the high volume of index entries read (totalKeysExamined: Long('59935'), nReturned: Long('59935')). All documents were retrieved (stage: 'FETCH', nReturned: Long('59935')) and sorted for pagination (stage: 'SORT', sortMethod: 'top-N heapsort'). Ultimately, this process returned the final result of 5 documents (stage: 'LIMIT', nReturned: Long('5')), discarding the thousands of documents read.

While the query and result are similar to MongoDB, the execution differs significantly. MongoDB avoids reading all documents and sorting them because its index not only helps find a range but also returns results in order.

To grasp the underlying reasons for this difference, we need more than just the execution plan of the emulation. I installed auto-explain in my lab to analyze the execution plan in PostgreSQL:

2025-08-31 17:20:47.765 UTC [416] LOG:  duration: 160.621 ms  plan:
        Query Text: EXPLAIN (FORMAT JSON, ANALYZE True, VERBOSE True, BUFFERS True, TIMING True) SELECT document FROM documentdb_api_catalog.bson_aggregation_find($1, $2)
        Query Parameters: $1 = 'test', $2 = '\x5f0000000266696e6400070000006672616e636b000366696c7465720022000000036e6e6e00180000001024677465001400000010246c740050000000000003736f7274000e000000106e6e6e000100000000106c696d6974000500000000'
        Limit  (cost=517.87..517.88 rows=5 width=68) (actual time=160.617..160.618 rows=5 loops=1)
          Output: document, (bson_orderby(document, 'BSONHEX0e000000106e6e6e000100000000'::documentdb_core.bson))
          Buffers: shared hit=4447, temp read=1860 written=2002
          ->  Sort  (cost=517.87..580.37 rows=25000 width=68) (actual time=160.615..160.616 rows=5 loops=1)
                Output: document, (bson_orderby(document, 'BSONHEX0e000000106e6e6e000100000000'::documentdb_core.bson))
                Sort Key: (bson_orderby(collection.document, 'BSONHEX0e000000106e6e6e000100000000'::documentdb_core.bson)) NULLS FIRST
                Sort Method: top-N heapsort  Memory: 26kB
                Buffers: shared hit=4447, temp read=1860 written=2002
                ->  Index Scan using nnn_1 on documentdb_data.documents_2 collection  (cost=0.00..102.62 rows=25000 width=68) (actual time=98.698..138.723 rows=59973 loops=1)
                      Output: document, bson_orderby(document, 'BSONHEX0e000000106e6e6e000100000000'::documentdb_core.bson)
                      Index Cond: (collection.document @<> 'BSONHEX3f000000036e6e6e0035000000106d696e0014000000106d61780050000000086d696e496e636c75736976650001086d6178496e636c757369766500000000'::documentdb_core.bson)
                      Buffers: shared hit=4439, temp read=1860 written=2002
        Settings: search_path = 'documentdb_api_catalog, documentdb_api, public'

The steps are similar but more detailed at the PostgreSQL level. The index access utilizes the @<> operator, which checks if a BSON value is within a specified range via a RUM index (DocumentDB uses an extended version of RUM index which provides more metadata than a GIN index). In this case, the index was applied solely for filtering, while a separate Sort step managed the final ordering. This method requires reading all documents before they can be ordered. Although the example is simple, indexed fields may contain arrays, which means a forward scan must return entries in order based on the smallest in the array. This behavior is native to MongoDB but not in PostgreSQL, and it likely explains why entries can't be retrieved in the desired order in the current version of DocumentDB.

Note that you might see a Bitmap Scan before auto-vacuum runs, but it's important to recognize that an Index Scan is also possible, which is a key distinction from GIN indexes.

The definition of the table and index is visible from PostgreSQL:

postgres=# \d documentdb_data.documents_2

                     Table "documentdb_data.documents_2"
     Column      |           Type           | Collation | Nullable | Default
-----------------+--------------------------+-----------+----------+---------
 shard_key_value | bigint                   |           | not null |
 object_id       | documentdb_core.bson     |           | not null |
 document        | documentdb_core.bson     |           | not null |
 creation_time   | timestamp with time zone |           |          |
Indexes:
    "collection_pk_2" PRIMARY KEY, btree (shard_key_value, object_id)
    "documents_rum_index_3" documentdb_rum (document documentdb_api_catalog.bson_rum_single_path_ops (path=nnn, tl='2699'))
Check constraints:
    "shard_key_value_check" CHECK (shard_key_value = '2'::bigint)

DocumentDB uses Citus for sharding, and an extended version of RUM indexes documentdb_rum for indexes:

postgres=# select indexdef, tablename, indexname 
           from pg_indexes 
           where schemaname='documentdb_data'
;
                                                                                indexdef                                                                                 |  tablename  |       indexname
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+-----------------------
 CREATE UNIQUE INDEX collection_pk_1 ON documentdb_data.documents_1 USING btree (shard_key_value, object_id)                                                             | documents_1 | collection_pk_1
 CREATE UNIQUE INDEX retry_1_pkey ON documentdb_data.retry_1 USING btree (shard_key_value, transaction_id)                                                               | retry_1     | retry_1_pkey
 CREATE INDEX retry_1_object_id_idx ON documentdb_data.retry_1 USING btree (object_id)                                                                                   | retry_1     | retry_1_object_id_idx
 CREATE UNIQUE INDEX collection_pk_2 ON documentdb_data.documents_2 USING btree (shard_key_value, object_id)                                                             | documents_2 | collection_pk_2
 CREATE UNIQUE INDEX retry_2_pkey ON documentdb_data.retry_2 USING btree (shard_key_value, transaction_id)                                                               | retry_2     | retry_2_pkey
 CREATE INDEX retry_2_object_id_idx ON documentdb_data.retry_2 USING btree (object_id)                                                                                   | retry_2     | retry_2_object_id_idx
 CREATE INDEX documents_rum_index_3 ON documentdb_data.documents_2 USING documentdb_rum (document documentdb_api_catalog.bson_rum_single_path_ops (path=nnn, tl='2699')) | documents_2 | documents_rum_index_3
(7 rows)

Here is the list of extensions installed:

postgres=# \dx
                                    List of installed extensions
      Name       | Version |   Schema   |                        Description
-----------------+---------+------------+------------------------------------------------------------
 documentdb      | 0.104-0 | public     | API surface for DocumentDB for PostgreSQL
 documentdb_core | 0.104-0 | public     | Core API surface for DocumentDB on PostgreSQL
 pg_cron         | 1.6     | pg_catalog | Job scheduler for PostgreSQL
 plpgsql         | 1.0     | pg_catalog | PL/pgSQL procedural language
 postgis         | 3.5.3   | public     | PostGIS geometry and geography spatial types and functions
 rum             | 1.3     | public     | RUM index access method
 tsm_system_rows | 1.0     | public     | TABLESAMPLE method which accepts number of rows as a limit
 vector          | 0.8.0   | public     | vector data type and ivfflat and hnsw access methods
(8 rows)

In my example, DocumentDB performed more work than MongoDB because the sort operation was not pushed down to the index scan. Achieving ordered results from a multi-key index is challenging due to multiple index entries per document. The scan must deduplicate these entries and arrange them correctly: the lowest array value for a forward scan and the greatest for a backward scan. MongoDB implemented this functionality from the get-go in its multi-key indexes. Emulation on top of SQL databases still requires further development to match the performance and scalability, as RDBMS were not designed for multi-key indexes, as one-to-many relationships are typically managed in separate tables according to the first normal form. This remains a TODO in the code for DocumentDB's RUM index access method.

The order by pushdown is not the only limitation. If you cannot read the five index entries needed for the query, you should at least try to avoid fetching thousands of documents. In MongoDB, using a covering index will replace the FETCH stage with a PROJECTION_COVERED stage. I attempted the same in DocumentDB by omitting the "_id" from the projection:

db.franck.find(
  { nnn: { $gte: 20, $lt: 80 } },
  { _id: 0, nnn: 1 }
).sort({ nnn: 1 }).limit(5).hint({nnn:1, _id:1 }).explain("executionStats").executionStats

{
  nReturned: Long('5'),
  executionTimeMillis: Long('170'),
  totalDocsExamined: Long('5'),
  totalKeysExamined: Long('5'),
  executionStages: {
    stage: 'LIMIT',
    nReturned: Long('5'),
    executionTimeMillis: Long('170'),
    totalKeysExamined: Long('5'),
    totalDocsExamined: 5,
    numBlocksFromCache: 4607,
    numBlocksFromDisk: 0,
    inputStage: {
      stage: 'PROJECT',
      nReturned: Long('5'),
      executionTimeMillis: Long('170'),
      totalKeysExamined: Long('5'),
      totalDocsExamined: 5,
      numBlocksFromCache: 4607,
      numBlocksFromDisk: 0,
      inputStage: {
        stage: 'SORT',
        nReturned: Long('5'),
        executionTimeMillis: Long('170'),
        totalKeysExamined: Long('5'),
        totalDocsExamined: 5,
        sortMethod: 'top-N heapsort',
        totalDataSizeSortedBytesEstimate: 25,
        numBlocksFromCache: 4607,
        
                                    
                                    
                                    
                                    
                                

August 31, 2025

August 28, 2025

Updates to the Same Value: MongoDB Optimization

In MongoDB, if you update a field to the same value it already has, the database query layer optimizes the operation by skipping index writes when no indexed fields change, and avoiding rewriting the document when no fields are different. These updates happen more often than you might think — for example, when an ORM updates all fields instead of just the changed ones to minimize the number of statements to parse, when an application saves all data from the UI, even if they come from a previous read, or during periodic syncs from external systems that provide a full snapshot rather than incremental changes.

SQL update to the same value

Before exposing this in MongoDB, here is an example on PostgreSQL. I create a simple table with one indexed column, the primary key, and one unindexed column:

create table test as 
select
 generate_series(1,1000) as key,
 'hello world'             as val
;
alter table test add primary key (key)
;
create index on test (val)
;
vacuum analyze test
;

I update the unindexed column to the same value (set val=val) for one row (where key = 42) in a loop (\watch), and look at the number of WAL records generated (explain (analyze, wal)):

explain (analyze, buffers, wal, costs off, summary off)
 update test 
 set val=val
 where key = 42
\watch

Here is the row output:

After multiple modifications to the same value, it still generates one WAL record for the change to the table:

 Update on test (actual time=0.042..0.042 rows=0.00 loops=1)
   Buffers: shared hit=5
   WAL: records=1 bytes=68
   ->  Index Scan using test_pkey on test (actual time=0.023..0.024 rows=1.00 loops=1)
         Index Cond: (key = 42)
         Index Searches: 1
         Buffers: shared hit=3

PostgreSQL detects that it is the same value, and avoids updating the index entry. However, the table is updated to the same value. There are reasons for that: in an SQL database, DML (Data Manipulation Language) statements like UPDATE denote an intention to update. Even if the value is the same, the user can expect locks to be acquired, triggers to be raised, and the audit log to record it.

I've run the update multiple times to get the optimized scenario. The first update generated four WAL records: it inserted the new version (with the same values) into a new block, updated the previous version, and modified two index entries to reference this new block. Since the new tuple was appended to the table without any concurrent inserts or updates, the second update found free space within the same block. This allowed it to avoid generating additional WAL records for updating the physical location, utilizing a HOT update optimization. However, it did create one WAL record for the Heap Only Tuple and an additional record during the read phase to clean up the old index entry. The next runs find space in the same block (as the previous versions there can be cleaned up) and do not have to clean up old index entries, so they continue with a single WAL record generated.

MongoDB $set to the same value

In SQL, statements declare an intention, which affects storage even when updating to the same value, to record the intention (in triggers, audit, transaction manager). In contrast, MongoDB statements declare a state, representing the new version of a document to synchronize transient application objects with the persistent database. Moreover, it's beneficial to use idempotent calls to a resilient database, allowing updates to the same value, as this practice is not uncommon and facilitates retrying in case of failures.

When MongoDB applies an update to a document, it compares the two versions (DiffApplier), not only to optimize for this case but also to record only the changes to save memory in the cache. During this comparison, it skips updating indexes for fields that didn't change between the two versions, and even skips the update when the two versions are identical.

I'm writing this following a question on MongoDB forum. To answer accurately, I tested in a lab where I can log what is actually updated.

Logging writes in MongoDB

I start a lab that logs the write operations:


-- Start MongoDB with db.setLogLevel(1, "write")

docker run --name mg -d -p 27017:27017 mongo \
 --setParameter 'logComponentVerbosity={write:{verbosity:1}}' \
 --replSet rs0

mongosh --eval "rs.initiate()"

-- display logs nicely formatted (with jq)

docker logs -f    mg | jq -c ' select(.c=="WRITE") | {
  keysInserted: .attr.keysInserted,
  keysDeleted:  .attr.keysDeleted,
  nMatched:     .attr.nMatched,
  nModified:    .attr.nModified,
  nUpserted:    .attr.nUpserted,
  planSummary:  .attr.planSummary,
  keysExamined: .attr.keysExamined,
  docsExamined: .attr.docsExamined,
  component: .c,
  numYields:    .attr.numYields,
  locks: {
   Global:      .attr.locks.Global.acquireCount.w,
   Database:    .attr.locks.Database.acquireCount.w,
   Collection:  .attr.locks.Collection.acquireCount.w,
   }
}' &

If you don't have jq installed you can simply docker logs -f, but I wanted an output that fits well in this blog post.

I connect with mongosh and create a collection with one document, one indexed field and one non-indexed field:

db.test.insertOne({ _id: 42, indexedField: "A", otherField: 101 });
db.test.createIndex({ indexedField: 1 });

I run an update that sets the same value on the indexed field:

db.test.updateOne(
 { _id: 42 },
 { $set: { indexedField: "A"                   } 
})

{"keysInserted":null,"keysDeleted":null,"nMatched":1,"nModified":0,"nUpserted":0,"planSummary":"EXPRESS_IXSCAN { _id: 1 },EXPRESS_UPDATE","keysExamined":1,"docsExamined":1,"component":"WRITE","numYields":0,"locks":{"Global":1,"Database":1,"Collection":1}}

The read part is executed, with an index scan on the key (EXPRESS_IXSCAN) finding one key and one document, but there's no document modified ("nModified":0). "keysInserted" and "keysDeleted", related to the index, are not even present in the log.

I run another update that sets the non-indexed field to a different value:

db.test.updateOne(
 { _id: 42 },
  { $set: {                     otherField: 102 } 
})

{"keysInserted":0   ,"keysDeleted":0   ,"nMatched":1,"nModified":1,"nUpserted":0,"planSummary":"EXPRESS_IXSCAN { _id: 1 },EXPRESS_UPDATE","keysExamined":1,"docsExamined":1,"component":"WRITE","numYields":0,"locks":{"Global":1,"Database":1,"Collection":1}}

Here, the document was modified ("nModified":1) but index key modification has been skipped ("keysInserted":0 ,"keysDeleted":0) as it is not needed to find the document.

The write amplification seen in PostgreSQL when the new version must be written to a new location doesn't happen here because in MongoDB the indexes reference a logical RecordId rather than the physical location like PostgreSQL's CTID.

I update the indexed field to a new value:

db.test.updateOne(
 { _id: 42 },
 { $set: { indexedField: "B"                   } 
})

{"keysInserted":1   ,"keysDeleted":1   ,"nMatched":1,"nModified":1,"nUpserted":0,"planSummary":"EXPRESS_IXSCAN { _id: 1 },EXPRESS_UPDATE","keysExamined":1,"docsExamined":1,"component":"WRITE","numYields":0,"locks":{"Global":1,"Database":1,"Collection":1}}

This must modify the document and the index entry. The index entry for the old value is deleted ("keysDeleted":1) and the new one inserted ("keysInserted":1).

Finally, I set back the document to the initial values, changing both fields:

db.test.updateOne(
 { _id: 42 },
 { $set: { indexedField: "A" , otherField: 101 } 
})

{"keysInserted":1   ,"keysDeleted":1   ,"nMatched":1,"nModified":1,"nUpserted":0,"planSummary":"EXPRESS_IXSCAN { _id: 1 },EXPRESS_UPDATE","keysExamined":1,"docsExamined":1,"component":"WRITE","numYields":0,"locks":{"Global":1,"Database":1,"Collection":1}}

As at least one field is different, the document is modified ("nModified":1) and as an indexed field has changed, the index entry is updated ("keysInserted":1 ,"keysDeleted":1)

No write conflict in MongoDB

To validate that an update in MongoDB does not affect write consistency guarantees when it doesn't change the document, I run an update to the same value within a transaction:


// start a transaction
const session = db.getMongo().startSession();
const TX = session.getDatabase(db.getName());
session.startTransaction();

// in the transaction, update the document to the same value
TX.test.updateOne(
 { _id: 42 },
 { $set: { indexedField: "A" , otherField: 101 }
})

This doesn't modify the document as it is the same values:

{"keysInserted":null,"keysDeleted":null,"nMatched":1,"nModified":0,"nUpserted":0,"planSummary":"EXPRESS_IXSCAN { _id: 1 },EXPRESS_UPDATE","keysExamined":1,"docsExamined":1,"component":"WRITE","numYields":0,"locks":{"Global":1,"Database":1,"Collection":1}}

While the transaction is active, I update to a new value in another session:

// concurrently, update the document to a different value
db.test.updateOne(
 { _id: 42 },
 { $set: { indexedField: "A" , otherField: 102 }
})

This updates the document:

{"keysInserted":0,"keysDeleted":0,"nMatched":1,"nModified":1,"nUpserted":0,"planSummary":"EXPRESS_IXSCAN { _id: 1 },EXPRESS_UPDATE","keysExamined":1,"docsExamined":1,"component":"WRITE","numYields":0,"locks":{"Global":2,"Database":2,"Collection":2}}

That proves that the transaction didn't record any write intention as there is no write conflict. It can commit:

// commit
session.commitTransaction();

If you run the same test but with different values, you will see the transparent retries, thanks to the automatic backoff loop, until the transaction times out (1 minute) and then update will be able to complete - the number of attempts being visible in the number of lightweight locks:

{"keysInserted":null,"keysDeleted":null,"nMatched":null,"nModified":null,"nUpserted":null,"planSummary":null,"keysExamined":null,"docsExamined":null,"component":"WRITE","numYields":null,"locks":{"Global":null,"Database":null,"Collection":null}}
...
{"keysInserted":null,"keysDeleted":null,"nMatched":null,"nModified":null,"nUpserted":null,"planSummary":null,"keysExamined":null,"docsExamined":null,"component":"WRITE","numYields":null,"locks":{"Global":null,"Database":null,"Collection":null}}
{"keysInserted":0,"keysDeleted":0,"nMatched":1,"nModified":1,"nUpserted":0,"planSummary":"EXPRESS_IXSCAN { _id: 1 },EXPRESS_UPDATE","keysExamined":534,"docsExamined":534,"component":"WRITE","numYields":533,"locks":{"Global":535,"Database":535,"Collection":535}}

Conclusion

I compare PostgreSQL and MongoDB as they are the most popular representatives of relational and document databases, respectively. Beyond performance, understanding the behavior is critical.

In PostgreSQL, an UPDATE statement indicates an intention to perform an operation, and the database executes it even if the stored value remains unchanged. This ensures SQL developers' expectations are met: locks are acquired, triggers activate, and changes are logged. While index modification can be skipped, a new version of the row is still recorded to fulfill the command.

In contrast, MongoDB operations specify the desired final document state rather than an explicit action. When updating, MongoDB computes the new document version and, if unchanged, skips rewriting it and avoids unnecessary index writes. This approach aligns well with idempotent, retry-friendly patterns and reduces write amplification when no actual change occurs.

Updates to the Same Value: MongoDB Optimization

In MongoDB, when you update a field to the same value it already holds, the database optimizes the operation by skipping index writes if no indexed fields have changed and avoids rewriting the document when no fields differ. This behavior contrasts with SQL databases.

Such updates occur more frequently than one might expect—for instance, when an ORM updates all fields instead of just the modified ones to reduce the number of statements to parse; when an application saves all data from the UI, even if it originates from a previous read; or during periodic syncs from external systems that provide a complete snapshot rather than incremental changes.

SQL update to the same value

Before exposing this in MongoDB, here is an example on PostgreSQL. I create a simple table with one indexed column, the primary key, and one unindexed column:

create table test as 
select
 generate_series(1,1000) as key,
 'hello world'             as val
;
alter table test add primary key (key)
;
create index on test (val)
;
vacuum analyze test
;

I update the unindexed column to the same value (set val=val) for one row (where key = 42) in a loop (\watch), and look at the number of WAL records generated (explain (analyze, wal)):

explain (analyze, buffers, wal, costs off, summary off)
 update test 
 set val=val
 where key = 42
\watch

Here is the row output:

After multiple modifications to the same value, it still generates one WAL record for the change to the table row:

 Update on test (actual time=0.042..0.042 rows=0.00 loops=1)
   Buffers: shared hit=5
   WAL: records=1 bytes=68
   ->  Index Scan using test_pkey on test (actual time=0.023..0.024 rows=1.00 loops=1)
         Index Cond: (key = 42)
         Index Searches: 1
         Buffers: shared hit=3

PostgreSQL detects that it is the same value, and avoids updating the index entry. However, the table is updated to the same value. There are reasons for that: In an SQL database, Data Manipulation Language (DML) statements like UPDATE denote an intention to update. Even if the value is the same, the user can expect locks to be acquired, triggers to be raised, and the audit log to record it.

I've run the update multiple times to get the optimized scenario. The first update generated four WAL records: It inserted the new version (with the same values) into a new block, updated the previous version, and modified two index entries to reference this new block. Since the new tuple was appended to the table without any concurrent inserts or updates, the second update found free space within the same block. This allowed it to avoid generating additional WAL records for updating the physical location, utilizing a HOT update optimization. However, it did create one WAL record for the Heap Only Tuple and an additional record during the read phase to clean up the old index entry. The next runs find space in the same block (as the previous versions there can be cleaned up) and do not have to clean up old index entries, so they continue with a single WAL record generated.

MongoDB $set to the same value

In SQL, statements declare an intention, which affects storage even when updating to the same value, to record the intention (in triggers, audit, transaction manager). In contrast, MongoDB statements declare a state, representing the new version of a document to synchronize transient application objects with the persistent database. Moreover, it's beneficial to use idempotent calls to a resilient database, allowing updates to the same value, as this practice is not uncommon and facilitates retrying in case of failures.

When MongoDB applies an update to a document, it compares the two versions (DiffApplier), not only to optimize for this case but also to record only the changes to save memory in the cache. During this comparison, it skips updating indexes for fields that didn't change between the two versions, and even skips the update when the two versions are identical.

I'm writing this following a question on the MongoDB forum. To answer accurately, I tested in a lab where I can log what is actually updated.

Logging writes in MongoDB

I start a lab that logs the write operations:


# Start MongoDB with db.setLogLevel(1, "write")

docker run --name mg -d -p 27017:27017 mongo \
 --setParameter 'logComponentVerbosity={write:{verbosity:1}}' \
 --replSet rs0

mongosh --eval "rs.initiate()"

# display logs nicely formatted (with jq)

docker logs -f    mg | jq -c ' select(.c=="WRITE") | {
  keysInserted: .attr.keysInserted,
  keysDeleted:  .attr.keysDeleted,
  nMatched:     .attr.nMatched,
  nModified:    .attr.nModified,
  nUpserted:    .attr.nUpserted,
  planSummary:  .attr.planSummary,
  keysExamined: .attr.keysExamined,
  docsExamined: .attr.docsExamined,
  component:    .c,
  numYields:    .attr.numYields,
  locks: {
   Global:      .attr.locks.Global.acquireCount.w,
   Database:    .attr.locks.Database.acquireCount.w,
   Collection:  .attr.locks.Collection.acquireCount.w,
   }
}' &

If you don't have jq installed, you can simply docker logs -f, but I wanted an output that fits well in this blog post.

I connect with mongosh and create a collection with one document, one indexed field, and one non-indexed field:

db.test.insertOne({ _id: 42, indexedField: "A", otherField: 101 });
db.test.createIndex({ indexedField: 1 });

I run an update that sets the same value on the indexed field:

db.test.updateOne(
 { _id: 42 },
 { $set: { indexedField: "A"                   } 
})

{"keysInserted":null,"keysDeleted":null,"nMatched":1,"nModified":0,"nUpserted":0,"planSummary":"EXPRESS_IXSCAN { _id: 1 },EXPRESS_UPDATE","keysExamined":1,"docsExamined":1,"component":"WRITE","numYields":0,"locks":{"Global":1,"Database":1,"Collection":1}}

The read part is executed, with an index scan on the key (EXPRESS_IXSCAN) finding one key and one document, but there's no document modified ("nModified":0). "keysInserted" and "keysDeleted", related to the index, are not even present in the log.

I run another update that sets the non-indexed field to a different value:

db.test.updateOne(
 { _id: 42 },
  { $set: {                     otherField: 102 } 
})

{"keysInserted":0   ,"keysDeleted":0   ,"nMatched":1,"nModified":1,"nUpserted":0,"planSummary":"EXPRESS_IXSCAN { _id: 1 },EXPRESS_UPDATE","keysExamined":1,"docsExamined":1,"component":"WRITE","numYields":0,"locks":{"Global":1,"Database":1,"Collection":1}}

Here, the document was modified ("nModified":1) but index key modification has been skipped ("keysInserted":0 ,"keysDeleted":0) as it is not needed to find the document.

The write amplification seen in PostgreSQL when the new version must be written to a new location doesn't happen here because in MongoDB, the indexes reference a logical RecordId rather than the physical location like PostgreSQL's CTID.

I update the indexed field to a new value:

db.test.updateOne(
 { _id: 42 },
 { $set: { indexedField: "B"                   } 
})

{"keysInserted":1   ,"keysDeleted":1   ,"nMatched":1,"nModified":1,"nUpserted":0,"planSummary":"EXPRESS_IXSCAN { _id: 1 },EXPRESS_UPDATE","keysExamined":1,"docsExamined":1,"component":"WRITE","numYields":0,"locks":{"Global":1,"Database":1,"Collection":1}}

This must modify the document and the index entry. The index entry for the old value is deleted ("keysDeleted":1) and the new one inserted ("keysInserted":1).

Finally, I set back the document to the initial values, changing both fields:

db.test.updateOne(
 { _id: 42 },
 { $set: { indexedField: "A" , otherField: 101 } 
})

{"keysInserted":1   ,"keysDeleted":1   ,"nMatched":1,"nModified":1,"nUpserted":0,"planSummary":"EXPRESS_IXSCAN { _id: 1 },EXPRESS_UPDATE","keysExamined":1,"docsExamined":1,"component":"WRITE","numYields":0,"locks":{"Global":1,"Database":1,"Collection":1}}

As at least one field is different, the document is modified ("nModified":1) and as an indexed field has changed, the index entry is updated ("keysInserted":1 ,"keysDeleted":1)

No write conflict in MongoDB

To validate that an update in MongoDB does not affect write consistency guarantees when it doesn't change the document, I run an update to the same value within a transaction:


// start a transaction
const session = db.getMongo().startSession();
const TX = session.getDatabase(db.getName());
session.startTransaction();

// in the transaction, update the document to the same value
TX.test.updateOne(
 { _id: 42 },
 { $set: { indexedField: "A" , otherField: 101 }
})

This doesn't modify the document as it is the same values:

{"keysInserted":null,"keysDeleted":null,"nMatched":1,"nModified":0,"nUpserted":0,"planSummary":"EXPRESS_IXSCAN { _id: 1 },EXPRESS_UPDATE","keysExamined":1,"docsExamined":1,"component":"WRITE","numYields":0,"locks":{"Global":1,"Database":1,"Collection":1}}

While the transaction is active, I update to a new value in another session:

// concurrently, update the document to a different value
db.test.updateOne(
 { _id: 42 },
 { $set: { indexedField: "A" , otherField: 102 }
})

This updates the document:

{"keysInserted":0,"keysDeleted":0,"nMatched":1,"nModified":1,"nUpserted":0,"planSummary":"EXPRESS_IXSCAN { _id: 1 },EXPRESS_UPDATE","keysExamined":1,"docsExamined":1,"component":"WRITE","numYields":0,"locks":{"Global":2,"Database":2,"Collection":2}}

That proves that the transaction didn't record any write intention as there is no write conflict. It can commit:

// commit
session.commitTransaction();

If you run the same test but with different values, you will see the transparent retries, thanks to the automatic backoff loop, until the transaction times out (one minute) and then update will be able to complete—the number of attempts being visible in the number of lightweight locks:

{"keysInserted":null,"keysDeleted":null,"nMatched":null,"nModified":null,"nUpserted":null,"planSummary":null,"keysExamined":null,"docsExamined":null,"component":"WRITE","numYields":null,"locks":{"Global":null,"Database":null,"Collection":null}}
...
{"keysInserted":null,"keysDeleted":null,"nMatched":null,"nModified":null,"nUpserted":null,"planSummary":null,"keysExamined":null,"docsExamined":null,"component":"WRITE","numYields":null,"locks":{"Global":null,"Database":null,"Collection":null}}
{"keysInserted":0,"keysDeleted":0,"nMatched":1,"nModified":1,"nUpserted":0,"planSummary":"EXPRESS_IXSCAN { _id: 1 },EXPRESS_UPDATE","keysExamined":534,"docsExamined":534,"component":"WRITE","numYields":533,"locks":{"Global":535,"Database":535,"Collection":535}}

Conclusion

I compare PostgreSQL and MongoDB as they are the most popular representatives of relational and document databases, respectively. Beyond performance, understanding the behavior is critical.

In PostgreSQL, an UPDATE statement indicates an intention to perform an operation, and the database executes it even if the stored value remains unchanged. This ensures SQL developers' expectations are met: Locks are acquired, triggers are activated, and changes are logged. While index modification can be skipped, a new version of the row is still recorded to fulfill the command.

In contrast, MongoDB operations specify the desired final document state rather than an explicit action. When updating, MongoDB computes the new document version and, if unchanged, skips rewriting it and avoids unnecessary index writes. This approach aligns well with idempotent, retry-friendly patterns and reduces write amplification when no actual change occurs.

It’s End of Life for Redis Enterprise 7.2 in Six Months – What Are Your Options?

From a technology perspective, Redis does a great job as a database and data cache. According to 6Sense, Redis is currently the number one product as an in-memory data cache. It just works. However, that position is also a potential challenge. Redis does its job so well that, well, why should you change it? One […]

August 27, 2025

PostgreSQL JSONB Size Limits to Prevent TOAST Slicing

In my previous post Embedding Into JSONB Still Feels Like a JOIN for Large Documents, I examined the behavior of large JSONB documents in PostgreSQL when they exceed 2 KB. This limitation can compromise the data locality objectives inherent in the document model. However, it's worth noting that this limit can be increased up to 32 KB.

Details about TOAST thresholds

PostgreSQL stores table rows in fixed-size pages (BLCKSZ), defaulting to 8 KB but configurable up to 32 KB at compile time. If a row value exceeds the TOAST_TUPLE_THRESHOLD, about a quarter of the page size (≈ 2 KB for 8 KB pages), the TOAST mechanism activates. PostgreSQL first attempts to reduce the row size to the TOAST_TUPLE_TARGET (defaulting to the threshold, but tunable per table with ALTER TABLE … SET (toast_tuple_target = …)) by compressing or relocating large TOASTable columns.

If the value remains too large, it is written to a separate TOAST table in chunks (up to TOAST_MAX_CHUNK_SIZE, also about a quarter of the page size), with each chunk stored as a separate row and retrieved via an index on (chunk_id, chunk_seq).

TOAST_TUPLE_THRESHOLD and TOAST_TUPLE_TARGET can be adjusted at compile time, while TOAST_MAX_CHUNK_SIZE and page size changes require initdb for a new database.

Storage strategies (PLAIN, MAIN, EXTERNAL, EXTENDED) and compression methods (pglz, lz4, etc.) are configurable in SQL. For large JSONB fields, exceeding the threshold incurs the cost of compression and indexed fetching, akin to an implicit join.

To fit the largest document possible in PostgreSQL, we must use the largest block size available, which is 32KB. This requires compiling PostgreSQL and creating a new database. While this is not feasible in a managed service environment, we can perform these actions in a lab setting.

32KB BLKSZ PostgreSQL lab

I used the official Dockerfile and added --with-blocksize=32 to the configure command:

diff --git a/18/alpine3.22/Dockerfile b/18/alpine3.22/Dockerfile
index 0a8c650..8ed3d26 100644
--- a/18/alpine3.22/Dockerfile
+++ b/18/alpine3.22/Dockerfile
@@ -148,6 +148,7 @@ RUN set -eux; \
                --with-llvm \
                --with-lz4 \
                --with-zstd \
+               --with-blocksize=32 \
        ; \
        make -j "$(nproc)" world-bin; \
        make install-world-bin; \

I built the image and started a container:

sudo docker build -t pg18-blksz32 .

sudo docker rm -f pg18-blksz32
sudo docker exec -it -e PGUSER=postgres $(
 sudo docker run -d --name pg18-blksz32 -e POSTGRES_PASSWORD=x pg18-blksz32
sleep 5
) psql

increase toast_tuple_target

I created the same table as in the previous post with only one difference: I set the TOAST target to the maximum, the block size minus the tuple header (24 bytes), and TOAST header (8 bytes), at the column level:

ALTER TABLE orders SET (toast_tuple_target = 32736);

Another solution could have been disabling external storage at the table level, unless it doesn't fit in the block.

ALTER TABLE orders ALTER COLUMN items SET STORAGE MAIN;

Explain (analyze, buffers, serialize)

After setting the document with the UPDATE statement of the previous post, I checked the number of pages read to get one document:

postgres=# explain (analyze, buffers, serialize text, verbose, costs off)
select *
 from orders
 where ord_id = 42
;
                                     QUERY PLAN
-------------------------------------------------------------------------------------
 Bitmap Heap Scan on public.orders (actual time=0.047..0.048 rows=1.00 loops=1)
   Output: ord_id, ord_dat, items
   Recheck Cond: (orders.ord_id = 42)
   Heap Blocks: exact=2
   Buffers: shared hit=4
   ->  Bitmap Index Scan on orders_pkey (actual time=0.034..0.035 rows=2.00 loops=1)
         Index Cond: (orders.ord_id = 42)
         Index Searches: 1
         Buffers: shared hit=2
 Planning Time: 0.067 ms
 Serialization: time=0.036 ms  output=20kB  format=text
 Execution Time: 0.169 ms

As before, scanning the index requires 3 pages and 1 page for the heap table. Because the document fits entirely within this 32 KB page, no additional buffers are needed during the Serialization phase, unlike the six buffer hits required in the database with an 8 KB block size.

Larger documents

If the document size increases above the block size, it will be TOASTed:


postgres=# update orders 
 set items = items || items 
 where ord_id=42
;
UPDATE 1

postgres=# SELECT o.ord_id, o.ord_dat, t.chunk_id, t.chunk_seq, t.ctid,
       pg_size_pretty(length(t.chunk_data)::bigint) AS chunk_size
FROM orders o
JOIN pg_toast.pg_toast_16384 t
  ON t.chunk_id = pg_column_toast_chunk_id(o.items)
WHERE o.ord_id = 42
ORDER BY t.chunk_seq
;
 ord_id |            ord_dat            | chunk_id | chunk_seq | ctid  | chunk_size
--------+-------------------------------+----------+-----------+-------+------------
     42 | 2025-08-27 17:21:30.677401+00 |   116406 |         0 | (0,1) | 8140 bytes
     42 | 2025-08-27 17:21:30.677401+00 |   116406 |         1 | (0,2) | 8140 bytes
     42 | 2025-08-27 17:21:30.677401+00 |   116406 |         2 | (0,3) | 8140 bytes
     42 | 2025-08-27 17:21:30.677401+00 |   116406 |         3 | (0,4) | 8140 bytes
     42 | 2025-08-27 17:21:30.677401+00 |   116406 |         4 | (1,1) | 7524 bytes
(5 rows)

When stored inline, the document was 20K (output=20kB) and was stored as a single datum. Now that it has doubled to 40 KB, it is stored into five chunks because the maximum TOAST chunk size is 1/4th of the block size. That doesn't matter because they are stored on the same page, or contiguous pages, as indicated by the CTID. We have doubled the size, so it makes sense to read two heap blocks instead of one. However, those are in addition to the tuple, and the main problem is the overhead of the TOAST index traversal to find those pages. There are 6 pages read in total:

postgres=# explain (analyze, buffers, serialize text, verbose, costs off)
select *
 from orders
 where ord_id = 42
;
                                     QUERY PLAN
-------------------------------------------------------------------------------------
 Bitmap Heap Scan on public.orders (actual time=0.028..0.028 rows=1.00 loops=1)
   Output: ord_id, ord_dat, items
   Recheck Cond: (orders.ord_id = 42)
   Heap Blocks: exact=1
   Buffers: shared hit=3
   ->  Bitmap Index Scan on orders_pkey (actual time=0.016..0.016 rows=1.00 loops=1)
         Index Cond: (orders.ord_id = 42)
         Index Searches: 1
         Buffers: shared hit=2
 Planning Time: 0.109 ms
 Serialization: time=0.143 ms  output=40kB  format=text
   Buffers: shared hit=3
 Execution Time: 0.238 ms

Compared to the previous example on a 8 KB block size database, there are less pages read in total because the indexes benefit from the large block size and necessitate less branch levels:

postgres=# create extension if not exists pageinspect;
CREATE EXTENSION

postgres=# select btpo_level from bt_page_stats('public.orders_pkey', (
 select root from bt_metap('public.orders_pkey'))
);

 btpo_level
------------
          1
(1 row)

postgres=# select btpo_level from bt_page_stats('pg_toast.pg_toast_16384_index', (
 select root from bt_metap('pg_toast.pg_toast_16384_index'))
);

 btpo_level
------------
          1
(1 row)

The main advantage of a document database is to offer a data model that natively matches application objects, without the complexity of object–relational mapping (ORM), and to preserve that logical model down to the physical layout. This reduces random I/O, improves cache efficiency, enables transparent sharding, and increases transactional scalability.

Conclusion

In MongoDB, documents are stored in a single block, with size ranging from 32 KB to 16 MB (the maximum BSON size), which ensures strong data locality. Each document typically corresponds to a domain-driven design (DDD) aggregate and can embed large fields, including text, arrays, vectors, extended references, and one-to-many relationships. Common document sizes ranging from 32 KB to 256 KB align with WiredTiger’s minimum I/O size, cloud storage maximum I/O size, and modern CPU cache sizes.

JSONB in PostgreSQL allows some locality for small documents. However, when a document exceeds about 2 KB, it moves to TOAST storage, breaking locality and adding overhead due to an extra index traversal. Increasing the page size to 32 KB allows larger documents to remain in-line, but this still falls short of the sizes typical in document databases. Moreover, achieving these sizes often requires non‑default builds and self-managed deployments, which are usually unavailable in managed services.

PostgreSQL’s TOAST mechanism allows for the storage of very large JSONB values, but it comes with a performance trade-off, impacting locality and access speed in favor of accommodating oversized data in a fixed block size storage engine. In contrast, MongoDB is designed to maintain whole-document locality up to a significantly larger size limit, and preserves the logical model of an application in its physical storage. While both databases can store JSON, they fulfill different purposes: MongoDB is optimized for an application-centric document model preserved down to the storage layer, whereas PostgreSQL is tailored for a database-centric normalized model, enhancing it with JSONB either for small additional attributes or for storing large documents that are accessed infrequently.

August 26, 2025

Astound Supports IPv6 Only in Washington

In the hopes that it saves someone else two hours later: the ISP Astound only supports IPv6 in Washington State. You might find this page which says “Astound supports IPv6 in most locations”. Their tech support agents might tell you that they support v6 on your connection, even if you are not in Washington. “Yes, we do support both DHCPv6 and SLAAC”, they might say, and tell you to use a prefix delegation size of 60. If you are staring at tcpdump and wondering why you’re not seeing anything coming back from your router’s plaintive requests for address information, it is because they do not, in fact, support v6 anywhere but Washington.

Valkey 9.0: Enterprise-Ready, Open Source, and Coming September 15, 2025

Circle September 15 on your calendar! That’s when Valkey 9.0 officially drops, bringing enterprise-grade features that solve real operational headaches without the licensing restrictions or unpredictable costs you face with Redis. If you’ve been following Valkey since it forked from Redis, this release represents a major milestone. The same engineers who built Redis now work […]

MySQL 5.6 thru 9.4: small server, Insert Benchmark

This has results for the Insert Benchmark on a small server with InnoDB from MySQL 5.6 through 9.4. The workload here uses low concurrency (1 client), a small server and a cached database. I run it this way to look for CPU regressions before moving on to IO-bound workloads with high concurrency.

tl;dr

  • good news - there are no large regressions after MySQL 8.0
  • bad news - there are large regressions from MySQL 5.6 to 5.7 to 8.0
    • load in 8.0, 8.4 and 9.4 gets about 60% of the throughput vs 5.6
    • queries in 8.0, 8.4 and 9.4 get between 60% and 70% of the throughput vs 5.6

Builds, configuration and hardware

I compiled MySQL 5.6.51, 5.7.44, 8.0.43, 8.4.6 and 9.4.0 from source.

The server is an ASUS PN53 with 8 cores, AMD SMT disabled and 32G of RAM. The OS is Ubuntu 24.04. Storage is 1 NVMe device with ext4. More details on it are here.

I used the cz12a_c8r32 config file (my.cnf) which is here for 5.6.51, 5.7.44, 8.0.43, 8.4.6 and 9.4.0.

The Benchmark

The benchmark is explained here. I recently updated the benchmark client to connect via socket rather than TCP so that I can get non-SSL connections for all versions tested. AFAIK, with TCP I can only get SSL connections for MySQL 8.4 and 9.4.

The workload uses 1 client, 1 table with 30M rows and a cached database.

The benchmark steps are:

  • l.i0
    • insert 30 million rows per table in PK order. The table has a PK index but no secondary indexes. There is one connection per client.
  • l.x
    • create 3 secondary indexes per table. There is one connection per client.
  • l.i1
    • use 2 connections/client. One inserts 40 million rows per table and the other does deletes at the same rate as the inserts. Each transaction modifies 50 rows (big transactions). This step is run for a fixed number of inserts, so the run time varies depending on the insert rate.
  • l.i2
    • like l.i1 but each transaction modifies 5 rows (small transactions) and 10 million rows are inserted and deleted per table.
    • Wait for N seconds after the step finishes to reduce variance during the read-write benchmark steps that follow. The value of N is a function of the table size.
  • qr100
    • use 3 connections/client. One does range queries and performance is reported for this. The second does does 100 inserts/s and the third does 100 deletes/s. The second and third are less busy than the first. The range queries use covering secondary indexes. This step is run for 1800 seconds. If the target insert rate is not sustained then that is considered to be an SLA failure. If the target insert rate is sustained then the step does the same number of inserts for all systems tested.
  • qp100
    • like qr100 except uses point queries on the PK index
  • qr500
    • like qr100 but the insert and delete rates are increased from 100/s to 500/s
  • qp500
    • like qp100 but the insert and delete rates are increased from 100/s to 500/s
  • qr1000
    • like qr100 but the insert and delete rates are increased from 100/s to 1000/s
  • qp1000
    • like qp100 but the insert and delete rates are increased from 100/s to 1000/s
Results: overview

The performance report is here.

The summary section has 3 tables. The first shows absolute throughput by DBMS tested X benchmark step. The second has throughput relative to the version from the first row of the table. The third shows the background insert rate for benchmark steps with background inserts. The second table makes it easy to see how performance changes over time. The third table makes it easy to see which DBMS+configs failed to meet the SLA. The summary section is here.

Below I use relative QPS (rQPS) to explain how performance changes. It is: (QPS for $me / QPS for $base) where $me is the result for some version $base is the result from MySQL 5.6.51.

When rQPS is > 1.0 then performance improved over time. When it is < 1.0 then there are regressions. When it is 0.90 then I claim there is a 10% regression. The Q in relative QPS measures: 
  • insert/s for l.i0, l.i1, l.i2
  • indexed rows/s for l.x
  • range queries/s for qr100, qr500, qr1000
  • point queries/s for qp100, qp500, qp1000
Below I use colors to highlight the relative QPS values with yellow for regressions and blue for improvements.

Results: details

This table is a copy of the second table in the summary section. It lists the relative QPS (rQPS) for each benchmark step where rQPS is explained above.

The benchmark steps are explained above, they are:
  • l.i0 - initial load in PK order
  • l.x - create 3 secondary indexes per table
  • l.i1, l.i2 - random inserts and random deletes
  • qr100, qr500, qr1000 - short range queries with background writes
  • qp100, qp500, qp1000 - point queries with background writes

dbmsl.i0l.xl.i1l.i2qr100qp100qr500qp500qr1000qp1000
5.6.511.001.001.001.001.001.001.001.001.001.00
5.7.440.891.521.141.080.830.840.830.840.840.84
8.0.430.602.501.040.860.690.620.690.630.700.62
8.4.60.602.531.030.860.680.610.670.610.680.61
9.4.00.602.531.030.870.700.630.700.630.700.62



The summary is:
  • l.i0
    • there are large regressions starting in 8.0 and modern MySQL only gets ~60% of the throughput relative to 5.6 because modern MySQL has more CPU overhead
  • l.x
    • I ignore this but there have been improvements
  • l.i1, l.i2
    • there was a large improvement in 5.7 but new CPU overhead since 8.0 reduces that
  • qr100, qr500, qr1000
    • there are large regressions from 5.6 to 5.7 and then again from 5.7 to 8.0
    • throughput in modern MySQL is ~60% to 70% of what it was in 5.6


    August 25, 2025

    Don’t Trust, Verify: How MyDumper’s Checksums Validates Data Consistency

    How do you know if your backup is truly reliable? The last thing you want is to discover your data is corrupted during a critical restore or during a migration. While MyDumper is a powerful tool for logical backups, its -M option takes backup integrity to the next level by creating checksums. This often-overlooked feature […]