a curated list of database news from authoritative sources

January 08, 2026

Unlock Amazon Aurora’s Advanced Features with Standard JDBC Driver using AWS Advanced JDBC Wrapper

In this post, we show how you can enhance your Java application with the cloud-based capabilities of Amazon Aurora by using the JDBC Wrapper. Simple code changes shared in this post can transform a standard JDBC application to use fast failover, read/write splitting, IAM authentication, AWS Secrets Manager integration, and federated authentication.

Postgres vs tproc-c on a small server

This is my first post with results from tproc-c using HammerDB. This post has results for Postgres. 

tl;dr - across 8 workloads (low and medium concurrency, cached database to IO-bound)

  • there might be a regression for Postgres 14.20 and 15.15 in one workload
  • there are improvements, some big, for Postgres 17 and 18 in most workloads

Builds, configuration and hardware

I compiled Postgres from source using -O2 -fno-omit-frame-pointer for versions 12.22, 13.23, 14.20, 15.15, 16.11, 17.7 and 18.1.

The server is an ASUS ExpertCenter PN53 with an AMD Ryzen 7 7735HS CPU, 8 cores, SMT disabled, and 32G of RAM. Storage is one NVMe device for the database using ext-4 with discard enabled. The OS is Ubuntu 24.04. More details on it are here.

For versions prior to 18, the config file is named conf.diff.cx10a_c8r32 and they are as similar as possible and here for versions 1213141516 and 17.

For Postgres 18 the config file is named conf.diff.cx10b_c8r32 and adds io_mod='sync' which matches behavior in earlier Postgres versions.

Benchmark

The benchmark is tproc-c from HammerDB. The tproc-c benchmark is derived from TPC-C.

The benchmark was run for several workloads:
  • vu=1, w=100 - 1 virtual user, 100 warehouses
  • vu=6, w=100 - 6 virtual users, 100 warehouses
  • vu=1, w=1000 - 1 virtual user, 1000 warehouses
  • vu=6, w=1000 - 6 virtual users, 1000 warehouses
  • vu=1, w=2000 - 1 virtual user, 2000 warehouses
  • vu=6, w=2000 - 6 virtual users, 2000 warehouses
  • vu=1, w=4000 - 1 virtual user, 4000 warehouses
  • vu=6, w=4000 - 6 virtual users, 4000 warehouses
The benchmark is run by this script which depends on scripts here.
  • stored procedures are enabled
  • partitioning is used for when the warehouse count is >= 1000
  • a 5 minute rampup is used
  • then performance is measured for 120 minutes

Results

All artifacts from the tests are here. A spreadsheet with the charts and numbers is here.

My analysis at this point is simple -- I only consider average throughput. Eventually I will examine throughput over time and efficiency (CPU and IO).

On the charts that follow y-axis starts at 0.9 to improve readability. The y-axis shows relative throughput. There might be a regression when the relative throughput is less than 1.0. There might be an improvement when it is > 1.0. The relative throughput is:
(NOPM for a given version / NOPM for Postgres 12.22)

Results: vu=1, w=100

Summary:
  • no regressions, no improvements
Results: vu=6, w=100

Summary:
  • no regressions, no improvements
Results: vu=1, w=1000

Summary:
  • no regressions, improvements in Postgres 17 and 18
Results: vu=6, w=1000

Summary:
  • no regressions, improvements in Postgres 16, 17 and 18
Results: vu=1, w=2000

Summary:
  • possible regressions in Postgres 14 and 15, improvements in 13, 16, 17\ and 18
Results: vu=6, w=2000

Summary:
  • no regressions, improvements in Postgres 13 through 18
Results: vu=1, w=4000

Summary:
  • no regressions, improvements in Postgres 13 through 18
Results: vu=6, w=4000

Summary:
  • no regressions, improvements in Postgres 16 through 18

Introducing TRAE SOLO integration with Supabase

TRAE IDE now integrates with Supabase, enabling developers to manage their entire backend infrastructure without leaving the IDE. Browse databases, manage storage, configure auth, and more.

The Sauna Algorithm: Surviving Asynchrony Without a Clock

While sweating it out in my gym's sauna recently, I found a neat way to illustrate the happened-before relationship in distributed systems. Imagine I suffer from a medical condition called dyschronometria, which makes me unable to perceive time reliably, such that 10 seconds and 10 minutes feel exactly the same to me.

In this scenario, the sauna lacks a visible clock. I'm flying blind here, yet I want to leave after a healthy session. If I stay too short, I get no health benefits. If I stay too long, I risk passing out on the floor. The question becomes: How do I, a distributed node with no local clock, ensure operating within a safety window in an asynchronous environment?

Thankfully, the sauna has a uniform arrival of people. Every couple of minutes, a new person walks in. These people don't suffer from dyschronometria and they stay for a healthy session, roughly 10 minutes.

My solution is simple: I identify the first person to enter after me, and I leave when he leaves. I don't measure time at all. Instead, I rely solely on the causal relationship between this person's arrival and departure. The presence of this other person acts as a reference, letting me safely synchronize my action in an entirely asynchronous setting.

Murat's Sauna Algorithm

  1. Enter the sauna.
  2. Identify the very next person to enter (let's call him A).
  3. Wait until A gets up to leave.
  4. Exit immediately after A.

By anchoring my exit to the departure of a person who arrived after me, I guarantee that my stay overlaps entirely with his healthy duration, plus the delta between my arrival and his. I have successfully bounded my sauna/execution time using causal links rather than a physical timer.

Now, for the distributed systems purists who might object to step 4 (arguing that "exiting immediately" implies measuring a time interval of zero): I can mend this. I exit after Person A leaves, but before the next person leaves. That keeps us in the realm of strictly causal reasoning.

Consider what would go wrong if I anchored my departure to someone already in the sauna when I entered. That person might be near the end of their session, causing me to leave in just a few seconds. In distributed systems, this is akin to reading the state of an outdated node to make a decision, leading to inconsistent outcomes. Anchoring only to events that occur after my entry ensures I account for all relevant causality.

There is, of course, a safe way to rely on the people already inside. I could take a snapshot of the room upon entry, identify everyone present, and refuse to leave until the entire set of them has departed. This guarantees a sufficient duration (assuming at least one person recently arrived), but it requires more memory. Instead of tracking a single reference point (the newcomer), I would have to maintain state for everyone in the sauna.

Let's connect back to distributed systems. In an asynchronous system, nodes rely on causal relationships between events to order them. Just as I watch the newcomer to determine my exit, a node can observe the messages sent by other nodes to determine safe ordering. For example, in a distributed database, a transaction may want to commit only after all causally prior transactions have completed, ensuring consistency without requiring synchronized clocks. Reasoning about happened-before relationships allows asynchronous systems to coordinate safely without synchronized clocks. See discussions on logical clocks, vector clocks, and hybrid logical clocks in my previous posts.

Before anybody gets too concerned, no, I don't actually suffer from dyschronometria. However, back when I was a professor, my PhD students often joked that I operated on logical clocks rather than physical ones.

Also, in reality, the gym's sauna does have a clock visible from the door. I check it, and aim to stay for about 15 minutes. But I still insist on using my sauna algorithm, well at least the first three steps, because I am a competitive SOB. I try to outcompete the average person and, knock on sauna-bench wood, I haven't had to break my rule yet. It would be absolutely catastrophic if everyone started using my algorithm, though. We'd all be waiting for the next person to leave, resulting in a deadlock. Maybe the memory-costly snapshot algorithm is better after all, as it avoids this problem.

Related post: Consistent snapshot analogies

January 07, 2026

SSDs, power loss protection and fsync latency

This has results to measure the impact of calling fsync (or fdatasync) per-write for files opened with O_DIRECT. My goal is to document the impact of the innodb_flush_method option. 

The primary point of this post is to document the claim:

For an SSD without power loss protection, writes are fast but fsync is slow.

The secondary point of this post is to provide yet another example where context matters when reporting performance problems. This post is motivated by results that look bad when run on a server with slow fsync but look OK otherwise. 

tl;dr

  • for my mini PCs I will switch from the Samsung 990 Pro to the Crucial T500 to get lower fsync latency. Both are nice devices but the T500 is better for my use case.
  • with a consumer SSD writes are fast but fsync is often slow
  • use an enterprise SSD if possible, if not run tests to understand fsync and fdatasync latency

InnoDB, O_DIRECT and O_DIRECT_NO_FSYNC

When innodb_flush_method is set to O_DIRECT there are calls to fsync after each batch of writes. While  I don't know the source like I used to, I did browse it for this blog post and then I looked at SHOW GLOBAL STATUS counters. I think that InnoDB does the following with it set to O_DIRECT: 

  1. Do one large write to the doublewrite buffer, call fsync on that file
  2. Do the batch of in-place (16kb) page writes
  3. Call fsync once per database file that was written by step 2

When set to O_DIRECT_NO_FSYNC then the frequency of calls to fsync are greatly reduced and are only done in cases where important filesystem metadata needs to be updated, such as after extending a file.  The reference manual is misleading WRT the following sentence. I don't think that InnoDB ever does an fsync after each write. It can do an fsync after each batch of writes:

O_DIRECT_NO_FSYNCInnoDB uses O_DIRECT during flushing I/O, but skips the fsync() system call after each write operation.

Many years ago it was risky to use O_DIRECT_NO_FSYNC on some filesystems because the feature as implemented (either upstream or in forks) didn't do fsync for cases where it was needed (see comment about metadata above). I experienced problems from this and I only have myself to blame. But the feature has been enhanced to do the right thing. And if the #whynotpostgres crowd wants to snark about MySQL not caring about data, lets not forget that InnoDB had per-page checksums long before Postgres -- those checksums made web-scale life much easier when using less than stellar hardware.

The following table uses results while running the Insert Benchmark for InnoDB to compute the ratio of fsyncs per write using the SHOW GLOBAL STATUS counters:
Innodb_data_fsyncs / Innodb_data_writes

And from this table a few things are clear. First, there isn't an fsync per write with O_DIRECT but there might be an fsync per batch of writes as explained above. Second, the rate of fsyncs is greatly reduced by using O_DIRECT_NO_FSYNC. 

5.7.44  8.0.44
.01046  .00729  O_DIRECT
.00172  .00053  O_DIRECT_NO_FSYNC

Power loss protection

I am far from an expert on this topic, but most SSDs have a write-buffer that makes small writes fast. And one way to achieve speed is to buffer those writes in RAM on the SSD while waiting for enough data to be written to an extent. But that speed means there is a risk of data loss if a server loses power. Some SSDs, especially those marketed as enterprise SSDs, have a feature called power loss protection that make data loss unlikely. Other SSDs, lets call them consumer SSDs, don't have that feature while some of the consumer SSDs claim to make a best effort to flush writes from the write buffer on power loss.

One solution to avoiding risk is to only buy enterprise SSDs. But they are more expensive, less common, and many are larger (22120 rather than 2280) because more room is needed for the capacitor or other HW that provides the power loss protection. Note that power loss protection is often abbreviated as PLP.

For devices without power loss protection it is often true that writes are fast but fsync is slow. When fsync is slow then calling fsync more frequently in InnoDB will hurt performance.

Results from fio

I used this fio script to measure performance for writes for files opened with O_DIRECT. The test was run twice configuration for 5 minutes per run followed by a 5 minute sleep. This was repeated for 1, 2, 4, 8, 16 and 32 fio jobs but I only share results here for 1 job. The configurations tested were:

  • O_DIRECT without fsync, 16kb writes
  • O_DIRECT with an fsync per write, 16kb writes
  • O_DIRECT with an fdatasync per write, 16kb writes
  • O_DIRECT without fsync, 2M writes
  • O_DIRECT with an fsync per write, 2M writes
  • O_DIRECT with an fdatasync per write, 2M writes
Results from all tests are here. I did the test on several servers:
  • dell32
    • a large server I have at home. The SSD is a Crucial T500 2TB using ext-4 with discard enabled and Ubuntu 24.04. This is a consumer SSD. While the web claims it has PLP via capacitors the fsync latency for it was almost 1 millisecond.
  • gcp
    • a c3d-standard-30-lssd from the Google cloud with 2 local NVMe devices using SW RAID 0 and 1TB of Hyperdisk Balanced storage configured for 50,000 IOPs and 800MB/s of throughput. The OS is Ubuntu 24.04 and I repeated tests for both ext-4 and xfs, both with discard enabled. I was not able to determine the brand of the local NVMe devices.
  • hetz
    • an ax162-s from Hetzner with 2 local NVME devices using SW RAID 1. Via udiskctl status I learned the devices are Intel D7-P5520 (now Solidigm). These are datacenter SSDs and the web claims they have power loss protection. The OS is Ubuntu 24.04 and the drives use ext-4 without discard enabled. 
  • ser7
  • socket2
    • a 2-socket server I have at home. The SSD is a Samsung PM-9a3. This is an enterprise SSD with power loss protection. The OS is Ubuntu 24.04 and the drives use ext-4 with discard enabled.
Results: overview

All of the results are here.

This table lists fsync and fdatasync latency per server:
  • for servers with consumer SSDs (dell, ser7) the latency is much larger on the ser7 that uses a Samsung 990 Pro than on the dell that uses a Crucial T500. This is to be expected given that the T500 has PLP while the 990 Pro does not.
  • sync latency is much lower on servers with enterprise SSDs
  • sync latency after 2M writes is sometimes much larger than after 16kb writes
  • for the Google server with Hyperdisk Balanced storage the fdatasync latency was good but fsync latency was high. While with the local NVMe devices the latencies were larger than for enterprise SSDs but much smaller than for consumer SSDs.

--- Sync latency in microseconds for sync after 16kb writes

dell    hetz    ser7    socket2
891.1   12.4    2974.2  1.6     fsync
447.4    9.8    2783.2  0.7     fdatasync

gcp
local devices           hyperdisk
ext-4   xfs             ext-4   xfs
56.2    39.5            738.1   635.0   fsync
28.1    29.0             46.8    46.0   fdatasync

--- Sync latency in microseconds for sync after 2M writes

dell    hetz    ser7    socket2
980.1   58.2    5396.8  139.1   fsync
449.7   10.8    3508.2    2.2   fdatasync

gcp
local devices           hyperdisk
ext-4   xfs             ext-4   xfs
1020.4  916.8           821.2   778.9   fsync
 832.4  809.7            63.6    51.2   fdatasync

Results: dell

Summary:
  • Write throughput drops dramatically when there is an fsync or fdatasync per write because sync latency is large.
  • This servers uses a consumer SSD so high sync latency is expected
Legend:
  • w/s - writes/s
  • MB/s - MB written/s
  • sync - latency per sync (fsync or fdatasync)

16 KB writes
w/s     MB/s    sync    test
43400   646.6   0.0     no-sync
43500   648.5   0.0     no-sync
-
1083    16.1    891.1   fsync
1085    16.2    889.2   fsync
-
2100    31.3    447.4   fdatasync
2095    31.2    448.6   fdatasync

2 MB writes
w/s     MB/s    sync    test
2617    4992.5  0.0     no-sync
2360    4502.3  0.0     no-sync
-
727     1388.5  980.1   fsync
753     1436.2  942.5   fsync
-
1204    2297.4  449.7   fdatasync
1208    2306.0  446.9   fdatasync

Results: gcp

Summary
  • Local NVMe devices have lower sync latency and more throughput with and without a sync per write at low concurrency (1 fio job).
  • At higher concurrency (32 fio jobs), the Hyperdisk Balanced setup provides similar throughput to local NVMe and would do even better had I paid more to get more IOPs and throughput. Results don't have nice formatting but are here for xfs on the local and Hyperdisk Balanced devices.
  • fsync latency is ~2X larger than fdatasync on the local devices and closer to 15X larger on the Hyperdisk Balanced setup. That difference is interesting. I wonder what the results are for Hyperdisk Extreme.
Legend:
  • w/s - writes/s
  • MB/s - MB written/s
  • sync - latency per sync (fsync or fdatasync)
--- ext-4 and local devices

16 KB writes
w/s     MB/s    sync    test
10100   150.7   0.0     no-sync
10300   153.5   0.0     no-sync
-
6555    97.3    56.2    fsync
6607    98.2    55.1    fsync
-
8189    122.1   28.1    fdatasync
8157    121.1   28.2    fdatasync

2 MB writes
w/s     MB/s    sync    test
390     744.8   0.0     no-sync
390     744.8   0.0     no-sync
-
388     741.0   1020.4  fsync
388     741.0   1012.7  fsync
-
390     744.8   832.4   fdatasync
390     744.8   869.6   fdatasync

--- xfs and local devices

16 KB writes
w/s     MB/s    sync    test
9866    146.9   0.0     no-sync
9730    145.0   0.0     no-sync
-
7421    110.6   39.5    fsync
7537    112.5   38.3    fsync
-
8100    121.1   29.0    fdatasync
8117    121.1   28.8    fdatasync

2 MB writes
w/s     MB/s    sync    test
390     744.8   0.0     no-sync
390     744.8   0.0     no-sync
-
389     743.9   916.8   fsync
389     743.9   919.1   fsync
-
390     744.8   809.7   fdatasync
390     744.8   806.5   fdatasync

--- ext-4 and Hyperdisk Balanced

16 KB writes
w/s     MB/s    sync    test
2093    31.2    0.0     no-sync
2068    30.8    0.0     no-sync
-
804     12.0    738.1   fsync
798     11.9    740.6   fsync
-
1963    29.3    46.8    fdatasync
1922    28.6    49.0    fdatasync

2 MB writes
w/s     MB/s    sync    test
348     663.8   0.0     no-sync
367     701.0   0.0     no-sync
-
278     531.2   821.2   fsync
271     517.8   814.1   fsync
-
358     683.8   63.6    fdatasync
345     659.0   64.5    fdatasync

--- xfs and Hyperdisk Balanced

16 KB writes
w/s     MB/s    sync    test
2033    30.3    0.0     no-sync
2004    29.9    0.0     no-sync
-
870     13.0    635.0   fsync
858     12.8    645.0   fsync
-
1787    26.6    46.0    fdatasync
1727    25.7    49.6    fdatasync

2 MB writes
w/s     MB/s    sync    test
343     655.2   0.0     no-sync
343     655.2   0.0     no-sync
-
267     511.2   778.9   fsync
268     511.2   774.7   fsync
-
347     661.8   51.2    fdatasync
336     642.8   54.4    fdatasync

Results: hetz

Summary
  • this has an enterprise SSD with excellent (low) sync latency
Legend:
  • w/s - writes/s
  • MB/s - MB written/s
  • sync - latency per sync (fsync or fdatasync)
16 KB writes
w/s     MB/s    sync    test
37700   561.7   0.0     no-sync
37500   558.9   0.0     no-sync
-
25200   374.8   12.4    fsync
25100   374.8   12.4    fsync
-
27600   411.0   0.0     fdatasync
27200   404.4   9.8     fdatasync

2 MB writes
w/s     MB/s    sync    test
1833    3497.1  0.0     no-sync
1922    3667.8  0.0     no-sync
-
1393    2656.9  58.2    fsync
1355    2585.4  59.6    fsync
-
1892    3610.6  10.8    fdatasync
1922    3665.9  10.8    fdatasync

Results: ser7

Summary:
  • this has a consumer SSD with high sync latency
  • results had much variance (see the 2MB results below) and results at higher concurrency. This is a great SSD, but not for my use case.
Legend:
  • w/s - writes/s
  • MB/s - MB written/s
  • sync - latency per sync (fsync or fdatasync)
16 KB writes
w/s     MB/s    sync    test
34000   506.4   0.0     no-sync
40200   598.9   0.0     no-sync
-
325     5.0     2974.2  fsync
333     5.1     2867.3  fsync
-
331     5.1     2783.2  fdatasync
330     5.0     2796.1  fdatasync

2 MB writes
w/s     MB/s    sync    test
362     691.4   0.0     no-sync
364     695.2   0.0     no-sync
-
67      128.7   10828.3 fsync
114     218.4   5396.8  fsync
-
141     268.9   3864.0  fdatasync
192     368.1   3508.2  fdatasync

Results: socket2

Summary:
  • this has an enterprise SSD with excellent (low) sync latency after small writes, but fsync latency after 2MB writes is much larger
Legend:
  • w/s - writes/s
  • MB/s - MB written/s
  • sync - latency per sync (fsync or fdatasync)
16 KB writes
w/s     MB/s    sync    test
49500   737.2   0.0     no-sync
49300   734.3   0.0     no-sync
-
44500   662.8   1.6     fsync
45400   676.2   1.5     fsync
-
46700   696.2   0.7     fdatasync
45200   674.2   0.7     fdatasync

2 MB writes
w/s     MB/s    sync    test
707     1350.4  0.0     no-sync
708     1350.4  0.0     no-sync
-
703     1342.8  139.1   fsync
703     1342.8  122.5   fsync
-
707     1350.4  2.2     fdatasync
707     1350.4  2.1     fdatasync


Introducing OpenEverest: An Independent Open Source Project for the Future of Data Platforms

Today, we are sharing an important step forward for the project many of you know as Percona Everest. Percona is transitioning Percona Everest into an independent open source project called OpenEverest (https://openeverest.io/). This change is about one thing: making the project stronger over the long term by building it in the open, with open governance, […]

Supabase Security Retro: 2025

A summary of Supabase platform security changes made in 2025 and the security defaults planned for 2026.

January 06, 2026

Implement multi-Region endpoint routing for Amazon Aurora DSQL

Applications using Aurora DSQL multi-Region clusters should implement a DNS-based routing solution (such as Amazon Route 53) to automatically redirect traffic between AWS Regions. In this post, we show you automated solution for redirecting database traffic to alternate regional endpoints without requiring manual configuration changes, particularly in mixed data store environments.

Urgent Security Update: Patching “Mongobleed” (CVE-2025-14847) in Percona Server for MongoDB

At Percona, our mission has always been to provide the community with truly open-source, enterprise-class software. A critical part of that mission is ensuring that when security vulnerabilities arise in the upstream ecosystem, we respond with the urgency and transparency our users expect. As many in the MongoDB community are now aware, a security vulnerability—CVE-2025-14847, […]

Blog Post: Good Bye Percona Everest, Hello OpenEverest!

Over the past few years, we’ve been building Percona Everest with a clear goal in mind: to deliver a powerful yet approachable DBaaS experience on Kubernetes. Thanks to strong user and customer adoption, Everest has grown into a platform with thousands of production clusters deployed and overwhelmingly positive feedback from the community. As adoption grew, […]

January 05, 2026

3 Steps to Optimize Your Queries for Speed

When performance matters, how you store your application objects, write queries, and design indexes can be the difference between scanning a collection for minutes and getting results in milliseconds—without touching a single document.

In this article, we’ll walk through an example in MongoDB. We’ll start with a normalized model, move to an embedded model, add an index to avoid a full collection scan, and then refine both the index and query to read only what’s needed for the result. The execution plan and execution statistics will guide us at each step.

Scenario

Our domain is an e‑commerce order‑entry application with customers, products, orders, and order lines. We want to:

Find the most recent order in Switzerland containing the book “MongoDB Data Modeling and Schema Design”.

Initial model: normalized with references

If we build the data model without considering how applications will access it—or simply migrate from a relational database by creating one collection per table—we end up with a model where a single domain aggregate is split across multiple collections.

In this model, we insert a customer, some products (books), an order for this customer, and the corresponding order lines for those books:

// Customer  
db.customers.insertOne({  
  "_id": "C0001",  
  "name": "Franck Pachot",  
  "email": "franck.pachot@example.com",  
  "address": { "city": "Geneva", "country": "Switzerland" }  
})  

// Products  
db.products.insertOne({  
  "_id": "P0001",  
  "title": "MongoDB in Action, Third Edition: Building on the Atlas Data Platform",  
  "author": "Arek Borucki",  
  "price": 59.99, "category": "Database", "format": "Paperback"  
})  
db.products.insertOne({  
  "_id": "P0002",  
  "title": "MongoDB Data Modeling and Schema Design",  
  "author": ["Daniel Coupal", "Pascal Desmarets", "Steve Hoberman"],  
  "price": 54.99, "category": "Database", "format": "Paperback"  
})  
db.products.insertOne({  
  "_id": "P0003",  
  "title": "High Performance with MongoDB: Best Practices for Performance Tuning, Scaling, and Architecture",  
  "author": ["Asya Kamsky", "Ger Hartnett", "Alex Bevilacqua"],  
  "price": 49.99, "category": "Database", "format": "Paperback"  
})  

// Order  
db.orders.insertOne({  
  "_id": "O0001",  
  "customerId": "C0001",  
  "orderDate": ISODate("2024-10-22T09:00:00Z"),  
  "status": "Processing"  
})  

// Order lines  
db.orderlines.insertMany([  
  { "orderId": "O0001", "productId": "P0001", "quantity": 1 },  
  { "orderId": "O0001", "productId": "P0002", "quantity": 1 },  
  { "orderId": "O0001", "productId": "P0003", "quantity": 1 }  
])

This is a classic relational database pattern, but querying and optimizing it is complex because joins must occur before filtering.

In MongoDB, to rebuild the domain object aggregate, we use an aggregation pipeline with $lookup and $match:

db.orders.aggregate([  
  // Get customer info to filter by country  
  { $lookup: {  
      from: "customers",  
      localField: "customerId",  
      foreignField: "_id",  
      as: "customer"  
    }},  
  { $unwind: "$customer" },  

  // Get product info to filter by title  
  { $lookup: {  
      from: "orderlines",  
      let: { orderIdVar: "$_id" },  
      pipeline: [  
        { $match: { $expr: { $eq: ["$orderId", "$$orderIdVar"] } } },  
        { $lookup: {  
            from: "products",  
            localField: "productId",  
            foreignField: "_id",  
            as: "product"  
          }},  
        { $unwind: "$product" }  
      ],  
      as: "orderLines"  
    }},  

  // Apply filters  
  { $match: {  
      "customer.address.country": "Switzerland",  
      "orderLines.product.title": "MongoDB Data Modeling and Schema Design"  
    }},  

  // Sort and limit  
  { $sort: { orderDate: -1 } },  
  { $limit: 1 }  
])  

This query works, but it scans the entire collection. Indexes aren’t effective here because $match runs only after the $lookup and $unwind stages.

With one million orders, it would read all one million, perform three million lookups, filter on country and book title, sort a large intermediate result, and then return just one document.

That’s why joins can be slow: when data is split across many tables or collections, you lose the benefit of efficient index access—compound indexes can only target one collection. It also requires to code every query in the domain repository.

We need to rethink the data model.

Step 1: Embedding to fit the domain model

Knowing the database is for an order‑entry system, we can design a document model that matches the domain. An order should be a self‑contained document with all customer info and complete product details for each order line—because that’s how it’s displayed, printed, or emailed. If it’s a single logical document on paper, it should be a single document in MongoDB.

Here’s the embedded form:

db.orders.insertOne({  
  _id: 'O0002',  
  customerId: 'C0001',  
  orderDate: ISODate('2024-10-22T09:00:00.000Z'),  
  status: 'Processing',  
  customer: {  
       _id: 'C0001',  
       name: 'Franck Pachot',  
       email: 'franck.pachot@example.com',  
       address: { city: 'Geneva', country: 'Switzerland' }  
  },  
  orderLines: [  
       { quantity: 1, product: { _id: 'P0001', title: "'MongoDB in Action...', price: 59.99 } },  "
       { quantity: 1, product: { _id: 'P0002', title: "'MongoDB Data Modeling and Schema Design', price: 54.99 } },  "
       { quantity: 1, product: { _id: 'P0003', title: "'High Performance with MongoDB...', price: 49.99 } }  "
  ]  
})

Duplicated data isn’t a concern here—documents are compressed on disk, and changes to product descriptions don’t affect completed orders.

Querying becomes far simpler:

db.orders.find(  
  {  
    "customer.address.country": "Switzerland",  
    "orderLines.product.title": "MongoDB Data Modeling and Schema Design"  
  }  
).sort({ orderDate: -1 }).limit(1)

No lookups needed. But it still does a collection scan, filters, sorts, and finally returns one document.

I add one million documents and run the query with execution statistics:

db.orders.insertMany(Array.from({length: 333333},()=>({customer:{address:{country:"Switzerland"}}})))
db.orders.insertMany(Array.from({length: 666666},()=>({customer:{address:{country:"Germany"}}})))

db.orders.find(  
  {  
    "customer.address.country": "Switzerland",  
    "orderLines.product.title": "MongoDB Data Modeling and Schema Design"  
  }  
).sort({ orderDate: -1 }).limit(1).explain("executionStats").executionStats

{
  executionSuccess: true,
  nReturned: 1,
  executionTimeMillis: 559,
  totalKeysExamined: 0,
  totalDocsExamined: 1000001,
  executionStages: {
    isCached: false,
    stage: 'SORT',
    nReturned: 1,
    executionTimeMillisEstimate: 542,
    works: 1000004,
    advanced: 1,
    needTime: 1000002,
    needYield: 0,
    saveState: 29,
    restoreState: 29,
    isEOF: 1,
    sortPattern: { orderDate: -1 },
    memLimit: 104857600,
    limitAmount: 1,
    type: 'simple',
    totalDataSizeSorted: 0,
    usedDisk: false,
    spills: 0,
    spilledRecords: 0,
    spilledBytes: 0,
    spilledDataStorageSize: 0,
    inputStage: {
      stage: 'COLLSCAN',
      filter: {
        '$and': [
          { 'customer.address.country': { '$eq': 'Switzerland' } },
          {
            'orderLines.product.title': { '$eq': 'MongoDB Data Modeling and Schema Design' }
          }
        ]
      },
      nReturned: 1,
      executionTimeMillisEstimate: 516,
      works: 1000002,
      advanced: 1,
      needTime: 1000000,
      needYield: 0,
      saveState: 29,
      restoreState: 29,
      isEOF: 1,
      direction: 'forward',
      docsExamined: 1000001
    }
  }
}

In order to return a single document (nReturned: 1), one million documents were read (totalDocsExamined: 1000001), and sorted (stage: 'SORT', works: 1000004).

Now we we have all in one collection, we can optimize with an index. Our objective is to reduce totalDocsExamined to nReturned.

Step 2: Indexing for filtering, sorting, and pagination

Our query filters by:

  • Equality on customer country
  • Equality on product title
  • A range on order date (due to sort().limit()).

An ideal compound index:

db.orders.createIndex({  
  "customer.address.country": 1,   // equality  
  "orderLines.product.title": 1,   // equality  
  "orderDate": 1                   // sort/range  
})  

Here is the execution plan:

db.orders.find(  
  {  
    "customer.address.country": "Switzerland",  
    "orderLines.product.title": "MongoDB Data Modeling and Schema Design"  
  }  
).sort({ orderDate: -1 }).limit(1).explain("executionStats").executionStats

{
  executionSuccess: true,
  nReturned: 1,
  executionTimeMillis: 0,
  totalKeysExamined: 1,
  totalDocsExamined: 1,
  executionStages: {
    isCached: false,
    stage: 'LIMIT',
    nReturned: 1,
    executionTimeMillisEstimate: 0,
    works: 2,
    advanced: 1,
    needTime: 0,
    needYield: 0,
    saveState: 0,
    restoreState: 0,
    isEOF: 1,
    limitAmount: 1,
    inputStage: {
      stage: 'FETCH',
      nReturned: 1,
      executionTimeMillisEstimate: 0,
      works: 1,
      advanced: 1,
      needTime: 0,
      needYield: 0,
      saveState: 0,
      restoreState: 0,
      isEOF: 0,
      docsExamined: 1,
      alreadyHasObj: 0,
      inputStage: {
        stage: 'IXSCAN',
        nReturned: 1,
        executionTimeMillisEstimate: 0,
        works: 1,
        advanced: 1,
        needTime: 0,
        needYield: 0,
        saveState: 0,
        restoreState: 0,
        isEOF: 0,
        keyPattern: {
          'customer.address.country': 1,
          'orderLines.product.title': 1,
          orderDate: 1
        },
        indexName: 'customer.address.country_1_orderLines.product.title_1_orderDate_1',
        isMultiKey: true,
        multiKeyPaths: {
          'customer.address.country': [],
          'orderLines.product.title': [ 'orderLines' ],
          orderDate: []
        },
        isUnique: false,
        isSparse: false,
        isPartial: false,
        indexVersion: 2,
        direction: 'backward',
        indexBounds: {
          'customer.address.country': [ '["Switzerland", "Switzerland"]' ],
          'orderLines.product.title': [
            '["MongoDB Data Modeling and Schema Design", "MongoDB Data Modeling and Schema Design"]'
          ],
          orderDate: [ '[MaxKey, MinKey]' ]
        },
        keysExamined: 1,
        seeks: 1,
        dupsTested: 1,
        dupsDropped: 0
      }
    }
  }
}

With this index, the query jumps straight to the target document:

  • totalKeysExamined: 1 from one range (seeks: 1 in stage: 'IXSCAN)
  • totalDocsExamined: 1 from stage: 'FETCH' without the need to sort
  • nReturned: 1

The result is retrieved in sorted order directly from the index, minimizing the number of documents to read. We can do better. Our next goal is to elimiate totalDocsExamined.

Step 3: Projection and covering index

If the application only needs the customer’s name and country, we can project:

db.orders.find(  
  {  
    "customer.address.country": "Switzerland",  
    "orderLines.product.title": "MongoDB Data Modeling and Schema Design"  
  },  
  { "customer.name": 1, "customer.address.country": 1, _id: 0 }  
)  

[
  {
    customer: { name: 'Franck Pachot', address: { country: 'Switzerland' } }
  }
]

Now, we can create a covering index:

db.orders.createIndex({  
  "customer.address.country": 1,  
  "orderLines.product.title": 1,  
  "orderDate": 1,  
  "customer.name": 1  
})  

Here is the execution plan:

db.orders.find(  
  {  
    "customer.address.country": "Switzerland",  
    "orderLines.product.title": "MongoDB Data Modeling and Schema Design"  
  },  
  { "customer.name": 1, "customer.address.country": 1, _id: 0 }  
).limit(1).explain("executionStats").executionStats

{
  executionSuccess: true,
  nReturned: 1,
  executionTimeMillis: 0,
  totalKeysExamined: 1,
  totalDocsExamined: 0,
  executionStages: {
    isCached: false,
    stage: 'LIMIT',
    nReturned: 1,
    executionTimeMillisEstimate: 0,
    works: 2,
    advanced: 1,
    needTime: 0,
    needYield: 0,
    saveState: 0,
    restoreState: 0,
    isEOF: 1,
    limitAmount: 1,
    inputStage: {
      stage: 'PROJECTION_DEFAULT',
      nReturned: 1,
      executionTimeMillisEstimate: 0,
      works: 1,
      advanced: 1,
      needTime: 0,
      needYield: 0,
      saveState: 0,
      restoreState: 0,
      isEOF: 0,
      transformBy: { 'customer.name': 1, 'customer.address.country': 1, _id: 0 },
      inputStage: {
        stage: 'IXSCAN',
        nReturned: 1,
        executionTimeMillisEstimate: 0,
        works: 1,
        advanced: 1,
        needTime: 0,
        needYield: 0,
        saveState: 0,
        restoreState: 0,
        isEOF: 0,
        keyPattern: {
          'customer.address.country': 1,
          'orderLines.product.title': 1,
          orderDate: 1,
          'customer.name': 1
        },
        indexName: 'customer.address.country_1_orderLines.product.title_1_orderDate_1_customer.name_1',
        isMultiKey: true,
        multiKeyPaths: {
          'customer.address.country': [],
          'orderLines.product.title': [ 'orderLines' ],
          orderDate: [],
          'customer.name': []
        },
        isUnique: false,
        isSparse: false,
        isPartial: false,
        indexVersion: 2,
        direction: 'forward',
        indexBounds: {
          'customer.address.country': [ '["Switzerland", "Switzerland"]' ],
          'orderLines.product.title': [
            '["MongoDB Data Modeling and Schema Design", "MongoDB Data Modeling and Schema Design"]'
          ],
          orderDate: [ '[MinKey, MaxKey]' ],
          'customer.name': [ '[MinKey, MaxKey]' ]
        },
        keysExamined: 1,
        seeks: 1,
        dupsTested: 1,
        dupsDropped: 0
      }
    }
  }
}

This is the optimal plan where all documents returned (nReturned: 1) come from the index entries that were scanned (totalKeysExamined: 1) without any FETCH (totalDocsExamined: 0)

Conclusion

MongoDB query performance isn’t just about adding indexes—it’s about aligning your data model and queries with how your application really uses data. Turning a four-collection join with millions of lookups into a single index scan that runs in milliseconds comes down to three rules:

  1. Model for Access Embed data that’s accessed together. If it belongs on one piece of paper—like an order with its customer and product details—it belongs in one document.
  2. Index for Your Query Use compound indexes that follow your filter, sort, and projection. Equality fields first, range fields last. This turns full scans into targeted index scans, cutting totalDocsExamined from millions to one.
  3. Project Only What You Need Return only required fields. When a query is fully covered by an index, MongoDB can achieve totalDocsExamined: 0 and return the expected result fast.

You don’t need to apply this to all queries—use these steps only when necessary. Excessive lookups and limited indexing options drove us to optimize the data model. After that, the remaining need for selective filters and the absence of a suitable index led to creating ... (truncated)

Are Database System Researchers Making Correct Assumptions about Transaction Workloads?

In this blog, we had reviewed quite a number of deterministic database papers, including Calvin, SLOG, Detock, which aimed to achieve higher throughput and lower latency. The downside of these systems is sacrificing transaction expressivity. They rely on two critical assumptions: first, that transactions are "non-interactive", meaning they are sent as a single request (one-shot) rather than engaging in a multi-round-trip conversation with the application, and second, that the database can know a transaction's read/write set before execution begins (to lock data deterministically).

So when these deterministic database researchers write a paper to validate how these assumptions hold in the real world, we should be skeptical and cautious in our reading. Don't get me wrong, this is a great and valuable paper. And we still need to be critical in our reading. 


Summary

The study employed a semi-automated annotation tool to analyze 111 popular open-source web applications and over 30,000 transactions. The authors target applications built with Django (Python) and TypeORM (TypeScript). These Object-Relational Mappers (ORMs) abstract away the SQL layer, allowing developers to interact with a wide variety of underlying relational databases (such as PostgreSQL, MySQL, and SQLite) using standard application code. The authors categorized transactions based on two primary dimensions: Transaction Interactivity and  Read/Write Set Inferability. 


Transaction Interactivity

The study found that interactive transactions, where the application holds a transaction open while performing logic or reads, are surprisingly rare. 39% of applications contain no interactive transactions. For the remaining 61% of applications, only roughly 9.6% of the workload is interactive.


The authors then distinguish between "Strictly Interactive" transactions, which inherently require external input or side effects mid-flight (e.g., waiting for an API response or user prompt), and "Non-Strictly Interactive" ones, where the back-and-forth is merely deterministic application logic processing reads between queries without outside dependencies. They find strictly interactive transactions account for only 0.5% of the total workload. They argue that the vast majority of remaining interactive transactions are convertible to one-shot transactions with minimal code changes.


Read/Write Set Inferability

The authors tested whether a database could statically predict which rows a transaction would touch. They found that for 90% of applications, at least 58% of transactions have read/write sets that are fully inferable in advance.

The primary obstacle to perfect knowledge is the CDA Mismatch (Conflict Detection Attribute Mismatch), which occurs in roughly 27% of transactions. This happens when a transaction queries data using a secondary attribute (e.g., 'email') rather than the primary locking key (e.g., 'ID'). See Figure 1 above. When this happens, the database cannot infer the lock without first inspecting the data. However, the study remains upbeat about this. Since CDA mismatches typically occur in simple single-statement transactions, they say that the "fallback" mechanism (running a lightweight reconnaissance query to find the ID) is inexpensive.



Critical Review

This paper provides very valuable empirical data on transaction workloads by surveying a large number of applications. It echoes the scale of the SIGMOD 2015 "Feral Concurrency Control" study, which examined the integrity risks of the mismatch between databases and modern ORM frameworks like Rails. However, we must examine this study critically to identify potential limitations in scope and bias. My two big criticism are the following. 

Selection Bias. The choice of the corpus, open-source web applications using ORMs, heavily skews the results and excludes almost all enterprise systems. The study also ignores ad-hoc transactions executed by human operators (DBAs, analysts) via terminals. These transactions are inherently interactive ("user think time") and often performance-critical.

Convertibility Optimism. The paper argues that "Non-Strictly Interactive" transactions are easily convertible to deterministic models without resorting to brittle Stored Procedures. They advocate for reconnaissance/scout query pattern, where the application performs a dry run read phase to optimistically build a write-set before sending it as a single atomic batch. While this is promising, the claim of minimal changes seems very optimistic. However, I think this conversion challenge is a great opportunity for research and for collecting empirical evidence of the actual engineering costs.

JavaScript Stored Routines in Percona Server for MySQL: A New Era for Database Programmability

For decades, we’ve accepted a painful compromise: if you wanted logic inside the database, you had to write SQL/PSM (Persistent Stored Modules). It’s clunky, hard to debug, and declarative by nature, making it terrible for algorithmic tasks. That ends with Percona Server 8.4.7-7. We are introducing JS Stored Programs as a Tech Preview. Unlike Oracle’s […]

January 04, 2026

Too Close to Our Own Image?

Recent work suggests we may be projecting ourselves onto LLMs more than we admit.

A paper in Nature reports that GPT-4 exhibits "state anxiety". When exposed to traumatic narratives (such as descriptions of accidents or violence), the model's responses score much higher on a standard psychological anxiety inventory. The jump is large, from "low anxiety" to levels comparable to highly anxious humans. The same study finds that therapy works: mindfulness-style relaxation prompts reduce these scores by about a third, though not back to baseline. The authors argue that managing an LLM's emotional state may be important for safe deployment, especially in mental health settings and perhaps in other mission-critical domains.

Another recent paper argues that LLMs can develop a form of brain rot. Continual training on what the authors call junk data (short, viral, sensationalist content typical of social media) leads to models developing weaker reasoning, poorer long-context handling, and worse safety compliance. Those models also score higher on dark personality traits like narcissism and psychopathy, as measured by the TRAITS benchmark. The primary mechanism behind this decline was identified as thought-skipping, where the model truncates its own reasoning chains, echoing the fragmented attention patterns associated with internet addiction in humans.

In my previous post, I discussed how agentic system techniques parallel self-improvement advice. I noted that successful agents rely on habits that look suspiciously like human self-help techniques. These models write things down, using scratchpads and memory buffers to offload working memory. They think in loops, iterating through write-reason-repeat cycles to decompose hard problems. They role-play, adopting constrained personas like "The Critic" or "The Engineer" to constrain the search and behavior space, much like the alter ego effect in humans.

This lets me think, would we need psychologists and life coaches for AI in the future? Would this be a vocation/career?

If models can suffer from anxiety that responds to mindfulness prompts, and from cognitive decline that responds to changes in their data diet, we are already treating them as if they are biological minds rather than machines. Perhaps the "Prompt Engineer" of today is just the precursor to the "AI Psychologist" of tomorrow, a professional whose job is to keep it sane, focused, and "healthy".

findOneAndUpdate(): UPDATE ... RETURNING in MongoDB, Idempotent and ACID, with

You can implement atomic read-write operations in a single call instead of an explicit transaction, reducing both round trips and the window for concurrent conflicts. In PostgreSQL, use UPDATE ... RETURNING instead of SELECT FOR UPDATE followed by UPDATE. In MongoDB, use findOneAndUpdate() instead of updateOne() followed by find(). This enables a single ACID read-write operation that is failure-resilient and safely retryable in MongoDB because it is idempotent. Here is an example.

Demo: two withdrawals and a receipt

Bob withdraws using a debit card (no negative balance allowed). The application first does updateOne to subtract the amount only if the balance is enough, then does a separate find() to print the receipt with the balance. Between those two calls, Alice uses a credit card (allowing a negative balance) and withdraws money from the same account, so Bob's printed balance becomes inconsistent with the balance checked for his withdrawal.

To fix this, we'll use findOneAndUpdate() to return the post-update balance atomically with the withdrawal.

Connection and collection

Here’s a Python + PyMongo program simulating this race condition and showing the fix:

from pymongo import MongoClient, ReturnDocument
import threading
import time

# Connect to MongoDB
client = MongoClient("mongodb://127.0.0.1:27017/?directConnection=true")
db = client.bank
accounts = db.accounts

# Prepare test account: Bob & Alice share this account
def reset_account():
    accounts.drop()
    accounts.insert_one({
        "_id": "b0b4l3ce",
        "owner": ["Bob", "Alice"],
        "balance": 100
    })

reset_account()

I have only one document for this example:

bank> db.accounts.find()

[ { _id: 'b0b4l3ce', owner: [ 'Bob', 'Alice' ], balance: 100 } ]

The balance is set at 100.

Scenario with updateOne() followed by find()

Bob's withdrawal process:


def bob_withdraw_debit(amount): 

    print("[Bob] Attempting debit withdrawal", amount)  
    # Application logic checks balance then updates  
    result = accounts.update_one(  
        {"_id": "b0b4l3ce", "balance": {"$gte": amount}},   # must have enough money  
        {"$inc": {"balance": -amount}}  
    )  

    # If no document was updated, the filter didn't find enough in the amount
    if result.modified_count == 0:  
        print("[Bob] Withdrawal denied - insufficient funds")  
        return  

    # Simulate some processing delay before printing the ticket - to be sure to reproduce the race condition
    time.sleep(1)

    # query the balance for the receipt
    balance = accounts.find_one({"_id": "b0b4l3ce"})["balance"]  
    print(f"[Bob] Debit card ticket: withdrew {amount}, balance after withdrawal: {balance}")  

Alice's withdrawal process:

# Alice withdraws using credit card (negative balance allowed)  
def alice_withdraw_credit(amount, delay=0):  
    time.sleep(delay)  # let Bob start first  
    print("[Alice] Attempting credit withdrawal", amount) 
    # There's no balance check for credit cards
    accounts.update_one(  
        {"_id": "b0b4l3ce"},  
        {"$inc": {"balance": -amount}}  
    )  
    print("[Alice] Credit withdrawal complete")  

Demo script to run in two threads and get the race condition:

def demo():
    reset_account()
    t_bob = threading.Thread(target=bob_withdraw_debit, args=(80,))
    t_alice = threading.Thread(target=alice_withdraw_credit, args=(30, 0.5))  # starts just after Bob update
    t_bob.start()
    t_alice.start()
    t_bob.join()
    t_alice.join()

The output is:

>>> demo()

[Bob] Attempting debit withdrawal 80
[Alice] Attempting credit withdrawal 30
[Alice] Credit withdrawal complete
[Bob] Ticket: withdrew 80, balance after withdrawal: -10

Bob received a ticket showing a withdrawal with a negative balance from his debit card — this is a bug ❌

We must get the balance at the time of the update, before, and subtract the amount, or after.

Scenario with findOneAndUpdate() and returnDocument: "after"

Bob's withdrawal process:

def bob_withdraw_debit(amount):  
    print("[Bob] Attempting debit withdrawal", amount)  
    doc = accounts.find_one_and_update(  
        {"_id": "b0b4l3ce", "balance": {"$gte": amount}},  
        {"$inc": {"balance": -amount}},  
        return_document=ReturnDocument.AFTER  # get post-update document atomically  
    )  
    # No need to check the update count, we have the document if it was updated
    if not doc:  
        print("[Bob] Withdrawal denied - insufficient funds")  
        return  
    # Ticket immediately shows consistent balance  
    print(f"[Bob] Ticket: withdrew {amount}, balance after withdrawal: {doc['balance']}")

Running the demo again:

>>> demo()

[Bob] Attempting debit withdrawal 80
[Bob] Ticket: withdrew 80, balance after withdrawal: 20
[Alice] Attempting credit withdrawal 30
[Alice] Credit withdrawal complete

Bob received a ticket showing the balance at the exact time of withdrawal ✅

The update write and post-update read occurred as a single atomic operation on the document, leaving no opportunity for another write between the update and the displayed read result.

After-image is saved for consistency and resilience

In MongoDB, reads and writes do not acquire transaction locks like they can in relational databases, but document updates are still atomic at the document level, even without staring an explicit transaction. MongoDB uses lightweight document-level locks internally to ensure ACID guarantees for a single document, since even one update can involve multiple internal reads and writes, such as enforcing unique constraints and updating indexes.

updateOne() returns only metadata (such as the number of documents updated), whereas findOneAndUpdate() returns the updated document itself, with the read and write occurring in the same atomic operation at the single-document level. This atomicity is preserved even in the event of failures.

If the network is lost or the primary crashes and a secondary is promoted, the MongoDB driver retries the operation as part of retryable writes. Because retries must be idempotent, find_one_and_update() returns the same document image on retry.

To support this, MongoDB stores a document image—either the after image (as in this example, using returnDocument: "after") or the before image—in an internal system collection (config.image_collection) that is replicated independently of the oplog, as part of the same transaction:

bank> use config

config> db.image_collection.find()
[
  {
    _id: {
      id: UUID('d04e10d6-c61d-42ad-9a44-5bb226a898d8'),
      uid: Binary.createFromBase64('47DEQpj8HBSa+/TImW+5JCeuQeRkm5NMpJWZG3hSuFU=', 0)
    },
    txnNum: Long('15'),
    ts: Timestamp({ t: 1767478784, i: 5 }),
    imageKind: 'postImage',
    image: { _id: 'b0b4l3ce', owner: [ 'Bob', 'Alice' ], balance: 20 },
    invalidated: false
  }
]
config>

When you enable write retries, this image collection is used internally to make write operations resilient to failures, which are handled transparently by the database driver and server, not by the application, and with the strongest consistency.

Comparison with PostgreSQL

In PostgreSQL, you would use the following queries to do the equivalent:

CREATE TABLE accounts (  
    id TEXT PRIMARY KEY,  
    balance NUMERIC,  
    owners TEXT[]  
);  
INSERT INTO accounts VALUES ('b0b4l3ce', ARRAY['Bob','Alice'], 100);  

-- Alice transaction  
UPDATE accounts    
SET balance = balance - 30    
WHERE id = 'shared_account';  

-- Bob transaction  
UPDATE accounts    
SET balance = balance - 80    
WHERE id = 'b0b4l3ce' AND balance >= 80    
RETURNING balance AS new_balance;

PostgreSQL drivers do not automatically retry failures and rely on MVCC and locks to ensure ACID properties.

With Repeatable Read isolation—appropriate because the write (SET balance = balance - 80) depends on the read (WHERE balance >= 80)—Bob's transaction takes a snapshot at the start and, when run concurrently with Alice's, still sees balance = 100. If Alice commits first and reduces the balance to 70, Bob's transaction fails with: ERROR: could not serialize access due to concurrent update. The application must handle this by retrying the entire transaction. The driver does not do this automatically.

With the default Read Committed isolation level, Bob’s transaction waits if Alice’s update has locked the row. After Alice commits, PostgreSQL reevaluates Bob’s WHERE clause. The balance is now 70, which no longer satisfies balance >= 80, so the UPDATE affects 0 rows and the withdrawal is refused, preventing a negative balance. This is fine when only one row is affected, but in a multi-row statement it could be inconsistent, operating on rows from different transactional states.

Conclusion

MongoDB supports both multi-document transactions and single-document atomic operations, but strongly encourages using single-document operations when possible. If you design your schema so that business logic fits in a single document, findOneAndUpdate() can perform conditional checks, apply updates, and return the updated document atomically, avoiding race condition anomalies and enabling idempotent retries.

In some cases—such as the one described in the earlier blog post on FOR UPDATE SKIP LOCKED in MongoDB—an updateOne() followed by a find() with appropriate conditions is enough and avoids storing before- or after-images.

Databases in 2025: A Year in Review

The world tried to kill Andy off but he had to stay alive to to talk about what happened with databases in 2025.

January 02, 2026

The Agentic Self: Parallels Between AI and Self-Improvement

2025 was the year of the agent. The goalposts for AGI shifted; we stopped asking AI to merely "talk" and demanded that it "act". As an outsider looking at the architecture of these new agents and agentic system, I noticed something strange. The engineering tricks used to make AI smarter felt oddly familiar. They read less like computer science and more like … self-help advice.

The secret to agentic intelligence seems to lie in three very human habits: writing things down, talking to yourself, and pretending to be someone else. They are almost too simple.

The Unreasonable Effectiveness of Writing

One of the most profound pieces of advice I ever read as a PhD student came from Prof. Manuel Blum, a Turing Award winner. In his essay "Advice to a Beginning Graduate Student", he wrote: "Without writing, you are reduced to a finite automaton. With writing you have the extraordinary power of a Turing machine."

If you try to hold a complex argument entirely in your head, you will fail. Your working memory is a "finite automaton”, it has short limits. But if you write things down, you offload that memory to the paper. You can then reference it, critique it, and build on it. The paper becomes your external hard drive.

AI Agents are now being built on this exact principle. An LLM has a limited context window, its effective attention span is limited. If it tries to solve a 50-step coding problem in one breath, it will crash and burn. To fix this, we give agents scratchpads. We force them to write down a plan before they execute code. We give them memory buffers to store facts for later. We are essentially asking AI to take notes. By externalizing their internal state onto a digital piece of paper, agents evolve from simple pattern-matchers into robust thinkers.

Thinking is Just Talking to Yourself in a Loop

For a long time, we treated LLMs as simple input-output machines: we asked a question, and it produced an answer. Often, the answers were unsatisfying—hallucinated or superficial. DeepSeek changed this by pausing the model before it answered, prompting it to “think” first. But what does thinking mean for a computer? It means generating text that remains hidden from the user: an internal monologue. In that sense, it mirrors human thought.

“Writing is nature’s way of letting you know how sloppy your thinking is.” — Guindon

“If you think without writing, you only think you’re thinking.” — Leslie Lamport

This brings us back to the unreasonable effectiveness of writing. But this time performed in a loop. Thinking isn’t instantaneous, it is a process. Prototyping for the win! The Agent follows a loop: Act/Write → Reason → Repeat. It talks to itself, breaking complex problems into manageable parts. It asks, “Wait, should I check this assumption?” or “That doesn’t look right; let me try again.” 

Role Playing: The Alter Ego Effect

A few years ago, I read The Alter Ego Effect: The Power of Secret Identities to Transform Your Life. Its core idea is that adopting a persona can unlock hidden capabilities. By stepping into a specific character, you bypass your own inhibitions and trigger a defined set of behaviors. Beyoncé had “Sasha Fierce” on stage, and most athletes have their “game faces”. The book made a convincing case, but this sounded so cheesy that I didn’t even blog about it at the time.

I guess that stuff actually works! In the world of AI Agents, they call this "Role Prompting" or "Multi-Agent Systems".

If you ask a single AI to "write code," it might do a mediocre job. But if you assign one AI the role of "The Architect" to plan the code, another the role of "The Engineer" to write it, and a third the role of "The Critic" to review it, the result is exponentially better.

These personas are not just theatrics; they are inductive biases. They constrain the search space. Just as adopting the persona of a "strict editor" helps a writer cut fluff, assigning an AI the role of a "debugger" forces it to look for errors rather than just generating text.

Advisor models act as meta-thinkers. They don’t execute tasks directly but monitor other agents, flag risks, and guide decisions. In coding, an Advisor might warn the Architect of a risky design or nudge the Engineer away from error-prone code. By providing this high-level oversight, Advisors let agents focus on immediate tasks while keeping long-term goals in view, making multi-agent systems more strategic.

Isn’t this just the classic advice-column cliche: "get a coach"?

Where do we go from here?

Maybe we are just discovering that the best way to utilize pattern-matching-based raw intelligence is through a simple universal workflow: Write it down, talk it through or try it out, and check your work. These steps may provide the minimal scaffolding required to sustain reasoning, forming the core mechanics of thought.

Lamport offers a corollary to Guindon:
Mathematics is nature's way of showing how sloppy our writing is.

Mathematics closes the loopholes that prose leaves open. What writing does to thought, mathematics does to writing. This is likely where symbolic AI and formal methods become important. Writing a specification already sharpens thinking. Formalizing it sharpens it again and makes reasoning as precise and reliable as possible.

Common prefix skipping, adaptive sort

The patent expired for US7680791B2. I invented this while at Oracle and it landed in 10gR2 with claims of ~5X better performance vs the previous sort algorithm used by Oracle. I hope for an open-source implementation one day. The patent has a good description of the algorithm, it is much easier to read than your typical patent. Thankfully the IP lawyer made good use of the functional and design docs that I wrote.

The patent is for a new in-memory sort algorithm that needs a name. Features include:

  • common prefix skipping
    • skips comparing the prefix of of key bytes when possible
  • adaptive
    • switches between quicksort and most-significant digit radix sort
  • key substring caching
    • reduces CPU cache misses by caching the next few bytes of the key
  • produces results before sort is done
    • sorted output can be produced (to the rest of the query, or spilled to disk for an external sort) before the sort is finished. 

How it came to be

From 2000 to 2005 I worked on query processing for Oracle. I am not sure why I started on this effort and it wasn't suggested by my bosses or peers. But the Sort Benchmark contest was active and I had more time to read technical papers. Perhaps I was inspired by the Alphasort paper.

While the Sort Benchmark advanced the state of the art in sort algorithms, it also encouraged algorithms that were great for benchmarks (focus on short keys with uniform distribution). But keys sorted by a DBMS are often much larger than 8 bytes and adjacent rows often have long common prefixes in their keys.

So I thought about this while falling to sleep and after many nights realized that with a divide and conquer sort, as the algorithm descends into subpartitions of the data, that the common prefixes of the keys in each subpartition were likely to grow:

  • were the algorithm able to remember the length of the common prefix as it descends then it can skip the common prefix during comparisons to save on CPU overhead
  • were the algorithm able to learn when the length of the common prefix grows then it can switch from quicksort to most-significant digit (MSD) radix sort using the next byte beyond the common prefix and then switch back to quicksort after doing that
  • the algorithm can cache bytes from the key in an array, like Alphasort. But unlike Alphasort as it descends it can cache the next few bytes it will need to compare rather than only caching the first few bytes of the key. This provides much better memory system behavior (fewer cache misses).
Early implementation

This might have been in 2003 before we were able to access work computers from home. I needed to get results that would convince management this was worth doing. I started my proof-of-concept on an old PowerPC based Mac I had at home that found a second life after I installed Yellow Dog Linux on it.

After some iteration I had good results on the PowerPC. So I brought my source code into work and repeated the test on other CPUs that I could find. On my desk I had a Sun workstation and a Windows PC with a 6 year old Pentium 3 CPU (600MHz, 128kb L2 cache). Elsewhere I had access to a new Sun server with a 900MHz UltraSPARC IV (or IV+) CPU and an HP server with a PA RISC CPU.

I also implemented other state of the art algorithms including Alphasort along with the old sort algorithm used by Oracle. From testing I learned:
  1. my new sort was much faster than other algorithms when keys were larger than 8 bytes
  2. my new sort was faster on my old Pentium 3 CPU than on the Sun UltraSPARC IV
The first was great news for me, the second was less than great news for Sun shareholders. I never learned why that UltraSPARC IV performance was lousy. It might have been latency to the caches.

Real implementation

Once I had great results, it was time for the functional and design specification reviews. I remember two issues:
  • the old sort was stable, the new sort was not
    • I don't remember how this concern was addressed
  • the new sort has a bad, but unlikely, worst-case
    • The problem here is the worst-case when quicksort picks the worst pivot every time it selects a pivot. The new sort wasn't naive, it used the median from a sample of keys each time to select a pivot (the sample size might have been 5). So I did the math to estimate the risk. Given that the numbers are big and probabilities are small I needed a library or tool that supported arbitrary-precision arithmetic and ended up using a Scheme implementation. The speedup in most cases justified the risk in a few cases.
And once I had this implemented within the Oracle DBMS I was able to compare it with the old sort. The new sort was often about 5 times faster than the old sort. I then compared it with SyncSort. I don't remember whether they had a DeWitt Clause so I won't share the results but I will say that the new sort in Oracle looked great in comparison.

The End

The new sorted landed in 10gR2 and was featured in a white-paper. I also got a short email from Larry Ellison thanking me for the work. A promotion or bonus would have to wait as you had to play the long-game in your career at Oracle. And that was all the motivation I needed to leave Oracle -- first for a startup, and then to Google and Facebook.

After leaving Oracle, much of my time was spent on making MySQL better. Great open-source DBMS, like MySQL and PostgreSQL, were not good for Oracle's new license revenue. Oracle is a better DBMS, but not everyone needs it or can afford it.

No, You Can't (Officially) Reconnect Cut Pieces of Hue Strip Lights

Hopefully this saves someone else an hour of digging. Philips Hue has a comparison page for their strip lights. This table says that for the Ambiance Gradient lightstrips, “Cut pieces can be reconnected”. Their “Can you cut LED strip lights” page also says “Many strip lights are cuttable, and some even allow the cut parts to be reused and added to the strip light using a connector. Lightstrip V4 and many of the latest models will enable this level of customization.”

Lightstrip V4 has been out for six years, but at least as far as Hue’s official product line goes, the statement “Cut pieces can be reconnected” is not true. Hue’s support representative told me that Hue might someday release a connector which allows reconnecting cut pieces. However, that product does not exist yet, and they can’t say when it might be released.

In the meantime, there’s a company called Litcessory which makes third-party adapters for Hue strips. I haven’t tried them, but I think they might do the trick.

Running Databases on Kubernetes: A Practical Guide to Risks, Benefits, and Best Practices

As a database administrator, you are the guardian of the company’s most critical asset: its data. You live by performance, reliability, and security, ensuring every change maintains uptime and data integrity. That level of precision takes time, as every update, patch, and configuration is tested before it goes live. Meanwhile, application teams have fully embraced […]

January 01, 2026

MongoDB equivalent to FOR UPDATE SKIP LOCKED

SELECT ... FOR UPDATE SKIP LOCKED is a vendor-specific SQL feature available in several relational databases (e.g., PostgreSQL, Oracle, MySQL). It helps parallel workers avoid waiting on locked rows. MongoDB’s concurrency model uses optimistic concurrency: reads don’t block writes, and writes don’t block reads. To coordinate parallel processing, you can reserve a document by writing a lock field so other workers skip it.

I'll use an example discussed in the Reddit question "ACID read then write – Python":

Client in python, multi process. Each process picks and reads one document, calls some public APIs, and add data to the document and save it. Then next document. What is written can depend on the read data. Question is: in python, how can I create and configure transactions in the code to make sure no other process can read or write its current document from the moment a process starts reading it until done writing its additional data? This means concurrent reads should not happen…

In this example, I'll process messages based on their originating IP address. Multiple threads will enrich them with location data fetched from the public API at https://ip-api.com/.

Here is an example of an initial document:

{
  _id: ObjectId('6956e772baea71e37a818e73'),
  originatingIp: '1.1.1.1',
  location: null
}

Here is the document while it is being processed:

{
  _id: ObjectId('6956e772baea71e37a818e73'),
  originatingIp: '1.1.1.1',
  location: null,
  lock: {
    by: 'franck',
    until: datetime.datetime(2026, 1, 1, 22, 33, 10, 833000)
  }
}

Here is the same document after processing:

{
  _id: ObjectId('6956e772baea71e37a818e73'),
  originatingIp: '1.1.1.1',
  location: {
    status: 'success',
    country: 'Hong Kong',
    countryCode: 'HK',
    region: 'HCW',
    regionName: 'Central and Western District',
    city: 'Hong Kong',
    zip: '',
    lat: 22.3193,
    lon: 114.1693,
    timezone: 'Asia/Hong_Kong',
    isp: 'Cloudflare, Inc',
    org: 'APNIC and Cloudflare DNS Resolver project',
    as: 'AS13335 Cloudflare, Inc.',
    query: '1.1.1.1'
  }
}

Storing in-process information avoids long transactions that hide the current status and make troubleshooting difficult when the public API is slow.

Design

This script is designed as a complete, runnable demonstration of how to implement SELECT ... FOR UPDATE SKIP LOCKED-style parallel job claiming in MongoDB. The script will generate everything it needs, process it, and show the end state.

  • insert_test_docs() inserts test data with random IP addresses in a new collection "message", and creates a partial index to get the message to process ({location: null}).
  • claim_document() updates a message to process, adding lock information so that another thread will not pick the same, and fetches the document. The criteria are that it must be processed ({location: null}) and not locked, or the lock must have expired (with a 1s grace to account for clock skew).
  • fetch_location() is the call to the public API, here getting location information for an IP address.
  • process_document() calls claim_document() to get a message to process, with a lock. It calls fetch_location() and updates the document with the location. It ensures the lock is still in place before the update, then unsets it. Each thread runs in a loop, claiming and processing documents until the timeout.
  • main() calls those in sequence and displays the final documents.

This solution avoids explicit transactions, which is preferable because they would include a call to a public API with unpredictable response time. It also avoids using findOneAndUpdate, whose higher overhead comes from storing full pre- and post-images of documents for retryable operations. For large documents—possible in real workloads, even if not shown in this demo—this would lead to significant write amplification. Finally, setting an expiration timestamp allows automatic re-processing if a message fails.

Code

Below is the complete Python program, which you can test using different numbers of documents and threads:

import os
import random
import socket
import threading
import time
from datetime import datetime, timedelta
import requests
from pymongo import MongoClient


# Mongo connection and collection
client = MongoClient("mongodb://127.0.0.1:27017/?directConnection=true")
db = client.test
messages = db.message

#  Test settings (the test inserts documents, then runs the processing threads for some duration)
DOCUMENTS = 10 # number of documents created initially
THREADS = 5    # number of threads that loop to claim a document
SECONDS = 15   # thread stops looping on claim

# Worker identity (to identify the thread, and set an expiration on the lock)
WORKER_ID = f"{socket.gethostname()}-{os.getpid()}"
LOCK_DURATION = timedelta(seconds=60) # assumes processing completes within that duration, if not, it will be claimed by another, and this one will not update it

# Get the time
def utcnow(): return datetime.utcnow()
MAX_CLOCK_SKEW=timedelta(seconds=1) # used as a grace period when lock is expired

# --- Prepare test messages (with random generated IP) ---
def insert_test_docs():
    # Drop the collection completely (removes data + indexes)
    messages.drop()
    # Create the partial index for unprocessed docs  (they have location = null )
    messages.create_index(  [("lock.until", 1)],  partialFilterExpression={"location": None}  )
    # Generate random IPs for the test
    ips = [
        ".".join(str(random.randint(1, 255)) for _ in range(4))
        for _ in range(DOCUMENTS)
    ]
    # Explicitly set location=None to match the partial index filter
    docs = [
        { "originatingIp": ip, "location": None  } # A null location is the marker to process it
        for ip in ips
    ]
    messages.insert_many(docs)
    print(f"[STARTUP] Inserted {DOCUMENTS} test docs into 'message'")
    for doc in messages.find({}, {"_id": 0, "originatingIp": 1, "location": 1}):
        print(doc)


# --- Claim a message ---
def claim_document():
    now = utcnow()
    lock_expiry = now + LOCK_DURATION
    token = random.randint(1000, 9999)  # unique lock token for extra safety
    # Atomic lock claim: match unlocked or steal locks expired
    result = messages.update_one(
        {
          "$and": [
            # the location is not set
            { "location": None },
            # the document is not locked, or locked expired including grace period
            {  "$or": [  { "lock": { "$exists": False } },  { "lock.until": { "$lt": now - MAX_CLOCK_SKEW } }  ]  }
          ]
        },
        { "$set": {  "lock": {  "by": WORKER_ID,  "until": lock_expiry,  "token": token  }  }}
    )
    if result.modified_count == 0:
        return None
    # Fetch exactly the doc we locked — match by worker, expiry, AND token
    doc = messages.find_one({  "lock.by": WORKER_ID,  "lock.until": lock_expiry,  "lock.token": token  })
    if doc:
        print(f"[{WORKER_ID}] {threading.current_thread().name} claimed IP {doc['originatingIp']} with token={token}")
    else:
        print(f"[{WORKER_ID}] {threading.current_thread().name} claim succeeded but fetch failed — possible race?")
    return doc

# --- Call the public API ---
def fetch_location(ip):
    url = f"http://ip-api.com/json/{ip}"
    try:
        resp = requests.get(url, timeout=30)
        if resp.status_code == 200:
            return resp.json()
        print(f"[API] Error: HTTP {resp.status_code} for {ip}")
        return None
    except Exception as e:
        print(f"[API] Exception for {ip}: {e}")
        return None

# --- Process messages in a loop ---
def process_document():
    start_time = time.time()
    timeout = SECONDS  # seconds
    thread_name = threading.current_thread().name
    while True:
        # Try to claim a doc
        doc = claim_document()
        if doc:
            # We successfully claimed a doc — process it
            ip = doc["originatingIp"]
            location_data = fetch_location(ip)
            if not location_data:
                print(f"[{WORKER_ID}] {thread_name} failed to fetch location for {ip}")
                return
            # Final update only if lock is still valid
            now = utcnow()
            result = messages.update_one(
                {
                    "_id": doc["_id"],
                    "lock.by": WORKER_ID,
                    "lock.until": {"$gte": now},
                    "lock.token": doc["lock"]["token"]
                },
                {
                    "$set": {"location": location_data},
                    "$unset": {"lock": ""}
                }
            )
        # No doc claimed — check elapsed time before wait and retry
        elapsed = time.time() - start_time
        if elapsed >= timeout:
            print(f"[{WORKER_ID}] {thread_name} exiting after {elapsed:.2f}s")
            return
        time.sleep(5)  # avoid hammering DB and the public API

# --- Initialize and run multiple processing threads ---
def main():
    print(f"\nInserting documents")
    insert_test_docs()
    print(f"\nStarting threads")
    threads = []
    for i in range(THREADS):
        tname = f"T{i}"
        t = threading.Thread(target=process_document, name=tname)
        t.start()
        threads.append(t)
    for t in threads:
        t.join()
    print(f"\n[{WORKER_ID}] Check final documents:")
    for doc in messages.find({}, {"originatingIp": 1, "location.query": 1, "location.country": 1, "location.message": 1, "lock.by": 1, "lock.until": 1}):
        print(doc)

if __name__ == "__main__":
    main()

Technical Insights

MongoDB’s storage engine guarantees atomicity for each update_one through its WriteUnitOfWork and RecoveryUnit mechanisms. However, maintaining read consistency across multiple operations requires application-level coordination. In this implementation, that coordination is provided by an atomic claim with conditional criteria, ensuring that only one worker can lock an unprocessed or expired document at a time.

Several safeguards mitigate race conditions. The claim step narrows matches using the worker ID, lock expiry, and a random token. The final update then re-verifies all these fields before committing changes, preventing stale or stolen locks from being applied. Lock expiration enables automatic recovery from failures, and a small grace window accounts for clock skew in distributed systems.

Write conflicts during concurrent updates are automatically resolved at the storage layer via optimistic concurrency control. This ensures correctness without blocking other operations. The result is a robust, non-blocking parallel processing workflow that preserves document-level ACID guarantees while scaling effectively in shared or cloud environments.

In this design, each thread processes one message at a time, in index order. Enforcing strict global message ordering would be more complex. The primary goal here is the scalability of the parallel processing.

Final Recommendation

When migrating from PostgreSQL to MongoDB—like between any two databases—avoid a direct feature-by-feature mapping, because the systems are fundamentally different. SKIP LOCKED works around blocking FOR UPDATE reads in PostgreSQL, while reads and writes do not block in MongoDB. Instead of replicating another database behavior, clarify the business requirement and design the most appropriate solution. In this example, rather than relying on generic transaction control like SQL, we modeled object states—such as claim acquisition and expiration—and store that state directly in the documents.