a curated list of database news from authoritative sources

January 10, 2026

Which Document class is best to use in Java to read MongoDB documents?

TL;DR: the answer is in the title, use Document.

BSON is a serialization format, similar to protobuf, designed for efficient document storage on disk and transfer over the network as a byte stream. Instead of scanning and rewriting the entire byte sequence to access its contents (fields, arrays, subdocuments), you work with an in-memory object that exposes methods to read and write fields efficiently. On MongoDB's server side, that's the mutable BSON object. On the client side, the drivers provide a similar API. Here are the five object types that implement the BSON interface in Java

  • Document is the recommendation for most applications. It provides the best balance of flexibility, ease of use, and functionality.

Only consider the other classes when you have specific requirements:

  • BsonDocument: When you need strict BSON type safety rather than application types.

  • RawBsonDocument: When you need to access the raw bytes of the document rather than field values.

  • JsonObject: When working exclusively with JSON strings, as plain text.

  • BasicDBObject: Only for legacy code migration.

RawBsonDocument is used internally—for example, for client-side encryption and change streams—whereas the other classes can all be used directly in MongoDB driver operations. Your choice mainly impacts how you construct, manipulate, and access document data in your application code. They are documented:

I reviewed the documentation and code to better understand their differences. Especially when comparing with other databases which have different design and API. In this post I provide a detailed description of those classes, what they are, how they are implemented, what they provide, and when to use them

Detailed description

Here's a detailed comparison of the five document classes available in the MongoDB Java Driver and when to use each:

Document (org.bson)

Document is a flexible representation of a BSON document that implements Map<String, Object>. It uses a LinkedHashMap<String, Object> internally to maintain insertion order.

Key characteristics:

  • Loosely-typed: Values are stored as Object, allowing you to use standard Java types (String, Integer, Date, etc.)
  • Flexible: Easy to work with dynamically structured documents
  • Map interface: Provides all standard Map operations

Document is the recommended default choice for most applications. Use it when you want a flexible and concise data representation that's easy to work with.

BsonDocument (org.bson)

BsonDocument is a type-safe container for BSON documents that implements Map<String, BsonValue>. It also uses a LinkedHashMap, but stores BsonValue types.

Key characteristics:

  • Type-safe: All values must be wrapped in BSON library types (BsonString, BsonInt32, BsonDocument, etc.)
  • Stricter API: Provides compile-time type safety but requires more verbose code
  • Map interface: Implements Map<String, BsonValue>

Use BsonDocument when you need a type-safe API and want explicit control over BSON types. This is particularly useful when you need to ensure precise type handling or when working with APIs that require BsonDocument.

RawBsonDocument (org.bson)

RawBsonDocument is an immutable BSON document represented using only raw bytes. It stores the BSON document as a byte array without parsing it.

Key characteristics:

  • Immutable: All mutation operations throw UnsupportedOperationException
  • Lazy parsing: Data is only parsed when accessed, making it very efficient for pass-through scenarios, when not for accessing each individual field
  • Memory efficient: Stores raw bytes, avoiding object allocation overhead
  • Can decode to other types: Provides decode() method to convert to other document types when needed

Use RawBsonDocument when you need maximum performance and memory efficiency for whole-document operations. It is particularly useful when reading documents from MongoDB and passing them to another system unchanged, when working with large documents that you don’t need to parse, when building high-performance data pipelines where parsing overhead matters, and when you need an immutable document representation.

JsonObject (org.bson.json)

JsonObject is a wrapper class that holds a JSON object string. It simply stores the JSON as a String.

Key characteristics:

  • Does not implement Map: It's just a string wrapper with validation
  • No parsing required: Avoids conversion to Map structure if you're just working with JSON
  • JSON-focused: Designed for applications that primarily work with JSON strings
  • Supports Extended JSON: Works with MongoDB Extended JSON format

Use JsonObject when you want to work directly with JSON strings and avoid the overhead of converting to and from Map objects. This is ideal for REST APIs that consume and produce JSON, for logging or persisting documents as JSON strings, and for applications that primarily handle JSON and do not require programmatic field-level access.

BasicDBObject (com.mongodb)

BasicDBObject is a legacy BSON object implementation that extends BasicBSONObject and implements the DBObject interface.

Key characteristics:

  • Legacy class: Exists for backward compatibility with older driver versions
  • Does not implement Map: Only implements the DBObject interface, lacking modern Map convenience methods
  • Binary compatibility concerns: Implements an interface rather than extending a class, which can cause compatibility issues

Only use BasicDBObject when migrating from a legacy driver version (pre-3.0). The documentation explicitly recommends avoiding this class for new development due to its limitations.

Conversion Between Types

All of these classes implement the Bson interface, which allows them to be used interchangeably in MongoDB operations (but without the same performance). You can also convert between types:

  • BsonDocument.parse(json) to create from JSON
  • RawBsonDocument.decode(codec) to convert RawBsonDocument to another type
  • Document can be converted to BsonDocument through codec operations
  • JsonObject.getJson() to extract the JSON string

Comparing RawBsonDocument and Document

RawBsonDocument stores BSON as a raw byte array and does not deserialize it in advance. When you access a field, it creates a BsonBinaryReader and scans the document sequentially, reading each field’s type and name until it finds the requested key. Only the matching field is decoded using RawBsonValueHelper.decode, while all other fields are skipped without parsing. For nested documents and arrays, it reads only their sizes and wraps the corresponding byte ranges in new RawBsonDocument or RawBsonArray instances, keeping their contents as raw bytes. This approach provides fast access for a single field lookup, while being memory-efficient and keeping the document immutable, which is ideal for large documents where only a few fields are needed or for documents that are mostly passed through without inspection.

In contrast, Document uses a fully deserialized LinkedHashMap<String, Object>. When a Document is created, all fields are eagerly parsed into Java objects. Field access and containsKey operations are simple HashMap lookups, and the document is fully mutable, supporting standard map operations such as put, remove, and clear. This design consumes more memory but is better suited to small or medium-sized documents, scenarios where many fields are accessed, or cases where the document needs to be modified frequently.

Finally, Document doesn't use RawBsonDocument for parsing or accessing fields since doing so would be inefficient, and the two serve different purposes.

Comparison with Oracle (OSON) and PostgreSQL (JSONB)

Neither Oracle nor PostgreSQL provides BSON as they use OSON and JSONB, so there's no BsonDocument or RawBsonDocument equivalent.

In Oracle’s JDBC driver, the closest equivalent to a Document is OracleJsonObject, one of the OracleJsonValue types, which can be exposed directly as a javax.json.JsonObject or mapped domain object. This API works directly on the underlying OSON bytes without fully parsing the entire document into an intermediate data structure. OSON is more than a raw serialization: it carries its own local dictionary of distinct field names, a sorted array of hash codes for those names, and compact field ID arrays and value offsets, enabling the driver to locate a field in place by binary‑searching the field ID array, and jumping to the right offset.

If JSON text is needed instead of the object model, the equivalent is simply to use ResultSet.getString(), which will convert the OSON image to JSON text on the client.

PostgreSQL’s JDBC driver, by contrast, offers no native Java JSON object API for either json or jsonb columns: values are always returned as text, so the application must parse the string into its own document model using a separate JSON library. Even when using PostgreSQL’s binary JSONB storage, none of the binary efficiency crosses the wire (See JSONB vs. BSON: Tracing PostgreSQL and MongoDB Wire Protocols), and client code still performs a full parse before accessing individual fields.

Conclusion

MongoDB’s main advantage for modern applications—whatever the data types or workloads—is the ability to work with data directly through your domain objects, without an intermediate object-relational mapping layer or view. Use the Document class as your document object model (DOM). It offers flexibility, map-style access, and a natural Java interface, while the driver transparently converts BSON from the network into objects your application can use immediately.

January 09, 2026

Cloudspecs: Cloud Hardware Evolution Through the Looking Glass

This paper (CIDR'26) presents a comprehensive analysis of cloud hardware trends from 2015 to 2025, focusing on AWS and comparing it with other clouds and on-premise hardware.

TL;DR: While network bandwidth per dollar improved by one order of magnitude (10x), CPU and DRAM gains have been much more modest. Most surprisingly, NVMe storage performance in the cloud has stagnated since 2016. Check out the NVMe SSD discussion below for data on this anomaly.


CPU Trends

Multi-core parallelism has skyrocketed in the cloud. Maximum core counts have increased by an order of magnitude over the last decade. The largest AWS instance u7in now boasts 448 cores. However, simply adding cores hasn't translated linearly into value. To measure real evolution, the authors normalized benchmarks (SPECint, TPC-H, TPC-C) by instance cost. SPECint benchmarking shows that cost-performance improved roughly 3x over ten years. A huge chunk of that gain comes from AWS Graviton. Without Graviton, the gain drops to roughly 2x. For in-memory database benchmarks, gains were even lower (2x–2.5x), likely due to memory and cache latency bottlenecks.

On-prem hardware comparison shows that this stagnation is not cloud price gouging. Historically, Moore's Law and Dennard scaling doubled cost-performance every two years (which would have sum up to 32x gain over a decade). However, an analysis of on-premise AMD server CPUs reveals a similar slump, only a 1.7x gain from 2017 to 2025.


Memory Trends

DRAM capacity per dollar has effectively flatlined. The only significant improvement was the 2016 introduction of memory-optimized x instances, which offered ~3.3x more GiB-hours/$ than compute-optimized peers. While absolute single-socket bandwidth jumped ~5x (93 GiB/s to 492 GiB/s) as servers moved from DDR3 to DDR5, the cost-normalized gain is only 2x.


Historical data suggests commodity DRAM prices dropped 3x over the decade. But in the last three months, due to AI-driven demand, DDR5 prices rose sharply, further limiting effective memory gains.


Network Trends

We have good news here, finally. Network bandwidth per dollar exploded by 10x. And absolute speeds went from 10 Gbit/s to 600 Gbit/s (60x).


These gains were not universal though. Generic instances saw little change. The gains were driven by network-optimized n instances (starting with the c5n in 2018) powered by proprietary Nitro cards.


NVMe Trends

NVMe SSDs are the biggest surprise. Unlike CPUs and memory, where cloud trends mirror on-prem hardware, NVMe performance in AWS has largely stagnated. The first NVMe-backed instance family, i3, appeared in 2016. As of 2025, AWS offers 36 NVMe instance families. Yet the i3 still delivers the best I/O performance per dollar by nearly 2x. 


SSD capacity has stagnated since 2019 and I/O throughput since 2016. This sharply contrasts with on-prem hardware, where SSD performance doubled twice (PCIe 4 and PCIe 5) in the same timeframe. The gap between cloud and on-premise NVMe is widening rapidly.

This  price/performance gap likely explains the accelerating push toward disaggregated storage. When local NVMe is expensive and underperforming, remote storage starts to look attractive. The paper speculates that with network speeds exploding and NVMe stagnating, architectures may shift further. For systems like Snowflake, using local NVMe for caching might no longer be worth the complexity compared to reading directly from S3 with fast networks.


Discussion

I think the main takeaway is that uniform hardware scaling in the cloud is over. Moore's Law no longer lifts all boats. Performance gains now come from specialization, especially networking (e.g., Graviton, Nitro, Accelerators).

In my HPTS 2024 review, I noted that contrary to the deafening AI hype, the real excitement in the hallways was about hardware/software codesign. This paper validates that sentiment. With general-purpose CPU and memory cost-performance stagnating, future databases must be tightly integrated with specialized hardware and software capabilities to provide value. I think the findings here will refuel that trend.

A key open question is why massive core counts deliver so little value. Where is the performance lost? Possible explanations include memory bandwidth limits, poor core-to-memory balance, or configuration mismatches. But I think the most likely culprit is software. Parallel programming remains hard, synchronization is expensive, and many systems fail to scale beyond a modest number of cores. We may be leaving significant performance on the table simply because our software cannot effectively utilize the massive parallelism now available.

The paper comes with an interactive tool, Cloudspecs, built on DuckDB-WASM (yay!). This allows you to run SQL queries over the dataset directly in the browser to visualize these trends. The figures in the PDF actually contain clickable link symbols that take you to the specific query used to generate that chart. Awesome reproducibility!

Aleksey and I did a live-reading of the paper. As usual, we had a lot to argue about. I'll add a recording of our discussion on YouTube when it becomes available, and here is a link to my annotated paper.

January 08, 2026

Unlock Amazon Aurora’s Advanced Features with Standard JDBC Driver using AWS Advanced JDBC Wrapper

In this post, we show how you can enhance your Java application with the cloud-based capabilities of Amazon Aurora by using the JDBC Wrapper. Simple code changes shared in this post can transform a standard JDBC application to use fast failover, read/write splitting, IAM authentication, AWS Secrets Manager integration, and federated authentication.

Postgres vs tproc-c on a small server

This is my first post with results from tproc-c using HammerDB. This post has results for Postgres. 

tl;dr - across 8 workloads (low and medium concurrency, cached database to IO-bound)

  • there might be a regression for Postgres 14.20 and 15.15 in one workload
  • there are improvements, some big, for Postgres 17 and 18 in most workloads

Builds, configuration and hardware

I compiled Postgres from source using -O2 -fno-omit-frame-pointer for versions 12.22, 13.23, 14.20, 15.15, 16.11, 17.7 and 18.1.

The server is an ASUS ExpertCenter PN53 with an AMD Ryzen 7 7735HS CPU, 8 cores, SMT disabled, and 32G of RAM. Storage is one NVMe device for the database using ext-4 with discard enabled. The OS is Ubuntu 24.04. More details on it are here.

For versions prior to 18, the config file is named conf.diff.cx10a_c8r32 and they are as similar as possible and here for versions 1213141516 and 17.

For Postgres 18 the config file is named conf.diff.cx10b_c8r32 and adds io_mod='sync' which matches behavior in earlier Postgres versions.

Benchmark

The benchmark is tproc-c from HammerDB. The tproc-c benchmark is derived from TPC-C.

The benchmark was run for several workloads:
  • vu=1, w=100 - 1 virtual user, 100 warehouses
  • vu=6, w=100 - 6 virtual users, 100 warehouses
  • vu=1, w=1000 - 1 virtual user, 1000 warehouses
  • vu=6, w=1000 - 6 virtual users, 1000 warehouses
  • vu=1, w=2000 - 1 virtual user, 2000 warehouses
  • vu=6, w=2000 - 6 virtual users, 2000 warehouses
  • vu=1, w=4000 - 1 virtual user, 4000 warehouses
  • vu=6, w=4000 - 6 virtual users, 4000 warehouses
The benchmark is run by this script which depends on scripts here.
  • stored procedures are enabled
  • partitioning is used for when the warehouse count is >= 1000
  • a 5 minute rampup is used
  • then performance is measured for 120 minutes

Results

All artifacts from the tests are here. A spreadsheet with the charts and numbers is here.

My analysis at this point is simple -- I only consider average throughput. Eventually I will examine throughput over time and efficiency (CPU and IO).

On the charts that follow y-axis starts at 0.9 to improve readability. The y-axis shows relative throughput. There might be a regression when the relative throughput is less than 1.0. There might be an improvement when it is > 1.0. The relative throughput is:
(NOPM for a given version / NOPM for Postgres 12.22)

Results: vu=1, w=100

Summary:
  • no regressions, no improvements
Results: vu=6, w=100

Summary:
  • no regressions, no improvements
Results: vu=1, w=1000

Summary:
  • no regressions, improvements in Postgres 17 and 18
Results: vu=6, w=1000

Summary:
  • no regressions, improvements in Postgres 16, 17 and 18
Results: vu=1, w=2000

Summary:
  • possible regressions in Postgres 14 and 15, improvements in 13, 16, 17\ and 18
Results: vu=6, w=2000

Summary:
  • no regressions, improvements in Postgres 13 through 18
Results: vu=1, w=4000

Summary:
  • no regressions, improvements in Postgres 13 through 18
Results: vu=6, w=4000

Summary:
  • no regressions, improvements in Postgres 16 through 18

Introducing TRAE SOLO integration with Supabase

TRAE IDE now integrates with Supabase, enabling developers to manage their entire backend infrastructure without leaving the IDE. Browse databases, manage storage, configure auth, and more.

The Sauna Algorithm: Surviving Asynchrony Without a Clock

While sweating it out in my gym's sauna recently, I found a neat way to illustrate the happened-before relationship in distributed systems. Imagine I suffer from a medical condition called dyschronometria, which makes me unable to perceive time reliably, such that 10 seconds and 10 minutes feel exactly the same to me.

In this scenario, the sauna lacks a visible clock. I'm flying blind here, yet I want to leave after a healthy session. If I stay too short, I get no health benefits. If I stay too long, I risk passing out on the floor. The question becomes: How do I, a distributed node with no local clock, ensure operating within a safety window in an asynchronous environment?

Thankfully, the sauna has a uniform arrival of people. Every couple of minutes, a new person walks in. These people don't suffer from dyschronometria and they stay for a healthy session, roughly 10 minutes.

My solution is simple: I identify the first person to enter after me, and I leave when he leaves. I don't measure time at all. Instead, I rely solely on the causal relationship between this person's arrival and departure. The presence of this other person acts as a reference, letting me safely synchronize my action in an entirely asynchronous setting.

Murat's Sauna Algorithm

  1. Enter the sauna.
  2. Identify the very next person to enter (let's call him A).
  3. Wait until A gets up to leave.
  4. Exit immediately after A.

By anchoring my exit to the departure of a person who arrived after me, I guarantee that my stay overlaps entirely with his healthy duration, plus the delta between my arrival and his. I have successfully bounded my sauna/execution time using causal links rather than a physical timer.

Now, for the distributed systems purists who might object to step 4 (arguing that "exiting immediately" implies measuring a time interval of zero): I can mend this. I exit after Person A leaves, but before the next person leaves. That keeps us in the realm of strictly causal reasoning.

Consider what would go wrong if I anchored my departure to someone already in the sauna when I entered. That person might be near the end of their session, causing me to leave in just a few seconds. In distributed systems, this is akin to reading the state of an outdated node to make a decision, leading to inconsistent outcomes. Anchoring only to events that occur after my entry ensures I account for all relevant causality.

There is, of course, a safe way to rely on the people already inside. I could take a snapshot of the room upon entry, identify everyone present, and refuse to leave until the entire set of them has departed. This guarantees a sufficient duration (assuming at least one person recently arrived), but it requires more memory. Instead of tracking a single reference point (the newcomer), I would have to maintain state for everyone in the sauna.

Let's connect back to distributed systems. In an asynchronous system, nodes rely on causal relationships between events to order them. Just as I watch the newcomer to determine my exit, a node can observe the messages sent by other nodes to determine safe ordering. For example, in a distributed database, a transaction may want to commit only after all causally prior transactions have completed, ensuring consistency without requiring synchronized clocks. Reasoning about happened-before relationships allows asynchronous systems to coordinate safely without synchronized clocks. See discussions on logical clocks, vector clocks, and hybrid logical clocks in my previous posts.

Before anybody gets too concerned, no, I don't actually suffer from dyschronometria. However, back when I was a professor, my PhD students often joked that I operated on logical clocks rather than physical ones.

Also, in reality, the gym's sauna does have a clock visible from the door. I check it, and aim to stay for about 15 minutes. But I still insist on using my sauna algorithm, well at least the first three steps, because I am a competitive SOB. I try to outcompete the average person and, knock on sauna-bench wood, I haven't had to break my rule yet. It would be absolutely catastrophic if everyone started using my algorithm, though. We'd all be waiting for the next person to leave, resulting in a deadlock. Maybe the memory-costly snapshot algorithm is better after all, as it avoids this problem.

Related post: Consistent snapshot analogies

January 07, 2026

SSDs, power loss protection and fsync latency

This has results to measure the impact of calling fsync (or fdatasync) per-write for files opened with O_DIRECT. My goal is to document the impact of the innodb_flush_method option. 

The primary point of this post is to document the claim:

For an SSD without power loss protection, writes are fast but fsync is slow.

The secondary point of this post is to provide yet another example where context matters when reporting performance problems. This post is motivated by results that look bad when run on a server with slow fsync but look OK otherwise. 

tl;dr

  • for my mini PCs I will switch from the Samsung 990 Pro to the Crucial T500 to get lower fsync latency. Both are nice devices but the T500 is better for my use case.
  • with a consumer SSD writes are fast but fsync is often slow
  • use an enterprise SSD if possible, if not run tests to understand fsync and fdatasync latency

InnoDB, O_DIRECT and O_DIRECT_NO_FSYNC

When innodb_flush_method is set to O_DIRECT there are calls to fsync after each batch of writes. While  I don't know the source like I used to, I did browse it for this blog post and then I looked at SHOW GLOBAL STATUS counters. I think that InnoDB does the following with it set to O_DIRECT: 

  1. Do one large write to the doublewrite buffer, call fsync on that file
  2. Do the batch of in-place (16kb) page writes
  3. Call fsync once per database file that was written by step 2

When set to O_DIRECT_NO_FSYNC then the frequency of calls to fsync are greatly reduced and are only done in cases where important filesystem metadata needs to be updated, such as after extending a file.  The reference manual is misleading WRT the following sentence. I don't think that InnoDB ever does an fsync after each write. It can do an fsync after each batch of writes:

O_DIRECT_NO_FSYNCInnoDB uses O_DIRECT during flushing I/O, but skips the fsync() system call after each write operation.

Many years ago it was risky to use O_DIRECT_NO_FSYNC on some filesystems because the feature as implemented (either upstream or in forks) didn't do fsync for cases where it was needed (see comment about metadata above). I experienced problems from this and I only have myself to blame. But the feature has been enhanced to do the right thing. And if the #whynotpostgres crowd wants to snark about MySQL not caring about data, lets not forget that InnoDB had per-page checksums long before Postgres -- those checksums made web-scale life much easier when using less than stellar hardware.

The following table uses results while running the Insert Benchmark for InnoDB to compute the ratio of fsyncs per write using the SHOW GLOBAL STATUS counters:
Innodb_data_fsyncs / Innodb_data_writes

And from this table a few things are clear. First, there isn't an fsync per write with O_DIRECT but there might be an fsync per batch of writes as explained above. Second, the rate of fsyncs is greatly reduced by using O_DIRECT_NO_FSYNC. 

5.7.44  8.0.44
.01046  .00729  O_DIRECT
.00172  .00053  O_DIRECT_NO_FSYNC

Power loss protection

I am far from an expert on this topic, but most SSDs have a write-buffer that makes small writes fast. And one way to achieve speed is to buffer those writes in RAM on the SSD while waiting for enough data to be written to an extent. But that speed means there is a risk of data loss if a server loses power. Some SSDs, especially those marketed as enterprise SSDs, have a feature called power loss protection that make data loss unlikely. Other SSDs, lets call them consumer SSDs, don't have that feature while some of the consumer SSDs claim to make a best effort to flush writes from the write buffer on power loss.

One solution to avoiding risk is to only buy enterprise SSDs. But they are more expensive, less common, and many are larger (22120 rather than 2280) because more room is needed for the capacitor or other HW that provides the power loss protection. Note that power loss protection is often abbreviated as PLP.

For devices without power loss protection it is often true that writes are fast but fsync is slow. When fsync is slow then calling fsync more frequently in InnoDB will hurt performance.

Results from fio

I used this fio script to measure performance for writes for files opened with O_DIRECT. The test was run twice configuration for 5 minutes per run followed by a 5 minute sleep. This was repeated for 1, 2, 4, 8, 16 and 32 fio jobs but I only share results here for 1 job. The configurations tested were:

  • O_DIRECT without fsync, 16kb writes
  • O_DIRECT with an fsync per write, 16kb writes
  • O_DIRECT with an fdatasync per write, 16kb writes
  • O_DIRECT without fsync, 2M writes
  • O_DIRECT with an fsync per write, 2M writes
  • O_DIRECT with an fdatasync per write, 2M writes
Results from all tests are here. I did the test on several servers:
  • dell32
    • a large server I have at home. The SSD is a Crucial T500 2TB using ext-4 with discard enabled and Ubuntu 24.04. This is a consumer SSD. While the web claims it has PLP via capacitors the fsync latency for it was almost 1 millisecond.
  • gcp
    • a c3d-standard-30-lssd from the Google cloud with 2 local NVMe devices using SW RAID 0 and 1TB of Hyperdisk Balanced storage configured for 50,000 IOPs and 800MB/s of throughput. The OS is Ubuntu 24.04 and I repeated tests for both ext-4 and xfs, both with discard enabled. I was not able to determine the brand of the local NVMe devices.
  • hetz
    • an ax162-s from Hetzner with 2 local NVME devices using SW RAID 1. Via udiskctl status I learned the devices are Intel D7-P5520 (now Solidigm). These are datacenter SSDs and the web claims they have power loss protection. The OS is Ubuntu 24.04 and the drives use ext-4 without discard enabled. 
  • ser7
  • socket2
    • a 2-socket server I have at home. The SSD is a Samsung PM-9a3. This is an enterprise SSD with power loss protection. The OS is Ubuntu 24.04 and the drives use ext-4 with discard enabled.
Results: overview

All of the results are here.

This table lists fsync and fdatasync latency per server:
  • for servers with consumer SSDs (dell, ser7) the latency is much larger on the ser7 that uses a Samsung 990 Pro than on the dell that uses a Crucial T500. This is to be expected given that the T500 has PLP while the 990 Pro does not.
  • sync latency is much lower on servers with enterprise SSDs
  • sync latency after 2M writes is sometimes much larger than after 16kb writes
  • for the Google server with Hyperdisk Balanced storage the fdatasync latency was good but fsync latency was high. While with the local NVMe devices the latencies were larger than for enterprise SSDs but much smaller than for consumer SSDs.

--- Sync latency in microseconds for sync after 16kb writes

dell    hetz    ser7    socket2
891.1   12.4    2974.2  1.6     fsync
447.4    9.8    2783.2  0.7     fdatasync

gcp
local devices           hyperdisk
ext-4   xfs             ext-4   xfs
56.2    39.5            738.1   635.0   fsync
28.1    29.0             46.8    46.0   fdatasync

--- Sync latency in microseconds for sync after 2M writes

dell    hetz    ser7    socket2
980.1   58.2    5396.8  139.1   fsync
449.7   10.8    3508.2    2.2   fdatasync

gcp
local devices           hyperdisk
ext-4   xfs             ext-4   xfs
1020.4  916.8           821.2   778.9   fsync
 832.4  809.7            63.6    51.2   fdatasync

Results: dell

Summary:
  • Write throughput drops dramatically when there is an fsync or fdatasync per write because sync latency is large.
  • This servers uses a consumer SSD so high sync latency is expected
Legend:
  • w/s - writes/s
  • MB/s - MB written/s
  • sync - latency per sync (fsync or fdatasync)

16 KB writes
w/s     MB/s    sync    test
43400   646.6   0.0     no-sync
43500   648.5   0.0     no-sync
-
1083    16.1    891.1   fsync
1085    16.2    889.2   fsync
-
2100    31.3    447.4   fdatasync
2095    31.2    448.6   fdatasync

2 MB writes
w/s     MB/s    sync    test
2617    4992.5  0.0     no-sync
2360    4502.3  0.0     no-sync
-
727     1388.5  980.1   fsync
753     1436.2  942.5   fsync
-
1204    2297.4  449.7   fdatasync
1208    2306.0  446.9   fdatasync

Results: gcp

Summary
  • Local NVMe devices have lower sync latency and more throughput with and without a sync per write at low concurrency (1 fio job).
  • At higher concurrency (32 fio jobs), the Hyperdisk Balanced setup provides similar throughput to local NVMe and would do even better had I paid more to get more IOPs and throughput. Results don't have nice formatting but are here for xfs on the local and Hyperdisk Balanced devices.
  • fsync latency is ~2X larger than fdatasync on the local devices and closer to 15X larger on the Hyperdisk Balanced setup. That difference is interesting. I wonder what the results are for Hyperdisk Extreme.
Legend:
  • w/s - writes/s
  • MB/s - MB written/s
  • sync - latency per sync (fsync or fdatasync)
--- ext-4 and local devices

16 KB writes
w/s     MB/s    sync    test
10100   150.7   0.0     no-sync
10300   153.5   0.0     no-sync
-
6555    97.3    56.2    fsync
6607    98.2    55.1    fsync
-
8189    122.1   28.1    fdatasync
8157    121.1   28.2    fdatasync

2 MB writes
w/s     MB/s    sync    test
390     744.8   0.0     no-sync
390     744.8   0.0     no-sync
-
388     741.0   1020.4  fsync
388     741.0   1012.7  fsync
-
390     744.8   832.4   fdatasync
390     744.8   869.6   fdatasync

--- xfs and local devices

16 KB writes
w/s     MB/s    sync    test
9866    146.9   0.0     no-sync
9730    145.0   0.0     no-sync
-
7421    110.6   39.5    fsync
7537    112.5   38.3    fsync
-
8100    121.1   29.0    fdatasync
8117    121.1   28.8    fdatasync

2 MB writes
w/s     MB/s    sync    test
390     744.8   0.0     no-sync
390     744.8   0.0     no-sync
-
389     743.9   916.8   fsync
389     743.9   919.1   fsync
-
390     744.8   809.7   fdatasync
390     744.8   806.5   fdatasync

--- ext-4 and Hyperdisk Balanced

16 KB writes
w/s     MB/s    sync    test
2093    31.2    0.0     no-sync
2068    30.8    0.0     no-sync
-
804     12.0    738.1   fsync
798     11.9    740.6   fsync
-
1963    29.3    46.8    fdatasync
1922    28.6    49.0    fdatasync

2 MB writes
w/s     MB/s    sync    test
348     663.8   0.0     no-sync
367     701.0   0.0     no-sync
-
278     531.2   821.2   fsync
271     517.8   814.1   fsync
-
358     683.8   63.6    fdatasync
345     659.0   64.5    fdatasync

--- xfs and Hyperdisk Balanced

16 KB writes
w/s     MB/s    sync    test
2033    30.3    0.0     no-sync
2004    29.9    0.0     no-sync
-
870     13.0    635.0   fsync
858     12.8    645.0   fsync
-
1787    26.6    46.0    fdatasync
1727    25.7    49.6    fdatasync

2 MB writes
w/s     MB/s    sync    test
343     655.2   0.0     no-sync
343     655.2   0.0     no-sync
-
267     511.2   778.9   fsync
268     511.2   774.7   fsync
-
347     661.8   51.2    fdatasync
336     642.8   54.4    fdatasync

Results: hetz

Summary
  • this has an enterprise SSD with excellent (low) sync latency
Legend:
  • w/s - writes/s
  • MB/s - MB written/s
  • sync - latency per sync (fsync or fdatasync)
16 KB writes
w/s     MB/s    sync    test
37700   561.7   0.0     no-sync
37500   558.9   0.0     no-sync
-
25200   374.8   12.4    fsync
25100   374.8   12.4    fsync
-
27600   411.0   0.0     fdatasync
27200   404.4   9.8     fdatasync

2 MB writes
w/s     MB/s    sync    test
1833    3497.1  0.0     no-sync
1922    3667.8  0.0     no-sync
-
1393    2656.9  58.2    fsync
1355    2585.4  59.6    fsync
-
1892    3610.6  10.8    fdatasync
1922    3665.9  10.8    fdatasync

Results: ser7

Summary:
  • this has a consumer SSD with high sync latency
  • results had much variance (see the 2MB results below) and results at higher concurrency. This is a great SSD, but not for my use case.
Legend:
  • w/s - writes/s
  • MB/s - MB written/s
  • sync - latency per sync (fsync or fdatasync)
16 KB writes
w/s     MB/s    sync    test
34000   506.4   0.0     no-sync
40200   598.9   0.0     no-sync
-
325     5.0     2974.2  fsync
333     5.1     2867.3  fsync
-
331     5.1     2783.2  fdatasync
330     5.0     2796.1  fdatasync

2 MB writes
w/s     MB/s    sync    test
362     691.4   0.0     no-sync
364     695.2   0.0     no-sync
-
67      128.7   10828.3 fsync
114     218.4   5396.8  fsync
-
141     268.9   3864.0  fdatasync
192     368.1   3508.2  fdatasync

Results: socket2

Summary:
  • this has an enterprise SSD with excellent (low) sync latency after small writes, but fsync latency after 2MB writes is much larger
Legend:
  • w/s - writes/s
  • MB/s - MB written/s
  • sync - latency per sync (fsync or fdatasync)
16 KB writes
w/s     MB/s    sync    test
49500   737.2   0.0     no-sync
49300   734.3   0.0     no-sync
-
44500   662.8   1.6     fsync
45400   676.2   1.5     fsync
-
46700   696.2   0.7     fdatasync
45200   674.2   0.7     fdatasync

2 MB writes
w/s     MB/s    sync    test
707     1350.4  0.0     no-sync
708     1350.4  0.0     no-sync
-
703     1342.8  139.1   fsync
703     1342.8  122.5   fsync
-
707     1350.4  2.2     fdatasync
707     1350.4  2.1     fdatasync


Introducing OpenEverest: An Independent Open Source Project for the Future of Data Platforms

Today, we are sharing an important step forward for the project many of you know as Percona Everest. Percona is transitioning Percona Everest into an independent open source project called OpenEverest (https://openeverest.io/). This change is about one thing: making the project stronger over the long term by building it in the open, with open governance, […]

Supabase Security Retro: 2025

A summary of Supabase platform security changes made in 2025 and the security defaults planned for 2026.

January 06, 2026

Implement multi-Region endpoint routing for Amazon Aurora DSQL

Applications using Aurora DSQL multi-Region clusters should implement a DNS-based routing solution (such as Amazon Route 53) to automatically redirect traffic between AWS Regions. In this post, we show you automated solution for redirecting database traffic to alternate regional endpoints without requiring manual configuration changes, particularly in mixed data store environments.

Urgent Security Update: Patching “Mongobleed” (CVE-2025-14847) in Percona Server for MongoDB

At Percona, our mission has always been to provide the community with truly open-source, enterprise-class software. A critical part of that mission is ensuring that when security vulnerabilities arise in the upstream ecosystem, we respond with the urgency and transparency our users expect. As many in the MongoDB community are now aware, a security vulnerability—CVE-2025-14847, […]

Blog Post: Good Bye Percona Everest, Hello OpenEverest!

Over the past few years, we’ve been building Percona Everest with a clear goal in mind: to deliver a powerful yet approachable DBaaS experience on Kubernetes. Thanks to strong user and customer adoption, Everest has grown into a platform with thousands of production clusters deployed and overwhelmingly positive feedback from the community. As adoption grew, […]

January 05, 2026

3 Steps to Optimize Your Queries for Speed

When performance matters, how you store your application objects, write queries, and design indexes can be the difference between scanning a collection for minutes and getting results in milliseconds—without touching a single document.

In this article, we’ll walk through an example in MongoDB. We’ll start with a normalized model, move to an embedded model, add an index to avoid a full collection scan, and then refine both the index and query to read only what’s needed for the result. The execution plan and execution statistics will guide us at each step.

Scenario

Our domain is an e‑commerce order‑entry application with customers, products, orders, and order lines. We want to:

Find the most recent order in Switzerland containing the book “MongoDB Data Modeling and Schema Design”.

Initial model: normalized with references

If we build the data model without considering how applications will access it—or simply migrate from a relational database by creating one collection per table—we end up with a model where a single domain aggregate is split across multiple collections.

In this model, we insert a customer, some products (books), an order for this customer, and the corresponding order lines for those books:

// Customer  
db.customers.insertOne({  
  "_id": "C0001",  
  "name": "Franck Pachot",  
  "email": "franck.pachot@example.com",  
  "address": { "city": "Geneva", "country": "Switzerland" }  
})  

// Products  
db.products.insertOne({  
  "_id": "P0001",  
  "title": "MongoDB in Action, Third Edition: Building on the Atlas Data Platform",  
  "author": "Arek Borucki",  
  "price": 59.99, "category": "Database", "format": "Paperback"  
})  
db.products.insertOne({  
  "_id": "P0002",  
  "title": "MongoDB Data Modeling and Schema Design",  
  "author": ["Daniel Coupal", "Pascal Desmarets", "Steve Hoberman"],  
  "price": 54.99, "category": "Database", "format": "Paperback"  
})  
db.products.insertOne({  
  "_id": "P0003",  
  "title": "High Performance with MongoDB: Best Practices for Performance Tuning, Scaling, and Architecture",  
  "author": ["Asya Kamsky", "Ger Hartnett", "Alex Bevilacqua"],  
  "price": 49.99, "category": "Database", "format": "Paperback"  
})  

// Order  
db.orders.insertOne({  
  "_id": "O0001",  
  "customerId": "C0001",  
  "orderDate": ISODate("2024-10-22T09:00:00Z"),  
  "status": "Processing"  
})  

// Order lines  
db.orderlines.insertMany([  
  { "orderId": "O0001", "productId": "P0001", "quantity": 1 },  
  { "orderId": "O0001", "productId": "P0002", "quantity": 1 },  
  { "orderId": "O0001", "productId": "P0003", "quantity": 1 }  
])

This is a classic relational database pattern, but querying and optimizing it is complex because joins must occur before filtering.

In MongoDB, to rebuild the domain object aggregate, we use an aggregation pipeline with $lookup and $match:

db.orders.aggregate([  
  // Get customer info to filter by country  
  { $lookup: {  
      from: "customers",  
      localField: "customerId",  
      foreignField: "_id",  
      as: "customer"  
    }},  
  { $unwind: "$customer" },  

  // Get product info to filter by title  
  { $lookup: {  
      from: "orderlines",  
      let: { orderIdVar: "$_id" },  
      pipeline: [  
        { $match: { $expr: { $eq: ["$orderId", "$$orderIdVar"] } } },  
        { $lookup: {  
            from: "products",  
            localField: "productId",  
            foreignField: "_id",  
            as: "product"  
          }},  
        { $unwind: "$product" }  
      ],  
      as: "orderLines"  
    }},  

  // Apply filters  
  { $match: {  
      "customer.address.country": "Switzerland",  
      "orderLines.product.title": "MongoDB Data Modeling and Schema Design"  
    }},  

  // Sort and limit  
  { $sort: { orderDate: -1 } },  
  { $limit: 1 }  
])  

This query works, but it scans the entire collection. Indexes aren’t effective here because $match runs only after the $lookup and $unwind stages.

With one million orders, it would read all one million, perform three million lookups, filter on country and book title, sort a large intermediate result, and then return just one document.

That’s why joins can be slow: when data is split across many tables or collections, you lose the benefit of efficient index access—compound indexes can only target one collection. It also requires to code every query in the domain repository.

We need to rethink the data model.

Step 1: Embedding to fit the domain model

Knowing the database is for an order‑entry system, we can design a document model that matches the domain. An order should be a self‑contained document with all customer info and complete product details for each order line—because that’s how it’s displayed, printed, or emailed. If it’s a single logical document on paper, it should be a single document in MongoDB.

Here’s the embedded form:

db.orders.insertOne({  
  _id: 'O0002',  
  customerId: 'C0001',  
  orderDate: ISODate('2024-10-22T09:00:00.000Z'),  
  status: 'Processing',  
  customer: {  
       _id: 'C0001',  
       name: 'Franck Pachot',  
       email: 'franck.pachot@example.com',  
       address: { city: 'Geneva', country: 'Switzerland' }  
  },  
  orderLines: [  
       { quantity: 1, product: { _id: 'P0001', title: "'MongoDB in Action...', price: 59.99 } },  "
       { quantity: 1, product: { _id: 'P0002', title: "'MongoDB Data Modeling and Schema Design', price: 54.99 } },  "
       { quantity: 1, product: { _id: 'P0003', title: "'High Performance with MongoDB...', price: 49.99 } }  "
  ]  
})

Duplicated data isn’t a concern here—documents are compressed on disk, and changes to product descriptions don’t affect completed orders.

Querying becomes far simpler:

db.orders.find(  
  {  
    "customer.address.country": "Switzerland",  
    "orderLines.product.title": "MongoDB Data Modeling and Schema Design"  
  }  
).sort({ orderDate: -1 }).limit(1)

No lookups needed. But it still does a collection scan, filters, sorts, and finally returns one document.

I add one million documents and run the query with execution statistics:

db.orders.insertMany(Array.from({length: 333333},()=>({customer:{address:{country:"Switzerland"}}})))
db.orders.insertMany(Array.from({length: 666666},()=>({customer:{address:{country:"Germany"}}})))

db.orders.find(  
  {  
    "customer.address.country": "Switzerland",  
    "orderLines.product.title": "MongoDB Data Modeling and Schema Design"  
  }  
).sort({ orderDate: -1 }).limit(1).explain("executionStats").executionStats

{
  executionSuccess: true,
  nReturned: 1,
  executionTimeMillis: 559,
  totalKeysExamined: 0,
  totalDocsExamined: 1000001,
  executionStages: {
    isCached: false,
    stage: 'SORT',
    nReturned: 1,
    executionTimeMillisEstimate: 542,
    works: 1000004,
    advanced: 1,
    needTime: 1000002,
    needYield: 0,
    saveState: 29,
    restoreState: 29,
    isEOF: 1,
    sortPattern: { orderDate: -1 },
    memLimit: 104857600,
    limitAmount: 1,
    type: 'simple',
    totalDataSizeSorted: 0,
    usedDisk: false,
    spills: 0,
    spilledRecords: 0,
    spilledBytes: 0,
    spilledDataStorageSize: 0,
    inputStage: {
      stage: 'COLLSCAN',
      filter: {
        '$and': [
          { 'customer.address.country': { '$eq': 'Switzerland' } },
          {
            'orderLines.product.title': { '$eq': 'MongoDB Data Modeling and Schema Design' }
          }
        ]
      },
      nReturned: 1,
      executionTimeMillisEstimate: 516,
      works: 1000002,
      advanced: 1,
      needTime: 1000000,
      needYield: 0,
      saveState: 29,
      restoreState: 29,
      isEOF: 1,
      direction: 'forward',
      docsExamined: 1000001
    }
  }
}

In order to return a single document (nReturned: 1), one million documents were read (totalDocsExamined: 1000001), and sorted (stage: 'SORT', works: 1000004).

Now we we have all in one collection, we can optimize with an index. Our objective is to reduce totalDocsExamined to nReturned.

Step 2: Indexing for filtering, sorting, and pagination

Our query filters by:

  • Equality on customer country
  • Equality on product title
  • A range on order date (due to sort().limit()).

An ideal compound index:

db.orders.createIndex({  
  "customer.address.country": 1,   // equality  
  "orderLines.product.title": 1,   // equality  
  "orderDate": 1                   // sort/range  
})  

Here is the execution plan:

db.orders.find(  
  {  
    "customer.address.country": "Switzerland",  
    "orderLines.product.title": "MongoDB Data Modeling and Schema Design"  
  }  
).sort({ orderDate: -1 }).limit(1).explain("executionStats").executionStats

{
  executionSuccess: true,
  nReturned: 1,
  executionTimeMillis: 0,
  totalKeysExamined: 1,
  totalDocsExamined: 1,
  executionStages: {
    isCached: false,
    stage: 'LIMIT',
    nReturned: 1,
    executionTimeMillisEstimate: 0,
    works: 2,
    advanced: 1,
    needTime: 0,
    needYield: 0,
    saveState: 0,
    restoreState: 0,
    isEOF: 1,
    limitAmount: 1,
    inputStage: {
      stage: 'FETCH',
      nReturned: 1,
      executionTimeMillisEstimate: 0,
      works: 1,
      advanced: 1,
      needTime: 0,
      needYield: 0,
      saveState: 0,
      restoreState: 0,
      isEOF: 0,
      docsExamined: 1,
      alreadyHasObj: 0,
      inputStage: {
        stage: 'IXSCAN',
        nReturned: 1,
        executionTimeMillisEstimate: 0,
        works: 1,
        advanced: 1,
        needTime: 0,
        needYield: 0,
        saveState: 0,
        restoreState: 0,
        isEOF: 0,
        keyPattern: {
          'customer.address.country': 1,
          'orderLines.product.title': 1,
          orderDate: 1
        },
        indexName: 'customer.address.country_1_orderLines.product.title_1_orderDate_1',
        isMultiKey: true,
        multiKeyPaths: {
          'customer.address.country': [],
          'orderLines.product.title': [ 'orderLines' ],
          orderDate: []
        },
        isUnique: false,
        isSparse: false,
        isPartial: false,
        indexVersion: 2,
        direction: 'backward',
        indexBounds: {
          'customer.address.country': [ '["Switzerland", "Switzerland"]' ],
          'orderLines.product.title': [
            '["MongoDB Data Modeling and Schema Design", "MongoDB Data Modeling and Schema Design"]'
          ],
          orderDate: [ '[MaxKey, MinKey]' ]
        },
        keysExamined: 1,
        seeks: 1,
        dupsTested: 1,
        dupsDropped: 0
      }
    }
  }
}

With this index, the query jumps straight to the target document:

  • totalKeysExamined: 1 from one range (seeks: 1 in stage: 'IXSCAN)
  • totalDocsExamined: 1 from stage: 'FETCH' without the need to sort
  • nReturned: 1

The result is retrieved in sorted order directly from the index, minimizing the number of documents to read. We can do better. Our next goal is to elimiate totalDocsExamined.

Step 3: Projection and covering index

If the application only needs the customer’s name and country, we can project:

db.orders.find(  
  {  
    "customer.address.country": "Switzerland",  
    "orderLines.product.title": "MongoDB Data Modeling and Schema Design"  
  },  
  { "customer.name": 1, "customer.address.country": 1, _id: 0 }  
)  

[
  {
    customer: { name: 'Franck Pachot', address: { country: 'Switzerland' } }
  }
]

Now, we can create a covering index:

db.orders.createIndex({  
  "customer.address.country": 1,  
  "orderLines.product.title": 1,  
  "orderDate": 1,  
  "customer.name": 1  
})  

Here is the execution plan:

db.orders.find(  
  {  
    "customer.address.country": "Switzerland",  
    "orderLines.product.title": "MongoDB Data Modeling and Schema Design"  
  },  
  { "customer.name": 1, "customer.address.country": 1, _id: 0 }  
).limit(1).explain("executionStats").executionStats

{
  executionSuccess: true,
  nReturned: 1,
  executionTimeMillis: 0,
  totalKeysExamined: 1,
  totalDocsExamined: 0,
  executionStages: {
    isCached: false,
    stage: 'LIMIT',
    nReturned: 1,
    executionTimeMillisEstimate: 0,
    works: 2,
    advanced: 1,
    needTime: 0,
    needYield: 0,
    saveState: 0,
    restoreState: 0,
    isEOF: 1,
    limitAmount: 1,
    inputStage: {
      stage: 'PROJECTION_DEFAULT',
      nReturned: 1,
      executionTimeMillisEstimate: 0,
      works: 1,
      advanced: 1,
      needTime: 0,
      needYield: 0,
      saveState: 0,
      restoreState: 0,
      isEOF: 0,
      transformBy: { 'customer.name': 1, 'customer.address.country': 1, _id: 0 },
      inputStage: {
        stage: 'IXSCAN',
        nReturned: 1,
        executionTimeMillisEstimate: 0,
        works: 1,
        advanced: 1,
        needTime: 0,
        needYield: 0,
        saveState: 0,
        restoreState: 0,
        isEOF: 0,
        keyPattern: {
          'customer.address.country': 1,
          'orderLines.product.title': 1,
          orderDate: 1,
          'customer.name': 1
        },
        indexName: 'customer.address.country_1_orderLines.product.title_1_orderDate_1_customer.name_1',
        isMultiKey: true,
        multiKeyPaths: {
          'customer.address.country': [],
          'orderLines.product.title': [ 'orderLines' ],
          orderDate: [],
          'customer.name': []
        },
        isUnique: false,
        isSparse: false,
        isPartial: false,
        indexVersion: 2,
        direction: 'forward',
        indexBounds: {
          'customer.address.country': [ '["Switzerland", "Switzerland"]' ],
          'orderLines.product.title': [
            '["MongoDB Data Modeling and Schema Design", "MongoDB Data Modeling and Schema Design"]'
          ],
          orderDate: [ '[MinKey, MaxKey]' ],
          'customer.name': [ '[MinKey, MaxKey]' ]
        },
        keysExamined: 1,
        seeks: 1,
        dupsTested: 1,
        dupsDropped: 0
      }
    }
  }
}

This is the optimal plan where all documents returned (nReturned: 1) come from the index entries that were scanned (totalKeysExamined: 1) without any FETCH (totalDocsExamined: 0)

Conclusion

MongoDB query performance isn’t just about adding indexes—it’s about aligning your data model and queries with how your application really uses data. Turning a four-collection join with millions of lookups into a single index scan that runs in milliseconds comes down to three rules:

  1. Model for Access Embed data that’s accessed together. If it belongs on one piece of paper—like an order with its customer and product details—it belongs in one document.
  2. Index for Your Query Use compound indexes that follow your filter, sort, and projection. Equality fields first, range fields last. This turns full scans into targeted index scans, cutting totalDocsExamined from millions to one.
  3. Project Only What You Need Return only required fields. When a query is fully covered by an index, MongoDB can achieve totalDocsExamined: 0 and return the expected result fast.

You don’t need to apply this to all queries—use these steps only when necessary. Excessive lookups and limited indexing options drove us to optimize the data model. After that, the remaining need for selective filters and the absence of a suitable index led to creating ... (truncated)

Are Database System Researchers Making Correct Assumptions about Transaction Workloads?

In this blog, we had reviewed quite a number of deterministic database papers, including Calvin, SLOG, Detock, which aimed to achieve higher throughput and lower latency. The downside of these systems is sacrificing transaction expressivity. They rely on two critical assumptions: first, that transactions are "non-interactive", meaning they are sent as a single request (one-shot) rather than engaging in a multi-round-trip conversation with the application, and second, that the database can know a transaction's read/write set before execution begins (to lock data deterministically).

So when these deterministic database researchers write a paper to validate how these assumptions hold in the real world, we should be skeptical and cautious in our reading. Don't get me wrong, this is a great and valuable paper. And we still need to be critical in our reading. 


Summary

The study employed a semi-automated annotation tool to analyze 111 popular open-source web applications and over 30,000 transactions. The authors target applications built with Django (Python) and TypeORM (TypeScript). These Object-Relational Mappers (ORMs) abstract away the SQL layer, allowing developers to interact with a wide variety of underlying relational databases (such as PostgreSQL, MySQL, and SQLite) using standard application code. The authors categorized transactions based on two primary dimensions: Transaction Interactivity and  Read/Write Set Inferability. 


Transaction Interactivity

The study found that interactive transactions, where the application holds a transaction open while performing logic or reads, are surprisingly rare. 39% of applications contain no interactive transactions. For the remaining 61% of applications, only roughly 9.6% of the workload is interactive.


The authors then distinguish between "Strictly Interactive" transactions, which inherently require external input or side effects mid-flight (e.g., waiting for an API response or user prompt), and "Non-Strictly Interactive" ones, where the back-and-forth is merely deterministic application logic processing reads between queries without outside dependencies. They find strictly interactive transactions account for only 0.5% of the total workload. They argue that the vast majority of remaining interactive transactions are convertible to one-shot transactions with minimal code changes.


Read/Write Set Inferability

The authors tested whether a database could statically predict which rows a transaction would touch. They found that for 90% of applications, at least 58% of transactions have read/write sets that are fully inferable in advance.

The primary obstacle to perfect knowledge is the CDA Mismatch (Conflict Detection Attribute Mismatch), which occurs in roughly 27% of transactions. This happens when a transaction queries data using a secondary attribute (e.g., 'email') rather than the primary locking key (e.g., 'ID'). See Figure 1 above. When this happens, the database cannot infer the lock without first inspecting the data. However, the study remains upbeat about this. Since CDA mismatches typically occur in simple single-statement transactions, they say that the "fallback" mechanism (running a lightweight reconnaissance query to find the ID) is inexpensive.



Critical Review

This paper provides very valuable empirical data on transaction workloads by surveying a large number of applications. It echoes the scale of the SIGMOD 2015 "Feral Concurrency Control" study, which examined the integrity risks of the mismatch between databases and modern ORM frameworks like Rails. However, we must examine this study critically to identify potential limitations in scope and bias. My two big criticism are the following. 

Selection Bias. The choice of the corpus, open-source web applications using ORMs, heavily skews the results and excludes almost all enterprise systems. The study also ignores ad-hoc transactions executed by human operators (DBAs, analysts) via terminals. These transactions are inherently interactive ("user think time") and often performance-critical.

Convertibility Optimism. The paper argues that "Non-Strictly Interactive" transactions are easily convertible to deterministic models without resorting to brittle Stored Procedures. They advocate for reconnaissance/scout query pattern, where the application performs a dry run read phase to optimistically build a write-set before sending it as a single atomic batch. While this is promising, the claim of minimal changes seems very optimistic. However, I think this conversion challenge is a great opportunity for research and for collecting empirical evidence of the actual engineering costs.

JavaScript Stored Routines in Percona Server for MySQL: A New Era for Database Programmability

For decades, we’ve accepted a painful compromise: if you wanted logic inside the database, you had to write SQL/PSM (Persistent Stored Modules). It’s clunky, hard to debug, and declarative by nature, making it terrible for algorithmic tasks. That ends with Percona Server 8.4.7-7. We are introducing JS Stored Programs as a Tech Preview. Unlike Oracle’s […]

January 04, 2026

Too Close to Our Own Image?

Recent work suggests we may be projecting ourselves onto LLMs more than we admit.

A paper in Nature reports that GPT-4 exhibits "state anxiety". When exposed to traumatic narratives (such as descriptions of accidents or violence), the model's responses score much higher on a standard psychological anxiety inventory. The jump is large, from "low anxiety" to levels comparable to highly anxious humans. The same study finds that therapy works: mindfulness-style relaxation prompts reduce these scores by about a third, though not back to baseline. The authors argue that managing an LLM's emotional state may be important for safe deployment, especially in mental health settings and perhaps in other mission-critical domains.

Another recent paper argues that LLMs can develop a form of brain rot. Continual training on what the authors call junk data (short, viral, sensationalist content typical of social media) leads to models developing weaker reasoning, poorer long-context handling, and worse safety compliance. Those models also score higher on dark personality traits like narcissism and psychopathy, as measured by the TRAITS benchmark. The primary mechanism behind this decline was identified as thought-skipping, where the model truncates its own reasoning chains, echoing the fragmented attention patterns associated with internet addiction in humans.

In my previous post, I discussed how agentic system techniques parallel self-improvement advice. I noted that successful agents rely on habits that look suspiciously like human self-help techniques. These models write things down, using scratchpads and memory buffers to offload working memory. They think in loops, iterating through write-reason-repeat cycles to decompose hard problems. They role-play, adopting constrained personas like "The Critic" or "The Engineer" to constrain the search and behavior space, much like the alter ego effect in humans.

This lets me think, would we need psychologists and life coaches for AI in the future? Would this be a vocation/career?

If models can suffer from anxiety that responds to mindfulness prompts, and from cognitive decline that responds to changes in their data diet, we are already treating them as if they are biological minds rather than machines. Perhaps the "Prompt Engineer" of today is just the precursor to the "AI Psychologist" of tomorrow, a professional whose job is to keep it sane, focused, and "healthy".

findOneAndUpdate(): UPDATE ... RETURNING in MongoDB, Idempotent and ACID, with

You can implement atomic read-write operations in a single call instead of an explicit transaction, reducing both round trips and the window for concurrent conflicts. In PostgreSQL, use UPDATE ... RETURNING instead of SELECT FOR UPDATE followed by UPDATE. In MongoDB, use findOneAndUpdate() instead of updateOne() followed by find(). This enables a single ACID read-write operation that is failure-resilient and safely retryable in MongoDB because it is idempotent. Here is an example.

Demo: two withdrawals and a receipt

Bob withdraws using a debit card (no negative balance allowed). The application first does updateOne to subtract the amount only if the balance is enough, then does a separate find() to print the receipt with the balance. Between those two calls, Alice uses a credit card (allowing a negative balance) and withdraws money from the same account, so Bob's printed balance becomes inconsistent with the balance checked for his withdrawal.

To fix this, we'll use findOneAndUpdate() to return the post-update balance atomically with the withdrawal.

Connection and collection

Here’s a Python + PyMongo program simulating this race condition and showing the fix:

from pymongo import MongoClient, ReturnDocument
import threading
import time

# Connect to MongoDB
client = MongoClient("mongodb://127.0.0.1:27017/?directConnection=true")
db = client.bank
accounts = db.accounts

# Prepare test account: Bob & Alice share this account
def reset_account():
    accounts.drop()
    accounts.insert_one({
        "_id": "b0b4l3ce",
        "owner": ["Bob", "Alice"],
        "balance": 100
    })

reset_account()

I have only one document for this example:

bank> db.accounts.find()

[ { _id: 'b0b4l3ce', owner: [ 'Bob', 'Alice' ], balance: 100 } ]

The balance is set at 100.

Scenario with updateOne() followed by find()

Bob's withdrawal process:


def bob_withdraw_debit(amount): 

    print("[Bob] Attempting debit withdrawal", amount)  
    # Application logic checks balance then updates  
    result = accounts.update_one(  
        {"_id": "b0b4l3ce", "balance": {"$gte": amount}},   # must have enough money  
        {"$inc": {"balance": -amount}}  
    )  

    # If no document was updated, the filter didn't find enough in the amount
    if result.modified_count == 0:  
        print("[Bob] Withdrawal denied - insufficient funds")  
        return  

    # Simulate some processing delay before printing the ticket - to be sure to reproduce the race condition
    time.sleep(1)

    # query the balance for the receipt
    balance = accounts.find_one({"_id": "b0b4l3ce"})["balance"]  
    print(f"[Bob] Debit card ticket: withdrew {amount}, balance after withdrawal: {balance}")  

Alice's withdrawal process:

# Alice withdraws using credit card (negative balance allowed)  
def alice_withdraw_credit(amount, delay=0):  
    time.sleep(delay)  # let Bob start first  
    print("[Alice] Attempting credit withdrawal", amount) 
    # There's no balance check for credit cards
    accounts.update_one(  
        {"_id": "b0b4l3ce"},  
        {"$inc": {"balance": -amount}}  
    )  
    print("[Alice] Credit withdrawal complete")  

Demo script to run in two threads and get the race condition:

def demo():
    reset_account()
    t_bob = threading.Thread(target=bob_withdraw_debit, args=(80,))
    t_alice = threading.Thread(target=alice_withdraw_credit, args=(30, 0.5))  # starts just after Bob update
    t_bob.start()
    t_alice.start()
    t_bob.join()
    t_alice.join()

The output is:

>>> demo()

[Bob] Attempting debit withdrawal 80
[Alice] Attempting credit withdrawal 30
[Alice] Credit withdrawal complete
[Bob] Ticket: withdrew 80, balance after withdrawal: -10

Bob received a ticket showing a withdrawal with a negative balance from his debit card — this is a bug ❌

We must get the balance at the time of the update, before, and subtract the amount, or after.

Scenario with findOneAndUpdate() and returnDocument: "after"

Bob's withdrawal process:

def bob_withdraw_debit(amount):  
    print("[Bob] Attempting debit withdrawal", amount)  
    doc = accounts.find_one_and_update(  
        {"_id": "b0b4l3ce", "balance": {"$gte": amount}},  
        {"$inc": {"balance": -amount}},  
        return_document=ReturnDocument.AFTER  # get post-update document atomically  
    )  
    # No need to check the update count, we have the document if it was updated
    if not doc:  
        print("[Bob] Withdrawal denied - insufficient funds")  
        return  
    # Ticket immediately shows consistent balance  
    print(f"[Bob] Ticket: withdrew {amount}, balance after withdrawal: {doc['balance']}")

Running the demo again:

>>> demo()

[Bob] Attempting debit withdrawal 80
[Bob] Ticket: withdrew 80, balance after withdrawal: 20
[Alice] Attempting credit withdrawal 30
[Alice] Credit withdrawal complete

Bob received a ticket showing the balance at the exact time of withdrawal ✅

The update write and post-update read occurred as a single atomic operation on the document, leaving no opportunity for another write between the update and the displayed read result.

After-image is saved for consistency and resilience

In MongoDB, reads and writes do not acquire transaction locks like they can in relational databases, but document updates are still atomic at the document level, even without staring an explicit transaction. MongoDB uses lightweight document-level locks internally to ensure ACID guarantees for a single document, since even one update can involve multiple internal reads and writes, such as enforcing unique constraints and updating indexes.

updateOne() returns only metadata (such as the number of documents updated), whereas findOneAndUpdate() returns the updated document itself, with the read and write occurring in the same atomic operation at the single-document level. This atomicity is preserved even in the event of failures.

If the network is lost or the primary crashes and a secondary is promoted, the MongoDB driver retries the operation as part of retryable writes. Because retries must be idempotent, find_one_and_update() returns the same document image on retry.

To support this, MongoDB stores a document image—either the after image (as in this example, using returnDocument: "after") or the before image—in an internal system collection (config.image_collection) that is replicated independently of the oplog, as part of the same transaction:

bank> use config

config> db.image_collection.find()
[
  {
    _id: {
      id: UUID('d04e10d6-c61d-42ad-9a44-5bb226a898d8'),
      uid: Binary.createFromBase64('47DEQpj8HBSa+/TImW+5JCeuQeRkm5NMpJWZG3hSuFU=', 0)
    },
    txnNum: Long('15'),
    ts: Timestamp({ t: 1767478784, i: 5 }),
    imageKind: 'postImage',
    image: { _id: 'b0b4l3ce', owner: [ 'Bob', 'Alice' ], balance: 20 },
    invalidated: false
  }
]
config>

When you enable write retries, this image collection is used internally to make write operations resilient to failures, which are handled transparently by the database driver and server, not by the application, and with the strongest consistency.

Comparison with PostgreSQL

In PostgreSQL, you would use the following queries to do the equivalent:

CREATE TABLE accounts (  
    id TEXT PRIMARY KEY,  
    balance NUMERIC,  
    owners TEXT[]  
);  
INSERT INTO accounts VALUES ('b0b4l3ce', ARRAY['Bob','Alice'], 100);  

-- Alice transaction  
UPDATE accounts    
SET balance = balance - 30    
WHERE id = 'shared_account';  

-- Bob transaction  
UPDATE accounts    
SET balance = balance - 80    
WHERE id = 'b0b4l3ce' AND balance >= 80    
RETURNING balance AS new_balance;

PostgreSQL drivers do not automatically retry failures and rely on MVCC and locks to ensure ACID properties.

With Repeatable Read isolation—appropriate because the write (SET balance = balance - 80) depends on the read (WHERE balance >= 80)—Bob's transaction takes a snapshot at the start and, when run concurrently with Alice's, still sees balance = 100. If Alice commits first and reduces the balance to 70, Bob's transaction fails with: ERROR: could not serialize access due to concurrent update. The application must handle this by retrying the entire transaction. The driver does not do this automatically.

With the default Read Committed isolation level, Bob’s transaction waits if Alice’s update has locked the row. After Alice commits, PostgreSQL reevaluates Bob’s WHERE clause. The balance is now 70, which no longer satisfies balance >= 80, so the UPDATE affects 0 rows and the withdrawal is refused, preventing a negative balance. This is fine when only one row is affected, but in a multi-row statement it could be inconsistent, operating on rows from different transactional states.

Conclusion

MongoDB supports both multi-document transactions and single-document atomic operations, but strongly encourages using single-document operations when possible. If you design your schema so that business logic fits in a single document, findOneAndUpdate() can perform conditional checks, apply updates, and return the updated document atomically, avoiding race condition anomalies and enabling idempotent retries.

In some cases—such as the one described in the earlier blog post on FOR UPDATE SKIP LOCKED in MongoDB—an updateOne() followed by a find() with appropriate conditions is enough and avoids storing before- or after-images.

Databases in 2025: A Year in Review

The world tried to kill Andy off but he had to stay alive to to talk about what happened with databases in 2025.