January 12, 2026
January 10, 2026
Which Document class is best to use in Java to read MongoDB documents?
TL;DR: the answer is in the title, use Document.
BSON is a serialization format, similar to protobuf, designed for efficient document storage on disk and transfer over the network as a byte stream. Instead of scanning and rewriting the entire byte sequence to access its contents (fields, arrays, subdocuments), you work with an in-memory object that exposes methods to read and write fields efficiently. On MongoDB's server side, that's the mutable BSON object. On the client side, the drivers provide a similar API. Here are the five object types that implement the BSON interface in Java
- Document is the recommendation for most applications. It provides the best balance of flexibility, ease of use, and functionality.
Only consider the other classes when you have specific requirements:
BsonDocument: When you need strict BSON type safety rather than application types.
RawBsonDocument: When you need to access the raw bytes of the document rather than field values.
JsonObject: When working exclusively with JSON strings, as plain text.
BasicDBObject: Only for legacy code migration.
RawBsonDocument is used internally—for example, for client-side encryption and change streams—whereas the other classes can all be used directly in MongoDB driver operations. Your choice mainly impacts how you construct, manipulate, and access document data in your application code. They are documented:
I reviewed the documentation and code to better understand their differences. Especially when comparing with other databases which have different design and API. In this post I provide a detailed description of those classes, what they are, how they are implemented, what they provide, and when to use them
Detailed description
Here's a detailed comparison of the five document classes available in the MongoDB Java Driver and when to use each:
Document (org.bson)
Document is a flexible representation of a BSON document that implements Map<String, Object>. It uses a LinkedHashMap<String, Object> internally to maintain insertion order.
Key characteristics:
- Loosely-typed: Values are stored as
Object, allowing you to use standard Java types (String,Integer,Date, etc.) - Flexible: Easy to work with dynamically structured documents
- Map interface: Provides all standard
Mapoperations
Document is the recommended default choice for most applications. Use it when you want a flexible and concise data representation that's easy to work with.
BsonDocument (org.bson)
BsonDocument is a type-safe container for BSON documents that implements Map<String, BsonValue>. It also uses a LinkedHashMap, but stores BsonValue types.
Key characteristics:
- Type-safe: All values must be wrapped in BSON library types (
BsonString,BsonInt32,BsonDocument, etc.) - Stricter API: Provides compile-time type safety but requires more verbose code
- Map interface: Implements
Map<String, BsonValue>
Use BsonDocument when you need a type-safe API and want explicit control over BSON types. This is particularly useful when you need to ensure precise type handling or when working with APIs that require BsonDocument.
RawBsonDocument (org.bson)
RawBsonDocument is an immutable BSON document represented using only raw bytes. It stores the BSON document as a byte array without parsing it.
Key characteristics:
- Immutable: All mutation operations throw
UnsupportedOperationException - Lazy parsing: Data is only parsed when accessed, making it very efficient for pass-through scenarios, when not for accessing each individual field
- Memory efficient: Stores raw bytes, avoiding object allocation overhead
- Can decode to other types: Provides
decode()method to convert to other document types when needed
Use RawBsonDocument when you need maximum performance and memory efficiency for whole-document operations. It is particularly useful when reading documents from MongoDB and passing them to another system unchanged, when working with large documents that you don’t need to parse, when building high-performance data pipelines where parsing overhead matters, and when you need an immutable document representation.
JsonObject (org.bson.json)
JsonObject is a wrapper class that holds a JSON object string. It simply stores the JSON as a String.
Key characteristics:
- Does not implement Map: It's just a string wrapper with validation
- No parsing required: Avoids conversion to
Mapstructure if you're just working with JSON - JSON-focused: Designed for applications that primarily work with JSON strings
- Supports Extended JSON: Works with MongoDB Extended JSON format
Use JsonObject when you want to work directly with JSON strings and avoid the overhead of converting to and from Map objects. This is ideal for REST APIs that consume and produce JSON, for logging or persisting documents as JSON strings, and for applications that primarily handle JSON and do not require programmatic field-level access.
BasicDBObject (com.mongodb)
BasicDBObject is a legacy BSON object implementation that extends BasicBSONObject and implements the DBObject interface.
Key characteristics:
- Legacy class: Exists for backward compatibility with older driver versions
- Does not implement Map: Only implements the
DBObjectinterface, lacking modern Map convenience methods - Binary compatibility concerns: Implements an interface rather than extending a class, which can cause compatibility issues
Only use BasicDBObject when migrating from a legacy driver version (pre-3.0). The documentation explicitly recommends avoiding this class for new development due to its limitations.
Conversion Between Types
All of these classes implement the Bson interface, which allows them to be used interchangeably in MongoDB operations (but without the same performance). You can also convert between types:
-
BsonDocument.parse(json)to create from JSON -
RawBsonDocument.decode(codec)to convert RawBsonDocument to another type -
Documentcan be converted toBsonDocumentthrough codec operations -
JsonObject.getJson()to extract the JSON string
Comparing RawBsonDocument and Document
RawBsonDocument stores BSON as a raw byte array and does not deserialize it in advance. When you access a field, it creates a BsonBinaryReader and scans the document sequentially, reading each field’s type and name until it finds the requested key. Only the matching field is decoded using RawBsonValueHelper.decode, while all other fields are skipped without parsing. For nested documents and arrays, it reads only their sizes and wraps the corresponding byte ranges in new RawBsonDocument or RawBsonArray instances, keeping their contents as raw bytes. This approach provides fast access for a single field lookup, while being memory-efficient and keeping the document immutable, which is ideal for large documents where only a few fields are needed or for documents that are mostly passed through without inspection.
In contrast, Document uses a fully deserialized LinkedHashMap<String, Object>. When a Document is created, all fields are eagerly parsed into Java objects. Field access and containsKey operations are simple HashMap lookups, and the document is fully mutable, supporting standard map operations such as put, remove, and clear. This design consumes more memory but is better suited to small or medium-sized documents, scenarios where many fields are accessed, or cases where the document needs to be modified frequently.
Finally, Document doesn't use RawBsonDocument for parsing or accessing fields since doing so would be inefficient, and the two serve different purposes.
Comparison with Oracle (OSON) and PostgreSQL (JSONB)
Neither Oracle nor PostgreSQL provides BSON as they use OSON and JSONB, so there's no BsonDocument or RawBsonDocument equivalent.
In Oracle’s JDBC driver, the closest equivalent to a Document is OracleJsonObject, one of the OracleJsonValue types, which can be exposed directly as a javax.json.JsonObject or mapped domain object. This API works directly on the underlying OSON bytes without fully parsing the entire document into an intermediate data structure. OSON is more than a raw serialization: it carries its own local dictionary of distinct field names, a sorted array of hash codes for those names, and compact field ID arrays and value offsets, enabling the driver to locate a field in place by binary‑searching the field ID array, and jumping to the right offset.
If JSON text is needed instead of the object model, the equivalent is simply to use ResultSet.getString(), which will convert the OSON image to JSON text on the client.
PostgreSQL’s JDBC driver, by contrast, offers no native Java JSON object API for either json or jsonb columns: values are always returned as text, so the application must parse the string into its own document model using a separate JSON library. Even when using PostgreSQL’s binary JSONB storage, none of the binary efficiency crosses the wire (See JSONB vs. BSON: Tracing PostgreSQL and MongoDB Wire Protocols), and client code still performs a full parse before accessing individual fields.
Conclusion
MongoDB’s main advantage for modern applications—whatever the data types or workloads—is the ability to work with data directly through your domain objects, without an intermediate object-relational mapping layer or view. Use the Document class as your document object model (DOM). It offers flexibility, map-style access, and a natural Java interface, while the driver transparently converts BSON from the network into objects your application can use immediately.
January 09, 2026
Cloudspecs: Cloud Hardware Evolution Through the Looking Glass
This paper (CIDR'26) presents a comprehensive analysis of cloud hardware trends from 2015 to 2025, focusing on AWS and comparing it with other clouds and on-premise hardware.
TL;DR: While network bandwidth per dollar improved by one order of magnitude (10x), CPU and DRAM gains have been much more modest. Most surprisingly, NVMe storage performance in the cloud has stagnated since 2016. Check out the NVMe SSD discussion below for data on this anomaly.
CPU Trends
Multi-core parallelism has skyrocketed in the cloud. Maximum core counts have increased by an order of magnitude over the last decade. The largest AWS instance u7in now boasts 448 cores. However, simply adding cores hasn't translated linearly into value. To measure real evolution, the authors normalized benchmarks (SPECint, TPC-H, TPC-C) by instance cost. SPECint benchmarking shows that cost-performance improved roughly 3x over ten years. A huge chunk of that gain comes from AWS Graviton. Without Graviton, the gain drops to roughly 2x. For in-memory database benchmarks, gains were even lower (2x–2.5x), likely due to memory and cache latency bottlenecks.
On-prem hardware comparison shows that this stagnation is not cloud price gouging. Historically, Moore's Law and Dennard scaling doubled cost-performance every two years (which would have sum up to 32x gain over a decade). However, an analysis of on-premise AMD server CPUs reveals a similar slump, only a 1.7x gain from 2017 to 2025.
Memory Trends
DRAM capacity per dollar has effectively flatlined. The only significant improvement was the 2016 introduction of memory-optimized x instances, which offered ~3.3x more GiB-hours/$ than compute-optimized peers. While absolute single-socket bandwidth jumped ~5x (93 GiB/s to 492 GiB/s) as servers moved from DDR3 to DDR5, the cost-normalized gain is only 2x.
Historical data suggests commodity DRAM prices dropped 3x over the decade. But in the last three months, due to AI-driven demand, DDR5 prices rose sharply, further limiting effective memory gains.
Network Trends
We have good news here, finally. Network bandwidth per dollar exploded by 10x. And absolute speeds went from 10 Gbit/s to 600 Gbit/s (60x).
These gains were not universal though. Generic instances saw little change. The gains were driven by network-optimized n instances (starting with the c5n in 2018) powered by proprietary Nitro cards.
NVMe Trends
NVMe SSDs are the biggest surprise. Unlike CPUs and memory, where cloud trends mirror on-prem hardware, NVMe performance in AWS has largely stagnated. The first NVMe-backed instance family, i3, appeared in 2016. As of 2025, AWS offers 36 NVMe instance families. Yet the i3 still delivers the best I/O performance per dollar by nearly 2x.
SSD capacity has stagnated since 2019 and I/O throughput since 2016. This sharply contrasts with on-prem hardware, where SSD performance doubled twice (PCIe 4 and PCIe 5) in the same timeframe. The gap between cloud and on-premise NVMe is widening rapidly.
This price/performance gap likely explains the accelerating push toward disaggregated storage. When local NVMe is expensive and underperforming, remote storage starts to look attractive. The paper speculates that with network speeds exploding and NVMe stagnating, architectures may shift further. For systems like Snowflake, using local NVMe for caching might no longer be worth the complexity compared to reading directly from S3 with fast networks.
Discussion
I think the main takeaway is that uniform hardware scaling in the cloud is over. Moore's Law no longer lifts all boats. Performance gains now come from specialization, especially networking (e.g., Graviton, Nitro, Accelerators).
In my HPTS 2024 review, I noted that contrary to the deafening AI hype, the real excitement in the hallways was about hardware/software codesign. This paper validates that sentiment. With general-purpose CPU and memory cost-performance stagnating, future databases must be tightly integrated with specialized hardware and software capabilities to provide value. I think the findings here will refuel that trend.
A key open question is why massive core counts deliver so little value. Where is the performance lost? Possible explanations include memory bandwidth limits, poor core-to-memory balance, or configuration mismatches. But I think the most likely culprit is software. Parallel programming remains hard, synchronization is expensive, and many systems fail to scale beyond a modest number of cores. We may be leaving significant performance on the table simply because our software cannot effectively utilize the massive parallelism now available.
The paper comes with an interactive tool, Cloudspecs, built on DuckDB-WASM (yay!). This allows you to run SQL queries over the dataset directly in the browser to visualize these trends. The figures in the PDF actually contain clickable link symbols that take you to the specific query used to generate that chart. Awesome reproducibility!
Aleksey and I did a live-reading of the paper. As usual, we had a lot to argue about. I'll add a recording of our discussion on YouTube when it becomes available, and here is a link to my annotated paper.
Best Practices to Backfill Materialized Views in ClickHouse® Safely
January 08, 2026
Unlock Amazon Aurora’s Advanced Features with Standard JDBC Driver using AWS Advanced JDBC Wrapper
Postgres vs tproc-c on a small server
This is my first post with results from tproc-c using HammerDB. This post has results for Postgres.
tl;dr - across 8 workloads (low and medium concurrency, cached database to IO-bound)
- there might be a regression for Postgres 14.20 and 15.15 in one workload
- there are improvements, some big, for Postgres 17 and 18 in most workloads
The server is an ASUS ExpertCenter PN53 with an AMD Ryzen 7 7735HS CPU, 8 cores, SMT disabled, and 32G of RAM. Storage is one NVMe device for the database using ext-4 with discard enabled. The OS is Ubuntu 24.04. More details on it are here.
Benchmark
The benchmark was run for several workloads:
- vu=1, w=100 - 1 virtual user, 100 warehouses
- vu=6, w=100 - 6 virtual users, 100 warehouses
- vu=1, w=1000 - 1 virtual user, 1000 warehouses
- vu=6, w=1000 - 6 virtual users, 1000 warehouses
- vu=1, w=2000 - 1 virtual user, 2000 warehouses
- vu=6, w=2000 - 6 virtual users, 2000 warehouses
- vu=1, w=4000 - 1 virtual user, 4000 warehouses
- vu=6, w=4000 - 6 virtual users, 4000 warehouses
- stored procedures are enabled
- partitioning is used for when the warehouse count is >= 1000
- a 5 minute rampup is used
- then performance is measured for 120 minutes
(NOPM for a given version / NOPM for Postgres 12.22)
Results: vu=1, w=100
- no regressions, no improvements
- no regressions, no improvements
- no regressions, improvements in Postgres 17 and 18
- no regressions, improvements in Postgres 16, 17 and 18
- possible regressions in Postgres 14 and 15, improvements in 13, 16, 17\ and 18
- no regressions, improvements in Postgres 13 through 18
- no regressions, improvements in Postgres 13 through 18
- no regressions, improvements in Postgres 16 through 18
Introducing TRAE SOLO integration with Supabase
The Sauna Algorithm: Surviving Asynchrony Without a Clock
While sweating it out in my gym's sauna recently, I found a neat way to illustrate the happened-before relationship in distributed systems. Imagine I suffer from a medical condition called dyschronometria, which makes me unable to perceive time reliably, such that 10 seconds and 10 minutes feel exactly the same to me.
In this scenario, the sauna lacks a visible clock. I'm flying blind here, yet I want to leave after a healthy session. If I stay too short, I get no health benefits. If I stay too long, I risk passing out on the floor. The question becomes: How do I, a distributed node with no local clock, ensure operating within a safety window in an asynchronous environment?
Thankfully, the sauna has a uniform arrival of people. Every couple of minutes, a new person walks in. These people don't suffer from dyschronometria and they stay for a healthy session, roughly 10 minutes.
My solution is simple: I identify the first person to enter after me, and I leave when he leaves. I don't measure time at all. Instead, I rely solely on the causal relationship between this person's arrival and departure. The presence of this other person acts as a reference, letting me safely synchronize my action in an entirely asynchronous setting.
Murat's Sauna Algorithm
- Enter the sauna.
- Identify the very next person to enter (let's call him A).
- Wait until A gets up to leave.
- Exit immediately after A.
By anchoring my exit to the departure of a person who arrived after me, I guarantee that my stay overlaps entirely with his healthy duration, plus the delta between my arrival and his. I have successfully bounded my sauna/execution time using causal links rather than a physical timer.
Now, for the distributed systems purists who might object to step 4 (arguing that "exiting immediately" implies measuring a time interval of zero): I can mend this. I exit after Person A leaves, but before the next person leaves. That keeps us in the realm of strictly causal reasoning.
Consider what would go wrong if I anchored my departure to someone already in the sauna when I entered. That person might be near the end of their session, causing me to leave in just a few seconds. In distributed systems, this is akin to reading the state of an outdated node to make a decision, leading to inconsistent outcomes. Anchoring only to events that occur after my entry ensures I account for all relevant causality.
There is, of course, a safe way to rely on the people already inside. I could take a snapshot of the room upon entry, identify everyone present, and refuse to leave until the entire set of them has departed. This guarantees a sufficient duration (assuming at least one person recently arrived), but it requires more memory. Instead of tracking a single reference point (the newcomer), I would have to maintain state for everyone in the sauna.
Let's connect back to distributed systems. In an asynchronous system, nodes rely on causal relationships between events to order them. Just as I watch the newcomer to determine my exit, a node can observe the messages sent by other nodes to determine safe ordering. For example, in a distributed database, a transaction may want to commit only after all causally prior transactions have completed, ensuring consistency without requiring synchronized clocks. Reasoning about happened-before relationships allows asynchronous systems to coordinate safely without synchronized clocks. See discussions on logical clocks, vector clocks, and hybrid logical clocks in my previous posts.
Before anybody gets too concerned, no, I don't actually suffer from dyschronometria. However, back when I was a professor, my PhD students often joked that I operated on logical clocks rather than physical ones.
Also, in reality, the gym's sauna does have a clock visible from the door. I check it, and aim to stay for about 15 minutes. But I still insist on using my sauna algorithm, well at least the first three steps, because I am a competitive SOB. I try to outcompete the average person and, knock on sauna-bench wood, I haven't had to break my rule yet. It would be absolutely catastrophic if everyone started using my algorithm, though. We'd all be waiting for the next person to leave, resulting in a deadlock. Maybe the memory-costly snapshot algorithm is better after all, as it avoids this problem.
Related post: Consistent snapshot analogies
January 07, 2026
SSDs, power loss protection and fsync latency
This has results to measure the impact of calling fsync (or fdatasync) per-write for files opened with O_DIRECT. My goal is to document the impact of the innodb_flush_method option.
The primary point of this post is to document the claim:
For an SSD without power loss protection, writes are fast but fsync is slow.
The secondary point of this post is to provide yet another example where context matters when reporting performance problems. This post is motivated by results that look bad when run on a server with slow fsync but look OK otherwise.
tl;dr
- for my mini PCs I will switch from the Samsung 990 Pro to the Crucial T500 to get lower fsync latency. Both are nice devices but the T500 is better for my use case.
- with a consumer SSD writes are fast but fsync is often slow
- use an enterprise SSD if possible, if not run tests to understand fsync and fdatasync latency
InnoDB, O_DIRECT and O_DIRECT_NO_FSYNC
When innodb_flush_method is set to O_DIRECT there are calls to fsync after each batch of writes. While I don't know the source like I used to, I did browse it for this blog post and then I looked at SHOW GLOBAL STATUS counters. I think that InnoDB does the following with it set to O_DIRECT:
- Do one large write to the doublewrite buffer, call fsync on that file
- Do the batch of in-place (16kb) page writes
- Call fsync once per database file that was written by step 2
When set to O_DIRECT_NO_FSYNC then the frequency of calls to fsync are greatly reduced and are only done in cases where important filesystem metadata needs to be updated, such as after extending a file. The reference manual is misleading WRT the following sentence. I don't think that InnoDB ever does an fsync after each write. It can do an fsync after each batch of writes:
O_DIRECT_NO_FSYNC: InnoDB uses O_DIRECT during flushing I/O, but skips the fsync() system call after each write operation.
Many years ago it was risky to use O_DIRECT_NO_FSYNC on some filesystems because the feature as implemented (either upstream or in forks) didn't do fsync for cases where it was needed (see comment about metadata above). I experienced problems from this and I only have myself to blame. But the feature has been enhanced to do the right thing. And if the #whynotpostgres crowd wants to snark about MySQL not caring about data, lets not forget that InnoDB had per-page checksums long before Postgres -- those checksums made web-scale life much easier when using less than stellar hardware.
Innodb_data_fsyncs / Innodb_data_writes
And from this table a few things are clear. First, there isn't an fsync per write with O_DIRECT but there might be an fsync per batch of writes as explained above. Second, the rate of fsyncs is greatly reduced by using O_DIRECT_NO_FSYNC.
.00172 .00053 O_DIRECT_NO_FSYNC
Power loss protection
I am far from an expert on this topic, but most SSDs have a write-buffer that makes small writes fast. And one way to achieve speed is to buffer those writes in RAM on the SSD while waiting for enough data to be written to an extent. But that speed means there is a risk of data loss if a server loses power. Some SSDs, especially those marketed as enterprise SSDs, have a feature called power loss protection that make data loss unlikely. Other SSDs, lets call them consumer SSDs, don't have that feature while some of the consumer SSDs claim to make a best effort to flush writes from the write buffer on power loss.
One solution to avoiding risk is to only buy enterprise SSDs. But they are more expensive, less common, and many are larger (22120 rather than 2280) because more room is needed for the capacitor or other HW that provides the power loss protection. Note that power loss protection is often abbreviated as PLP.
For devices without power loss protection it is often true that writes are fast but fsync is slow. When fsync is slow then calling fsync more frequently in InnoDB will hurt performance.
Results from fio
I used this fio script to measure performance for writes for files opened with O_DIRECT. The test was run twice configuration for 5 minutes per run followed by a 5 minute sleep. This was repeated for 1, 2, 4, 8, 16 and 32 fio jobs but I only share results here for 1 job. The configurations tested were:
- O_DIRECT without fsync, 16kb writes
- O_DIRECT with an fsync per write, 16kb writes
- O_DIRECT with an fdatasync per write, 16kb writes
- O_DIRECT without fsync, 2M writes
- O_DIRECT with an fsync per write, 2M writes
- O_DIRECT with an fdatasync per write, 2M writes
- dell32
- a large server I have at home. The SSD is a Crucial T500 2TB using ext-4 with discard enabled and Ubuntu 24.04. This is a consumer SSD. While the web claims it has PLP via capacitors the fsync latency for it was almost 1 millisecond.
- gcp
- a c3d-standard-30-lssd from the Google cloud with 2 local NVMe devices using SW RAID 0 and 1TB of Hyperdisk Balanced storage configured for 50,000 IOPs and 800MB/s of throughput. The OS is Ubuntu 24.04 and I repeated tests for both ext-4 and xfs, both with discard enabled. I was not able to determine the brand of the local NVMe devices.
- hetz
- an ax162-s from Hetzner with 2 local NVME devices using SW RAID 1. Via udiskctl status I learned the devices are Intel D7-P5520 (now Solidigm). These are datacenter SSDs and the web claims they have power loss protection. The OS is Ubuntu 24.04 and the drives use ext-4 without discard enabled.
- ser7
- a mini-PC I have at home. The SSD is a Samsung 990 Pro using ext-4 with discard enabled and Ubuntu 24.04. This is a consumer SSD, the web claims it does not have PLP and fsync latency is several milliseconds.
- socket2
- a 2-socket server I have at home. The SSD is a Samsung PM-9a3. This is an enterprise SSD with power loss protection. The OS is Ubuntu 24.04 and the drives use ext-4 with discard enabled.
- for servers with consumer SSDs (dell, ser7) the latency is much larger on the ser7 that uses a Samsung 990 Pro than on the dell that uses a Crucial T500. This is to be expected given that the T500 has PLP while the 990 Pro does not.
- sync latency is much lower on servers with enterprise SSDs
- sync latency after 2M writes is sometimes much larger than after 16kb writes
- for the Google server with Hyperdisk Balanced storage the fdatasync latency was good but fsync latency was high. While with the local NVMe devices the latencies were larger than for enterprise SSDs but much smaller than for consumer SSDs.
- Write throughput drops dramatically when there is an fsync or fdatasync per write because sync latency is large.
- This servers uses a consumer SSD so high sync latency is expected
- w/s - writes/s
- MB/s - MB written/s
- sync - latency per sync (fsync or fdatasync)
- Local NVMe devices have lower sync latency and more throughput with and without a sync per write at low concurrency (1 fio job).
- At higher concurrency (32 fio jobs), the Hyperdisk Balanced setup provides similar throughput to local NVMe and would do even better had I paid more to get more IOPs and throughput. Results don't have nice formatting but are here for xfs on the local and Hyperdisk Balanced devices.
- fsync latency is ~2X larger than fdatasync on the local devices and closer to 15X larger on the Hyperdisk Balanced setup. That difference is interesting. I wonder what the results are for Hyperdisk Extreme.
- w/s - writes/s
- MB/s - MB written/s
- sync - latency per sync (fsync or fdatasync)
- this has an enterprise SSD with excellent (low) sync latency
- w/s - writes/s
- MB/s - MB written/s
- sync - latency per sync (fsync or fdatasync)
- this has a consumer SSD with high sync latency
- results had much variance (see the 2MB results below) and results at higher concurrency. This is a great SSD, but not for my use case.
- w/s - writes/s
- MB/s - MB written/s
- sync - latency per sync (fsync or fdatasync)
- this has an enterprise SSD with excellent (low) sync latency after small writes, but fsync latency after 2MB writes is much larger
- w/s - writes/s
- MB/s - MB written/s
- sync - latency per sync (fsync or fdatasync)
Introducing OpenEverest: An Independent Open Source Project for the Future of Data Platforms
Supabase Security Retro: 2025
Automating our changelog with Cursor commands
January 06, 2026
Implement multi-Region endpoint routing for Amazon Aurora DSQL
Urgent Security Update: Patching “Mongobleed” (CVE-2025-14847) in Percona Server for MongoDB
Blog Post: Good Bye Percona Everest, Hello OpenEverest!
January 05, 2026
3 Steps to Optimize Your Queries for Speed
When performance matters, how you store your application objects, write queries, and design indexes can be the difference between scanning a collection for minutes and getting results in milliseconds—without touching a single document.
In this article, we’ll walk through an example in MongoDB. We’ll start with a normalized model, move to an embedded model, add an index to avoid a full collection scan, and then refine both the index and query to read only what’s needed for the result. The execution plan and execution statistics will guide us at each step.
Scenario
Our domain is an e‑commerce order‑entry application with customers, products, orders, and order lines. We want to:
Find the most recent order in Switzerland containing the book “MongoDB Data Modeling and Schema Design”.
Initial model: normalized with references
If we build the data model without considering how applications will access it—or simply migrate from a relational database by creating one collection per table—we end up with a model where a single domain aggregate is split across multiple collections.
In this model, we insert a customer, some products (books), an order for this customer, and the corresponding order lines for those books:
// Customer
db.customers.insertOne({
"_id": "C0001",
"name": "Franck Pachot",
"email": "franck.pachot@example.com",
"address": { "city": "Geneva", "country": "Switzerland" }
})
// Products
db.products.insertOne({
"_id": "P0001",
"title": "MongoDB in Action, Third Edition: Building on the Atlas Data Platform",
"author": "Arek Borucki",
"price": 59.99, "category": "Database", "format": "Paperback"
})
db.products.insertOne({
"_id": "P0002",
"title": "MongoDB Data Modeling and Schema Design",
"author": ["Daniel Coupal", "Pascal Desmarets", "Steve Hoberman"],
"price": 54.99, "category": "Database", "format": "Paperback"
})
db.products.insertOne({
"_id": "P0003",
"title": "High Performance with MongoDB: Best Practices for Performance Tuning, Scaling, and Architecture",
"author": ["Asya Kamsky", "Ger Hartnett", "Alex Bevilacqua"],
"price": 49.99, "category": "Database", "format": "Paperback"
})
// Order
db.orders.insertOne({
"_id": "O0001",
"customerId": "C0001",
"orderDate": ISODate("2024-10-22T09:00:00Z"),
"status": "Processing"
})
// Order lines
db.orderlines.insertMany([
{ "orderId": "O0001", "productId": "P0001", "quantity": 1 },
{ "orderId": "O0001", "productId": "P0002", "quantity": 1 },
{ "orderId": "O0001", "productId": "P0003", "quantity": 1 }
])
This is a classic relational database pattern, but querying and optimizing it is complex because joins must occur before filtering.
In MongoDB, to rebuild the domain object aggregate, we use an aggregation pipeline with $lookup and $match:
db.orders.aggregate([
// Get customer info to filter by country
{ $lookup: {
from: "customers",
localField: "customerId",
foreignField: "_id",
as: "customer"
}},
{ $unwind: "$customer" },
// Get product info to filter by title
{ $lookup: {
from: "orderlines",
let: { orderIdVar: "$_id" },
pipeline: [
{ $match: { $expr: { $eq: ["$orderId", "$$orderIdVar"] } } },
{ $lookup: {
from: "products",
localField: "productId",
foreignField: "_id",
as: "product"
}},
{ $unwind: "$product" }
],
as: "orderLines"
}},
// Apply filters
{ $match: {
"customer.address.country": "Switzerland",
"orderLines.product.title": "MongoDB Data Modeling and Schema Design"
}},
// Sort and limit
{ $sort: { orderDate: -1 } },
{ $limit: 1 }
])
This query works, but it scans the entire collection. Indexes aren’t effective here because $match runs only after the $lookup and $unwind stages.
With one million orders, it would read all one million, perform three million lookups, filter on country and book title, sort a large intermediate result, and then return just one document.
That’s why joins can be slow: when data is split across many tables or collections, you lose the benefit of efficient index access—compound indexes can only target one collection. It also requires to code every query in the domain repository.
We need to rethink the data model.
Step 1: Embedding to fit the domain model
Knowing the database is for an order‑entry system, we can design a document model that matches the domain. An order should be a self‑contained document with all customer info and complete product details for each order line—because that’s how it’s displayed, printed, or emailed. If it’s a single logical document on paper, it should be a single document in MongoDB.
Here’s the embedded form:
db.orders.insertOne({
_id: 'O0002',
customerId: 'C0001',
orderDate: ISODate('2024-10-22T09:00:00.000Z'),
status: 'Processing',
customer: {
_id: 'C0001',
name: 'Franck Pachot',
email: 'franck.pachot@example.com',
address: { city: 'Geneva', country: 'Switzerland' }
},
orderLines: [
{ quantity: 1, product: { _id: 'P0001', title: "'MongoDB in Action...', price: 59.99 } }, "
{ quantity: 1, product: { _id: 'P0002', title: "'MongoDB Data Modeling and Schema Design', price: 54.99 } }, "
{ quantity: 1, product: { _id: 'P0003', title: "'High Performance with MongoDB...', price: 49.99 } } "
]
})
Duplicated data isn’t a concern here—documents are compressed on disk, and changes to product descriptions don’t affect completed orders.
Querying becomes far simpler:
db.orders.find(
{
"customer.address.country": "Switzerland",
"orderLines.product.title": "MongoDB Data Modeling and Schema Design"
}
).sort({ orderDate: -1 }).limit(1)
No lookups needed. But it still does a collection scan, filters, sorts, and finally returns one document.
I add one million documents and run the query with execution statistics:
db.orders.insertMany(Array.from({length: 333333},()=>({customer:{address:{country:"Switzerland"}}})))
db.orders.insertMany(Array.from({length: 666666},()=>({customer:{address:{country:"Germany"}}})))
db.orders.find(
{
"customer.address.country": "Switzerland",
"orderLines.product.title": "MongoDB Data Modeling and Schema Design"
}
).sort({ orderDate: -1 }).limit(1).explain("executionStats").executionStats
{
executionSuccess: true,
nReturned: 1,
executionTimeMillis: 559,
totalKeysExamined: 0,
totalDocsExamined: 1000001,
executionStages: {
isCached: false,
stage: 'SORT',
nReturned: 1,
executionTimeMillisEstimate: 542,
works: 1000004,
advanced: 1,
needTime: 1000002,
needYield: 0,
saveState: 29,
restoreState: 29,
isEOF: 1,
sortPattern: { orderDate: -1 },
memLimit: 104857600,
limitAmount: 1,
type: 'simple',
totalDataSizeSorted: 0,
usedDisk: false,
spills: 0,
spilledRecords: 0,
spilledBytes: 0,
spilledDataStorageSize: 0,
inputStage: {
stage: 'COLLSCAN',
filter: {
'$and': [
{ 'customer.address.country': { '$eq': 'Switzerland' } },
{
'orderLines.product.title': { '$eq': 'MongoDB Data Modeling and Schema Design' }
}
]
},
nReturned: 1,
executionTimeMillisEstimate: 516,
works: 1000002,
advanced: 1,
needTime: 1000000,
needYield: 0,
saveState: 29,
restoreState: 29,
isEOF: 1,
direction: 'forward',
docsExamined: 1000001
}
}
}
In order to return a single document (nReturned: 1), one million documents were read (totalDocsExamined: 1000001), and sorted (stage: 'SORT', works: 1000004).
Now we we have all in one collection, we can optimize with an index. Our objective is to reduce totalDocsExamined to nReturned.
Step 2: Indexing for filtering, sorting, and pagination
Our query filters by:
- Equality on customer country
- Equality on product title
- A range on order date (due to
sort().limit()).
An ideal compound index:
db.orders.createIndex({
"customer.address.country": 1, // equality
"orderLines.product.title": 1, // equality
"orderDate": 1 // sort/range
})
Here is the execution plan:
db.orders.find(
{
"customer.address.country": "Switzerland",
"orderLines.product.title": "MongoDB Data Modeling and Schema Design"
}
).sort({ orderDate: -1 }).limit(1).explain("executionStats").executionStats
{
executionSuccess: true,
nReturned: 1,
executionTimeMillis: 0,
totalKeysExamined: 1,
totalDocsExamined: 1,
executionStages: {
isCached: false,
stage: 'LIMIT',
nReturned: 1,
executionTimeMillisEstimate: 0,
works: 2,
advanced: 1,
needTime: 0,
needYield: 0,
saveState: 0,
restoreState: 0,
isEOF: 1,
limitAmount: 1,
inputStage: {
stage: 'FETCH',
nReturned: 1,
executionTimeMillisEstimate: 0,
works: 1,
advanced: 1,
needTime: 0,
needYield: 0,
saveState: 0,
restoreState: 0,
isEOF: 0,
docsExamined: 1,
alreadyHasObj: 0,
inputStage: {
stage: 'IXSCAN',
nReturned: 1,
executionTimeMillisEstimate: 0,
works: 1,
advanced: 1,
needTime: 0,
needYield: 0,
saveState: 0,
restoreState: 0,
isEOF: 0,
keyPattern: {
'customer.address.country': 1,
'orderLines.product.title': 1,
orderDate: 1
},
indexName: 'customer.address.country_1_orderLines.product.title_1_orderDate_1',
isMultiKey: true,
multiKeyPaths: {
'customer.address.country': [],
'orderLines.product.title': [ 'orderLines' ],
orderDate: []
},
isUnique: false,
isSparse: false,
isPartial: false,
indexVersion: 2,
direction: 'backward',
indexBounds: {
'customer.address.country': [ '["Switzerland", "Switzerland"]' ],
'orderLines.product.title': [
'["MongoDB Data Modeling and Schema Design", "MongoDB Data Modeling and Schema Design"]'
],
orderDate: [ '[MaxKey, MinKey]' ]
},
keysExamined: 1,
seeks: 1,
dupsTested: 1,
dupsDropped: 0
}
}
}
}
With this index, the query jumps straight to the target document:
-
totalKeysExamined: 1from one range (seeks: 1instage: 'IXSCAN) -
totalDocsExamined: 1fromstage: 'FETCH'without the need to sort nReturned: 1
The result is retrieved in sorted order directly from the index, minimizing the number of documents to read. We can do better. Our next goal is to elimiate totalDocsExamined.
Step 3: Projection and covering index
If the application only needs the customer’s name and country, we can project:
db.orders.find(
{
"customer.address.country": "Switzerland",
"orderLines.product.title": "MongoDB Data Modeling and Schema Design"
},
{ "customer.name": 1, "customer.address.country": 1, _id: 0 }
)
[
{
customer: { name: 'Franck Pachot', address: { country: 'Switzerland' } }
}
]
Now, we can create a covering index:
db.orders.createIndex({
"customer.address.country": 1,
"orderLines.product.title": 1,
"orderDate": 1,
"customer.name": 1
})
Here is the execution plan:
db.orders.find(
{
"customer.address.country": "Switzerland",
"orderLines.product.title": "MongoDB Data Modeling and Schema Design"
},
{ "customer.name": 1, "customer.address.country": 1, _id: 0 }
).limit(1).explain("executionStats").executionStats
{
executionSuccess: true,
nReturned: 1,
executionTimeMillis: 0,
totalKeysExamined: 1,
totalDocsExamined: 0,
executionStages: {
isCached: false,
stage: 'LIMIT',
nReturned: 1,
executionTimeMillisEstimate: 0,
works: 2,
advanced: 1,
needTime: 0,
needYield: 0,
saveState: 0,
restoreState: 0,
isEOF: 1,
limitAmount: 1,
inputStage: {
stage: 'PROJECTION_DEFAULT',
nReturned: 1,
executionTimeMillisEstimate: 0,
works: 1,
advanced: 1,
needTime: 0,
needYield: 0,
saveState: 0,
restoreState: 0,
isEOF: 0,
transformBy: { 'customer.name': 1, 'customer.address.country': 1, _id: 0 },
inputStage: {
stage: 'IXSCAN',
nReturned: 1,
executionTimeMillisEstimate: 0,
works: 1,
advanced: 1,
needTime: 0,
needYield: 0,
saveState: 0,
restoreState: 0,
isEOF: 0,
keyPattern: {
'customer.address.country': 1,
'orderLines.product.title': 1,
orderDate: 1,
'customer.name': 1
},
indexName: 'customer.address.country_1_orderLines.product.title_1_orderDate_1_customer.name_1',
isMultiKey: true,
multiKeyPaths: {
'customer.address.country': [],
'orderLines.product.title': [ 'orderLines' ],
orderDate: [],
'customer.name': []
},
isUnique: false,
isSparse: false,
isPartial: false,
indexVersion: 2,
direction: 'forward',
indexBounds: {
'customer.address.country': [ '["Switzerland", "Switzerland"]' ],
'orderLines.product.title': [
'["MongoDB Data Modeling and Schema Design", "MongoDB Data Modeling and Schema Design"]'
],
orderDate: [ '[MinKey, MaxKey]' ],
'customer.name': [ '[MinKey, MaxKey]' ]
},
keysExamined: 1,
seeks: 1,
dupsTested: 1,
dupsDropped: 0
}
}
}
}
This is the optimal plan where all documents returned (nReturned: 1) come from the index entries that were scanned (totalKeysExamined: 1) without any FETCH (totalDocsExamined: 0)
Conclusion
MongoDB query performance isn’t just about adding indexes—it’s about aligning your data model and queries with how your application really uses data. Turning a four-collection join with millions of lookups into a single index scan that runs in milliseconds comes down to three rules:
- Model for Access Embed data that’s accessed together. If it belongs on one piece of paper—like an order with its customer and product details—it belongs in one document.
- Index for Your Query Use compound indexes that follow your filter, sort, and projection. Equality fields first, range fields last. This turns full scans into targeted index scans, cutting totalDocsExamined from millions to one.
-
Project Only What You Need
Return only required fields. When a query is fully covered by an index, MongoDB can achieve
totalDocsExamined: 0and return the expected result fast.
You don’t need to apply this to all queries—use these steps only when necessary. Excessive lookups and limited indexing options drove us to optimize the data model. After that, the remaining need for selective filters and the absence of a suitable index led to creating ... (truncated)
Are Database System Researchers Making Correct Assumptions about Transaction Workloads?
In this blog, we had reviewed quite a number of deterministic database papers, including Calvin, SLOG, Detock, which aimed to achieve higher throughput and lower latency. The downside of these systems is sacrificing transaction expressivity. They rely on two critical assumptions: first, that transactions are "non-interactive", meaning they are sent as a single request (one-shot) rather than engaging in a multi-round-trip conversation with the application, and second, that the database can know a transaction's read/write set before execution begins (to lock data deterministically).
So when these deterministic database researchers write a paper to validate how these assumptions hold in the real world, we should be skeptical and cautious in our reading. Don't get me wrong, this is a great and valuable paper. And we still need to be critical in our reading.
Summary
The study employed a semi-automated annotation tool to analyze 111 popular open-source web applications and over 30,000 transactions. The authors target applications built with Django (Python) and TypeORM (TypeScript). These Object-Relational Mappers (ORMs) abstract away the SQL layer, allowing developers to interact with a wide variety of underlying relational databases (such as PostgreSQL, MySQL, and SQLite) using standard application code. The authors categorized transactions based on two primary dimensions: Transaction Interactivity and Read/Write Set Inferability.
Transaction Interactivity
The study found that interactive transactions, where the application holds a transaction open while performing logic or reads, are surprisingly rare. 39% of applications contain no interactive transactions. For the remaining 61% of applications, only roughly 9.6% of the workload is interactive.
The authors then distinguish between "Strictly Interactive" transactions, which inherently require external input or side effects mid-flight (e.g., waiting for an API response or user prompt), and "Non-Strictly Interactive" ones, where the back-and-forth is merely deterministic application logic processing reads between queries without outside dependencies. They find strictly interactive transactions account for only 0.5% of the total workload. They argue that the vast majority of remaining interactive transactions are convertible to one-shot transactions with minimal code changes.
Read/Write Set Inferability
The authors tested whether a database could statically predict which rows a transaction would touch. They found that for 90% of applications, at least 58% of transactions have read/write sets that are fully inferable in advance.
The primary obstacle to perfect knowledge is the CDA Mismatch (Conflict Detection Attribute Mismatch), which occurs in roughly 27% of transactions. This happens when a transaction queries data using a secondary attribute (e.g., 'email') rather than the primary locking key (e.g., 'ID'). See Figure 1 above. When this happens, the database cannot infer the lock without first inspecting the data. However, the study remains upbeat about this. Since CDA mismatches typically occur in simple single-statement transactions, they say that the "fallback" mechanism (running a lightweight reconnaissance query to find the ID) is inexpensive.
Critical Review
This paper provides very valuable empirical data on transaction workloads by surveying a large number of applications. It echoes the scale of the SIGMOD 2015 "Feral Concurrency Control" study, which examined the integrity risks of the mismatch between databases and modern ORM frameworks like Rails. However, we must examine this study critically to identify potential limitations in scope and bias. My two big criticism are the following.
Selection Bias. The choice of the corpus, open-source web applications using ORMs, heavily skews the results and excludes almost all enterprise systems. The study also ignores ad-hoc transactions executed by human operators (DBAs, analysts) via terminals. These transactions are inherently interactive ("user think time") and often performance-critical.
Convertibility Optimism. The paper argues that "Non-Strictly Interactive" transactions are easily convertible to deterministic models without resorting to brittle Stored Procedures. They advocate for reconnaissance/scout query pattern, where the application performs a dry run read phase to optimistically build a write-set before sending it as a single atomic batch. While this is promising, the claim of minimal changes seems very optimistic. However, I think this conversion challenge is a great opportunity for research and for collecting empirical evidence of the actual engineering costs.