Stack traces for Postgres errors with backtrace_functions
This is an external post of mine. Click here if you are not redirected.
This is an external post of mine. Click here if you are not redirected.
Aleksey and I sat down to read this paper on Monday night. This was an experiment which aimed to share how experts read papers in real time. We haven't read this paper before to keep things raw. As it is with research, we ended up arguing with the paper (and between each other) back and forth. It was messy, and it was also awesome. We had a lot of fun. Check our discussion video below (please listen at 1.5x, I sound less horrible at that speed, ah also this thing is 2 hours long). The paper I annotated during our discussion is also available here.
This paper appeared in HotOS 2025, so it is very recent. It's a position paper arguing that the traditional F-threshold fault model in consensus protocols is outdated and even misleading.
Yes, the F-threshold fault model does feel like training wheels we never took off. In his essay "the joy of sects", Pat Helland bring this topic to tease distributed systems folk: "Distributed systems folks. These people vacillate between philosophers and lawyers. No one else can talk so fluently about total order without discussing the scope of totality. Availability is always couched by assuming a loss of, at most, F servers, while never veering into what happens when an operator logs into the system. Integral to their psyche is the belief in deterministic outcomes in the face of nondeterministic environments. Sometimes these discussions get seriously F'd up!" I am proud to report that, during a meeting with Pat, I came up with the pun this is "F'ed up, for F=1 and N=3". Although I must concede that Pat is the real A-tell-a The Pun!
So, the premise is agreeable: the clean abstraction of "up to f faults" is too simplistic for how real distributed systems fail. They argue that treating all faults equally (as the f-model does) hides real-world complexities. Machines don’t fail uniformly. Failures evolve over time (bathtub curve), cluster around software updates, and depend on hardware, and even location in datacenter (temperature). This 6 page paper has 77 references, so they seem to have done an extensive literature search on this topic.
Building on this observation, the paper advocates a paradigm shift: replacing the fixed F-model with a probabilistic approach based on per-node failure probabilities, derived from telemetry and predictive modeling. While the paper doesn't propose a new algorithm, it suggests that such a model enables cost-reliability tradeoffs. For instance, it mentions that running Raft on nine unreliable nodes could match the reliability of three high-end ones at a third of the cost. (It is unclear whether this accounts/prorates for throughput differences.)
But the deeper we read in to the paper, the more we found ourselves asking: what exactly is a fault curve (p_u), and what is this new model? Is this p_u = 1%, per year, per month, per quorum formation? The paper never quite defines fault-curves, even though it's central to the argument.
We got even more confused in the paper's conflation of safety and liveness for Raft. FLP (1985) taught us to keep safety and liveness separate. Raft and Paxos are known for prioritizing safety above all. Even when there are more crash failures than F, the safety is not violated. So when the paper reports “Raft is only 99.97% safe and live,” the precision is misleading. What does that number even mean? How was it calculated? Also there is a big difference between "safe OR live" and "safe AND live". Why were the two bunched together in Table 2 for Raft? What is meant here?
The paper says: "As faults are probabilistic, it is always possible for the number of faults to exceed F. Thus no consensus protocol can offer a guarantee stronger than probabilistic safety or liveness." Again, I suspect that "or" (in this case) between safety and liveness is carrying a lot of load. The safety of Paxos family of protocols rely on the quorum intersection property, so even when F is exceeded, the safety is not violated, although liveness could be lost. The paper says "Violating quorum intersection invariants triggers safety violations." But the quorum intersection is a priori calculated, the sum of two quorum sizes has to be bigger than N, so this is guaranteed by arithmetic, and it is not a probabilistic guarantee. We had to hypothesize a lot about why the paper seems to claim some safety violation: Is it maybe some durability loss? Is this assuming Byzantine failure? We still don't have an answer.
The paper does better with PBFT, separating safety and liveness in their reliability table. But even there, the model feels underspecified. There's a leap from "fault curves exist" to "this quorum configuration gives X nines of safety" without laying out a mathematical foundation.
Another counterintuitive point in the paper was the idea that more nodes can make things worse, probabilistically. For instance, the paper claims that a 5-node PBFT deployment could be significantly more reliable than a 7-node one, because quorum intersection becomes more fragile as the system scales with unreliable nodes. Again, we couldn't really make sense of this claim either, as there was not much explanation for it.
This is a position paper, and it plays that role well. It surfaces important observations, challenges sacred abstractions, and provokes discussion. It aims to bring consensus modeling into a more probabilistic/complex (real?) world where failure rates vary, telemetry exists, and tradeoffs matter. It advocates for getting rid of the rigid F upper-bound for fault-tolerance. But complexity cuts both ways. A richer/complex model may capture more nuance, but it can also make reasoning and safety proofs much harder. And clarity and simplicity and guaranteed fault-tolerance is essential for consensus.
Actually, the paper made me appreciate the F abstraction for faults even more. It is simple, but it makes reasoning simpler in return. It is possible to still be probabilistic and do all that analysis in selecting the F number. These days due to constant software rollovers many systems go with F=2 and N=5, or even higher numbers. And the nice thing about the Paxos family of protocols is due to quorum intersection, safety is always guaranteed, non-probabilistic, even when the F limit is exceed by extra crash faults (in reality network faults and partitions also bunch in here). And there has been good progress in decoupling F from N (thanks to the flexible quorums result), which addresses some of the complaints in the paper (e.g., "Linear size quorums can be overkill"). Moreover, heterogeneous deployments are already considered, and leader selection heuristics exist.
If the goal is the replace the F abstraction, there should be more thought put into what new abstraction would be proposed to take over. Abstractions are at the core of Computer Science and Distributed Systems. As one of my favorite Dijkstra quotes say: "The purpose of abstraction is not to be vague, but to create new semantic level where one can be absolutely precise."
This has performance results for Postgres 17.4, 17.5, 18 beta1 and 18 beta2 on a large server with sysbench microbenchmarks. Results like this from me are usually boring because Postgres has done a great job at avoiding performance regressions over time. This work was done by Small Datum LLC and not sponsored. Previous work from me for Postgres 17.4 and 18 beta1 is here.
The workload here is cached by Postgres and my focus is on regressions from new CPU overhead or mutex contention.
tl;dr
(QPS for some version) / (QPS for PG 17.4)
This EuroSys '23 paper reads like an SOSP best paper. Maybe it helped that EuroSys 2023 was in Rome. Academic conferences are more enjoyable when the venue doubles as a vacation.
Morty tackles a fundamental question: how can we improve concurrency under serializable isolation (SER), especially without giving up on interactive transactions? Unlike deterministic databases (e.g., Calvin) that require transactions to declare read and write sets upfront, Morty supports transactions that issue dynamic reads and writes based on earlier results.
Transactional systems, particularly in geo-replicated settings, struggle under contention. High WAN latency stretches transaction durations, increasing the window for conflicts. The traditional answer is blind exponential backoff, but that leads to low CPU utilization. TAPIR and Spanner replicas often idle below 17% under contention as Morty's evaluation experiments show.
Morty's approach to tackle the problem is to start from first principles, and investigate what limits concurrency under serializability? For this, Morty formalizes serialization windows, defined per transaction and per object, stretching from the commit of the value read to the commit of the value written. Serializability requires these windows not to overlap. (Note that avoiding overlaps is necessary but not sufficient for SER: you also need to prevent cycles through read-only transactions or indirect dependencies, which Morty addresses at commit time validation checks.)
Figures 1 and 2 illustrate these serialization windows. With re-execution, Morty commits T2 with minimal delay, instead of aborting and retrying. Figure 3 shows another case where re-execution avoids wasted work.
Morty's re-execution mechanism hinges on two ideas: read unrolling and a priori ordering.
Read unrolling allows Morty to selectively rewind and recompute parts of a transaction that depended on outdated reads. Rather than aborting the entire transaction, Morty re-executes just the stale portion. This is possible because transactions are written in continuation-passing style (CPS), which makes control flow and data dependencies explicit. CPS is common in asynchronous programming (JavaScript, Go, Java, Python, and libraries like NodeJS, LibEvent, and Tokio) and it maps well to networked databases like Morty.
Morty’s CPS API supports re-execution directly. Get(ctx, key, cont) reads a key and resumes at cont. Put(ctx, key, val) tentatively writes. Commit(ctx, cont) initiates prepare and finalize. Abort drops the execution. Abandon means transaction is re-executed (often partially) using a continuation from an earlier point. Re-execution reuses the context, shifts the stale read forward, and resumes execution.
A priori ordering assigns each transaction a speculative timestamp at arrival, defining a total order a la MVTSO. This order is not revised, even if clocks are skewed or messages are delayed. Instead, if execution violates the speculative order (e.g., a read misses a write that should've come earlier), Morty detects the conflict and re-executes the transaction to realign with the original order. The system adapts execution to fit the speculative schedule, not vice versa. The paper claims aborts are rare since re-execution usually succeeds.
I think a key idea in Morty is that contrary to most approaches, Morty ignores read validity (that committed transactions only observe committed data) during execution to expose more concurrency to transactions. It exposes both committed and uncommitted write to transactions by leveraging MVTSO and allows reads from uncommitted versions. These speculative reads are later validated at prepare-time prior to commit. If a read depended on a write that never committed, or missed a newer write, Morty re-executes the transaction (through abondon call) or aborts it as a last resort.
In addition to serialization windows, Morty defines validity windows to measure how long a transaction waits for its inputs to commit. A transaction Ti's validity window on object x starts when its dependency commits and ends when Ti commits. Like serialization windows, overlapping validity windows are disallowed. But unlike serialization windows, Morty doesn't try to align validity windows, and instead focuses on minimizing their span. Long validity windows mean low throughput. Morty shortens validity windows by avoiding unnecessary delays between reads and commits, preventing cascading speculative reads, and favoring re-execution over abort-and-retry.
Re-execution typically occurs during the commit protocol, when replicas must check commit status across a quorum. If they detect a stale read or violated serialization window, they trigger re-execution before finalizing. Validation checks include:
But why do replicas vote at all? Because Morty doesn't use Raft-style replica groups, with a leader calling the shots. In contrast to Raft-groups approach, Morty doesn't have a central log or a leader for serializing/ordering all commands. It is closer to TAPIR, and it uses timestamps to assign speculative order. By integrating concurrency control with replication, Morty aims to improve throughput under contention and achieve low-latency geo-replication. So, quorum-based voting ensures consistency and fault-tolerance as in TAPIR.
Voting ensures that a commit is durable across failures, visible to a majority, and recoverable even if the coordinator crashes. Without this, there's no way to guarantee correctness in a crash or partition.
Recovery is still tricky. Morty replicates across 2f+1 nodes and tolerates f failures. Coordinators may stall, so Morty uses a Paxos-style recovery protocol with view changes: any replica can step up and finalize the commit decision for a failed coordinator. This isn't trivial as it requires care to avoid split-brain and maintain consistency.
Morty's re-execution resembles CockroachDB’s read-refresh a bit. CRDB refreshes read timestamps if read spans haven't been overwritten, but it doesn't re-execute application logic. If one key's value changes, Morty rewinds only the dependent continuation. In contrast to CRDB, which must restart the whole transaction if refresh fails, Morty semantically rewinds and reruns logic with new values.
The results are impressive. On TPC-C, Morty achieves 7.4x the throughput of Spanner-style-TAPIR, 4.4x of TAPIR, and 1.7x of MVTSO. On the high-contention Retwis benchmark, Morty delivers 96x throughput over TAPIR.
Morty scales with CPU. On Zipfian Retwis, it grows from 7.8k to 35.3k txn/s with 20 cores. Spanner and TAPIR plateau early (at 17% CPU utilization) due to frequent aborts and exponential backoff.
Morty is one of the most technically rich papers on serializability in recent years. It's dense and demanding. It assumes deep familiarity with concurrency control, replication, and async programming. But for those in the distributed systems and databases intersection, Morty is a very rewarding read.
One gripe: the code link is broken. https://github.com/matthelb/morty/
This has results for the Insert Benchmark with Postgres on a large server.
There might be small regressions, but I have more work in progress to explain that:
Builds, configuration and hardware
This paper (SIGMOD '08) proposes a lightweight runtime technique to make Snapshot Isolation (SI) serializable without falling back to locking. The key idea behind Serializable SI (SSI) is to detect potentially dangerous (write-skew) executions at runtime and abort one of the transactions to guarantee serializability (SER).
The goal is to offer the strong guarantees of SER without sacrificing SI's high performance and non-blocking reads. But would it make sense to implement SER by layering on MVCC SI instead of implementing it directly? Do you think an SI-based implementation would be more performant than native 2PL-based SER implementations? What about compared to OCC-based SER? The evaluation section gives some answers.
Let's back up.
Write-skew is the canonical anomaly under SI. And the canonical example for write-skew is the "doctors on-call" scenario. Consider two doctors removing themselves from a duty roster. Each transaction checks that at least one doctor remains on duty and proceeds to update its own status. Under SI, each transaction sees the old snapshot where the other doctor is still on duty. Both commit. The resulting state violates the application’s invariant, even though each transaction was locally consistent. Compare this to SER which ensures that integrity constraints are maintained even if those constraints are not explicitly declared to the DBMS. (Another famous write-skew example is reading savings and checking accounts both, and withdrawing from one. Yet another example could be graph coloring, reading neighbor's color to switch your own in response.)
A naive fix for preventing write-skew would be to track the read set of a transaction T1 and, at commit time, check whether any item read by T1 was overwritten by a concurrent writer. If so, we abort T1. This eliminates outgoing rw-dependencies and thus prevents write-skew cycles. It works, but it's overkill: You end up aborting transactions even when no cycle would form. For example, in the execution "b1 r1(x) w2(x) w1(y) c1 c2", T1 would be aborted because its read r1(x) is overwritten by T2's w2(x), but it is possible to serialize this execution as T1 followed by T2, as the writesets are not conflicting as per SI.
The paper proposes a smarter fix: only abort if a transaction has both an inConflict and an outConflict edges with other transactions. That is, it sits in the middle of two concurrent rw-dependencies (the paper calls this a pivot in a dangerous structure). Prior work by the authors (Fekete et al. 2005) proves that all SI anomalies contain such a pivot. So this check would suffice.
The approach still over-approximates somewhat: having both in and out conflicts doesn’t always imply a cycle. See Fig 9, for an example. Although T0 has both inConflict and outConflict, it is possible to serialize these transactions as TN, T0, and T1 order. The paper doesn't quantify the false positive rate or explore pathological workloads.
The paper says full serialization graph cycle detection would be too expensive, and this is the compromise. This is still a tighter and more efficient condition than the naive approach. The Ferro-Yabandeh Eurosys '12 paper uses something close to the naive approach. Same with naive OCC-based SER implementations. They abort based on single outConflict edge.
The implementation tracks two flags per transaction: inConflict and outConflict. It also adds a new lock mode, SIREAD, to record reads. Reads acquire SIREAD locks; writes acquire WRITE locks. When a read at T1 observes a version written by T2, we set T1.outConflict and T2.inConflict. This is called a rw-dependency (a.k.a antidependency) from T1 to T2. When a write at T1 detects an existing SIREAD by T2, we set T1.inConflict and T2.outConflict. See the pseudocode in Figures 5-8 for how this is implemented.
When both flags are set for any transaction, one transaction in the conflict chain is aborted. The implementation prefers to abort the pivot (the transaction with both in and out conflicts), unless it has already committed. Else an adjacent transaction is aborted. If there are two pivots in the cycle, whichever is detected first gets aborted. This choice is sound and gives room for policy tuning (e.g., prefer aborting younger transactions).
Note that all of this is layered on top of SI. Readers never block writers. Writers never block readers. The only cost is the extra bookkeeping to track conflicts.
The authors implemented SSI over Berkeley DB (which provides SI through MVCC) with minimal changes, around 700 lines of modified code out of a total of over 200,000 lines of code in Berkeley DB. The experiments with the SmallBank benchmark show that the performance is close to plain SI and significantly better than 2PL under contention.
The paper does not contrast SSI with OCC implementation of SER. But it argues that naive OCC implementations abort on any rw-conflict. SSI waits until both in and out conflicts are present, and would result in fewer false positives and better throughput.
SI is attractive because it avoids blocking: readers don't block writers and vice versa. Maybe the takeaway is this: Stick with SI. But if you really must go serializable, consider layering SSI on top of SI instead of using 2PL or OCC based SER implementations.
But, still, there are many caveats to SSI as presented in this paper: phantoms are not addressed, false positives remain, B-tree level conflicts inflate aborts, and adapting this to distributed systems is non-trivial.
This paper is a research prototype after all. A VLDB'12 paper by Dan Ports followed this algorithm to provide a PostgreSQL implementation adds integration with real-world features like: replication and crash recovery, two-phase commit (tupac), subtransactions and savepoints, and memory bounding via transaction summarization. It also handled phantoms by leveraging SIREAD locks on index ranges to detect predicate read/write conflicts.
The paper punts on phantoms. It says that Berkeley DB uses page-level locking, so predicate-based anomalies don’t occur. But for engines with row-level locking, it suggests that SSI would need to acquire SIREAD locks on index ranges or use multigranularity locking.
The design assumes a single-node system. In a distributed DB, you would need to propagate SIREAD metadata and conflict flags across partitions. Detecting cross-partition dependencies and coordinating pivot aborts would be hard. The paper does not talk about implementation in distributed databases.
A known artifact in the Berkeley DB implementation is page-level locking. Updates to shared pages, like the B-tree root during a split, look like write-write conflicts across unrelated transactions. These inflate false positives and aborts under SSI.
I mean I have to complain about this (otherwise it won't be a blog post from Murat). Figure 4 was terribly confusing for me. I had a hard time figuring out why these two scenarios are not serializable. It turns out I got confused by the "helpful text" saying: "In Figure 4(a), both reads occur after the writes. In Figure 4(b), TN reads x before it is written by T0." No, that is irrelevant: the reads in SI happen from the snapshot at the transaction start time, so when the reads happen is not relevant. Confused by this I was wondering why we can't serialize Figure 4.a execution as T1, T0, TN.
Ok, let's focus on Fig 4.a. T0's serialization point should be before T1, because of rw-dependency from T0 to T1: T0' read happens before T1's write. T1 needs be serialized before TN, because TN reads z from T1's write. But TN is problematic. TN introduces rw-dependence from TN to T0. It reads x from its snapshot which is earlier than T0's commit point. So TN should come before T0. This conflicts with the two previous requirements that T0 precedes T1 and T1 precedes TN. We have a cycle that prevents serialization.
Figure 4.b is very similar to Figure 4.a and has the same cycle. The only difference is TN's commit point comes a little later, but it doesn't change anything, TN does its read from the snapshot taken at the start of TN.
It would have been nice to supplement the figure with a text explaining the cycles, and giving some insights into the thought process.
At CedarDB, we set out to bring the fruits of the highly successful Umbra research project to a wider audience. While Umbra has undoubtedly always had the potential to be the foundation of a highly performant production-grade database system, getting a research project ready for production workloads and building a company at the same time is no trivial task.
When we launched a year ago, we were still figuring out the differences between building a research system at university, and building a system for widespread use. Since then, we have learned a lot.
Concurrency control in databases has two main modes for handling conflicts. Traditionally, these are labeled "pessimistic" and "optimistic," but those names can be misleading. Let’s define them by what actually happens from a user point of view:
In SQL databases, where multi-statement transactions are the norm because of normalization to multiple tables, you’ll see this as an explicit WAIT or NOWAIT on locking reads (LOCK, SELECT FOR UPDATE), or a consequence of the isolation level. READ COMMITTED tries to wait and detect deadlocks, SERIALIZABLE can fail with a serialization error.
MongoDB’s document-oriented approach is straightforward, providing developers the freedom to choose what best fits their needs.
The need to implement retry logic in the application is not a limitation of MongoDB. A database cannot transparently cancel a user-defined transaction and retry it because it lacks knowledge of the application's prior actions, like sending emails or writing to files. Only the application can cancel or compensate for actions taken before retrying the transaction.
In contrast to multi-statement transactions, single-statement transactions or auto-commit commands can be canceled and restarted before control is returned to the application. That's how MongoDB provides a lock-free wait-on-conflict behavior for single-document updates: it is internally a fail-on-conflict optimistic concurrency control (OCC), to avoid locks, but with internal retries to appear as a wait-on-conflict to the user.
Here’s what this looks like in practice. I start with a one-document collection:
AtlasLocalDev atlas [direct: primary] test>
db.demo.drop();
true
AtlasLocalDev atlas [direct: primary] test>
db.demo.insertOne({ _id: 'x' , value: 0 });
{ acknowledged: true, insertedId: 'x' }
I start a first transaction, in session A, that updates the document, but do not commit it yet:
//first transaction A updates one doc
AtlasLocalDev atlas [direct: primary] test>
sessionA = db.getMongo().startSession();
{ id: UUID('ac8f6741-4ebb-4145-acb2-6b861a390c25') }
AtlasLocalDev atlas [direct: primary] test> sessionA.startTransaction();
AtlasLocalDev atlas [direct: primary] test>
dbA = sessionA.getDatabase(db.getName());
test
AtlasLocalDev atlas [direct: primary] test>
dbA.demo.updateOne({ _id: 'x' }, { $inc: { value: 2 } });
{
acknowledged: true,
insertedId: null,
matchedCount: 1,
modifiedCount: 1,
upsertedCount: 0
}
While this transaction is ongoing, I start another one in session B and try to update the same document:
// second transaction B tries to update the same doc
AtlasLocalDev atlas [direct: primary] test>
sessionB = db.getMongo().startSession();
{ id: UUID('a5302cab-9688-43db-badd-e3691b30a15b') }
AtlasLocalDev atlas [direct: primary] test> sessionB.startTransaction();
AtlasLocalDev atlas [direct: primary] test>
dbB = sessionB.getDatabase(db.getName());
test
AtlasLocalDev atlas [direct: primary] test>
Date();
Wed Jul 23 2025 10:59:37 GMT+0000 (Greenwich Mean Time)
AtlasLocalDev atlas [direct: primary] test>
dbB.demo.updateOne({ _id: 'x' }, { $inc: { value: 4 } });
MongoServerError[WriteConflict]: Caused by :: Write conflict during plan execution and yielding is disabled. :: Please retry your operation or multi-document transaction.
AtlasLocalDev atlas [direct: primary] test>
Date();
Wed Jul 23 2025 10:59:40 GMT+0000 (Greenwich Mean Time)
AtlasLocalDev atlas [direct: primary] test> sessionB.abortTransaction();
The conflict was immediately detected and the application gets a WriteConflict error and can retry it.
This time, I'll retry without starting an explicit transaction:
// do the same without starting a transaction
AtlasLocalDev atlas [direct: primary] test>
Date();
Wed Jul 23 2025 10:59:49 GMT+0000 (Greenwich Mean Time)
AtlasLocalDev atlas [direct: primary] test>
dbB.demo.updateOne({ _id: 'x' }, { $inc: { value: 8 } });
The session blocks, it is waiting until the document is free. I leave the transaction in session A to show what happens with long transactions.
After one minute, the update is successful:
{
acknowledged: true,
insertedId: null,
matchedCount: 1,
modifiedCount: 1,
upsertedCount: 0
}
AtlasLocalDev atlas [direct: primary] test>
Date();
Wed Jul 23 2025 11:00:48 GMT+0000 (Greenwich Mean Time)
My two transactions are not serializable, as they update the same document at the same time, so if one was successful, the other must abort. This is what happened, the transaction in session A was automatically canceled after a one minute timeout. I can see that if I try to commit:
AtlasLocalDev atlas [direct: primary] test>
sessionA.commitTransaction();
MongoServerError[NoSuchTransaction]: Transaction with { txnNumber: 2 } has been aborted.
AtlasLocalDev atlas [direct: primary] test>
sessionA.endSession();
Finally, the result is consistent, with the changes committed by session B:
AtlasLocalDev atlas [direct: primary] test>
db.demo.find();
[ { _id: 'x', value: 8 } ]
I looked at the call stack while it was waiting and here is the Flame Graph:
It shows the update attempts. updateWithDamages is not as terrible as it sounds, and is just an incremental update that logs the changes ("damages") rather than re-writing the whole document. logWriteConflictAndBackoff is the internal retry. Each write conflict in PlanExecutorExpress, increment an attempt counter and calls logWriteConflictAndBackoff → logAndBackoff → logAndBackoffImpl with the number of attempts. The code shows that the wait depends on it:
This mitigates resource contention by increasingly slowing down repeated conflicts with an exponential-to-linear backoff while still getting a chance to get the conflict resolved. In one minute, we can estimate that it has performed approximately 4 + 6 + 90 + 100 + 585 = 785 write conflict retries under this backoff schedule, spending most of that time in the 100 ms sleep interval.
I have run the wait-on-conflict situation again:
AtlasLocalDev atlas [direct: primary] test>
Date();
Wed Jul 23 2025 08:48:58 GMT+0000 (Greenwich Mean Time)
AtlasLocalDev atlas [direct: primary] test>
dbB.demo.updateOne({ _id: 'x' }, { $inc: { value: 8 } });
{
acknowledged: true,
insertedId: null,
matchedCount: 1,
modifiedCount: 1,
upsertedCount: 0
}
AtlasLocalDev atlas [direct: primary] test>
Date();
Wed Jul 23 2025 08:50:18 GMT+0000 (Greenwich Mean Time)
During these 80 seconds, I displayed the writeConflicts metric with mongostat:
mongostat -o="insert,query,update,delete,metrics.operation.writeConflicts.rate(),time"
insert query update delete writeConflicts.rate() time
*0 *0 *0 *0 0 Jul 23 08:49:01.172
*0 *0 *0 *0 0 Jul 23 08:49:02.173
*0 *0 *0 *0 0 Jul 23 08:49:03.172
*0 *0 *0 *0 0 Jul 23 08:49:04.172
*0 *0 *0 *0 0 Jul 23 08:49:05.172
...
*0 *0 *0 *0 0 Jul 23 08:50:15.172
*0 *0 *0 *0 0 Jul 23 08:50:16.171
*0 *0 *0 *0 0 Jul 23 08:50:17.172
*0 *0 *0 *0 981 Jul 23 08:50:18.172
*0 *0 *0 *0 0 Jul 23 08:50:19.172
*0 *0 *0 *0 0 Jul 23 08:50:20.172
It shows 981 write conflict retries during those 80 second. Early retries go fast, adding just a few milliseconds. As time goes on, the loop spends the vast majority of those 80 seconds in the late-stage sleep(100ms) phase, pushing the retry count close to elapsed_time / 100ms (plus early faster backoffs).
Here is the exact calculation:
Note that with mongostat you won't see individual retries until the moment the operation finally succeeds and increments the write conflict counter. If you see large spikes, it's time to look at data modeling patterns or long-running operations.
Another way to look at it is with the MongoDB profiler:
db.setProfilingLevel(2);
Date();
dbB.demo.updateOne({ _id: 'x' }, { $inc: { value: 8 } });
Date();
db.setProfilingLevel(0);
db.system.profile.find().sort({ ts: -1 }).limit(1).pretty();
Here is the output:
[
{
op: 'update',
ns: 'test.demo',
command: {
q: { _id: 'x' },
u: { '$inc': { value: 8 } },
multi: false,
upsert: false
},
keysExamined: 1014,
docsExamined: 1014,
nMatched: 1,
nModified: 1,
nUpserted: 0,
keysInserted: 0,
keysDeleted: 0,
writeConflicts: 1013,
numYield: 1013,
locks: {
ReplicationStateTransition: { acquireCount: { w: Long('1015') } },
Global: { acquireCount: { r: Long('1'), w: Long('1015') } },
Database: { acquireCount: { w: Long('1015') } },
Collection: { acquireCount: { w: Long('1015') } }
},
flowControl: { acquireCount: Long('1014') },
readConcern: { level: 'local', provenance: 'implicitDefault' },
storage: { data: { txnBytesDirty: Long('932') } },
cpuNanos: 145071200,
millis: 82965,
planSummary: 'EXPRESS_IXSCAN { _id: 1 },EXPRESS_UPDATE',
planningTimeMicros: 71,
totalOplogSlotDurationMicros: 135,
execStats: {
isCached: false,
stage: 'EXPRESS_UPDATE',
keyPattern: '{ _id: 1 }',
indexName: '_id_',
keysExamined: 1014,
docsExamined: 1014,
nReturned: 0,
nWouldModify: 1,
nWouldUpsert: 0,
nWouldDelete: 0
},
ts: ISODate('2025-07-23T17:18:48.082Z'),
client: '172.17.0.1',
appName: 'mongosh 2.5.0',
allUsers: [],
user: ''
}
]
The single update of one document (Modified: 1) has been retried 1014 times (keysExamined: 1014, docsExamined: 1014) because the first 1013 were conflicting with the concurrent transaction (writeConflicts: 1013). It took in total an elapsed time of 83 seconds (millis: 82965) without consuming resources as each retry waited (numYield: 1013) so the CPU usage is minimal, less than 0.2% (cpuNanos: 145071200). The conflict resolution, with optimistic concurrency control, is lock-free while waiting. It only briefly acquires some write intent locks for each retry (acquireCount: { w: Long('1015') }).
The single update of one document (nModified: 1) was retried 1,014 times (keysExamined: 1014, docsExamined: 1014), as the first 1,013 attempts conflicted with a concurrent transaction (writeConflicts: 1013). The entire process took about 83 seconds (millis: 82965), but resource usage remained minimal: each retry yielded control (numYield: 1013), so CPU time used was less than 0.2% (cpuNanos: 145071200). This conflict resolution uses optimistic concurrency control and is lock-free while waiting. MongoDB only briefly acquires write intent locks for each retry (acquireCount: { w: Long('1015') }), never blocking other work for the whole time.
In MongoDB, single-document operations do not require explicit transactions and utilize a "wait-on-conflict" method, allowing for seamless, automatic conflict handling. This is achieved through optimistic concurrency control, where MongoDB retries updates on conflict with exponential backoff and operates in a lock-free manner while waiting.
In contrast, multi-document operations that involve explicit transactions follow a “fail-on-conflict” approach. If a conflict occurs, MongoDB immediately returns an error, placing the responsibility of canceling and retrying the transaction on your application, since it may involve some operations that are not visible to the database.
Concurrency control in databases has two main modes for handling conflicts. Traditionally, these are labeled "pessimistic" and "optimistic," but those names can be misleading. Let’s define them by what actually happens from a user point of view:
In SQL databases, where multi-statement transactions are the norm because of normalization to multiple tables, you’ll see this as an explicit WAIT or NOWAIT on locking reads (LOCK, SELECT FOR UPDATE), or a consequence of the isolation level. READ COMMITTED tries to wait and detect deadlocks, SERIALIZABLE can fail with a serialization error.
MongoDB’s document-oriented approach is straightforward, providing developers the freedom to choose what best fits their needs.
The need to implement retry logic in the application is not a limitation of MongoDB. A database cannot transparently cancel a user-defined transaction and retry it because it lacks knowledge of the application's prior actions, like sending emails or writing to files. Only the application can cancel or compensate for actions taken before retrying the transaction.
In contrast to multi-statement transactions, single-statement transactions or auto-commit commands can be canceled and restarted before control is returned to the application. That's how MongoDB provides a lock-free wait-on-conflict behavior for single-document updates: it is internally a fail-on-conflict optimistic concurrency control (OCC), to avoid locks, but with internal retries to appear as a wait-on-conflict to the user.
Here’s what this looks like in practice. I start with a one-document collection:
AtlasLocalDev atlas [direct: primary] test>
db.demo.drop();
true
AtlasLocalDev atlas [direct: primary] test>
db.demo.insertOne({ _id: 'x' , value: 0 });
{ acknowledged: true, insertedId: 'x' }
I start a first transaction, in session A, that updates the document, but do not commit it yet:
//first transaction A updates one doc
AtlasLocalDev atlas [direct: primary] test>
sessionA = db.getMongo().startSession();
{ id: UUID('ac8f6741-4ebb-4145-acb2-6b861a390c25') }
AtlasLocalDev atlas [direct: primary] test> sessionA.startTransaction();
AtlasLocalDev atlas [direct: primary] test>
dbA = sessionA.getDatabase(db.getName());
test
AtlasLocalDev atlas [direct: primary] test>
dbA.demo.updateOne({ _id: 'x' }, { $inc: { value: 2 } });
{
acknowledged: true,
insertedId: null,
matchedCount: 1,
modifiedCount: 1,
upsertedCount: 0
}
While this transaction is ongoing, I start another one in session B and try to update the same document:
// second transaction B tries to update the same doc
AtlasLocalDev atlas [direct: primary] test>
sessionB = db.getMongo().startSession();
{ id: UUID('a5302cab-9688-43db-badd-e3691b30a15b') }
AtlasLocalDev atlas [direct: primary] test> sessionB.startTransaction();
AtlasLocalDev atlas [direct: primary] test>
dbB = sessionB.getDatabase(db.getName());
test
AtlasLocalDev atlas [direct: primary] test>
Date();
Wed Jul 23 2025 10:59:37 GMT+0000 (Greenwich Mean Time)
AtlasLocalDev atlas [direct: primary] test>
dbB.demo.updateOne({ _id: 'x' }, { $inc: { value: 4 } });
MongoServerError[WriteConflict]: Caused by :: Write conflict during plan execution and yielding is disabled. :: Please retry your operation or multi-document transaction.
AtlasLocalDev atlas [direct: primary] test>
Date();
Wed Jul 23 2025 10:59:40 GMT+0000 (Greenwich Mean Time)
AtlasLocalDev atlas [direct: primary] test> sessionB.abortTransaction();
The conflict was immediately detected and the application gets a WriteConflict error and can retry it.
This time, I'll retry without starting an explicit transaction:
// do the same without starting a transaction
AtlasLocalDev atlas [direct: primary] test>
Date();
Wed Jul 23 2025 10:59:49 GMT+0000 (Greenwich Mean Time)
AtlasLocalDev atlas [direct: primary] test>
dbB.demo.updateOne({ _id: 'x' }, { $inc: { value: 8 } });
The session blocks, it is waiting until the document is free. I leave the transaction in session A to show what happens with long transactions.
After one minute, the update is successful:
{
acknowledged: true,
insertedId: null,
matchedCount: 1,
modifiedCount: 1,
upsertedCount: 0
}
AtlasLocalDev atlas [direct: primary] test>
Date();
Wed Jul 23 2025 11:00:48 GMT+0000 (Greenwich Mean Time)
My two transactions are not serializable, as they update the same document at the same time, so if one was successful, the other must abort. This is what happened, the transaction in session A was automatically canceled after a one minute timeout. I can see that if I try to commit:
AtlasLocalDev atlas [direct: primary] test>
sessionA.commitTransaction();
MongoServerError[NoSuchTransaction]: Transaction with { txnNumber: 2 } has been aborted.
AtlasLocalDev atlas [direct: primary] test>
sessionA.endSession();
Finally, the result is consistent, with the changes committed by session B:
AtlasLocalDev atlas [direct: primary] test>
db.demo.find();
[ { _id: 'x', value: 8 } ]
I looked at the call stack while it was waiting and here is the Flame Graph:
It shows the update attempts. updateWithDamages is not as terrible as it sounds, and is just an incremental update that logs the changes ("damages") rather than re-writing the whole document. logWriteConflictAndBackoff is the internal retry. Each write conflict in PlanExecutorExpress, increment an attempt counter and calls logWriteConflictAndBackoff → logAndBackoff → logAndBackoffImpl with the number of attempts. The code shows that the wait depends on it:
This mitigates resource contention by increasingly slowing down repeated conflicts with an exponential-to-linear backoff while still getting a chance to get the conflict resolved. In one minute, we can estimate that it has performed approximately 4 + 6 + 90 + 100 + 585 = 785 write conflict retries under this backoff schedule, spending most of that time in the 100 ms sleep interval.
Internally, WiredTiger enforces snapshot isolation at the storage layer. When a MongoDB operation tries to modify a document, WiredTiger checks that the current snapshot timestamp still matches what’s on disk. If another operation has modified the document since that snapshot was taken, WiredTiger detects that the update’s view is stale. Instead of acquiring a long-term lock, WiredTiger uses optimistic concurrency control and returns WT_ROLLBACK to MongoDB’s server layer to signal the conflict.
At this point, the MongoDB server recognizes the error, aborts the in-progress write attempt, and releases any snapshot or locks it has held on behalf of the operation. It then yields to allow other work, starts a new WiredTiger transaction with a fresh snapshot timestamp, and retries the update operation.
After a successful update (with a clean, current snapshot and no conflict), or if a timeout is reached, the process ends, and MongoDB will either commit the change or return an error to the application.
I have run the wait-on-conflict situation again:
AtlasLocalDev atlas [direct: primary] test>
Date();
Wed Jul 23 2025 08:48:58 GMT+0000 (Greenwich Mean Time)
AtlasLocalDev atlas [direct: primary] test>
dbB.demo.updateOne({ _id: 'x' }, { $inc: { value: 8 } });
{
acknowledged: true,
insertedId: null,
matchedCount: 1,
modifiedCount: 1,
upsertedCount: 0
}
AtlasLocalDev atlas [direct: primary] test>
Date();
Wed Jul 23 2025 08:50:18 GMT+0000 (Greenwich Mean Time)
During these 80 seconds, I displayed the writeConflicts metric with mongostat:
mongostat -o="insert,query,update,delete,metrics.operation.writeConflicts.rate(),time"
insert query update delete writeConflicts.rate() time
*0 *0 *0 *0 0 Jul 23 08:49:01.172
*0 *0 *0 *0 0 Jul 23 08:49:02.173
*0 *0 *0 *0 0 Jul 23 08:49:03.172
*0 *0 *0 *0 0 Jul 23 08:49:04.172
*0 *0 *0 *0 0 Jul 23 08:49:05.172
...
*0 *0 *0 *0 0 Jul 23 08:50:15.172
*0 *0 *0 *0 0 Jul 23 08:50:16.171
*0 *0 *0 *0 0 Jul 23 08:50:17.172
*0 *0 *0 *0 981 Jul 23 08:50:18.172
*0 *0 *0 *0 0 Jul 23 08:50:19.172
*0 *0 *0 *0 0 Jul 23 08:50:20.172
It shows 981 write conflict retries during those 80 second. Early retries go fast, adding just a few milliseconds. As time goes on, the loop spends the vast majority of those 80 seconds in the late-stage sleep(100ms) phase, pushing the retry count close to elapsed_time / 100ms (plus early faster backoffs).
Here is the exact calculation:
Note that with mongostat you won't see individual retries until the moment the operation finally succeeds and increments the write conflict counter. If you see large spikes, it's time to look at data modeling patterns or long-running operations.
Another way to look at it is with the MongoDB profiler:
db.setProfilingLevel(2);
Date();
dbB.demo.updateOne({ _id: 'x' }, { $inc: { value: 8 } });
Date();
db.setProfilingLevel(0);
db.system.profile.find().sort({ ts: -1 }).limit(1).pretty();
Here is the output:
[
{
op: 'update',
ns: 'test.demo',
command: {
q: { _id: 'x' },
u: { '$inc': { value: 8 } },
multi: false,
upsert: false
},
keysExamined: 1014,
docsExamined: 1014,
nMatched: 1,
nModified: 1,
nUpserted: 0,
keysInserted: 0,
keysDeleted: 0,
writeConflicts: 1013,
numYield: 1013,
locks: {
ReplicationStateTransition: { acquireCount: { w: Long('1015') } },
Global: { acquireCount: { r: Long('1'), w: Long('1015') } },
Database: { acquireCount: { w: Long('1015') } },
Collection: { acquireCount: { w: Long('1015') } }
},
flowControl: { acquireCount: Long('1014') },
readConcern: { level: 'local', provenance: 'implicitDefault' },
storage: { data: { txnBytesDirty: Long('932') } },
cpuNanos: 145071200,
millis: 82965,
planSummary: 'EXPRESS_IXSCAN { _id: 1 },EXPRESS_UPDATE',
planningTimeMicros: 71,
totalOplogSlotDurationMicros: 135,
execStats: {
isCached: false,
stage: 'EXPRESS_UPDATE',
keyPattern: '{ _id: 1 }',
indexName: '_id_',
keysExamined: 1014,
docsExamined: 1014,
nReturned: 0,
nWouldModify: 1,
nWouldUpsert: 0,
nWouldDelete: 0
},
ts: ISODate('2025-07-23T17:18:48.082Z'),
client: '172.17.0.1',
appName: 'mongosh 2.5.0',
allUsers: [],
user: ''
}
]
The single update of one document (Modified: 1) has been retried 1014 times (keysExamined: 1014, docsExamined: 1014) because the first 1013 were conflicting with the concurrent transaction (writeConflicts: 1013). It took in total an elapsed time of 83 seconds (millis: 82965) without consuming resources as each retry waited (numYield: 1013) so the CPU usage is minimal, less than 0.2% (cpuNanos: 145071200). The conflict resolution, with optimistic concurrency control, is lock-free while waiting. It only briefly acquires some write intent locks for each retry (acquireCount: { w: Long('1015') }).
The single update of one document (nModified: 1) was retried 1,014 times (keysExamined: 1014, docsExamined: 1014), as the first 1,013 attempts conflicted with a concurrent transaction (writeConflicts: 1013). The entire process took about 83 seconds (millis: 82965), but resource usage remained minimal: each retry yielded control (numYield: 1013), so CPU time used was less than 0.2% (cpuNanos: 145071200). This conflict resolution uses optimistic concurrency control and is lock-free while waiting. MongoDB only briefly acquires write intent locks for each retry (acquireCount: { w: Long('1015') }), never blocking other work for the whole time.
In MongoDB, single-document operations do not require explicit transactions and utilize a "wait-on-conflict" method, allowing for seamless, automatic conflict handling. This is achieved through optimistic concurrency control, where MongoDB retries updates on conflict with exponential backoff and operates in a lock-free manner while waiting. With a proper document model, single-document operations are the general case, as there's no need to normalize to multiple tables like in SQL databases.
In contrast, multi-document operations that involve explicit transactions adopt a “fail-on-conflict” strategy. When a conflict arises, MongoDB immediately returns an error, requiring your application to handle cancellation and retrying the transaction. This is necessary because some operations may not be visible to the database. Such transactions must be kept short, and are canceled if they take longer than one minute. The reason for this limitation is that,like other MVCC databases, lengthy transactions consume more resources to maintain a consistent snapshot of the database state at the start of the transaction.
MongoDB sits at the crossroads between SQL, where developers can write queries on a logical model without naming indexes, and the NoSQL world, where developers are accustomed to querying directly by mentioning indexes in their queries.
Except for Atlas Search and Vector indexes, where you must explicitly indicate whether it uses approximate search, you typically query a collection, and the MongoDB query planner determines which index to use.
One exception to this rule happens when using an optimizer hint to reference an index. This feature is supported in most popular databases, except PostgreSQL, where it requires pg_hint_plan. When the hint isn't fully compatible with the query planner's execution plans, different databases may behave differently.
Here is my test case to show how it behaves in MongoDB:
mdb> db.demo.insertMany([
{ _id: 1, name: "Alice", age: 30 }, // has age
{ _id: 2, name: "Bob" }, // missing age
{ _id: 3, name: "Charlie", age: null }, // age is null
{ _id: 4, name: "Dave", age: 35, status: "active" }, // age + status
{ _id: 5, name: "Eve", age: 28, status: "inactive" } // age + status
]);
{
acknowledged: true,
insertedIds: { '0': 1, '1': 2, '2': 3, '3': 4, '4': 5 }
}
mdb> db.demo.find().sort({ age: 1 });
[
{ _id: 2, name: 'Bob' },
{ _id: 3, name: 'Charlie', age: null },
{ _id: 5, name: 'Eve', age: 28, status: 'inactive' },
{ _id: 1, name: 'Alice', age: 30 },
{ _id: 4, name: 'Dave', age: 35, status: 'active' }
]
mdb> db.demo.find(
{ age: { $gte: 28 } }, // WHERE age >= 28
{ _id:0, age:1, name: 1 } // SELECT age, name
);
[
{ name: 'Alice', age: 30 },
{ name: 'Dave', age: 35 },
{ name: 'Eve', age: 28 }
]
Without any index, the query scans the collection:
mdb> db.demo.find(
{ age: { $gte: 28 } }, // WHERE age >= 28
{ _id:0, age:1, name: 1 } // SELECT age, name
).explain("executionStats").executionStats
{
executionSuccess: true,
nReturned: 3,
executionTimeMillis: 0,
totalKeysExamined: 0,
totalDocsExamined: 5,
executionStages: {
isCached: false,
stage: 'PROJECTION_SIMPLE',
nReturned: 3,
executionTimeMillisEstimate: 0,
works: 6,
advanced: 3,
needTime: 2,
needYield: 0,
saveState: 0,
restoreState: 0,
isEOF: 1,
transformBy: { _id: 0, age: 1, name: 1 },
inputStage: {
stage: 'COLLSCAN',
filter: { age: { '$gte': 28 } },
nReturned: 3,
executionTimeMillisEstimate: 0,
works: 6,
advanced: 3,
needTime: 2,
needYield: 0,
saveState: 0,
restoreState: 0,
isEOF: 1,
direction: 'forward',
docsExamined: 5
}
}
}
With an index, the same query does an index scan:
mdb> db.demo.createIndex({ age: 1 }, { name: "idx_age" })
idx_age
mdb> db.demo.find(
{ age: { $gte: 28 } }, // WHERE age >= 28
{ _id:0, age:1, name: 1 } // SELECT age, name
).explain("executionStats").executionStats
{
executionSuccess: true,
nReturned: 3,
executionTimeMillis: 1,
totalKeysExamined: 3,
totalDocsExamined: 3,
executionStages: {
isCached: false,
stage: 'PROJECTION_SIMPLE',
nReturned: 3,
executionTimeMillisEstimate: 1,
works: 4,
advanced: 3,
needTime: 0,
needYield: 0,
saveState: 0,
restoreState: 0,
isEOF: 1,
transformBy: { _id: 0, age: 1, name: 1 },
inputStage: {
stage: 'FETCH',
nReturned: 3,
executionTimeMillisEstimate: 1,
works: 4,
advanced: 3,
needTime: 0,
needYield: 0,
saveState: 0,
restoreState: 0,
isEOF: 1,
docsExamined: 3,
alreadyHasObj: 0,
inputStage: {
stage: 'IXSCAN',
nReturned: 3,
executionTimeMillisEstimate: 1,
works: 4,
advanced: 3,
needTime: 0,
needYield: 0,
saveState: 0,
restoreState: 0,
isEOF: 1,
keyPattern: { age: 1 },
indexName: 'idx_age',
isMultiKey: false,
multiKeyPaths: { age: [] },
isUnique: false,
isSparse: false,
isPartial: false,
indexVersion: 2,
direction: 'forward',
indexBounds: { age: [ '[28, inf.0]' ] },
keysExamined: 3,
seeks: 1,
dupsTested: 0,
dupsDropped: 0
}
}
}
}
I create a better index, that covers the projection, and it is used:
mdb> db.demo.createIndex({ age: 1, name: 1 }, { name: "idx_age_name" })
idx_age_name
mdb> db.demo.find(
{ age: { $gte: 28 } }, // WHERE age >= 28
{ _id:0, age:1, name: 1 } // SELECT age, name
).explain("executionStats").executionStats
{
executionSuccess: true,
nReturned: 3,
executionTimeMillis: 1,
totalKeysExamined: 3,
totalDocsExamined: 0,
executionStages: {
isCached: false,
stage: 'PROJECTION_COVERED',
nReturned: 3,
executionTimeMillisEstimate: 1,
works: 5,
advanced: 3,
needTime: 0,
needYield: 0,
saveState: 0,
restoreState: 0,
isEOF: 1,
transformBy: { _id: 0, age: 1, name: 1 },
inputStage: {
stage: 'IXSCAN',
nReturned: 3,
executionTimeMillisEstimate: 1,
works: 5,
advanced: 3,
needTime: 0,
needYield: 0,
saveState: 0,
restoreState: 0,
isEOF: 1,
keyPattern: { age: 1, name: 1 },
indexName: 'idx_age_name',
isMultiKey: false,
multiKeyPaths: { age: [], name: [] },
isUnique: false,
isSparse: false,
isPartial: false,
indexVersion: 2,
direction: 'forward',
indexBounds: { age: [ '[28, inf.0]' ], name: [ '[MinKey, MaxKey]' ] },
keysExamined: 3,
seeks: 1,
dupsTested: 0,
dupsDropped: 0
}
}
}
I can decide to force the other index with a hint:
mdb> db.demo.find(
{ age: { $gte: 28 } }, // WHERE age >= 28
{ _id:0, age:1, name: 1 } // SELECT age, name
).hint("idx_age").explain("executionStats").executionStats
{
executionSuccess: true,
nReturned: 3,
executionTimeMillis: 0,
totalKeysExamined: 3,
totalDocsExamined: 3,
executionStages: {
isCached: false,
stage: 'PROJECTION_SIMPLE',
nReturned: 3,
executionTimeMillisEstimate: 0,
works: 4,
advanced: 3,
needTime: 0,
needYield: 0,
saveState: 0,
restoreState: 0,
isEOF: 1,
transformBy: { _id: 0, age: 1, name: 1 },
inputStage: {
stage: 'FETCH',
nReturned: 3,
executionTimeMillisEstimate: 0,
works: 4,
advanced: 3,
needTime: 0,
needYield: 0,
saveState: 0,
restoreState: 0,
isEOF: 1,
docsExamined: 3,
alreadyHasObj: 0,
inputStage: {
stage: 'IXSCAN',
nReturned: 3,
executionTimeMillisEstimate: 0,
works: 4,
advanced: 3,
needTime: 0,
needYield: 0,
saveState: 0,
restoreState: 0,
isEOF: 1,
keyPattern: { age: 1 },
indexName: 'idx_age',
isMultiKey: false,
multiKeyPaths: { age: [] },
isUnique: false,
isSparse: false,
isPartial: false,
indexVersion: 2,
direction: 'forward',
indexBounds: { age: [ '[28, inf.0]' ] },
keysExamined: 3,
seeks: 1,
dupsTested: 0,
dupsDropped: 0
}
}
}
}
A hint() is more than a hint here. It forces the index scan. I can even force its bounds with min() and max():
mdb> db.demo.find(
{ age: { $gte: 28 } }, // WHERE age >= 28
{ _id:0, age:1, name: 1 } // SELECT age, name
).hint("idx_age").max({age:31}) // scan from begining to age: 30
[ { name: 'Eve', age: 28 }, { name: 'Alice', age: 30 } ]
Rather than the bounds defined by the filter ([ '[28, inf.0]' ]), I forced an index scan of range [ '[-inf.0, 30]' ]. Then the filter { '$gte': 28 } applied and the result has only two documents.
This is visible in the execution plan.
mdb> db.demo.find(
{ age: { $gte: 28 } }, // WHERE age >= 28
{ _id:0, age:1, name: 1 } // SELECT age, name
).hint("idx_age").max({age:31}).explain("executionStats").executionStats
{
executionSuccess: true,
nReturned: 2,
executionTimeMillis: 0,
totalKeysExamined: 4,
totalDocsExamined: 4,
executionStages: {
isCached: false,
stage: 'PROJECTION_SIMPLE',
nReturned: 2,
executionTimeMillisEstimate: 0,
works: 5,
advanced: 2,
needTime: 2,
needYield: 0,
saveState: 0,
restoreState: 0,
isEOF: 1,
transformBy: { _id: 0, age: 1, name: 1 },
inputStage: {
stage: 'FETCH',
filter: { age: { '$gte': 28 } },
nReturned: 2,
executionTimeMillisEstimate: 0,
works: 5,
advanced: 2,
needTime: 2,
needYield: 0,
saveState: 0,
restoreState: 0,
isEOF: 1,
docsExamined: 4,
alreadyHasObj: 0,
inputStage: {
stage: 'IXSCAN',
nReturned: 4,
executionTimeMillisEstimate: 0,
works: 5,
advanced: 4,
needTime: 0,
needYield: 0,
saveState: 0,
restoreState: 0,
isEOF: 1,
keyPattern: { age: 1 },
indexName: 'idx_age',
isMultiKey: false,
multiKeyPaths: { age: [] },
isUnique: false,
isSparse: false,
isPartial: false,
indexVersion: 2,
direction: 'forward',
indexBounds: {},
keysExamined: 4,
seeks: 1,
dupsTested: 0,
dupsDropped: 0
}
}
}
}
The bounds are not correctly displayed, but it shows that 4 keys were examined, the 4 keys before {age:31}.
The most important to remember is that hints in MongoDB replace the query planner decision, and then may may give a different results. This is different from SQL hints where they only prune the query planner possibilities.
This is also visible with sparse and partial indexes:
// Named sparse index
db.demo.createIndex(
{ age: 1 },
{ sparse: true, name: "age_sparse" }
)
// Named partial index (only for status: 'active')
db.demo.createIndex(
{ age: 1 },
{
partialFilterExpression: { status: "active" },
name: "age_active_partial"
}
);
The result depends not only on the query, but also the index definition:
mdb> db.demo.find(
).sort({ age: 1 }) // ORDER BY age NULLS FIRST
[
{ _id: 2, name: 'Bob' },
{ _id: 3, name: 'Charlie', age: null },
{ _id: 5, name: 'Eve', age: 28, status: 'inactive' },
{ _id: 1, name: 'Alice', age: 30 },
{ _id: 4, name: 'Dave', age: 35, status: 'active' }
]
mdb> db.demo.find(
).hint("age_sparse").sort({ age: 1 }) // sparse index
[
{ _id: 3, name: 'Charlie', age: null },
{ _id: 5, name: 'Eve', age: 28, status: 'inactive' },
{ _id: 1, name: 'Alice', age: 30 },
{ _id: 4, name: 'Dave', age: 35, status: 'active' }
<... (truncated)
MongoDB sits at the crossroads between SQL, where developers can write queries on a logical model without naming indexes, and the NoSQL world, where developers are accustomed to querying directly by mentioning indexes in their queries.
Except for Atlas Search and Vector indexes, where you must explicitly indicate whether it uses approximate search, you typically query a collection, and the MongoDB query planner determines which index to use.
One exception to this rule happens when using an optimizer hint to reference an index. This feature is supported in most popular databases, except PostgreSQL, where it requires pg_hint_plan. When the hint isn't fully compatible with the query planner's execution plans, different databases may behave differently.
Here is my test case to show how it behaves in MongoDB:
mdb> db.demo.insertMany([
{ _id: 1, name: "Alice", age: 30 }, // has age
{ _id: 2, name: "Bob" }, // missing age
{ _id: 3, name: "Charlie", age: null }, // age is null
{ _id: 4, name: "Dave", age: 35, status: "active" }, // age + status
{ _id: 5, name: "Eve", age: 28, status: "inactive" } // age + status
]);
{
acknowledged: true,
insertedIds: { '0': 1, '1': 2, '2': 3, '3': 4, '4': 5 }
}
mdb> db.demo.find().sort({ age: 1 });
[
{ _id: 2, name: 'Bob' },
{ _id: 3, name: 'Charlie', age: null },
{ _id: 5, name: 'Eve', age: 28, status: 'inactive' },
{ _id: 1, name: 'Alice', age: 30 },
{ _id: 4, name: 'Dave', age: 35, status: 'active' }
]
mdb> db.demo.find(
{ age: { $gte: 28 } }, // WHERE age >= 28
{ _id:0, age:1, name: 1 } // SELECT age, name
);
[
{ name: 'Alice', age: 30 },
{ name: 'Dave', age: 35 },
{ name: 'Eve', age: 28 }
]
Without any index, the query scans the collection:
mdb> db.demo.find(
{ age: { $gte: 28 } }, // WHERE age >= 28
{ _id:0, age:1, name: 1 } // SELECT age, name
).explain("executionStats").executionStats
{
executionSuccess: true,
nReturned: 3,
executionTimeMillis: 0,
totalKeysExamined: 0,
totalDocsExamined: 5,
executionStages: {
isCached: false,
stage: 'PROJECTION_SIMPLE',
nReturned: 3,
executionTimeMillisEstimate: 0,
works: 6,
advanced: 3,
needTime: 2,
needYield: 0,
saveState: 0,
restoreState: 0,
isEOF: 1,
transformBy: { _id: 0, age: 1, name: 1 },
inputStage: {
stage: 'COLLSCAN',
filter: { age: { '$gte': 28 } },
nReturned: 3,
executionTimeMillisEstimate: 0,
works: 6,
advanced: 3,
needTime: 2,
needYield: 0,
saveState: 0,
restoreState: 0,
isEOF: 1,
direction: 'forward',
docsExamined: 5
}
}
}
With an index, the same query does an index scan:
mdb> db.demo.createIndex({ age: 1 }, { name: "idx_age" })
idx_age
mdb> db.demo.find(
{ age: { $gte: 28 } }, // WHERE age >= 28
{ _id:0, age:1, name: 1 } // SELECT age, name
).explain("executionStats").executionStats
{
executionSuccess: true,
nReturned: 3,
executionTimeMillis: 1,
totalKeysExamined: 3,
totalDocsExamined: 3,
executionStages: {
isCached: false,
stage: 'PROJECTION_SIMPLE',
nReturned: 3,
executionTimeMillisEstimate: 1,
works: 4,
advanced: 3,
needTime: 0,
needYield: 0,
saveState: 0,
restoreState: 0,
isEOF: 1,
transformBy: { _id: 0, age: 1, name: 1 },
inputStage: {
stage: 'FETCH',
nReturned: 3,
executionTimeMillisEstimate: 1,
works: 4,
advanced: 3,
needTime: 0,
needYield: 0,
saveState: 0,
restoreState: 0,
isEOF: 1,
docsExamined: 3,
alreadyHasObj: 0,
inputStage: {
stage: 'IXSCAN',
nReturned: 3,
executionTimeMillisEstimate: 1,
works: 4,
advanced: 3,
needTime: 0,
needYield: 0,
saveState: 0,
restoreState: 0,
isEOF: 1,
keyPattern: { age: 1 },
indexName: 'idx_age',
isMultiKey: false,
multiKeyPaths: { age: [] },
isUnique: false,
isSparse: false,
isPartial: false,
indexVersion: 2,
direction: 'forward',
indexBounds: { age: [ '[28, inf.0]' ] },
keysExamined: 3,
seeks: 1,
dupsTested: 0,
dupsDropped: 0
}
}
}
}
I create a better index, that covers the projection, and it is used:
mdb> db.demo.createIndex({ age: 1, name: 1 }, { name: "idx_age_name" })
idx_age_name
mdb> db.demo.find(
{ age: { $gte: 28 } }, // WHERE age >= 28
{ _id:0, age:1, name: 1 } // SELECT age, name
).explain("executionStats").executionStats
{
executionSuccess: true,
nReturned: 3,
executionTimeMillis: 1,
totalKeysExamined: 3,
totalDocsExamined: 0,
executionStages: {
isCached: false,
stage: 'PROJECTION_COVERED',
nReturned: 3,
executionTimeMillisEstimate: 1,
works: 5,
advanced: 3,
needTime: 0,
needYield: 0,
saveState: 0,
restoreState: 0,
isEOF: 1,
transformBy: { _id: 0, age: 1, name: 1 },
inputStage: {
stage: 'IXSCAN',
nReturned: 3,
executionTimeMillisEstimate: 1,
works: 5,
advanced: 3,
needTime: 0,
needYield: 0,
saveState: 0,
restoreState: 0,
isEOF: 1,
keyPattern: { age: 1, name: 1 },
indexName: 'idx_age_name',
isMultiKey: false,
multiKeyPaths: { age: [], name: [] },
isUnique: false,
isSparse: false,
isPartial: false,
indexVersion: 2,
direction: 'forward',
indexBounds: { age: [ '[28, inf.0]' ], name: [ '[MinKey, MaxKey]' ] },
keysExamined: 3,
seeks: 1,
dupsTested: 0,
dupsDropped: 0
}
}
}
I can decide to force the other index with a hint:
mdb> db.demo.find(
{ age: { $gte: 28 } }, // WHERE age >= 28
{ _id:0, age:1, name: 1 } // SELECT age, name
).hint("idx_age").explain("executionStats").executionStats
{
executionSuccess: true,
nReturned: 3,
executionTimeMillis: 0,
totalKeysExamined: 3,
totalDocsExamined: 3,
executionStages: {
isCached: false,
stage: 'PROJECTION_SIMPLE',
nReturned: 3,
executionTimeMillisEstimate: 0,
works: 4,
advanced: 3,
needTime: 0,
needYield: 0,
saveState: 0,
restoreState: 0,
isEOF: 1,
transformBy: { _id: 0, age: 1, name: 1 },
inputStage: {
stage: 'FETCH',
nReturned: 3,
executionTimeMillisEstimate: 0,
works: 4,
advanced: 3,
needTime: 0,
needYield: 0,
saveState: 0,
restoreState: 0,
isEOF: 1,
docsExamined: 3,
alreadyHasObj: 0,
inputStage: {
stage: 'IXSCAN',
nReturned: 3,
executionTimeMillisEstimate: 0,
works: 4,
advanced: 3,
needTime: 0,
needYield: 0,
saveState: 0,
restoreState: 0,
isEOF: 1,
keyPattern: { age: 1 },
indexName: 'idx_age',
isMultiKey: false,
multiKeyPaths: { age: [] },
isUnique: false,
isSparse: false,
isPartial: false,
indexVersion: 2,
direction: 'forward',
indexBounds: { age: [ '[28, inf.0]' ] },
keysExamined: 3,
seeks: 1,
dupsTested: 0,
dupsDropped: 0
}
}
}
}
A hint() is more than a hint here. It forces the index scan. I can even force its bounds with min() and max():
mdb> db.demo.find(
{ age: { $gte: 28 } }, // WHERE age >= 28
{ _id:0, age:1, name: 1 } // SELECT age, name
).hint("idx_age").max({age:31}) // scan from begining to age: 30
[ { name: 'Eve', age: 28 }, { name: 'Alice', age: 30 } ]
Rather than the bounds defined by the filter ([ '[28, inf.0]' ]), I forced an index scan of range [ '[-inf.0, 30]' ]. Then the filter { '$gte': 28 } applied and the result has only two documents.
This is visible in the execution plan.
mdb> db.demo.find(
{ age: { $gte: 28 } }, // WHERE age >= 28
{ _id:0, age:1, name: 1 } // SELECT age, name
).hint("idx_age").max({age:31}).explain("executionStats").executionStats
{
executionSuccess: true,
nReturned: 2,
executionTimeMillis: 0,
totalKeysExamined: 4,
totalDocsExamined: 4,
executionStages: {
isCached: false,
stage: 'PROJECTION_SIMPLE',
nReturned: 2,
executionTimeMillisEstimate: 0,
works: 5,
advanced: 2,
needTime: 2,
needYield: 0,
saveState: 0,
restoreState: 0,
isEOF: 1,
transformBy: { _id: 0, age: 1, name: 1 },
inputStage: {
stage: 'FETCH',
filter: { age: { '$gte': 28 } },
nReturned: 2,
executionTimeMillisEstimate: 0,
works: 5,
advanced: 2,
needTime: 2,
needYield: 0,
saveState: 0,
restoreState: 0,
isEOF: 1,
docsExamined: 4,
alreadyHasObj: 0,
inputStage: {
stage: 'IXSCAN',
nReturned: 4,
executionTimeMillisEstimate: 0,
works: 5,
advanced: 4,
needTime: 0,
needYield: 0,
saveState: 0,
restoreState: 0,
isEOF: 1,
keyPattern: { age: 1 },
indexName: 'idx_age',
isMultiKey: false,
multiKeyPaths: { age: [] },
isUnique: false,
isSparse: false,
isPartial: false,
indexVersion: 2,
direction: 'forward',
indexBounds: {},
keysExamined: 4,
seeks: 1,
dupsTested: 0,
dupsDropped: 0
}
}
}
}
The bounds are not correctly displayed, but it shows that 4 keys were examined, the 4 keys before {age:31}.
The most important to remember is that hints in MongoDB replace the query planner decision, and then may may give a different results. This is different from SQL hints where they only prune the query planner possibilities.
This is also visible with sparse and partial indexes:
// Named sparse index
db.demo.createIndex(
{ age: 1 },
{ sparse: true, name: "age_sparse" }
)
// Named partial index (only for status: 'active')
db.demo.createIndex(
{ age: 1 },
{
partialFilterExpression: { status: "active" },
name: "age_active_partial"
}
);
The result depends not only on the query, but also the index definition:
mdb> db.demo.find(
).sort({ age: 1 }) // ORDER BY age NULLS FIRST
[
{ _id: 2, name: 'Bob' },
{ _id: 3, name: 'Charlie', age: null },
{ _id: 5, name: 'Eve', age: 28, status: 'inactive' },
{ _id: 1, name: 'Alice', age: 30 },
{ _id: 4, name: 'Dave', age: 35, status: 'active' }
]
mdb> db.demo.find(
).hint("age_sparse").sort({ age: 1 }) // sparse index
[
{ _id: 3, name: 'Charlie', age: null },
{ _id: 5, name: 'Eve', age: 28, status: 'inactive' },
{ _id: 1, name: 'Alice', age: 30 },
{ _id: 4, name: 'Dave', age: 35, status: 'active' }
<... (truncated)