July 24, 2025
Serializable Isolation for Snapshot Databases
This paper (SIGMOD '08) proposes a lightweight runtime technique to make Snapshot Isolation (SI) serializable without falling back to locking. The key idea behind Serializable SI (SSI) is to detect potentially dangerous (write-skew) executions at runtime and abort one of the transactions to guarantee serializability (SER).
The goal is to offer the strong guarantees of SER without sacrificing SI's high performance and non-blocking reads. But would it make sense to implement SER by layering on MVCC SI instead of implementing it directly? Do you think an SI-based implementation would be more performant than native 2PL-based SER implementations? What about compared to OCC-based SER? The evaluation section gives some answers.
The problem and insight
Let's back up.
Write-skew is the canonical anomaly under SI. And the canonical example for write-skew is the "doctors on-call" scenario. Consider two doctors removing themselves from a duty roster. Each transaction checks that at least one doctor remains on duty and proceeds to update its own status. Under SI, each transaction sees the old snapshot where the other doctor is still on duty. Both commit. The resulting state violates the application’s invariant, even though each transaction was locally consistent. Compare this to SER which ensures that integrity constraints are maintained even if those constraints are not explicitly declared to the DBMS. (Another famous write-skew example is reading savings and checking accounts both, and withdrawing from one. Yet another example could be graph coloring, reading neighbor's color to switch your own in response.)
A naive fix for preventing write-skew would be to track the read set of a transaction T1 and, at commit time, check whether any item read by T1 was overwritten by a concurrent writer. If so, we abort T1. This eliminates outgoing rw-dependencies and thus prevents write-skew cycles. It works, but it's overkill: You end up aborting transactions even when no cycle would form. For example, in the execution "b1 r1(x) w2(x) w1(y) c1 c2", T1 would be aborted because its read r1(x) is overwritten by T2's w2(x), but it is possible to serialize this execution as T1 followed by T2, as the writesets are not conflicting as per SI.
The paper proposes a smarter fix: only abort if a transaction has both an inConflict and an outConflict edges with other transactions. That is, it sits in the middle of two concurrent rw-dependencies (the paper calls this a pivot in a dangerous structure). Prior work by the authors (Fekete et al. 2005) proves that all SI anomalies contain such a pivot. So this check would suffice.
The approach still over-approximates somewhat: having both in and out conflicts doesn’t always imply a cycle. See Fig 9, for an example. Although T0 has both inConflict and outConflict, it is possible to serialize these transactions as TN, T0, and T1 order. The paper doesn't quantify the false positive rate or explore pathological workloads.
The paper says full serialization graph cycle detection would be too expensive, and this is the compromise. This is still a tighter and more efficient condition than the naive approach. The Ferro-Yabandeh Eurosys '12 paper uses something close to the naive approach. Same with naive OCC-based SER implementations. They abort based on single outConflict edge.
Implementation and evaluation
The implementation tracks two flags per transaction: inConflict and outConflict. It also adds a new lock mode, SIREAD, to record reads. Reads acquire SIREAD locks; writes acquire WRITE locks. When a read at T1 observes a version written by T2, we set T1.outConflict and T2.inConflict. This is called a rw-dependency (a.k.a antidependency) from T1 to T2. When a write at T1 detects an existing SIREAD by T2, we set T1.inConflict and T2.outConflict. See the pseudocode in Figures 5-8 for how this is implemented.
When both flags are set for any transaction, one transaction in the conflict chain is aborted. The implementation prefers to abort the pivot (the transaction with both in and out conflicts), unless it has already committed. Else an adjacent transaction is aborted. If there are two pivots in the cycle, whichever is detected first gets aborted. This choice is sound and gives room for policy tuning (e.g., prefer aborting younger transactions).
Note that all of this is layered on top of SI. Readers never block writers. Writers never block readers. The only cost is the extra bookkeeping to track conflicts.
The authors implemented SSI over Berkeley DB (which provides SI through MVCC) with minimal changes, around 700 lines of modified code out of a total of over 200,000 lines of code in Berkeley DB. The experiments with the SmallBank benchmark show that the performance is close to plain SI and significantly better than 2PL under contention.
The paper does not contrast SSI with OCC implementation of SER. But it argues that naive OCC implementations abort on any rw-conflict. SSI waits until both in and out conflicts are present, and would result in fewer false positives and better throughput.
My take
SI is attractive because it avoids blocking: readers don't block writers and vice versa. Maybe the takeaway is this: Stick with SI. But if you really must go serializable, consider layering SSI on top of SI instead of using 2PL or OCC based SER implementations.
But, still, there are many caveats to SSI as presented in this paper: phantoms are not addressed, false positives remain, B-tree level conflicts inflate aborts, and adapting this to distributed systems is non-trivial.
This paper is a research prototype after all. A VLDB'12 paper by Dan Ports followed this algorithm to provide a PostgreSQL implementation adds integration with real-world features like: replication and crash recovery, two-phase commit (tupac), subtransactions and savepoints, and memory bounding via transaction summarization. It also handled phantoms by leveraging SIREAD locks on index ranges to detect predicate read/write conflicts.
Caveats and Open Questions
The paper punts on phantoms. It says that Berkeley DB uses page-level locking, so predicate-based anomalies don’t occur. But for engines with row-level locking, it suggests that SSI would need to acquire SIREAD locks on index ranges or use multigranularity locking.
The design assumes a single-node system. In a distributed DB, you would need to propagate SIREAD metadata and conflict flags across partitions. Detecting cross-partition dependencies and coordinating pivot aborts would be hard. The paper does not talk about implementation in distributed databases.
A known artifact in the Berkeley DB implementation is page-level locking. Updates to shared pages, like the B-tree root during a split, look like write-write conflicts across unrelated transactions. These inflate false positives and aborts under SSI.
My troubles with Figure 4
I mean I have to complain about this (otherwise it won't be a blog post from Murat). Figure 4 was terribly confusing for me. I had a hard time figuring out why these two scenarios are not serializable. It turns out I got confused by the "helpful text" saying: "In Figure 4(a), both reads occur after the writes. In Figure 4(b), TN reads x before it is written by T0." No, that is irrelevant: the reads in SI happen from the snapshot at the transaction start time, so when the reads happen is not relevant. Confused by this I was wondering why we can't serialize Figure 4.a execution as T1, T0, TN.
Ok, let's focus on Fig 4.a. T0's serialization point should be before T1, because of rw-dependency from T0 to T1: T0' read happens before T1's write. T1 needs be serialized before TN, because TN reads z from T1's write. But TN is problematic. TN introduces rw-dependence from TN to T0. It reads x from its snapshot which is earlier than T0's commit point. So TN should come before T0. This conflicts with the two previous requirements that T0 precedes T1 and T1 precedes TN. We have a cycle that prevents serialization.
Figure 4.b is very similar to Figure 4.a and has the same cycle. The only difference is TN's commit point comes a little later, but it doesn't change anything, TN does its read from the snapshot taken at the start of TN.
It would have been nice to supplement the figure with a text explaining the cycles, and giving some insights into the thought process.
What It Takes to Get a Research Project Ready for Production
What It Takes to Get a Research Project Ready for Production
At CedarDB, we set out to bring the fruits of the highly successful Umbra research project to a wider audience. While Umbra has undoubtedly always had the potential to be the foundation of a highly performant production-grade database system, getting a research project ready for production workloads and building a company at the same time is no trivial task.
When we launched a year ago, we were still figuring out the differences between building a research system at university, and building a system for widespread use. Since then, we have learned a lot.
July 23, 2025
Lock-Free Wait-on-Conflict and Fail-on-Conflict in MongoDB
Concurrency control in databases has two main modes for handling conflicts. Traditionally, these are labeled "pessimistic" and "optimistic," but those names can be misleading. Let’s define them by what actually happens from a user point of view:
- Wait-on-Conflict: the operation blocks until the conflict is resolved, typically waiting for a concurrent transaction to either commit or abort and release its locks. For most Online Transaction Processing (OLTP) workloads, this process is usually quick, unless a long-running transaction is blocking progress or there is a deadlock in which two transactions are waiting on each other.
- Fail-on-Conflict: Instead of waiting, the operation immediately raises an error to the application. Now it’s up to the application to decide: retry after a short pause (hoping the conflict is gone), or just propagate the error. The application may also have done some non-transactional operations that need to be canceled or compensated.
In SQL databases, where multi-statement transactions are the norm because of normalization to multiple tables, you’ll see this as an explicit WAIT or NOWAIT on locking reads (LOCK, SELECT FOR UPDATE), or a consequence of the isolation level. READ COMMITTED tries to wait and detect deadlocks, SERIALIZABLE can fail with a serialization error.
MongoDB’s document-oriented approach is straightforward, providing developers the freedom to choose what best fits their needs.
- Single-document updates without an explicit transaction will wait on conflict, until the operation timeout. The document model encourages encapsulation of a business transaction within one document, keeping concurrency simple and fast.
- Multi-document operations within an explicit transaction, however, fail on conflict. Here, MongoDB expects your application to do the retry logic and error handling: since you started the transaction, you're responsible for finishing it.
The need to implement retry logic in the application is not a limitation of MongoDB. A database cannot transparently cancel a user-defined transaction and retry it because it lacks knowledge of the application's prior actions, like sending emails or writing to files. Only the application can cancel or compensate for actions taken before retrying the transaction.
In contrast to multi-statement transactions, single-statement transactions or auto-commit commands can be canceled and restarted before control is returned to the application. That's how MongoDB provides a lock-free wait-on-conflict behavior for single-document updates: it is internally a fail-on-conflict optimistic concurrency control (OCC), to avoid locks, but with internal retries to appear as a wait-on-conflict to the user.
Demonstration
Here’s what this looks like in practice. I start with a one-document collection:
AtlasLocalDev atlas [direct: primary] test>
db.demo.drop();
true
AtlasLocalDev atlas [direct: primary] test>
db.demo.insertOne({ _id: 'x' , value: 0 });
{ acknowledged: true, insertedId: 'x' }
I start a first transaction, in session A, that updates the document, but do not commit it yet:
//first transaction A updates one doc
AtlasLocalDev atlas [direct: primary] test>
sessionA = db.getMongo().startSession();
{ id: UUID('ac8f6741-4ebb-4145-acb2-6b861a390c25') }
AtlasLocalDev atlas [direct: primary] test> sessionA.startTransaction();
AtlasLocalDev atlas [direct: primary] test>
dbA = sessionA.getDatabase(db.getName());
test
AtlasLocalDev atlas [direct: primary] test>
dbA.demo.updateOne({ _id: 'x' }, { $inc: { value: 2 } });
{
acknowledged: true,
insertedId: null,
matchedCount: 1,
modifiedCount: 1,
upsertedCount: 0
}
Explicit transaction
While this transaction is ongoing, I start another one in session B and try to update the same document:
// second transaction B tries to update the same doc
AtlasLocalDev atlas [direct: primary] test>
sessionB = db.getMongo().startSession();
{ id: UUID('a5302cab-9688-43db-badd-e3691b30a15b') }
AtlasLocalDev atlas [direct: primary] test> sessionB.startTransaction();
AtlasLocalDev atlas [direct: primary] test>
dbB = sessionB.getDatabase(db.getName());
test
AtlasLocalDev atlas [direct: primary] test>
Date();
Wed Jul 23 2025 10:59:37 GMT+0000 (Greenwich Mean Time)
AtlasLocalDev atlas [direct: primary] test>
dbB.demo.updateOne({ _id: 'x' }, { $inc: { value: 4 } });
MongoServerError[WriteConflict]: Caused by :: Write conflict during plan execution and yielding is disabled. :: Please retry your operation or multi-document transaction.
AtlasLocalDev atlas [direct: primary] test>
Date();
Wed Jul 23 2025 10:59:40 GMT+0000 (Greenwich Mean Time)
AtlasLocalDev atlas [direct: primary] test> sessionB.abortTransaction();
The conflict was immediately detected and the application gets a WriteConflict error and can retry it.
Single-document
This time, I'll retry without starting an explicit transaction:
// do the same without starting a transaction
AtlasLocalDev atlas [direct: primary] test>
Date();
Wed Jul 23 2025 10:59:49 GMT+0000 (Greenwich Mean Time)
AtlasLocalDev atlas [direct: primary] test>
dbB.demo.updateOne({ _id: 'x' }, { $inc: { value: 8 } });
The session blocks, it is waiting until the document is free. I leave the transaction in session A to show what happens with long transactions.
After one minute, the update is successful:
{
acknowledged: true,
insertedId: null,
matchedCount: 1,
modifiedCount: 1,
upsertedCount: 0
}
AtlasLocalDev atlas [direct: primary] test>
Date();
Wed Jul 23 2025 11:00:48 GMT+0000 (Greenwich Mean Time)
My two transactions are not serializable, as they update the same document at the same time, so if one was successful, the other must abort. This is what happened, the transaction in session A was automatically canceled after a one minute timeout. I can see that if I try to commit:
AtlasLocalDev atlas [direct: primary] test>
sessionA.commitTransaction();
MongoServerError[NoSuchTransaction]: Transaction with { txnNumber: 2 } has been aborted.
AtlasLocalDev atlas [direct: primary] test>
sessionA.endSession();
Finally, the result is consistent, with the changes committed by session B:
AtlasLocalDev atlas [direct: primary] test>
db.demo.find();
[ { _id: 'x', value: 8 } ]
Internals
I looked at the call stack while it was waiting and here is the Flame Graph:
It shows the update attempts. updateWithDamages is not as terrible as it sounds, and is just an incremental update that logs the changes ("damages") rather than re-writing the whole document. logWriteConflictAndBackoff is the internal retry. Each write conflict in PlanExecutorExpress, increment an attempt counter and calls logWriteConflictAndBackoff → logAndBackoff → logAndBackoffImpl with the number of attempts. The code shows that the wait depends on it:
- For the first 4 attempts: No sleep, immediate retry.
- For attempts 4–9: Sleep 1 ms each retry.
- For attempts 10–99: Sleep 5 ms each retry.
- For attempts 100–199: Sleep 10 ms each retry.
- For 200 and greater: Sleep 100 ms each retry.
This mitigates resource contention by increasingly slowing down repeated conflicts with an exponential-to-linear backoff while still getting a chance to get the conflict resolved. In one minute, we can estimate that it has performed approximately 4 + 6 + 90 + 100 + 585 = 785 write conflict retries under this backoff schedule, spending most of that time in the 100 ms sleep interval.
Mongostat
I have run the wait-on-conflict situation again:
AtlasLocalDev atlas [direct: primary] test>
Date();
Wed Jul 23 2025 08:48:58 GMT+0000 (Greenwich Mean Time)
AtlasLocalDev atlas [direct: primary] test>
dbB.demo.updateOne({ _id: 'x' }, { $inc: { value: 8 } });
{
acknowledged: true,
insertedId: null,
matchedCount: 1,
modifiedCount: 1,
upsertedCount: 0
}
AtlasLocalDev atlas [direct: primary] test>
Date();
Wed Jul 23 2025 08:50:18 GMT+0000 (Greenwich Mean Time)
During these 80 seconds, I displayed the writeConflicts metric with mongostat:
mongostat -o="insert,query,update,delete,metrics.operation.writeConflicts.rate(),time"
insert query update delete writeConflicts.rate() time
*0 *0 *0 *0 0 Jul 23 08:49:01.172
*0 *0 *0 *0 0 Jul 23 08:49:02.173
*0 *0 *0 *0 0 Jul 23 08:49:03.172
*0 *0 *0 *0 0 Jul 23 08:49:04.172
*0 *0 *0 *0 0 Jul 23 08:49:05.172
...
*0 *0 *0 *0 0 Jul 23 08:50:15.172
*0 *0 *0 *0 0 Jul 23 08:50:16.171
*0 *0 *0 *0 0 Jul 23 08:50:17.172
*0 *0 *0 *0 981 Jul 23 08:50:18.172
*0 *0 *0 *0 0 Jul 23 08:50:19.172
*0 *0 *0 *0 0 Jul 23 08:50:20.172
It shows 981 write conflict retries during those 80 second. Early retries go fast, adding just a few milliseconds. As time goes on, the loop spends the vast majority of those 80 seconds in the late-stage sleep(100ms) phase, pushing the retry count close to elapsed_time / 100ms (plus early faster backoffs).
Here is the exact calculation:
- First 4: 0ms
- Attempts 4–9 (6): 6×1=6ms
- Attempts 10–99 (90): 90×5=450ms
- Attempts 100–199 (100): 100×10=1,000ms
- Attempts 200–980 (781): 781×100=78,100ms The total time is 78 seconds and we have 981 write conflicts. The measure matches the algorithm.
Note that with mongostat you won't see individual retries until the moment the operation finally succeeds and increments the write conflict counter. If you see large spikes, it's time to look at data modeling patterns or long-running operations.
Profiling
Another way to look at it is with the MongoDB profiler:
db.setProfilingLevel(2);
Date();
dbB.demo.updateOne({ _id: 'x' }, { $inc: { value: 8 } });
Date();
db.setProfilingLevel(0);
db.system.profile.find().sort({ ts: -1 }).limit(1).pretty();
Here is the output:
[
{
op: 'update',
ns: 'test.demo',
command: {
q: { _id: 'x' },
u: { '$inc': { value: 8 } },
multi: false,
upsert: false
},
keysExamined: 1014,
docsExamined: 1014,
nMatched: 1,
nModified: 1,
nUpserted: 0,
keysInserted: 0,
keysDeleted: 0,
writeConflicts: 1013,
numYield: 1013,
locks: {
ReplicationStateTransition: { acquireCount: { w: Long('1015') } },
Global: { acquireCount: { r: Long('1'), w: Long('1015') } },
Database: { acquireCount: { w: Long('1015') } },
Collection: { acquireCount: { w: Long('1015') } }
},
flowControl: { acquireCount: Long('1014') },
readConcern: { level: 'local', provenance: 'implicitDefault' },
storage: { data: { txnBytesDirty: Long('932') } },
cpuNanos: 145071200,
millis: 82965,
planSummary: 'EXPRESS_IXSCAN { _id: 1 },EXPRESS_UPDATE',
planningTimeMicros: 71,
totalOplogSlotDurationMicros: 135,
execStats: {
isCached: false,
stage: 'EXPRESS_UPDATE',
keyPattern: '{ _id: 1 }',
indexName: '_id_',
keysExamined: 1014,
docsExamined: 1014,
nReturned: 0,
nWouldModify: 1,
nWouldUpsert: 0,
nWouldDelete: 0
},
ts: ISODate('2025-07-23T17:18:48.082Z'),
client: '172.17.0.1',
appName: 'mongosh 2.5.0',
allUsers: [],
user: ''
}
]
The single update of one document (Modified: 1) has been retried 1014 times (keysExamined: 1014, docsExamined: 1014) because the first 1013 were conflicting with the concurrent transaction (writeConflicts: 1013). It took in total an elapsed time of 83 seconds (millis: 82965) without consuming resources as each retry waited (numYield: 1013) so the CPU usage is minimal, less than 0.2% (cpuNanos: 145071200). The conflict resolution, with optimistic concurrency control, is lock-free while waiting. It only briefly acquires some write intent locks for each retry (acquireCount: { w: Long('1015') }).
The single update of one document (nModified: 1) was retried 1,014 times (keysExamined: 1014, docsExamined: 1014), as the first 1,013 attempts conflicted with a concurrent transaction (writeConflicts: 1013). The entire process took about 83 seconds (millis: 82965), but resource usage remained minimal: each retry yielded control (numYield: 1013), so CPU time used was less than 0.2% (cpuNanos: 145071200). This conflict resolution uses optimistic concurrency control and is lock-free while waiting. MongoDB only briefly acquires write intent locks for each retry (acquireCount: { w: Long('1015') }), never blocking other work for the whole time.
Summary
In MongoDB, single-document operations do not require explicit transactions and utilize a "wait-on-conflict" method, allowing for seamless, automatic conflict handling. This is achieved through optimistic concurrency control, where MongoDB retries updates on conflict with exponential backoff and operates in a lock-free manner while waiting.
In contrast, multi-document operations that involve explicit transactions follow a “fail-on-conflict” approach. If a conflict occurs, MongoDB immediately returns an error, placing the responsibility of canceling and retrying the transaction on your application, since it may involve some operations that are not visible to the database.
Lock-Free Wait-on-Conflict and Fail-on-Conflict in MongoDB
Concurrency control in databases has two main modes for handling conflicts. Traditionally, these are labeled "pessimistic" and "optimistic," but those names can be misleading. Let’s define them by what actually happens from a user point of view:
- Wait-on-Conflict: the operation blocks until the conflict is resolved, typically waiting for a concurrent transaction to either commit or abort and release its locks. For most Online Transaction Processing (OLTP) workloads, this process is usually quick, unless a long-running transaction is blocking progress or there is a deadlock in which two transactions are waiting on each other.
- Fail-on-Conflict: Instead of waiting, the operation immediately raises an error to the application. Now it’s up to the application to decide: retry after a short pause (hoping the conflict is gone), or just propagate the error. The application may also have done some non-transactional operations that need to be canceled or compensated.
In SQL databases, where multi-statement transactions are the norm because of normalization to multiple tables, you’ll see this as an explicit WAIT or NOWAIT on locking reads (LOCK, SELECT FOR UPDATE), or a consequence of the isolation level. READ COMMITTED tries to wait and detect deadlocks, SERIALIZABLE can fail with a serialization error.
MongoDB’s document-oriented approach is straightforward, providing developers the freedom to choose what best fits their needs.
- Single-document updates without an explicit transaction will wait on conflict, until the operation timeout. The document model encourages encapsulation of a business transaction within one document, keeping concurrency simple and fast.
- Multi-document operations within an explicit transaction, however, fail on conflict. Here, MongoDB expects your application to do the retry logic and error handling: since you started the transaction, you're responsible for finishing it.
The need to implement retry logic in the application is not a limitation of MongoDB. A database cannot transparently cancel a user-defined transaction and retry it because it lacks knowledge of the application's prior actions, like sending emails or writing to files. Only the application can cancel or compensate for actions taken before retrying the transaction.
In contrast to multi-statement transactions, single-statement transactions or auto-commit commands can be canceled and restarted before control is returned to the application. That's how MongoDB provides a lock-free wait-on-conflict behavior for single-document updates: it is internally a fail-on-conflict optimistic concurrency control (OCC), to avoid locks, but with internal retries to appear as a wait-on-conflict to the user.
Demonstration
Here’s what this looks like in practice. I start with a one-document collection:
AtlasLocalDev atlas [direct: primary] test>
db.demo.drop();
true
AtlasLocalDev atlas [direct: primary] test>
db.demo.insertOne({ _id: 'x' , value: 0 });
{ acknowledged: true, insertedId: 'x' }
I start a first transaction, in session A, that updates the document, but do not commit it yet:
//first transaction A updates one doc
AtlasLocalDev atlas [direct: primary] test>
sessionA = db.getMongo().startSession();
{ id: UUID('ac8f6741-4ebb-4145-acb2-6b861a390c25') }
AtlasLocalDev atlas [direct: primary] test> sessionA.startTransaction();
AtlasLocalDev atlas [direct: primary] test>
dbA = sessionA.getDatabase(db.getName());
test
AtlasLocalDev atlas [direct: primary] test>
dbA.demo.updateOne({ _id: 'x' }, { $inc: { value: 2 } });
{
acknowledged: true,
insertedId: null,
matchedCount: 1,
modifiedCount: 1,
upsertedCount: 0
}
Explicit transaction
While this transaction is ongoing, I start another one in session B and try to update the same document:
// second transaction B tries to update the same doc
AtlasLocalDev atlas [direct: primary] test>
sessionB = db.getMongo().startSession();
{ id: UUID('a5302cab-9688-43db-badd-e3691b30a15b') }
AtlasLocalDev atlas [direct: primary] test> sessionB.startTransaction();
AtlasLocalDev atlas [direct: primary] test>
dbB = sessionB.getDatabase(db.getName());
test
AtlasLocalDev atlas [direct: primary] test>
Date();
Wed Jul 23 2025 10:59:37 GMT+0000 (Greenwich Mean Time)
AtlasLocalDev atlas [direct: primary] test>
dbB.demo.updateOne({ _id: 'x' }, { $inc: { value: 4 } });
MongoServerError[WriteConflict]: Caused by :: Write conflict during plan execution and yielding is disabled. :: Please retry your operation or multi-document transaction.
AtlasLocalDev atlas [direct: primary] test>
Date();
Wed Jul 23 2025 10:59:40 GMT+0000 (Greenwich Mean Time)
AtlasLocalDev atlas [direct: primary] test> sessionB.abortTransaction();
The conflict was immediately detected and the application gets a WriteConflict error and can retry it.
Single-document
This time, I'll retry without starting an explicit transaction:
// do the same without starting a transaction
AtlasLocalDev atlas [direct: primary] test>
Date();
Wed Jul 23 2025 10:59:49 GMT+0000 (Greenwich Mean Time)
AtlasLocalDev atlas [direct: primary] test>
dbB.demo.updateOne({ _id: 'x' }, { $inc: { value: 8 } });
The session blocks, it is waiting until the document is free. I leave the transaction in session A to show what happens with long transactions.
After one minute, the update is successful:
{
acknowledged: true,
insertedId: null,
matchedCount: 1,
modifiedCount: 1,
upsertedCount: 0
}
AtlasLocalDev atlas [direct: primary] test>
Date();
Wed Jul 23 2025 11:00:48 GMT+0000 (Greenwich Mean Time)
My two transactions are not serializable, as they update the same document at the same time, so if one was successful, the other must abort. This is what happened, the transaction in session A was automatically canceled after a one minute timeout. I can see that if I try to commit:
AtlasLocalDev atlas [direct: primary] test>
sessionA.commitTransaction();
MongoServerError[NoSuchTransaction]: Transaction with { txnNumber: 2 } has been aborted.
AtlasLocalDev atlas [direct: primary] test>
sessionA.endSession();
Finally, the result is consistent, with the changes committed by session B:
AtlasLocalDev atlas [direct: primary] test>
db.demo.find();
[ { _id: 'x', value: 8 } ]
Internals
I looked at the call stack while it was waiting and here is the Flame Graph:
It shows the update attempts. updateWithDamages is not as terrible as it sounds, and is just an incremental update that logs the changes ("damages") rather than re-writing the whole document. logWriteConflictAndBackoff is the internal retry. Each write conflict in PlanExecutorExpress, increment an attempt counter and calls logWriteConflictAndBackoff → logAndBackoff → logAndBackoffImpl with the number of attempts. The code shows that the wait depends on it:
- For the first 4 attempts: No sleep, immediate retry.
- For attempts 4–9: Sleep 1 ms each retry.
- For attempts 10–99: Sleep 5 ms each retry.
- For attempts 100–199: Sleep 10 ms each retry.
- For 200 and greater: Sleep 100 ms each retry.
This mitigates resource contention by increasingly slowing down repeated conflicts with an exponential-to-linear backoff while still getting a chance to get the conflict resolved. In one minute, we can estimate that it has performed approximately 4 + 6 + 90 + 100 + 585 = 785 write conflict retries under this backoff schedule, spending most of that time in the 100 ms sleep interval.
Internally, WiredTiger enforces snapshot isolation at the storage layer. When a MongoDB operation tries to modify a document, WiredTiger checks that the current snapshot timestamp still matches what’s on disk. If another operation has modified the document since that snapshot was taken, WiredTiger detects that the update’s view is stale. Instead of acquiring a long-term lock, WiredTiger uses optimistic concurrency control and returns WT_ROLLBACK to MongoDB’s server layer to signal the conflict.
At this point, the MongoDB server recognizes the error, aborts the in-progress write attempt, and releases any snapshot or locks it has held on behalf of the operation. It then yields to allow other work, starts a new WiredTiger transaction with a fresh snapshot timestamp, and retries the update operation.
After a successful update (with a clean, current snapshot and no conflict), or if a timeout is reached, the process ends, and MongoDB will either commit the change or return an error to the application.
Mongostat
I have run the wait-on-conflict situation again:
AtlasLocalDev atlas [direct: primary] test>
Date();
Wed Jul 23 2025 08:48:58 GMT+0000 (Greenwich Mean Time)
AtlasLocalDev atlas [direct: primary] test>
dbB.demo.updateOne({ _id: 'x' }, { $inc: { value: 8 } });
{
acknowledged: true,
insertedId: null,
matchedCount: 1,
modifiedCount: 1,
upsertedCount: 0
}
AtlasLocalDev atlas [direct: primary] test>
Date();
Wed Jul 23 2025 08:50:18 GMT+0000 (Greenwich Mean Time)
During these 80 seconds, I displayed the writeConflicts metric with mongostat:
mongostat -o="insert,query,update,delete,metrics.operation.writeConflicts.rate(),time"
insert query update delete writeConflicts.rate() time
*0 *0 *0 *0 0 Jul 23 08:49:01.172
*0 *0 *0 *0 0 Jul 23 08:49:02.173
*0 *0 *0 *0 0 Jul 23 08:49:03.172
*0 *0 *0 *0 0 Jul 23 08:49:04.172
*0 *0 *0 *0 0 Jul 23 08:49:05.172
...
*0 *0 *0 *0 0 Jul 23 08:50:15.172
*0 *0 *0 *0 0 Jul 23 08:50:16.171
*0 *0 *0 *0 0 Jul 23 08:50:17.172
*0 *0 *0 *0 981 Jul 23 08:50:18.172
*0 *0 *0 *0 0 Jul 23 08:50:19.172
*0 *0 *0 *0 0 Jul 23 08:50:20.172
It shows 981 write conflict retries during those 80 second. Early retries go fast, adding just a few milliseconds. As time goes on, the loop spends the vast majority of those 80 seconds in the late-stage sleep(100ms) phase, pushing the retry count close to elapsed_time / 100ms (plus early faster backoffs).
Here is the exact calculation:
- First 4: 0ms
- Attempts 4–9 (6): 6×1=6ms
- Attempts 10–99 (90): 90×5=450ms
- Attempts 100–199 (100): 100×10=1,000ms
- Attempts 200–980 (781): 781×100=78,100ms The total time is 78 seconds and we have 981 write conflicts. The measure matches the algorithm.
Note that with mongostat you won't see individual retries until the moment the operation finally succeeds and increments the write conflict counter. If you see large spikes, it's time to look at data modeling patterns or long-running operations.
Profiling
Another way to look at it is with the MongoDB profiler:
db.setProfilingLevel(2);
Date();
dbB.demo.updateOne({ _id: 'x' }, { $inc: { value: 8 } });
Date();
db.setProfilingLevel(0);
db.system.profile.find().sort({ ts: -1 }).limit(1).pretty();
Here is the output:
[
{
op: 'update',
ns: 'test.demo',
command: {
q: { _id: 'x' },
u: { '$inc': { value: 8 } },
multi: false,
upsert: false
},
keysExamined: 1014,
docsExamined: 1014,
nMatched: 1,
nModified: 1,
nUpserted: 0,
keysInserted: 0,
keysDeleted: 0,
writeConflicts: 1013,
numYield: 1013,
locks: {
ReplicationStateTransition: { acquireCount: { w: Long('1015') } },
Global: { acquireCount: { r: Long('1'), w: Long('1015') } },
Database: { acquireCount: { w: Long('1015') } },
Collection: { acquireCount: { w: Long('1015') } }
},
flowControl: { acquireCount: Long('1014') },
readConcern: { level: 'local', provenance: 'implicitDefault' },
storage: { data: { txnBytesDirty: Long('932') } },
cpuNanos: 145071200,
millis: 82965,
planSummary: 'EXPRESS_IXSCAN { _id: 1 },EXPRESS_UPDATE',
planningTimeMicros: 71,
totalOplogSlotDurationMicros: 135,
execStats: {
isCached: false,
stage: 'EXPRESS_UPDATE',
keyPattern: '{ _id: 1 }',
indexName: '_id_',
keysExamined: 1014,
docsExamined: 1014,
nReturned: 0,
nWouldModify: 1,
nWouldUpsert: 0,
nWouldDelete: 0
},
ts: ISODate('2025-07-23T17:18:48.082Z'),
client: '172.17.0.1',
appName: 'mongosh 2.5.0',
allUsers: [],
user: ''
}
]
The single update of one document (Modified: 1) has been retried 1014 times (keysExamined: 1014, docsExamined: 1014) because the first 1013 were conflicting with the concurrent transaction (writeConflicts: 1013). It took in total an elapsed time of 83 seconds (millis: 82965) without consuming resources as each retry waited (numYield: 1013) so the CPU usage is minimal, less than 0.2% (cpuNanos: 145071200). The conflict resolution, with optimistic concurrency control, is lock-free while waiting. It only briefly acquires some write intent locks for each retry (acquireCount: { w: Long('1015') }).
The single update of one document (nModified: 1) was retried 1,014 times (keysExamined: 1014, docsExamined: 1014), as the first 1,013 attempts conflicted with a concurrent transaction (writeConflicts: 1013). The entire process took about 83 seconds (millis: 82965), but resource usage remained minimal: each retry yielded control (numYield: 1013), so CPU time used was less than 0.2% (cpuNanos: 145071200). This conflict resolution uses optimistic concurrency control and is lock-free while waiting. MongoDB only briefly acquires write intent locks for each retry (acquireCount: { w: Long('1015') }), never blocking other work for the whole time.
Summary
In MongoDB, single-document operations do not require explicit transactions and utilize a "wait-on-conflict" method, allowing for seamless, automatic conflict handling. This is achieved through optimistic concurrency control, where MongoDB retries updates on conflict with exponential backoff and operates in a lock-free manner while waiting. With a proper document model, single-document operations are the general case, as there's no need to normalize to multiple tables like in SQL databases.
In contrast, multi-document operations that involve explicit transactions adopt a “fail-on-conflict” strategy. When a conflict arises, MongoDB immediately returns an error, requiring your application to handle cancellation and retrying the transaction. This is necessary because some operations may not be visible to the database. Such transactions must be kept short, and are canceled if they take longer than one minute. The reason for this limitation is that,like other MVCC databases, lengthy transactions consume more resources to maintain a consistent snapshot of the database state at the start of the transaction.
The Quirks of Index Maintenance in Open Source Databases
July 22, 2025
.hint() in MongoDB is different that SQL optimizer hints: they force indexes
MongoDB sits at the crossroads between SQL, where developers can write queries on a logical model without naming indexes, and the NoSQL world, where developers are accustomed to querying directly by mentioning indexes in their queries.
Except for Atlas Search and Vector indexes, where you must explicitly indicate whether it uses approximate search, you typically query a collection, and the MongoDB query planner determines which index to use.
One exception to this rule happens when using an optimizer hint to reference an index. This feature is supported in most popular databases, except PostgreSQL, where it requires pg_hint_plan. When the hint isn't fully compatible with the query planner's execution plans, different databases may behave differently.
Here is my test case to show how it behaves in MongoDB:
mdb> db.demo.insertMany([
{ _id: 1, name: "Alice", age: 30 }, // has age
{ _id: 2, name: "Bob" }, // missing age
{ _id: 3, name: "Charlie", age: null }, // age is null
{ _id: 4, name: "Dave", age: 35, status: "active" }, // age + status
{ _id: 5, name: "Eve", age: 28, status: "inactive" } // age + status
]);
{
acknowledged: true,
insertedIds: { '0': 1, '1': 2, '2': 3, '3': 4, '4': 5 }
}
mdb> db.demo.find().sort({ age: 1 });
[
{ _id: 2, name: 'Bob' },
{ _id: 3, name: 'Charlie', age: null },
{ _id: 5, name: 'Eve', age: 28, status: 'inactive' },
{ _id: 1, name: 'Alice', age: 30 },
{ _id: 4, name: 'Dave', age: 35, status: 'active' }
]
mdb> db.demo.find(
{ age: { $gte: 28 } }, // WHERE age >= 28
{ _id:0, age:1, name: 1 } // SELECT age, name
);
[
{ name: 'Alice', age: 30 },
{ name: 'Dave', age: 35 },
{ name: 'Eve', age: 28 }
]
Without any index, the query scans the collection:
mdb> db.demo.find(
{ age: { $gte: 28 } }, // WHERE age >= 28
{ _id:0, age:1, name: 1 } // SELECT age, name
).explain("executionStats").executionStats
{
executionSuccess: true,
nReturned: 3,
executionTimeMillis: 0,
totalKeysExamined: 0,
totalDocsExamined: 5,
executionStages: {
isCached: false,
stage: 'PROJECTION_SIMPLE',
nReturned: 3,
executionTimeMillisEstimate: 0,
works: 6,
advanced: 3,
needTime: 2,
needYield: 0,
saveState: 0,
restoreState: 0,
isEOF: 1,
transformBy: { _id: 0, age: 1, name: 1 },
inputStage: {
stage: 'COLLSCAN',
filter: { age: { '$gte': 28 } },
nReturned: 3,
executionTimeMillisEstimate: 0,
works: 6,
advanced: 3,
needTime: 2,
needYield: 0,
saveState: 0,
restoreState: 0,
isEOF: 1,
direction: 'forward',
docsExamined: 5
}
}
}
With an index, the same query does an index scan:
mdb> db.demo.createIndex({ age: 1 }, { name: "idx_age" })
idx_age
mdb> db.demo.find(
{ age: { $gte: 28 } }, // WHERE age >= 28
{ _id:0, age:1, name: 1 } // SELECT age, name
).explain("executionStats").executionStats
{
executionSuccess: true,
nReturned: 3,
executionTimeMillis: 1,
totalKeysExamined: 3,
totalDocsExamined: 3,
executionStages: {
isCached: false,
stage: 'PROJECTION_SIMPLE',
nReturned: 3,
executionTimeMillisEstimate: 1,
works: 4,
advanced: 3,
needTime: 0,
needYield: 0,
saveState: 0,
restoreState: 0,
isEOF: 1,
transformBy: { _id: 0, age: 1, name: 1 },
inputStage: {
stage: 'FETCH',
nReturned: 3,
executionTimeMillisEstimate: 1,
works: 4,
advanced: 3,
needTime: 0,
needYield: 0,
saveState: 0,
restoreState: 0,
isEOF: 1,
docsExamined: 3,
alreadyHasObj: 0,
inputStage: {
stage: 'IXSCAN',
nReturned: 3,
executionTimeMillisEstimate: 1,
works: 4,
advanced: 3,
needTime: 0,
needYield: 0,
saveState: 0,
restoreState: 0,
isEOF: 1,
keyPattern: { age: 1 },
indexName: 'idx_age',
isMultiKey: false,
multiKeyPaths: { age: [] },
isUnique: false,
isSparse: false,
isPartial: false,
indexVersion: 2,
direction: 'forward',
indexBounds: { age: [ '[28, inf.0]' ] },
keysExamined: 3,
seeks: 1,
dupsTested: 0,
dupsDropped: 0
}
}
}
}
I create a better index, that covers the projection, and it is used:
mdb> db.demo.createIndex({ age: 1, name: 1 }, { name: "idx_age_name" })
idx_age_name
mdb> db.demo.find(
{ age: { $gte: 28 } }, // WHERE age >= 28
{ _id:0, age:1, name: 1 } // SELECT age, name
).explain("executionStats").executionStats
{
executionSuccess: true,
nReturned: 3,
executionTimeMillis: 1,
totalKeysExamined: 3,
totalDocsExamined: 0,
executionStages: {
isCached: false,
stage: 'PROJECTION_COVERED',
nReturned: 3,
executionTimeMillisEstimate: 1,
works: 5,
advanced: 3,
needTime: 0,
needYield: 0,
saveState: 0,
restoreState: 0,
isEOF: 1,
transformBy: { _id: 0, age: 1, name: 1 },
inputStage: {
stage: 'IXSCAN',
nReturned: 3,
executionTimeMillisEstimate: 1,
works: 5,
advanced: 3,
needTime: 0,
needYield: 0,
saveState: 0,
restoreState: 0,
isEOF: 1,
keyPattern: { age: 1, name: 1 },
indexName: 'idx_age_name',
isMultiKey: false,
multiKeyPaths: { age: [], name: [] },
isUnique: false,
isSparse: false,
isPartial: false,
indexVersion: 2,
direction: 'forward',
indexBounds: { age: [ '[28, inf.0]' ], name: [ '[MinKey, MaxKey]' ] },
keysExamined: 3,
seeks: 1,
dupsTested: 0,
dupsDropped: 0
}
}
}
I can decide to force the other index with a hint:
mdb> db.demo.find(
{ age: { $gte: 28 } }, // WHERE age >= 28
{ _id:0, age:1, name: 1 } // SELECT age, name
).hint("idx_age").explain("executionStats").executionStats
{
executionSuccess: true,
nReturned: 3,
executionTimeMillis: 0,
totalKeysExamined: 3,
totalDocsExamined: 3,
executionStages: {
isCached: false,
stage: 'PROJECTION_SIMPLE',
nReturned: 3,
executionTimeMillisEstimate: 0,
works: 4,
advanced: 3,
needTime: 0,
needYield: 0,
saveState: 0,
restoreState: 0,
isEOF: 1,
transformBy: { _id: 0, age: 1, name: 1 },
inputStage: {
stage: 'FETCH',
nReturned: 3,
executionTimeMillisEstimate: 0,
works: 4,
advanced: 3,
needTime: 0,
needYield: 0,
saveState: 0,
restoreState: 0,
isEOF: 1,
docsExamined: 3,
alreadyHasObj: 0,
inputStage: {
stage: 'IXSCAN',
nReturned: 3,
executionTimeMillisEstimate: 0,
works: 4,
advanced: 3,
needTime: 0,
needYield: 0,
saveState: 0,
restoreState: 0,
isEOF: 1,
keyPattern: { age: 1 },
indexName: 'idx_age',
isMultiKey: false,
multiKeyPaths: { age: [] },
isUnique: false,
isSparse: false,
isPartial: false,
indexVersion: 2,
direction: 'forward',
indexBounds: { age: [ '[28, inf.0]' ] },
keysExamined: 3,
seeks: 1,
dupsTested: 0,
dupsDropped: 0
}
}
}
}
A hint() is more than a hint here. It forces the index scan. I can even force its bounds with min() and max():
mdb> db.demo.find(
{ age: { $gte: 28 } }, // WHERE age >= 28
{ _id:0, age:1, name: 1 } // SELECT age, name
).hint("idx_age").max({age:31}) // scan from begining to age: 30
[ { name: 'Eve', age: 28 }, { name: 'Alice', age: 30 } ]
Rather than the bounds defined by the filter ([ '[28, inf.0]' ]), I forced an index scan of range [ '[-inf.0, 30]' ]. Then the filter { '$gte': 28 } applied and the result has only two documents.
This is visible in the execution plan.
mdb> db.demo.find(
{ age: { $gte: 28 } }, // WHERE age >= 28
{ _id:0, age:1, name: 1 } // SELECT age, name
).hint("idx_age").max({age:31}).explain("executionStats").executionStats
{
executionSuccess: true,
nReturned: 2,
executionTimeMillis: 0,
totalKeysExamined: 4,
totalDocsExamined: 4,
executionStages: {
isCached: false,
stage: 'PROJECTION_SIMPLE',
nReturned: 2,
executionTimeMillisEstimate: 0,
works: 5,
advanced: 2,
needTime: 2,
needYield: 0,
saveState: 0,
restoreState: 0,
isEOF: 1,
transformBy: { _id: 0, age: 1, name: 1 },
inputStage: {
stage: 'FETCH',
filter: { age: { '$gte': 28 } },
nReturned: 2,
executionTimeMillisEstimate: 0,
works: 5,
advanced: 2,
needTime: 2,
needYield: 0,
saveState: 0,
restoreState: 0,
isEOF: 1,
docsExamined: 4,
alreadyHasObj: 0,
inputStage: {
stage: 'IXSCAN',
nReturned: 4,
executionTimeMillisEstimate: 0,
works: 5,
advanced: 4,
needTime: 0,
needYield: 0,
saveState: 0,
restoreState: 0,
isEOF: 1,
keyPattern: { age: 1 },
indexName: 'idx_age',
isMultiKey: false,
multiKeyPaths: { age: [] },
isUnique: false,
isSparse: false,
isPartial: false,
indexVersion: 2,
direction: 'forward',
indexBounds: {},
keysExamined: 4,
seeks: 1,
dupsTested: 0,
dupsDropped: 0
}
}
}
}
The bounds are not correctly displayed, but it shows that 4 keys were examined, the 4 keys before {age:31}.
The most important to remember is that hints in MongoDB replace the query planner decision, and then may may give a different results. This is different from SQL hints where they only prune the query planner possibilities.
This is also visible with sparse and partial indexes:
// Named sparse index
db.demo.createIndex(
{ age: 1 },
{ sparse: true, name: "age_sparse" }
)
// Named partial index (only for status: 'active')
db.demo.createIndex(
{ age: 1 },
{
partialFilterExpression: { status: "active" },
name: "age_active_partial"
}
);
The result depends not only on the query, but also the index definition:
mdb> db.demo.find(
).sort({ age: 1 }) // ORDER BY age NULLS FIRST
[
{ _id: 2, name: 'Bob' },
{ _id: 3, name: 'Charlie', age: null },
{ _id: 5, name: 'Eve', age: 28, status: 'inactive' },
{ _id: 1, name: 'Alice', age: 30 },
{ _id: 4, name: 'Dave', age: 35, status: 'active' }
]
mdb> db.demo.find(
).hint("age_sparse").sort({ age: 1 }) // sparse index
[
{ _id: 3, name: 'Charlie', age: null },
{ _id: 5, name: 'Eve', age: 28, status: 'inactive' },
{ _id: 1, name: 'Alice', age: 30 },
{ _id: 4, name: 'Dave', age: 35, status: 'active' }
<... (truncated)
.hint() in MongoDB is different that SQL optimizer hints: they force indexes
MongoDB sits at the crossroads between SQL, where developers can write queries on a logical model without naming indexes, and the NoSQL world, where developers are accustomed to querying directly by mentioning indexes in their queries.
Except for Atlas Search and Vector indexes, where you must explicitly indicate whether it uses approximate search, you typically query a collection, and the MongoDB query planner determines which index to use.
One exception to this rule happens when using an optimizer hint to reference an index. This feature is supported in most popular databases, except PostgreSQL, where it requires pg_hint_plan. When the hint isn't fully compatible with the query planner's execution plans, different databases may behave differently.
Here is my test case to show how it behaves in MongoDB:
mdb> db.demo.insertMany([
{ _id: 1, name: "Alice", age: 30 }, // has age
{ _id: 2, name: "Bob" }, // missing age
{ _id: 3, name: "Charlie", age: null }, // age is null
{ _id: 4, name: "Dave", age: 35, status: "active" }, // age + status
{ _id: 5, name: "Eve", age: 28, status: "inactive" } // age + status
]);
{
acknowledged: true,
insertedIds: { '0': 1, '1': 2, '2': 3, '3': 4, '4': 5 }
}
mdb> db.demo.find().sort({ age: 1 });
[
{ _id: 2, name: 'Bob' },
{ _id: 3, name: 'Charlie', age: null },
{ _id: 5, name: 'Eve', age: 28, status: 'inactive' },
{ _id: 1, name: 'Alice', age: 30 },
{ _id: 4, name: 'Dave', age: 35, status: 'active' }
]
mdb> db.demo.find(
{ age: { $gte: 28 } }, // WHERE age >= 28
{ _id:0, age:1, name: 1 } // SELECT age, name
);
[
{ name: 'Alice', age: 30 },
{ name: 'Dave', age: 35 },
{ name: 'Eve', age: 28 }
]
Without any index, the query scans the collection:
mdb> db.demo.find(
{ age: { $gte: 28 } }, // WHERE age >= 28
{ _id:0, age:1, name: 1 } // SELECT age, name
).explain("executionStats").executionStats
{
executionSuccess: true,
nReturned: 3,
executionTimeMillis: 0,
totalKeysExamined: 0,
totalDocsExamined: 5,
executionStages: {
isCached: false,
stage: 'PROJECTION_SIMPLE',
nReturned: 3,
executionTimeMillisEstimate: 0,
works: 6,
advanced: 3,
needTime: 2,
needYield: 0,
saveState: 0,
restoreState: 0,
isEOF: 1,
transformBy: { _id: 0, age: 1, name: 1 },
inputStage: {
stage: 'COLLSCAN',
filter: { age: { '$gte': 28 } },
nReturned: 3,
executionTimeMillisEstimate: 0,
works: 6,
advanced: 3,
needTime: 2,
needYield: 0,
saveState: 0,
restoreState: 0,
isEOF: 1,
direction: 'forward',
docsExamined: 5
}
}
}
With an index, the same query does an index scan:
mdb> db.demo.createIndex({ age: 1 }, { name: "idx_age" })
idx_age
mdb> db.demo.find(
{ age: { $gte: 28 } }, // WHERE age >= 28
{ _id:0, age:1, name: 1 } // SELECT age, name
).explain("executionStats").executionStats
{
executionSuccess: true,
nReturned: 3,
executionTimeMillis: 1,
totalKeysExamined: 3,
totalDocsExamined: 3,
executionStages: {
isCached: false,
stage: 'PROJECTION_SIMPLE',
nReturned: 3,
executionTimeMillisEstimate: 1,
works: 4,
advanced: 3,
needTime: 0,
needYield: 0,
saveState: 0,
restoreState: 0,
isEOF: 1,
transformBy: { _id: 0, age: 1, name: 1 },
inputStage: {
stage: 'FETCH',
nReturned: 3,
executionTimeMillisEstimate: 1,
works: 4,
advanced: 3,
needTime: 0,
needYield: 0,
saveState: 0,
restoreState: 0,
isEOF: 1,
docsExamined: 3,
alreadyHasObj: 0,
inputStage: {
stage: 'IXSCAN',
nReturned: 3,
executionTimeMillisEstimate: 1,
works: 4,
advanced: 3,
needTime: 0,
needYield: 0,
saveState: 0,
restoreState: 0,
isEOF: 1,
keyPattern: { age: 1 },
indexName: 'idx_age',
isMultiKey: false,
multiKeyPaths: { age: [] },
isUnique: false,
isSparse: false,
isPartial: false,
indexVersion: 2,
direction: 'forward',
indexBounds: { age: [ '[28, inf.0]' ] },
keysExamined: 3,
seeks: 1,
dupsTested: 0,
dupsDropped: 0
}
}
}
}
I create a better index, that covers the projection, and it is used:
mdb> db.demo.createIndex({ age: 1, name: 1 }, { name: "idx_age_name" })
idx_age_name
mdb> db.demo.find(
{ age: { $gte: 28 } }, // WHERE age >= 28
{ _id:0, age:1, name: 1 } // SELECT age, name
).explain("executionStats").executionStats
{
executionSuccess: true,
nReturned: 3,
executionTimeMillis: 1,
totalKeysExamined: 3,
totalDocsExamined: 0,
executionStages: {
isCached: false,
stage: 'PROJECTION_COVERED',
nReturned: 3,
executionTimeMillisEstimate: 1,
works: 5,
advanced: 3,
needTime: 0,
needYield: 0,
saveState: 0,
restoreState: 0,
isEOF: 1,
transformBy: { _id: 0, age: 1, name: 1 },
inputStage: {
stage: 'IXSCAN',
nReturned: 3,
executionTimeMillisEstimate: 1,
works: 5,
advanced: 3,
needTime: 0,
needYield: 0,
saveState: 0,
restoreState: 0,
isEOF: 1,
keyPattern: { age: 1, name: 1 },
indexName: 'idx_age_name',
isMultiKey: false,
multiKeyPaths: { age: [], name: [] },
isUnique: false,
isSparse: false,
isPartial: false,
indexVersion: 2,
direction: 'forward',
indexBounds: { age: [ '[28, inf.0]' ], name: [ '[MinKey, MaxKey]' ] },
keysExamined: 3,
seeks: 1,
dupsTested: 0,
dupsDropped: 0
}
}
}
I can decide to force the other index with a hint:
mdb> db.demo.find(
{ age: { $gte: 28 } }, // WHERE age >= 28
{ _id:0, age:1, name: 1 } // SELECT age, name
).hint("idx_age").explain("executionStats").executionStats
{
executionSuccess: true,
nReturned: 3,
executionTimeMillis: 0,
totalKeysExamined: 3,
totalDocsExamined: 3,
executionStages: {
isCached: false,
stage: 'PROJECTION_SIMPLE',
nReturned: 3,
executionTimeMillisEstimate: 0,
works: 4,
advanced: 3,
needTime: 0,
needYield: 0,
saveState: 0,
restoreState: 0,
isEOF: 1,
transformBy: { _id: 0, age: 1, name: 1 },
inputStage: {
stage: 'FETCH',
nReturned: 3,
executionTimeMillisEstimate: 0,
works: 4,
advanced: 3,
needTime: 0,
needYield: 0,
saveState: 0,
restoreState: 0,
isEOF: 1,
docsExamined: 3,
alreadyHasObj: 0,
inputStage: {
stage: 'IXSCAN',
nReturned: 3,
executionTimeMillisEstimate: 0,
works: 4,
advanced: 3,
needTime: 0,
needYield: 0,
saveState: 0,
restoreState: 0,
isEOF: 1,
keyPattern: { age: 1 },
indexName: 'idx_age',
isMultiKey: false,
multiKeyPaths: { age: [] },
isUnique: false,
isSparse: false,
isPartial: false,
indexVersion: 2,
direction: 'forward',
indexBounds: { age: [ '[28, inf.0]' ] },
keysExamined: 3,
seeks: 1,
dupsTested: 0,
dupsDropped: 0
}
}
}
}
A hint() is more than a hint here. It forces the index scan. I can even force its bounds with min() and max():
mdb> db.demo.find(
{ age: { $gte: 28 } }, // WHERE age >= 28
{ _id:0, age:1, name: 1 } // SELECT age, name
).hint("idx_age").max({age:31}) // scan from begining to age: 30
[ { name: 'Eve', age: 28 }, { name: 'Alice', age: 30 } ]
Rather than the bounds defined by the filter ([ '[28, inf.0]' ]), I forced an index scan of range [ '[-inf.0, 30]' ]. Then the filter { '$gte': 28 } applied and the result has only two documents.
This is visible in the execution plan.
mdb> db.demo.find(
{ age: { $gte: 28 } }, // WHERE age >= 28
{ _id:0, age:1, name: 1 } // SELECT age, name
).hint("idx_age").max({age:31}).explain("executionStats").executionStats
{
executionSuccess: true,
nReturned: 2,
executionTimeMillis: 0,
totalKeysExamined: 4,
totalDocsExamined: 4,
executionStages: {
isCached: false,
stage: 'PROJECTION_SIMPLE',
nReturned: 2,
executionTimeMillisEstimate: 0,
works: 5,
advanced: 2,
needTime: 2,
needYield: 0,
saveState: 0,
restoreState: 0,
isEOF: 1,
transformBy: { _id: 0, age: 1, name: 1 },
inputStage: {
stage: 'FETCH',
filter: { age: { '$gte': 28 } },
nReturned: 2,
executionTimeMillisEstimate: 0,
works: 5,
advanced: 2,
needTime: 2,
needYield: 0,
saveState: 0,
restoreState: 0,
isEOF: 1,
docsExamined: 4,
alreadyHasObj: 0,
inputStage: {
stage: 'IXSCAN',
nReturned: 4,
executionTimeMillisEstimate: 0,
works: 5,
advanced: 4,
needTime: 0,
needYield: 0,
saveState: 0,
restoreState: 0,
isEOF: 1,
keyPattern: { age: 1 },
indexName: 'idx_age',
isMultiKey: false,
multiKeyPaths: { age: [] },
isUnique: false,
isSparse: false,
isPartial: false,
indexVersion: 2,
direction: 'forward',
indexBounds: {},
keysExamined: 4,
seeks: 1,
dupsTested: 0,
dupsDropped: 0
}
}
}
}
The bounds are not correctly displayed, but it shows that 4 keys were examined, the 4 keys before {age:31}.
The most important to remember is that hints in MongoDB replace the query planner decision, and then may may give a different results. This is different from SQL hints where they only prune the query planner possibilities.
This is also visible with sparse and partial indexes:
// Named sparse index
db.demo.createIndex(
{ age: 1 },
{ sparse: true, name: "age_sparse" }
)
// Named partial index (only for status: 'active')
db.demo.createIndex(
{ age: 1 },
{
partialFilterExpression: { status: "active" },
name: "age_active_partial"
}
);
The result depends not only on the query, but also the index definition:
mdb> db.demo.find(
).sort({ age: 1 }) // ORDER BY age NULLS FIRST
[
{ _id: 2, name: 'Bob' },
{ _id: 3, name: 'Charlie', age: null },
{ _id: 5, name: 'Eve', age: 28, status: 'inactive' },
{ _id: 1, name: 'Alice', age: 30 },
{ _id: 4, name: 'Dave', age: 35, status: 'active' }
]
mdb> db.demo.find(
).hint("age_sparse").sort({ age: 1 }) // sparse index
[
{ _id: 3, name: 'Charlie', age: null },
{ _id: 5, name: 'Eve', age: 28, status: 'inactive' },
{ _id: 1, name: 'Alice', age: 30 },
{ _id: 4, name: 'Dave', age: 35, status: 'active' }
<... (truncated)
Diagnosing MySQL Crashes on RHEL with GDB: How to Identify the Database, Table, and Query Involved
PSA: SQLite WAL checksums fail silently and may lose data
July 21, 2025
Using replicaSetHorizons in MongoDB
Database latency with PostgreSQL and MongoDB: it's the data model that makes it fast
A business transaction ideally involves a single roundtrip to the database. MongoDB allows a single document to hold all transaction data, simplifying sharding and scaling. In contrast, SQL normalization spreads data across multiple tables, necessitating multi-statement transactions and multiple roundtrips, unless using stored procedures, which are often less preferred due to language and deployment differences from application code. This results in higher latency when the application is closer to the user and farther from the database.
PostgreSQL transaction with pgbench
I start a lab with a PostgreSQL database server running in the background and a container for the client application:
docker rm -f app db
docker run --name db -d -e POSTGRES_PASSWORD=x postgres
docker run --link db:db --rm -it --privileged postgres bash
I add a 50 millisecond latency from the application container to simulate a deployement where the application is in an edge location:
apt-get update && apt-get install -y iproute2
tc qdisc add dev eth0 root netem delay 50ms
I initialize and run pgbench with the default workload TCPB-like displaying the statements:
pgbench -i postgres://postgres:x@db
pgbench -r postgres://postgres:x@db
Here is the output:
root@e18764bd2f77:/# pgbench -i postgres://postgres:x@db
dropping old tables...
creating tables...
generating data (client-side)...
vacuuming...
creating primary keys...
done in 1.75 s (drop tables 0.05 s, create tables 0.21 s, client-side generate 0.94 s, vacuum 0.24 s, primary keys 0.31 s).
root@e18764bd2f77:/# pgbench -r postgres://postgres:x@db
pgbench (17.5 (Debian 17.5-1.pgdg120+1))
starting vacuum...end.
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
maximum number of tries: 1
number of transactions per client: 10
number of transactions actually processed: 10/10
number of failed transactions: 0 (0.000%)
latency average = 353.092 ms
initial connection time = 263.115 ms
tps = 2.832121 (without initial connection time)
statement latencies in milliseconds and failures:
0.004 0 \set aid random(1, 100000 * :scale)
0.001 0 \set bid random(1, 1 * :scale)
0.001 0 \set tid random(1, 10 * :scale)
0.001 0 \set delta random(-5000, 5000)
50.226 0 BEGIN;
50.470 0 UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
50.344 0 SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
50.350 0 UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
50.416 0 UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
50.292 0 INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
50.981 0 END;
This has run ten transactions from one client. With the artificial 50 ms latency, the connection has taken 263 ms and the transactions 352 ms on average. This is 7 times the network roundtrip latency. The reason is visible thanks to the -r option showing the per-statement response times: 7 statements have been run to start the transaction, run DML statements, and commit the transaction.
The default PgBench script is:
root@e18764bd2f77:/# pgbench --show-script=tpcb-like
-- tpcb-like: <builtin: TPC-B (sort of)>
\set aid random(1, 100000 * :scale)
\set bid random(1, 1 * :scale)
\set tid random(1, 10 * :scale)
\set delta random(-5000, 5000)
BEGIN;
UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
END;
BEGIN; starts a transaction and END; commits its.
This method is inefficient due to multiple network roundtrips between the application and database, which increase latency. Each client-server roundtrip is a system call, which increases CPU usage due to context switching, and leaves the database backend in the worst state for performance and scalability: idle yet locking resources. All operations for a business transaction should ideally be sent in a single request, either as an anonymous block or stored procedure, to minimize roundtrips, reduce network overhead and leave a stateless connection.
PostgreSQL single auto-commit call
I can run the same in one block with a DO command in PostgreSQL:
DO '
begin
UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
PERFORM abalance FROM pgbench_accounts WHERE aid = :aid;
UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
end;
';
Note that begin and end have different meanings here, marking the boundaries of a PL/pgSQL block. Without explicitly starting a transaction, the call operates in autocommit mode. One transaction per call is the default in PostgreSQL.
I changed SELECT to PERFORM because such call cannot return the result. In practice, you must deploy this as a stored procedure, with exception handling, and all business logic in it, and that's the reason it is not used in modern application. I use the DO block to show that this requires a single roundtrip to the database:
root@e18764bd2f77:/# pgbench --show-script=tpcb-like 2>&1 \
| sed \
-e "s/^BEGIN;/DO ' begin/" \
-e "s/^SELECT/PERFORM/" \
-e "s/^END;/end; ';/" \
| pgbench -r -f /dev/stdin postgres://postgres:x@db
pgbench (17.5 (Debian 17.5-1.pgdg120+1))
starting vacuum...end.
transaction type: /dev/stdin
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
maximum number of tries: 1
number of transactions per client: 10
number of transactions actually processed: 10/10
number of failed transactions: 0 (0.000%)
latency average = 51.763 ms
initial connection time = 263.673 ms
tps = 19.318893 (without initial connection time)
statement latencies in milliseconds and failures:
0.002 0 \set aid random(1, 100000 * :scale)
0.001 0 \set bid random(1, 1 * :scale)
0.000 0 \set tid random(1, 10 * :scale)
0.000 0 \set delta random(-5000, 5000)
51.755 0 DO ' begin
The transaction took 50 ms to complete. It was started and committed on the database side, which is why all business logic and error handling must occur there. Using stored procedures adds complexity to development, testing, and deployment. A DO block doesn't have to be deployed to the database first, but complicates the process of returning information. Both stored procedures and DO blocks in PL/SQL are sent as PL/pgSQL code in a character string, interpreted at runtime, which poses a risk for runtime errors. Developers prefer to keep code in their application, using their language of choice, ensuring it is compiled, tested, packaged, and deployed consistently.
Multi-statement transactions in SQL databases struggle to scale due to increased client-server roundtrips when all logic resides in the applications. Normalization was developed when applications were deployed on database servers, utilizing embedded SQL or stored procedures. This allowed transactions to execute multiple statements and acquire locks using two-phase locking, without waiting in between. However, with the rise of client-server and three-tier architectures, this didn't scale.
Document databases utilize a different data modeling strategy, where a single document can contain all relevant transaction information. This approach allows the business logic to reside in the application code, enabling an entire business transaction to fit into a single atomic call.
MongoDB multi-document transaction
I ran the same workload on MongoDB, using one collection per table and a multi-document transaction. Since the issue lies in the data model rather than the database engine, the response time remains unchanged. Document databases only demonstrate their advantages when the document model aligns with business transactions. Normalization undermines this alignment. This is why benchmarks from PostgreSQL vendors can be misleading: they apply a normalized model to a database built for unnormalized models.
Still, I've done it to prove the point, doing the same as pgbench from mongosh.
I start a lab with a MongoDB database server running in the background and a container for the client application:
docker rm -f app db
docker run --name db --hostname db -d mongo mongod --replSet rs0
docker run --link db:db --rm -it --privileged mongo bash
I add a 50 millisecond latency from the application container:
apt-get update && apt-get install -y iproute2
tc qdisc add dev eth0 root netem delay 50ms
I define it as a single node replica set and start MongoDB shell:
mongosh "mongodb://db" --eval '
rs.initiate( {_id: "rs0", members: [
{_id: 0, priority: 1, host: "db:27017"},
]});
'
mongosh "mongodb://db?replicaSet=rs0"
Here is my equivalent to pgbench -i:
// mongosh equivalent to pgbench -i
db.accounts.drop(); db.branches.drop(); db.tellers.drop(); db.history.drop();
db.branches.insertOne({ _id: 1, bbalance: 0 });
let tellerDocs = []; for (let i = 1; i <= 10; ++i) {
tellerDocs.push({ _id: i, bid: 1, tbalance: 0 });
} ;
db.tellers.insertMany(tellerDocs);
const nAccounts = 100000; const bulk = db.accounts.initializeUnorderedBulkOp();
for (let i = 1; i <= nAccounts; ++i) {
bulk.insert({ _id: i, bid: 1, abalance: 0 });
if (i % 10000 === 0) print(`inserted ${i} accounts`);
} ;
bulk.execute();
This has created three collections, and initialized it with data, "history" is empty and will be created when used:
rs0 [primary] test> show collections
accounts
branches
tellers
Here is my equivalent to pgbench with all default options:
// mongosh equivalent to pgbench with all default options
// Measure connection time
let t0 = Date.now();
let session = db.getMongo().startSession()
let sess_db = session.getDatabase(db.getName());
let connTime = Date.now() - t0;
// Run 10 transactions
let fail = 0;
let totalTime = 0;
let nTx = 10
for (let i = 0; i < nTx; ++i) {
let t1 = Date.now();
let aid = Math.floor(Math.random() * (100000)) + 1;
let bid = Math.floor(Math.random() * (1)) + 1;
let tid = Math.floor(Math.random() * (10)) + 1;
let delta = Math.floor(Math.random() * 10001) - 5000;
// BEGIN;
session.startTransaction();
// UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
sess_db.accounts.updateOne({_id: aid}, {$inc: {abalance: delta}});
// SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
let acc = sess_db.accounts.findOne({_id: aid}, {abalance: 1, _id: 0});
// UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
sess_db.tellers.updateOne({_id: tid}, {$inc: {tbalance: delta}});
// UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
sess_db.branches.updateOne({_id: bid}, {$inc: {bbalance: delta}});
// INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
sess_db.history.insertOne({tid, bid, aid, delta, mtime: new Date()});
// END;
session.commitTransaction();
totalTime += Date.now() - t1;
}
session.endSession();
// display timings
const avgLat = totalTime / nTx;
const tps = nTx / (totalTime / 1000);
print("latency average =", avgLat.toFixed(3), "ms");
print("initial connection time =", connTime.toFixed(3), "ms");
print("tps =", tps.toFixed(6), "(without initial connection time)");
Here is the output:
...
... print("latency average =", avgLat.toFixed(3), "ms");
... print("initial connection time =", connTime.toFixed(3), "ms");
... print("tps =", tps.toFixed(6), "(without initial connection time)");
latency average = 319.100 ms
initial connection time = 1.000 ms
tps = 3.133814 (without initial connection time)
One difference with PostgreSQL is that the start of a transaction doesn't need a roundtrip to the database in MongoDB, saving 50ms latency in this lab. Each statements takes 2ms more, but that's not comparable because PostgreSQL deffers lots of work to later with vacuum and the consequence is not visible when running ten transactions. MongoDB throughput is a bit higher than PostgreSQL for multi-table/collection transactions but not tremendeously because the real benefit comes from the document model, which is not used here.
I check that the ten transactions have been recorded in "history":
rs0 [primary] test> db.history.find().sort({mtime:1}).forEach(
doc => print(JSON.stringify(doc))
);
{"_id":"687c0863fafd81ddbcbaa8b9","tid":9,"bid":1,"aid":79275,"delta":2113,"mtime":"2025-07-19T21:04:35.438Z"}
{"_id":"687c0863fafd81ddbcbaa8ba","tid":10,"bid":1,"aid":12931,"delta":-5,"mtime":"2025-07-19T21:04:35.767Z"}
{"_id":"687c0864fafd81ddbcbaa8bb","tid":7,"bid":1,"aid":73292,"delta":-2319,"mtime":"2025-07-19T21:04:36.084Z"}
{"_id":"687c0864fafd81ddbcbaa8bc","tid":2,"bid":1,"aid":74453,"delta":-2909,"mtime":"2025-07-19T21:04:36.402Z"}
{"_id":"687c0864fafd81ddbcbaa8bd","tid":8,"bid":1,"aid":25159,"delta":-1522,"mtime":"2025-07-19T21:04:36.721Z"}
{"_id":"687c0865fafd81ddbcbaa8be","tid":5,"bid":1,"aid":21455,"delta":-2985,"mtime":"2025-07-19T21:04:37.036Z"}
{"_id":"687c0865fafd81ddbcbaa8bf","tid":8,"bid":1,"aid":66059,"delta":328,"mtime":"2025-07-19T21:04:37.353Z"}
{"_id":"687c0865fafd81ddbcbaa8c0","tid":8,"bid":1,"aid":58666,"delta":-4803,"mtime":"2025-07-19T21:04:37.668Z"}
{"_id":"687c0865fafd81ddbcbaa8c1","tid":1,"bid":1,"aid":99695,"delta":-4717,"mtime":"2025-07-19T21:04:37.987Z"}
{"_id":"687c0866fafd81ddbcbaa8c2","tid":9,"bid":1,"aid":15122,"delta":-20,"mtime":"2025-07-19T21:04:38.304Z"}
My business transactions, including deposits and withdrawals, are fully recorded in this collection. In contrast, other collections only maintain the current balance to avoid aggregating all historical operations. While this approach is valid, should the client application, which is close to the user and awaits completion, be responsible for such optimization?
MongoDB single-document transaction (auto-commit)
In MongoDB, achieving the same result in a single call is done not through interpreted procedural code or stored procedures, but by employing a proper document design. The TCPB-like benchmark records a transaction that modifies an account balance and updates some summaries per teller and branches. This workload was designed to stress the database in a non-scalable manner: the teller and branch summaries are hotspots.
In a proper application, the business transaction is recorded in the "history" collection. Summaries can be updated asynchronously by applying the transaction information, and a view can do the same in real-time if there's a need to see the current summary before it is applied to the account, teller, or branch collections. In this case, the workload on which the latency must be measured is a single-document insert into "history", with an additional field to flag what is applied to summaries.
Here is the code which records transactions in one atomic call to the database service:
let t0 = Date.now();
// No explicit session needed
let connTime = Date.now() - t0;
let totalTime = 0;
let nTx = 10;
for (let i = 0; i < nTx; ++i) {
let t1 = Date.now();
let aid = Math.floor(Math.random() * 100000) + 1;
let bid = 1; // for scale 1
let tid = Math.floor(Math.random() * 10) + 1;
let delta = Math.floor(Math.random() * 10001) - 5000;
db.history.insertOne({
tid: tid,
bid: bid,
aid: aid,
delta: delta,
mtime: new Date(),
to_apply: true // pending for background applier
});
t... (truncated)
Database latency with PostgreSQL and MongoDB: it's the data model that makes it fast
A business transaction ideally involves a single roundtrip to the database. MongoDB allows a single document to hold all transaction data, simplifying sharding and scaling. In contrast, SQL normalization spreads data across multiple tables, necessitating multi-statement transactions and multiple roundtrips, unless using stored procedures, which are often less preferred due to language and deployment differences from application code. This results in higher latency when the application is closer to the user and farther from the database.
- PostgreSQL transaction with pgbench
- PostgreSQL single auto-commit call
- MongoDB multi-document transaction
- MongoDB single-document transaction
- Summary: where is your business logic
PostgreSQL transaction with pgbench
I start a lab with a PostgreSQL database server running in the background and a container for the client application:
docker rm -f app db
docker run --name db -d -e POSTGRES_PASSWORD=x postgres
docker run --link db:db --rm -it --privileged postgres bash
I add a 50 millisecond latency from the application container to simulate a deployement where the application is in an edge location:
apt-get update && apt-get install -y iproute2
tc qdisc add dev eth0 root netem delay 50ms
I initialize and run pgbench with the default workload TCPB-like displaying the statements:
pgbench -i postgres://postgres:x@db
pgbench -r postgres://postgres:x@db
Here is the output:
root@e18764bd2f77:/# pgbench -i postgres://postgres:x@db
dropping old tables...
creating tables...
generating data (client-side)...
vacuuming...
creating primary keys...
done in 1.75 s (drop tables 0.05 s, create tables 0.21 s, client-side generate 0.94 s, vacuum 0.24 s, primary keys 0.31 s).
root@e18764bd2f77:/# pgbench -r postgres://postgres:x@db
pgbench (17.5 (Debian 17.5-1.pgdg120+1))
starting vacuum...end.
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
maximum number of tries: 1
number of transactions per client: 10
number of transactions actually processed: 10/10
number of failed transactions: 0 (0.000%)
latency average = 353.092 ms
initial connection time = 263.115 ms
tps = 2.832121 (without initial connection time)
statement latencies in milliseconds and failures:
0.004 0 \set aid random(1, 100000 * :scale)
0.001 0 \set bid random(1, 1 * :scale)
0.001 0 \set tid random(1, 10 * :scale)
0.001 0 \set delta random(-5000, 5000)
50.226 0 BEGIN;
50.470 0 UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
50.344 0 SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
50.350 0 UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
50.416 0 UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
50.292 0 INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
50.981 0 END;
This has run ten transactions from one client. With the artificial 50 ms latency, the connection has taken 263 ms and the transactions 352 ms on average. This is 7 times the network roundtrip latency. The reason is visible thanks to the -r option showing the per-statement response times: 7 statements have been run to start the transaction, run DML statements, and commit the transaction.
The default PgBench script is:
root@e18764bd2f77:/# pgbench --show-script=tpcb-like
-- tpcb-like: <builtin: TPC-B (sort of)>
\set aid random(1, 100000 * :scale)
\set bid random(1, 1 * :scale)
\set tid random(1, 10 * :scale)
\set delta random(-5000, 5000)
BEGIN;
UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
END;
In PostgreSQL, BEGIN; is like START TRANSACTION; and initiates a transaction, disabling auto-commit. END; is equivalent to COMMIT; and commits the current transaction.
This method is inefficient due to multiple network roundtrips between the application and database, which increase latency. Each client-server roundtrip is a system call, which increases CPU usage due to context switching, and leaves the database backend in the worst state for performance and scalability: idle yet locking resources. All operations for a business transaction should ideally be sent in a single request, either as an anonymous block or stored procedure, to minimize roundtrips, reduce network overhead and leave a stateless connection.
PostgreSQL single auto-commit call
I can run the same in one block with a DO command in PostgreSQL:
DO '
begin
UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
PERFORM abalance FROM pgbench_accounts WHERE aid = :aid;
UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
end;
';
Note that begin and end have different meanings here, marking the boundaries of a PL/pgSQL block. Without explicitly starting a transaction, the call operates in autocommit mode. One transaction per call is the default in PostgreSQL.
I changed SELECT to PERFORM because such call cannot return the result. In practice, you must deploy this as a stored procedure, with exception handling, and all business logic in it, and that's the reason it is not used in modern application. I use the DO block to show that this requires a single roundtrip to the database:
root@e18764bd2f77:/# pgbench --show-script=tpcb-like 2>&1 \
| sed \
-e "s/^BEGIN;/DO ' begin/" \
-e "s/^SELECT/PERFORM/" \
-e "s/^END;/end; ';/" \
| pgbench -r -f /dev/stdin postgres://postgres:x@db
pgbench (17.5 (Debian 17.5-1.pgdg120+1))
starting vacuum...end.
transaction type: /dev/stdin
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
maximum number of tries: 1
number of transactions per client: 10
number of transactions actually processed: 10/10
number of failed transactions: 0 (0.000%)
latency average = 51.763 ms
initial connection time = 263.673 ms
tps = 19.318893 (without initial connection time)
statement latencies in milliseconds and failures:
0.002 0 \set aid random(1, 100000 * :scale)
0.001 0 \set bid random(1, 1 * :scale)
0.000 0 \set tid random(1, 10 * :scale)
0.000 0 \set delta random(-5000, 5000)
51.755 0 DO ' begin
The transaction took 50 ms to complete. It was started and committed on the database side, which is why all business logic and error handling must occur there. Using stored procedures adds complexity to development, testing, and deployment. A DO block doesn't have to be deployed to the database first, but complicates the process of returning information. Both stored procedures and DO blocks in PL/SQL are sent as PL/pgSQL code in a character string, interpreted at runtime, which poses a risk for runtime errors. Developers prefer to keep code in their application, using their language of choice, ensuring it is compiled, tested, packaged, and deployed consistently.
Multi-statement transactions in SQL databases struggle to scale due to increased client-server roundtrips when all logic resides in the applications. Normalization was developed when applications were deployed on database servers, utilizing embedded SQL or stored procedures. This allowed transactions to execute multiple statements and acquire locks using two-phase locking, without waiting in between. However, with the rise of client-server and three-tier architectures, this didn't scale.
Document databases utilize a different data modeling strategy, where a single document can contain all relevant transaction information. This approach allows the business logic to reside in the application code, enabling an entire business transaction to fit into a single atomic call.
MongoDB multi-document transaction
I ran the same workload on MongoDB, using one collection per table and a multi-document transaction. Since the issue lies in the data model rather than the database engine, the response time remains unchanged. Document databases only demonstrate their advantages when the document model aligns with business transactions. Normalization undermines this alignment. This is why benchmarks from PostgreSQL vendors can be misleading: they apply a normalized model to a database built for unnormalized models.
Still, I've done it to prove the point, doing the same as pgbench from mongosh.
I start a lab with a MongoDB database server running in the background and a container for the client application:
docker rm -f app db
docker run --name db --hostname db -d mongo mongod --replSet rs0
docker run --link db:db --rm -it --privileged mongo bash
I add a 50 millisecond latency from the application container:
apt-get update && apt-get install -y iproute2
tc qdisc add dev eth0 root netem delay 50ms
I define it as a single node replica set and start MongoDB shell:
mongosh "mongodb://db" --eval '
rs.initiate( {_id: "rs0", members: [
{_id: 0, priority: 1, host: "db:27017"},
]});
'
mongosh "mongodb://db?replicaSet=rs0"
Here is my equivalent to pgbench -i:
// mongosh equivalent to pgbench -i
db.accounts.drop(); db.branches.drop(); db.tellers.drop(); db.history.drop();
db.branches.insertOne({ _id: 1, bbalance: 0 });
let tellerDocs = []; for (let i = 1; i <= 10; ++i) {
tellerDocs.push({ _id: i, bid: 1, tbalance: 0 });
} ;
db.tellers.insertMany(tellerDocs);
const nAccounts = 100000; const bulk = db.accounts.initializeUnorderedBulkOp();
for (let i = 1; i <= nAccounts; ++i) {
bulk.insert({ _id: i, bid: 1, abalance: 0 });
if (i % 10000 === 0) print(`inserted ${i} accounts`);
} ;
bulk.execute();
This has created three collections, and initialized it with data, "history" is empty and will be created when used:
rs0 [primary] test> show collections
accounts
branches
tellers
Here is my equivalent to pgbench with all default options:
// mongosh equivalent to pgbench with all default options
// Measure connection time
let t0 = Date.now();
let session = db.getMongo().startSession()
let sess_db = session.getDatabase(db.getName());
let connTime = Date.now() - t0;
// Run 10 transactions
let fail = 0;
let totalTime = 0;
let nTx = 10
for (let i = 0; i < nTx; ++i) {
let t1 = Date.now();
let aid = Math.floor(Math.random() * (100000)) + 1;
let bid = Math.floor(Math.random() * (1)) + 1;
let tid = Math.floor(Math.random() * (10)) + 1;
let delta = Math.floor(Math.random() * 10001) - 5000;
// BEGIN;
session.startTransaction();
// UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
sess_db.accounts.updateOne({_id: aid}, {$inc: {abalance: delta}});
// SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
let acc = sess_db.accounts.findOne({_id: aid}, {abalance: 1, _id: 0});
// UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
sess_db.tellers.updateOne({_id: tid}, {$inc: {tbalance: delta}});
// UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
sess_db.branches.updateOne({_id: bid}, {$inc: {bbalance: delta}});
// INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
sess_db.history.insertOne({tid, bid, aid, delta, mtime: new Date()});
// END;
session.commitTransaction();
totalTime += Date.now() - t1;
}
session.endSession();
// display timings
const avgLat = totalTime / nTx;
const tps = nTx / (totalTime / 1000);
print("latency average =", avgLat.toFixed(3), "ms");
print("initial connection time =", connTime.toFixed(3), "ms");
print("tps =", tps.toFixed(6), "(without initial connection time)");
Here is the output:
...
... print("latency average =", avgLat.toFixed(3), "ms");
... print("initial connection time =", connTime.toFixed(3), "ms");
... print("tps =", tps.toFixed(6), "(without initial connection time)");
latency average = 319.100 ms
initial connection time = 1.000 ms
tps = 3.133814 (without initial connection time)
One difference with PostgreSQL is that the start of a transaction doesn't need a roundtrip to the database in MongoDB, saving 50ms latency in this lab. Each statements takes 2ms more, but that's not comparable because PostgreSQL deffers lots of work to later with vacuum and the consequence is not visible when running ten transactions. MongoDB throughput is a bit higher than PostgreSQL for multi-table/collection transactions but not tremendeously because the real benefit comes from the document model, which is not used here.
I check that the ten transactions have been recorded in "history":
rs0 [primary] test> db.history.find().sort({mtime:1}).forEach(
doc => print(JSON.stringify(doc))
);
{"_id":"687c0863fafd81ddbcbaa8b9","tid":9,"bid":1,"aid":79275,"delta":2113,"mtime":"2025-07-19T21:04:35.438Z"}
{"_id":"687c0863fafd81ddbcbaa8ba","tid":10,"bid":1,"aid":12931,"delta":-5,"mtime":"2025-07-19T21:04:35.767Z"}
{"_id":"687c0864fafd81ddbcbaa8bb","tid":7,"bid":1,"aid":73292,"delta":-2319,"mtime":"2025-07-19T21:04:36.084Z"}
{"_id":"687c0864fafd81ddbcbaa8bc","tid":2,"bid":1,"aid":74453,"delta":-2909,"mtime":"2025-07-19T21:04:36.402Z"}
{"_id":"687c0864fafd81ddbcbaa8bd","tid":8,"bid":1,"aid":25159,"delta":-1522,"mtime":"2025-07-19T21:04:36.721Z"}
{"_id":"687c0865fafd81ddbcbaa8be","tid":5,"bid":1,"aid":21455,"delta":-2985,"mtime":"2025-07-19T21:04:37.036Z"}
{"_id":"687c0865fafd81ddbcbaa8bf","tid":8,"bid":1,"aid":66059,"delta":328,"mtime":"2025-07-19T21:04:37.353Z"}
{"_id":"687c0865fafd81ddbcbaa8c0","tid":8,"bid":1,"aid":58666,"delta":-4803,"mtime":"2025-07-19T21:04:37.668Z"}
{"_id":"687c0865fafd81ddbcbaa8c1","tid":1,"bid":1,"aid":99695,"delta":-4717,"mtime":"2025-07-19T21:04:37.987Z"}
{"_id":"687c0866fafd81ddbcbaa8c2","tid":9,"bid":1,"aid":15122,"delta":-20,"mtime":"2025-07-19T21:04:38.304Z"}
My business transactions, including deposits and withdrawals, are fully recorded in this collection. In contrast, other collections only maintain the current balance to avoid aggregating all historical operations. While this approach is valid, should the client application, which is close to the user and awaits completion, be responsible for such optimization?
MongoDB single-document transaction
In MongoDB, achieving the same result in a single call is done not through interpreted procedural code or stored procedures, but by employing a proper document design. The TCPB-like benchmark records a transaction that modifies an account balance and updates some summaries per teller and branches. This workload was designed to stress the database in a non-scalable manner: the teller and branch summaries are hotspots.
In a proper application, the business transaction is recorded in the "history" collection. Summaries can be updated asynchronously by applying the transaction information, and a view can do the same in real-time if there's a need to see the current summary before it is applied to the account, teller, or branch collections. In this case, the workload on which the latency must be measured is a single-document insert into "history", with an additional field to flag what is applied to summaries.
Here is the code which records transactions in one atomic call to the database service:
let t0 = Date.now();
// No explicit session needed
let connTime = Date.now() - t0;
let totalTime = 0;
let nTx = 10;
for (let i = 0; i < nTx; ++i) {
let t1 = Date.now();
let aid = Math.floor(Math.random() * 100000) + 1;
let bid = 1; // for scale 1
let tid = Math.floor(Math.random() * 10) + 1;
let delta = Math.floor(Math.random() * 10001) - 5000;
db.history.insertOne({
tid: tid,
bid: bid,
aid: aid,
delta: delta
July 20, 2025
Rickrolling Turso DB (SQLite rewrite in Rust)
$isArray: [💬,💬,💬] ❌ - Arrays are Argument Lists in MongoDB Aggregation Pipeline
If you test $isArray: [42] in MongoDB, it returns false and that's the right answer. If you test $isArray: [42,42,42] you get an error telling you that Expression $isArray takes exactly 1 arguments, but 3 were passed in. Documentation about $isArray explains that the input is interpreted as multiple arguments rather than a single argument that is an array:
Aggregation expressions accept a variable number of arguments. These arguments are normally passed as an array. However, when the argument is a single value, you can simplify your code by passing the argument directly without wrapping it in an array.
The documentation about the Expression Operator clarifies that when passing a single argument that is an array, you must use $literal or wrap the array in an additional array that will be interpreted by the language:
Here are a few reproducible examples (also available here) to explain it.
Expression Operator with literals
I use a collection with one document to run my expressions in an aggregation pipeline:
db.dual.insertOne( { "dummy": "x" } )
{
acknowledged: true,
insertedId: ObjectId('687d09913b111f172ebaa8ba')
}
$isNumber takes one argument and returns true if it is a number:
db.dual.aggregate([
{
$project: {
" is 42 a number?": {
$isNumber: [ 42 ] // one argument -> number
}, "_id": 0
}
}
])
[ { ' is 42 a number?': true } ]
The array that you see in $isNumber: [ 42 ] is interpreted as a list of arguments for the expression operator. Text-based languages would use isNumber(42) but MongoDB query language is structured into BSON to better integrate with application languages and be easily parsed by the drivers.
Trying to pass two arguments raises an error:
db.dual.aggregate([
{
$project: {
" is 42 a number?": {
$isNumber: [ 42 , 42 ] // two arguments -> error
}, "_id": 0
}
}
])
MongoServerError[Location16020]: Invalid $project :: caused by :: Expression $isNumber takes exactly 1 arguments. 2 were passed in.
For expression operators that take one argument, you can pass it as a value rather than as an array, but this is just a syntactic shortcut and doesn't change the data type of the argument:
db.dual.aggregate([
{
$project: {
" is 42 a number?": {
$isNumber: 42 // one argument -> number
}, "_id": 0
}
}
])
[ { ' is 42 a number?': true } ]
$isArray takes a single argument and returns true if that argument is an array. However, in the examples above, the array syntax is used to structure the list of arguments rather than define a literal array data type:
db.dual.aggregate([
{
$project: {
" is 42 an array?": {
$isArray: [ 42 ] // one argument -> number
}, "_id": 0
}
}
])
[ { ' is 42 an array?': false } ]
db.dual.aggregate([
{
$project: {
" is 42 an array?": {
$isArray: [ 42 , 42 ] // two arguments -> error
}, "_id": 0
}
}
])
MongoServerError[Location16020]: Invalid $project :: caused by :: Expression $isArray takes exactly 1 arguments. 2 were passed in.
db.dual.aggregate([
{
$project: {
" is 42 an array?": {
$isArray: 42 // one argument -> number
}, "_id": 0
}
}
])
[ { ' is 42 an array?': false } ]
If you want to pass an array as an argument, you must nest the data array inside another array, which structures the list of arguments:
db.dual.aggregate([
{
$project: {
" is 42 an array?": {
$isArray: [ [ 42 ] ] // one argument -> array
}, "_id": 0
}
}
])
[ { ' is 42 an array?': true } ]
db.dual.aggregate([
{
$project: {
" is 42 a number?": {
$isNumber: [ [ 42 ] ] // one argument -> array
}, "_id": 0
}
}
])
[ { ' is 42 a number?': false } ]
Another possibility is using $literal, which doesn't take a list of arguments, but rather avoids parsing an array as a list of arguments:
db.dual.aggregate([
{
$project: {
" is 42 an array?": {
$isArray: { $literal: [ 42 ] } // one argument -> array
}, "_id": 0
}
}
])
[ { ' is 42 an array?': true } ]
db.dual.aggregate([
{
$project: {
" is 42 a number?": {
$isNumber: { $literal: [ 42 ] } // one argument -> array
}, "_id": 0
}
}
])
[ { ' is 42 a number?': false } ]
This can be confusing. I wrote this while answering a question on Bluesky:
#MongoDB seriously, why? $isNumber: 5 → true $isArray: [] → false $isArrau: [[5]] → true
— Ivan “CLOVIS” Canet (@ivcanet.bsky.social) 2025-07-08T18:54:24.735Z
All languages exhibit certain idiosyncrasies where specific elements must be escaped to be interpreted literally rather than as language tokens. For instance, in PostgreSQL, a literal array must follow a specific syntax:
postgres=> SELECT pg_typeof(42) AS "is 42 a number?";
is 42 a number?
-----------------
integer
postgres=> SELECT pg_typeof(42, 42);
ERROR: function pg_typeof(integer, integer) does not exist
LINE 1: SELECT pg_typeof(42, 42);
^
HINT: No function matches the given name and argument types. You might need to add explicit type casts.
postgres=> SELECT pg_typeof(ARRAY[42]);
pg_typeof
-----------
integer[]
postgres=> SELECT pg_typeof('{42}'::int[]);
pg_typeof
-----------
integer[]
main=> SELECT pg_typeof('{''42'',''42''}'::text[]);
pg_typeof
-----------
text[]
Additionally, in SQL, using double quotes serves to escape language elements, so that during parsing, they are interpreted as data or as part of the language syntax. In PostgreSQL the ambiguity is between text and array, in MongoDB it is between arguments and arrays.
You typically shouldn't encounter issues with MongoDB in common situations because expression operators are designed for expressions rather than literals. For instance, you don't need to call the database to know that 42 is a number and [] is an array. If this was generated by a framework, it likely uses $literal for clarity. If it is coded for expressions, there's no ambiguity.
Expression Operator with expressions
I use a collection containing one document with a field that is an array:
db.arrays.insertOne( { "field": [42] } )
{
acknowledged: true,
insertedId: ObjectId('687d1c413e05815622d4b0c2')
}
I can pass the "$field" expression argument to the operator and it remains an array:
db.arrays.aggregate([
{
$project: {
field: 1,
" is $field a number?": {
$isNumber: "$field" // one argument -> expression
},
" is $field an array?": {
$isArray: "$field" // one argument -> expression
},
"_id": 0
}
}
])
[
{
field: [ 42 ],
' is $field a number?': false,
' is $field an array?': true
}
]
In this case, using the one argument shortcut ("$field") or an array (["$field"]) doesn't matter because there's a clear distinction between language elements and data.
In short, with expressions that return arrays when executed, there's no problem, and for array literal, use $literal.
$isArray: [💬,💬,💬] ❌ - Arrays are Argument Lists in MongoDB Aggregation Pipeline
If you test $isArray: [42] in MongoDB, it returns false and that's the right answer. If you test $isArray: [42,42,42] you get an error telling you that Expression $isArray takes exactly 1 arguments, but 3 were passed in. Documentation about $isArray explains that the input is interpreted as multiple arguments rather than a single argument that is an array:
Aggregation expressions accept a variable number of arguments. These arguments are normally passed as an array. However, when the argument is a single value, you can simplify your code by passing the argument directly without wrapping it in an array.
The documentation about the Expression Operator clarifies that when passing a single argument that is an array, you must use $literal or wrap the array in an additional array that will be interpreted by the language:
Here are a few reproducible examples (also available here) to explain it.
Expression Operator with literals
I use a collection with one document to run my expressions in an aggregation pipeline:
db.dual.insertOne( { "dummy": "x" } )
{
acknowledged: true,
insertedId: ObjectId('687d09913b111f172ebaa8ba')
}
$isNumber takes one argument and returns true if it is a number:
db.dual.aggregate([
{
$project: {
" is 42 a number?": {
$isNumber: [ 42 ] // one argument -> number
}, "_id": 0
}
}
])
[ { ' is 42 a number?': true } ]
The array that you see in $isNumber: [ 42 ] is interpreted as a list of arguments for the expression operator. Text-based languages would use isNumber(42) but MongoDB query language is structured into BSON to better integrate with application languages and be easily parsed by the drivers.
Trying to pass two arguments raises an error:
db.dual.aggregate([
{
$project: {
" is 42 a number?": {
$isNumber: [ 42 , 42 ] // two arguments -> error
}, "_id": 0
}
}
])
MongoServerError[Location16020]: Invalid $project :: caused by :: Expression $isNumber takes exactly 1 arguments. 2 were passed in.
For expression operators that take one argument, you can pass it as a value rather than as an array, but this is just a syntactic shortcut and doesn't change the data type of the argument:
db.dual.aggregate([
{
$project: {
" is 42 a number?": {
$isNumber: 42 // one argument -> number
}, "_id": 0
}
}
])
[ { ' is 42 a number?': true } ]
$isArray takes a single argument and returns true if that argument is an array. However, in the examples above, the array syntax is used to structure the list of arguments rather than define a literal array data type:
db.dual.aggregate([
{
$project: {
" is 42 an array?": {
$isArray: [ 42 ] // one argument -> number
}, "_id": 0
}
}
])
[ { ' is 42 an array?': false } ]
db.dual.aggregate([
{
$project: {
" is 42 an array?": {
$isArray: [ 42 , 42 ] // two arguments -> error
}, "_id": 0
}
}
])
MongoServerError[Location16020]: Invalid $project :: caused by :: Expression $isArray takes exactly 1 arguments. 2 were passed in.
db.dual.aggregate([
{
$project: {
" is 42 an array?": {
$isArray: 42 // one argument -> number
}, "_id": 0
}
}
])
[ { ' is 42 an array?': false } ]
If you want to pass an array as an argument, you must nest the data array inside another array, which structures the list of arguments:
db.dual.aggregate([
{
$project: {
" is 42 an array?": {
$isArray: [ [ 42 ] ] // one argument -> array
}, "_id": 0
}
}
])
[ { ' is 42 an array?': true } ]
db.dual.aggregate([
{
$project: {
" is 42 a number?": {
$isNumber: [ [ 42 ] ] // one argument -> array
}, "_id": 0
}
}
])
[ { ' is 42 a number?': false } ]
Another possibility is using $literal, which doesn't take a list of arguments, but rather avoids parsing an array as a list of arguments:
db.dual.aggregate([
{
$project: {
" is 42 an array?": {
$isArray: { $literal: [ 42 ] } // one argument -> array
}, "_id": 0
}
}
])
[ { ' is 42 an array?': true } ]
db.dual.aggregate([
{
$project: {
" is 42 a number?": {
$isNumber: { $literal: [ 42 ] } // one argument -> array
}, "_id": 0
}
}
])
[ { ' is 42 a number?': false } ]
This can be confusing. I wrote this while answering a question on Bluesky:
#MongoDB seriously, why? $isNumber: 5 → true $isArray: [] → false $isArrau: [[5]] → true
— Ivan “CLOVIS” Canet (@ivcanet.bsky.social) 2025-07-08T18:54:24.735Z
All languages exhibit certain idiosyncrasies where specific elements must be escaped to be interpreted literally rather than as language tokens. For instance, in PostgreSQL, a literal array must follow a specific syntax:
postgres=> SELECT pg_typeof(42) AS "is 42 a number?";
is 42 a number?
-----------------
integer
postgres=> SELECT pg_typeof(42, 42);
ERROR: function pg_typeof(integer, integer) does not exist
LINE 1: SELECT pg_typeof(42, 42);
^
HINT: No function matches the given name and argument types. You might need to add explicit type casts.
postgres=> SELECT pg_typeof(ARRAY[42]);
pg_typeof
-----------
integer[]
postgres=> SELECT pg_typeof('{42}'::int[]);
pg_typeof
-----------
integer[]
main=> SELECT pg_typeof('{''42'',''42''}'::text[]);
pg_typeof
-----------
text[]
Additionally, in SQL, using double quotes serves to escape language elements, so that during parsing, they are interpreted as data or as part of the language syntax. In PostgreSQL the ambiguity is between text and array, in MongoDB it is between arguments and arrays.
You typically shouldn't encounter issues with MongoDB in common situations because expression operators are designed for expressions rather than literals. For instance, you don't need to call the database to know that 42 is a number and [] is an array. If this was generated by a framework, it likely uses $literal for clarity. If it is coded for expressions, there's no ambiguity.
Expression Operator with expressions
I use a collection containing one document with a field that is an array:
db.arrays.insertOne( { "field": [42] } )
{
acknowledged: true,
insertedId: ObjectId('687d1c413e05815622d4b0c2')
}
I can pass the "$field" expression argument to the operator and it remains an array:
db.arrays.aggregate([
{
$project: {
field: 1,
" is $field a number?": {
$isNumber: "$field" // one argument -> expression
},
" is $field an array?": {
$isArray: "$field" // one argument -> expression
},
"_id": 0
}
}
])
[
{
field: [ 42 ],
' is $field a number?': false,
' is $field an array?': true
}
]
In this case, using the one argument shortcut ("$field") or an array (["$field"]) doesn't matter because there's a clear distinction between language elements and data.
In short, with expressions that return arrays when executed, there's no problem, and for array literal, use $literal.
July 18, 2025
Sequences in MongoDB
In a previous post about No-gap sequence in PostgreSQL and YugabyteDB, I mentioned that sequences in SQL databases are not transactional, which can lead to gaps. MongoDB does not require a special sequence object. A collection can be used thanks to incremental update ($inc) and returning the updated value in a single atomic operation with findOneAndUpdate().
Here is an example. I create a "sequence" collection to hold sequence numbers and a "demo" collection to insert values with an auto incremented identifier:
db.createCollection("demo");
db.createCollection("sequence");
An insert can simply fetch the next value while incrementing it:
db.demo.insertOne({
_id: db.sequences.findOneAndUpdate(
{ _id: "demo_id" },
{ $inc: { seq: 1 } },
{ upsert: true, returnDocument: "after" }
),
value: "I'm the first"
});
db.demo.find();
[ { _id: { _id: 'demo_id', seq: 1 }, value: "I'm the first" } ]
I start two transactions:
sessionA = db.getMongo().startSession();
sessionA.startTransaction();
dbA = sessionA.getDatabase(db.getName());
sessionB = db.getMongo().startSession();
sessionB.startTransaction();
dbB = sessionB.getDatabase(db.getName());
Scalable sequence (accepting gaps on rollback)
The two transactions insert a document into "demo" with an "_id" fetched from the "sequences":
dbA.demo.insertOne({
_id: db.sequences.findOneAndUpdate( // non-transactional
{ _id: "demo_id" },
{ $inc: { seq: 1 } },
{ upsert: true, returnDocument: "after" }
),
value: "I'll abort"
});
dbB.demo.insertOne({
_id: db.sequences.findOneAndUpdate( // non-transactional
{ _id: "demo_id" },
{ $inc: { seq: 1 } },
{ upsert: true, returnDocument: "after" }
),
value: "I'll commit"
});
It is important to note that I increment ($inc) and fetch the value (returnDocument: "after") with db, out of the dbA or dbB transactions. The sequence operation is atomic, but non-transactional (not part of a multi-document transaction). This simulates the behavior of sequences in SQL databases.
The first transaction aborts (rollback) and the second one commits:
sessionA.abortTransaction();
sessionA.endSession();
sessionB.commitTransaction();
sessionB.endSession();
I check the result:
db.demo.find()
[
{ _id: { _id: 'demo_id', seq: 1 }, value: "I'm the first" },
{ _id: { _id: 'demo_id', seq: 3 }, value: "I'll commit" }
]
I have a gap in the numbers because "_id: 1" has been used by a transaction that aborted. The transaction has been rolled back, but because I incremented the sequence out of the transaction (using db instead of dbA) the incrementat was not rolled back.
Note that all updates to a single document are atomic, but I used findOneAndUpdate() so that it returns the updated document in the same atomic operation that updated it. It can return the before or after value and I used returnDocument: "after" to get the next value. I used upsert: true to initialize the sequence if no value exists, and $inc sets the field to the specified value when it doesn't exist.
No-gap Sequences (and optimistic locking)
If you want a no-gap sequence, you can fetch the sequence number in the multi-document transaction:
sessionA = db.getMongo().startSession();
sessionA.startTransaction();
dbA = sessionA.getDatabase(db.getName());
sessionB = db.getMongo().startSession();
sessionB.startTransaction();
dbB = sessionB.getDatabase(db.getName());
dbA.demo.insertOne({
_id: dbA.sequences.findOneAndUpdate( // part of the transaction
{ _id: "demo_id" },
{ $inc: { seq: 1 } },
{ upsert: true, returnDocument: "after" }
),
value: "I'll abort"
});
dbB.demo.insertOne({
_id: dbB.sequences.findOneAndUpdate( // part of the transaction
{ _id: "demo_id" },
{ $inc: { seq: 1 } },
{ upsert: true, returnDocument: "after" }
),
value: "I'll commit"
});
When two transactions try to increment the same sequence, an optimistic locking error is raised:
MongoServerError[WriteConflict]: Caused by :: Write conflict during plan execution and yielding is disabled. :: Please retry your operation or multi-document transaction.
This is a retriable error and the application should have implemented a retry logic:
sessionA.abortTransaction();
// retry the insert
sessionB.abortTransaction();
sessionB.startTransaction();
dbB.demo.insertOne({
_id: dbB.sequences.findOneAndUpdate( // part of the transaction
{ _id: "demo_id" },
{ $inc: { seq: 1 } },
{ upsert: true, returnDocument: "after" }
),
value: "I'll commit"
});
sessionB.commitTransaction();
sessionB.endSession();
sessionA.endSession();
I check the result:
db.demo.find()
[
{ _id: { _id: 'demo_id', seq: 1 }, value: "I'm the first" },
{ _id: { _id: 'demo_id', seq: 3 }, value: "I'll commit" },
{ _id: { _id: 'demo_id', seq: 4 }, value: "I'll commit" }
]
Here, the first session rollback didn't introduce another gap because the sequence increment was part of the insert transaction. To achieve the same in a SQL database, you must use a table and UPDATE ... SET seq=seq+1 RETURNING seq;. With pessimistic locking, it acquires a lock and waits for the other transaction to complete. To be scalable, SQL databases provide a non-transactional SEQUENCE that does the same without waiting, but with gaps. It still has some scalability issues, and distributed databases may discourage (CockroachDB raises a warning) or even not support sequences (like Google Spanner or Amazon Aurora DSQL).
Incrementing identifiers for the primary key
MongoDB provides developers with greater control, allowing them to apply the same logic to a collection while deciding whether to include it in a transaction. It also supports optimized atomic operations. You can also use Atlas triggers to deploy the logic into the managed database, like demonstrated in MongoDB Auto-Increment
It's important to note that generating an incremented sequence is typically rare, primarily occurring during migrations from MySQL's AUTO_INCREMENT or PostgreSQL's BIGSERIAL. The default "_id" field is a globally unique and scalable ObjectId. You can also use a UUID generated by the application and choose the format (UUIDv4 or UUIDv7) to distribute or collocate the documents inserted at the same time.
Sequences in MongoDB
In a previous post about No-gap sequence in PostgreSQL and YugabyteDB, I mentioned that sequences in SQL databases are not transactional, which can lead to gaps. MongoDB does not require a special sequence object. A collection can be used thanks to incremental update ($inc) and returning the updated value in a single atomic operation with findOneAndUpdate().
Here is an example. I create a "sequence" collection to hold sequence numbers and a "demo" collection to insert values with an auto incremented identifier:
db.createCollection("demo");
db.createCollection("sequence");
An insert can simply fetch the next value while incrementing it:
db.demo.insertOne({
_id: db.sequences.findOneAndUpdate(
{ _id: "demo_id" },
{ $inc: { seq: 1 } },
{ upsert: true, returnDocument: "after" }
),
value: "I'm the first"
});
db.demo.find();
[ { _id: { _id: 'demo_id', seq: 1 }, value: "I'm the first" } ]
I start two transactions:
sessionA = db.getMongo().startSession();
sessionA.startTransaction();
dbA = sessionA.getDatabase(db.getName());
sessionB = db.getMongo().startSession();
sessionB.startTransaction();
dbB = sessionB.getDatabase(db.getName());
Scalable sequence (accepting gaps on rollback)
The two transactions insert a document into "demo" with an "_id" fetched from the "sequences":
dbA.demo.insertOne({
_id: db.sequences.findOneAndUpdate( // non-transactional
{ _id: "demo_id" },
{ $inc: { seq: 1 } },
{ upsert: true, returnDocument: "after" }
),
value: "I'll abort"
});
dbB.demo.insertOne({
_id: db.sequences.findOneAndUpdate( // non-transactional
{ _id: "demo_id" },
{ $inc: { seq: 1 } },
{ upsert: true, returnDocument: "after" }
),
value: "I'll commit"
});
It is important to note that I increment ($inc) and fetch the value (returnDocument: "after") with db, out of the dbA or dbB transactions. The sequence operation is atomic, but non-transactional (not part of a multi-document transaction). This simulates the behavior of sequences in SQL databases.
The first transaction aborts (rollback) and the second one commits:
sessionA.abortTransaction();
sessionA.endSession();
sessionB.commitTransaction();
sessionB.endSession();
I check the result:
db.demo.find()
[
{ _id: { _id: 'demo_id', seq: 1 }, value: "I'm the first" },
{ _id: { _id: 'demo_id', seq: 3 }, value: "I'll commit" }
]
I have a gap in the numbers because {_id: 2} has been used by a transaction that aborted. The transaction has been rolled back, but because I incremented the sequence out of the transaction (using db instead of dbA) the increment was not rolled back.
Note that all updates to a single document are atomic, but I used findOneAndUpdate() so that it returns the updated document in the same atomic operation that updated it. It can return the before or after value and I used returnDocument: "after" to get the next value. I used upsert: true to initialize the sequence if no value exists, and $inc sets the field to the specified value when it doesn't exist.
No-gap Sequences (and optimistic locking)
If you want a no-gap sequence, you can fetch the sequence number in the multi-document transaction:
sessionA = db.getMongo().startSession();
sessionA.startTransaction();
dbA = sessionA.getDatabase(db.getName());
sessionB = db.getMongo().startSession();
sessionB.startTransaction();
dbB = sessionB.getDatabase(db.getName());
dbA.demo.insertOne({
_id: dbA.sequences.findOneAndUpdate( // part of the transaction
{ _id: "demo_id" },
{ $inc: { seq: 1 } },
{ upsert: true, returnDocument: "after" }
),
value: "I'll abort"
});
dbB.demo.insertOne({
_id: dbB.sequences.findOneAndUpdate( // part of the transaction
{ _id: "demo_id" },
{ $inc: { seq: 1 } },
{ upsert: true, returnDocument: "after" }
),
value: "I'll commit"
});
When two transactions try to increment the same sequence, an optimistic locking error is raised:
MongoServerError[WriteConflict]: Caused by :: Write conflict during plan execution and yielding is disabled. :: Please retry your operation or multi-document transaction.
This is a retriable error and the application should have implemented a retry logic:
sessionA.abortTransaction();
// retry the insert
sessionB.abortTransaction();
sessionB.startTransaction();
dbB.demo.insertOne({
_id: dbB.sequences.findOneAndUpdate( // part of the transaction
{ _id: "demo_id" },
{ $inc: { seq: 1 } },
{ upsert: true, returnDocument: "after" }
),
value: "I'll commit"
});
sessionB.commitTransaction();
sessionB.endSession();
sessionA.endSession();
I check the result:
db.demo.find()
[
{ _id: { _id: 'demo_id', seq: 1 }, value: "I'm the first" },
{ _id: { _id: 'demo_id', seq: 3 }, value: "I'll commit" },
{ _id: { _id: 'demo_id', seq: 4 }, value: "I'll commit" }
]
Here, the first session rollback didn't introduce another gap because the sequence increment was part of the insert transaction. To achieve the same in a SQL database, you must use a table and UPDATE ... SET seq=seq+1 RETURNING seq;. With pessimistic locking, it acquires a lock and waits for the other transaction to complete. To be scalable, SQL databases provide a non-transactional SEQUENCE that does the same without waiting, but with gaps. It still has some scalability issues, and distributed databases may discourage (CockroachDB raises a warning) or even not support sequences (like Google Spanner or Amazon Aurora DSQL).
Incrementing identifiers for the primary key
MongoDB provides developers with greater control, allowing them to apply the same logic to a collection while deciding whether to include it in a transaction. It also supports optimized atomic operations. You can also use Atlas triggers to deploy the logic into the managed database, like demonstrated in MongoDB Auto-Increment
It's important to note that generating an incremented sequence is typically rare, primarily occurring during migrations from MySQL's AUTO_INCREMENT or PostgreSQL's BIGSERIAL. The default "_id" field is a globally unique and scalable ObjectId. You can also use a UUID generated by the application and choose the format (UUIDv4 or UUIDv7) to distribute or collocate the documents inserted at the same time.