You can understand how MongoDB stores documents internally with simple queries that rely on the physical storage ordering. Some databases store records (called rows or tuples) in heap tables, using their physical location in the data files, such as ROWID in Oracle or CTID in PostgreSQL, to reference those records from index entries. In contrast, databases like MySQL's InnoDB or YugabyteDB store records in the primary key index ("clustered index", or "index organized table"), storing them by the logical order of their primary key values, so that secondary indexes point to these logical locations with the primary key, or an encoded version of it.
MongoDB default collections are similar to heap tables because their documents are stored independently of the primary key ("_id") exposed to the application. Internally, the WiredTiger storage engine organizes collection documents using a B+Tree structure, with an internal RecordId as the key, assigned by MongoDB. This structure resembles a clustered index, but it is clustered on an internal key rather than the primary key.
MongoDB’s approach improves on traditional heap tables, especially for storing variable-size documents, because WiredTiger uses B+Tree nodes for efficient space management, reusing space and splitting pages as needed, rather than relying on settings like PCTFREE or FILLFACTOR to reserve space for updates, or SHRINK/VACUUM operations to defragment after deletes.
To cover all cases, with clustered collections, MongoDB can generate the RecordId from the "_id", the primary key exposed to the application, making storage similar to how some databases organize tables in clustered indexes, as "_id" can be a generated ObjectId or defined by the application at insert time. So, when looking at storage internals levels, there are two keys:
-
"_id" is the application's primary key, a generated surrogate key, or a natural key. It is always indexed, and this unique index, like other secondary indexes, references the document with a RecordId
-
RecordId is the internal key. It can be generated from "_id" (in clustered collections), but is more generally generated as a monotonically increasing 64-bit integer during inserts. It can be considered a physical address by the query layer, but it is not directly mapped to an address in the filesystem because files are B+Tree structures.
This offers physical data independence since the primary key generation pattern does not affect the storage organization. However, it is helpful to understand how it functions when reviewing execution plans.
Another perspective is that, aside from clustered collections, all indexes in MongoDB, including the primary key index on "_id", are essentially secondary indexes, similar to those found in heap table databases (such as Db2, Oracle, and PostgreSQL). However, instead of a heap table with a fixed block size and row identification tied to their physical location, MongoDB documents are stored within an internal B+Tree index using the WiredTiger engine. Both approaches have their rationale:
-
SQL databases are primarily optimized for fixed, normalized schemas with small, uniform row lengths, where a PCTFREE or FILLFACTOR can be set according to the expected updates. Storing larger types involves row chaining or slicing (like Oracle LOB chunks or PostgreSQL TOAST)
-
MongoDB is designed for flexible schemas, and collections can contain documents of any size up to the BSON limit. Typically, documents are tens to hundreds of kilobytes, with some larger ones reaching a few megabytes. This flexibility requires adaptable storage management and efforts to minimize fragmentation beyond a small, fixed page size. A B+Tree with a flexible leaf block size is a suitable structure for this purpose.
The document size is flexible, thanks to the storage described above, but the ideal document size is a frequent question. Until I write a blog post on this, here’s a slide: the green area indicates where the most efficient access is, the red side is acceptable for outliers if they don't grow further, but may be a sign of embedding too much, and the blue side works for small documents inserted and queried together, but it may also be a sign that you did unnecessary normalization and should embed more to avoid runtime scattered lookups.
An example to understand the internal ordering
To demonstrate how it works, I generate ten documents and insert them asynchronously, so they may be written to the database in a random order:
db.collection.drop();
Array.from({ length: 10 }, (_, i) => {
db.collection.insertOne({
_id: `#${String(i).padStart(5, '0')}` ,
val: Math.random()
});
});
If I query without any filter or sort, the query planner chooses a COLLSCAN, which reads the records in the order of their RecordID, in the order they were inserted:
test> db.collection.find();
[
{ _id: '#00002', val: 0.07658988613973294 },
{ _id: '#00008', val: 0.39893981577036675 },
{ _id: '#00009', val: 0.5279631881196858 },
{ _id: '#00007', val: 0.8445363162277748 },
{ _id: '#00006', val: 0.01935050813731909 },
{ _id: '#00004', val: 0.0732484258238264 },
{ _id: '#00005', val: 0.7733464850237388 },
{ _id: '#00003', val: 0.3356001641172073 },
{ _id: '#00000', val: 0.8956753135566624 },
{ _id: '#00001', val: 0.4952318922619017 }
]
Keep in mind that I'm working with just a single node here. Sharding and parallel processing might retrieve rows in a different order than how they're stored. You should not rely on any "natural" order. Instead, unless you're conducting this type of investigation, where you're guessing the physical ordering from the query layer, ensure that you use an explicit sort operation to specify the expected order of results.
I can display the RecordId with .showRecordId(), which adds it to the cursor projection:
test> db.collection.find().showRecordId();
[
{ _id: '#00002', val: 0.07658988613973294, '$recordId': Long('1') },
{ _id: '#00008', val: 0.39893981577036675, '$recordId': Long('2') },
{ _id: '#00009', val: 0.5279631881196858, '$recordId': Long('3') },
{ _id: '#00007', val: 0.8445363162277748, '$recordId': Long('4') },
{ _id: '#00006', val: 0.01935050813731909, '$recordId': Long('5') },
{ _id: '#00004', val: 0.0732484258238264, '$recordId': Long('6') },
{ _id: '#00005', val: 0.7733464850237388, '$recordId': Long('7') },
{ _id: '#00003', val: 0.3356001641172073, '$recordId': Long('8') },
{ _id: '#00000', val: 0.8956753135566624, '$recordId': Long('9') },
{ _id: '#00001', val: 0.4952318922619017, '$recordId': Long('10') }
]
Documentation: showRecordId()
Forcing an index with a hint
I can force an index with a hint, for example the index on "_id" which was created automatically:
test> db.collection.find().hint( { _id: 1} ).showRecordId();
[
{ _id: '#00000', val: 0.8956753135566624, '$recordId': Long('9') },
{ _id: '#00001', val: 0.4952318922619017, '$recordId': Long('10') },
{ _id: '#00002', val: 0.07658988613973294, '$recordId': Long('1') },
{ _id: '#00003', val: 0.3356001641172073, '$recordId': Long('8') },
{ _id: '#00004', val: 0.0732484258238264, '$recordId': Long('6') },
{ _id: '#00005', val: 0.7733464850237388, '$recordId': Long('7') },
{ _id: '#00006', val: 0.01935050813731909, '$recordId': Long('5') },
{ _id: '#00007', val: 0.8445363162277748, '$recordId': Long('4') },
{ _id: '#00008', val: 0.39893981577036675, '$recordId': Long('2') },
{ _id: '#00009', val: 0.5279631881196858, '$recordId': Long('3') }
]
This runs a IXSCAN instead of a COLLSCAN and returns the documents in the order of the index. You can verify it with .explain(), but it is also perceptible from the order of the document fetched, which follows the order of "_id" rather than the order of insertion as before (also called "natural" order).
Rather than using a hint, I can add a filter, and the query planner chooses the index. A filter like {$gt:MinKey} or {$lt:MaxKey} does not change the result, but changes the execution plan to an IXSCAN:
test> db.collection.find({_id:{$gt:MinKey}}).showRecordId();
[
{ _id: '#00000', val: 0.8956753135566624, '$recordId': Long('9') },
{ _id: '#00001', val: 0.4952318922619017, '$recordId': Long('10') },
{ _id: '#00002', val: 0.07658988613973294, '$recordId': Long('1') },
{ _id: '#00003', val: 0.3356001641172073, '$recordId': Long('8') },
{ _id: '#00004', val: 0.0732484258238264, '$recordId': Long('6') },
{ _id: '#00005', val: 0.7733464850237388, '$recordId': Long('7') },
{ _id: '#00006', val: 0.01935050813731909, '$recordId': Long('5') },
{ _id: '#00007', val: 0.8445363162277748, '$recordId': Long('4') },
{ _id: '#00008', val: 0.39893981577036675, '$recordId': Long('2') },
{ _id: '#00009', val: 0.5279631881196858, '$recordId': Long('3') }
]
An equality filter will also run an IXSCAN, and we observe the result fetched in that order:
test> db.collection.find({_id:{$ne:null}}).showRecordId();
[
{ _id: '#00000', val: 0.8956753135566624, '$recordId': Long('9') },
{ _id: '#00001', val: 0.4952318922619017, '$recordId': Long('10') },
{ _id: '#00002', val: 0.07658988613973294, '$recordId': Long('1') },
{ _id: '#00003', val: 0.3356001641172073, '$recordId': Long('8') },
{ _id: '#00004', val: 0.0732484258238264, '$recordId': Long('6') },
{ _id: '#00005', val: 0.7733464850237388, '$recordId': Long('7') },
{ _id: '#00006', val: 0.01935050813731909, '$recordId': Long('5') },
{ _id: '#00007', val: 0.8445363162277748, '$recordId': Long('4') },
{ _id: '#00008', val: 0.39893981577036675, '$recordId': Long('2') },
{ _id: '#00009', val: 0.5279631881196858, '$recordId': Long('3') }
]
This technique is used to add an unbounded range predicate on the indexed sort field to get the index used for the sort in the absence of an equality predicate: MongoDB Equality, Sort, Range (ESR) without Equality (SR)
Forcing a full scan with a hint for natural order
Hints specify the index definition, and you may wonder how to force a full scan instead of the index scan chosen by the query planner. Remember that it's an index on RecordId that stores the documents. So you can hint this internal index using the $natural operator - asking for natural order of the collection documents:
test> db.collection.find({_id:{$ne:null}}).hint({$natural:1}).showRecordId();
[
{ _id: '#00002', val: 0.07658988613973294, '$recordId': Long('1') },
{ _id: '#00008', val: 0.39893981577036675, '$recordId': Long('2') },
{ _id: '#00009', val: 0.5279631881196858, '$recordId': Long('3') },
{ _id: '#00007', val: 0.8445363162277748, '$recordId': Long('4') },
{ _id: '#00006', val: 0.01935050813731909, '$recordId': Long('5') },
{ _id: '#00004', val: 0.0732484258238264, '$recordId': Long('6') },
{ _id: '#00005', val: 0.7733464850237388, '$recordId': Long('7') },
{ _id: '#00003', val: 0.3356001641172073, '$recordId': Long('8') },
{ _id: '#00000', val: 0.8956753135566624, '$recordId': Long('9') },
{ _id: '#00001', val: 0.4952318922619017, '$recordId': Long('10') }
]
The documents are fetched in order of RecordId from a COLLSCAN. The hint syntax allows an ascending or descending option to start at the beginning or end of the collection. I'm showing this to explain how records are stored internally. However, if you need a specific order, you should use sort() and let the query planner decide whether to use the index to avoid a sort operation.
MongoDB is more than a NoSQL database:
- Like many NoSQL databases, it allows you to query the indexes directly with
.hint(), forcing the access path
- Like all SQL databases, it has a query planner offering data independence, allowing you to declare the collection and expected order with
.sort() and let the database optimize the access path.
Avoid combining storage-level instructions, such as .hint(), .min(), or .max(), with declarative query filters in find() or $match, as this can undermine the query planner's guarantees that results match the query predicates. For example, hinting at a partial index might lead to incomplete results.
Covering indexes and "_id" projection
Understanding what is stored in the index entries helps optimize queries to use an index-only scan (covering index).
For example, the following query reads the index on "_id" and projects only "_id" (which is by default) and "val":
test> db.collection.find(
{ _id: { $ne: null } },
{ val: 1 }
).explain().queryPlanner.winningPlan;
{
isCached: false,
stage: 'PROJECTION_SIMPLE',
transformBy: { val: 1 },
inputStage: {
stage: 'FETCH',
inputStage: {
stage: 'IXSCAN',
keyPattern: { _id: 1 },
indexName: '_id_',
isMultiKey: false,
isUnique: true,
isSparse: false,
isPartial: false,
indexVersion: 2,
direction: 'forward',
indexBounds: { _id: [ '[MinKey, null)', '(null, MaxKey]' ] }
}
}
}
Because the index on "_id" holds only the key ("_id") and RecordId, it must fetch the document (FETCH) before the projection (PROJECTION_SIMPLE). Even if it is a primary index from the application's point of view, it is physically equivalent to a secondary index.
I can see the same with another secondary index:
test> db.collection.createIndex( { val: 1 } );
val_1
test> db.collection.find(
{ val:{$gt: 0} },
{ val: 1 }
).explain().queryPlanner.winningPlan;
{
isCached: false,
stage: 'PROJECTION_SIMPLE',
transformBy: { val: 1 },
inputStage: {
stage: 'FETCH',
inputStage: {
stage: 'IXSCAN',
keyPattern: { val: 1 },
indexName: 'val_1',
isMultiKey: false,
multiKeyPaths: { val: [] },
isUnique: false,
isSparse: false,
isPartial: false,
indexVersion: 2,
direction: 'forward',
indexBounds: { val: [ '(0, inf.0]' ] }
}
}
}
Such query projects "_id" because it is there by default, and then the index on "val" is not covering all fields. To avoid the FETCH, I need to remove "_id" from the projection explicitly:
test> db.collection.find(
{ val:{$gt: 0} },
{ val: 1 , _id: 0 }
).explain().queryPlanner.winningPlan;
{
isCached: false,
stage: 'PROJECTION_COVERED',
transformBy: { val: 1, _id: 0 },
inputStage: {
stage: 'IXSCAN',
keyPattern: { val: 1 },
indexName: 'val_1',
isMultiKey: false,
multiKeyPaths: { val: [] },
isUnique: false,
isSparse: false,
isPartial: false,
indexVersion: 2,
direction: 'forward',
indexBounds: { val: [ '(0, inf.0]' ] }
}
}
Another possibility: if I need to project "_id", I can add it to the index definition, making it a covering index for my query:
test> db.collection.createIndex( { val: 1 , _id: 1 } );
val_1__id_1
test> db.collection.find(
{
by Franck Pachot
Franck Pachot
You can understand how MongoDB stores documents internally with simple queries that rely on the physical storage ordering. Some databases store records (called rows or tuples) in heap tables, using their physical location in the data files, such as ROWID in Oracle or CTID in PostgreSQL, to reference those records from index entries. In contrast, databases like MySQL's InnoDB or YugabyteDB store records in the primary key index ("clustered index", or "index organized table"), storing them by the logical order of their primary key values, so that secondary indexes point to these logical locations with the primary key, or an encoded version of it.
MongoDB default collections are similar to heap tables because their documents are stored independently of the primary key ("_id") exposed to the application. Internally, the WiredTiger storage engine organizes collection documents using a B+Tree structure, with an internal RecordId as the key, assigned by MongoDB. This structure resembles a clustered index, but it is clustered on an internal key rather than the primary key.
MongoDB’s approach improves on traditional heap tables, especially for storing variable-size documents, because WiredTiger uses B+Tree nodes for efficient space management, reusing space and splitting pages as needed, rather than relying on settings like PCTFREE or FILLFACTOR to reserve space for updates, or SHRINK/VACUUM operations to defragment after deletes.
To cover all cases, with clustered collections, MongoDB can generate the RecordId from the "_id", the primary key exposed to the application, making storage similar to how some databases organize tables in clustered indexes, as "_id" can be a generated ObjectId or defined by the application at insert time. So, when looking at storage internals levels, there are two keys:
-
"_id" is the application's primary key, a generated surrogate key, or a natural key. It is always indexed, and this unique index, like other secondary indexes, references the document with a RecordId
-
RecordId is the internal key. It can be generated from "_id" (in clustered collections), but is more generally generated as a monotonically increasing 64-bit integer during inserts. It can be considered a physical address by the query layer, but it is not directly mapped to an address in the filesystem because files are B+Tree structures.
This offers physical data independence since the primary key generation pattern does not affect the storage organization. However, it is helpful to understand how it functions when reviewing execution plans.
Another perspective is that, aside from clustered collections, all indexes in MongoDB, including the primary key index on "_id", are essentially secondary indexes, similar to those found in heap table databases (such as Db2, Oracle, and PostgreSQL). However, instead of a heap table with a fixed block size and row identification tied to their physical location, MongoDB documents are stored within an internal B+Tree index using the WiredTiger engine. Both approaches have their rationale:
-
SQL databases are primarily optimized for fixed, normalized schemas with small, uniform row lengths, where a PCTFREE or FILLFACTOR can be set according to the expected updates. Storing larger types involves row chaining or slicing (like Oracle LOB chunks or PostgreSQL TOAST)
-
MongoDB is designed for flexible schemas, and collections can contain documents of any size up to the BSON limit. Typically, documents are tens to hundreds of kilobytes, with some larger ones reaching a few megabytes. This flexibility requires adaptable storage management and efforts to minimize fragmentation beyond a small, fixed page size. A B+Tree with a flexible leaf block size is a suitable structure for this purpose.
The document size is flexible, thanks to the storage described above, but the ideal document size is a frequent question. Until I write a blog post on this, here’s a slide: the green area indicates where the most efficient access is, the red side is acceptable for outliers if they don't grow further, but may be a sign of embedding too much, and the blue side works for small documents inserted and queried together, but it may also be a sign that you did unnecessary normalization and should embed more to avoid runtime scattered lookups.
An example to understand the internal ordering
To demonstrate how it works, I generate ten documents and insert them asynchronously, so they may be written to the database in a random order:
db.collection.drop();
Array.from({ length: 10 }, (_, i) => {
db.collection.insertOne({
_id: `#${String(i).padStart(5, '0')}` ,
val: Math.random()
});
});
If I query without any filter or sort, the query planner chooses a COLLSCAN, which reads the records in the order of their RecordID, in the order they were inserted:
test> db.collection.find();
[
{ _id: '#00002', val: 0.07658988613973294 },
{ _id: '#00008', val: 0.39893981577036675 },
{ _id: '#00009', val: 0.5279631881196858 },
{ _id: '#00007', val: 0.8445363162277748 },
{ _id: '#00006', val: 0.01935050813731909 },
{ _id: '#00004', val: 0.0732484258238264 },
{ _id: '#00005', val: 0.7733464850237388 },
{ _id: '#00003', val: 0.3356001641172073 },
{ _id: '#00000', val: 0.8956753135566624 },
{ _id: '#00001', val: 0.4952318922619017 }
]
Keep in mind that I'm working with just a single node here. Sharding and parallel processing might retrieve rows in a different order than how they're stored. You should not rely on any "natural" order. Instead, unless you're conducting this type of investigation, where you're guessing the physical ordering from the query layer, ensure that you use an explicit sort operation to specify the expected order of results.
I can display the RecordId with .showRecordId(), which adds it to the cursor projection:
test> db.collection.find().showRecordId();
[
{ _id: '#00002', val: 0.07658988613973294, '$recordId': Long('1') },
{ _id: '#00008', val: 0.39893981577036675, '$recordId': Long('2') },
{ _id: '#00009', val: 0.5279631881196858, '$recordId': Long('3') },
{ _id: '#00007', val: 0.8445363162277748, '$recordId': Long('4') },
{ _id: '#00006', val: 0.01935050813731909, '$recordId': Long('5') },
{ _id: '#00004', val: 0.0732484258238264, '$recordId': Long('6') },
{ _id: '#00005', val: 0.7733464850237388, '$recordId': Long('7') },
{ _id: '#00003', val: 0.3356001641172073, '$recordId': Long('8') },
{ _id: '#00000', val: 0.8956753135566624, '$recordId': Long('9') },
{ _id: '#00001', val: 0.4952318922619017, '$recordId': Long('10') }
]
Documentation: showRecordId()
Forcing an index with a hint
I can force an index with a hint, for example the index on "_id" which was created automatically:
test> db.collection.find().hint( { _id: 1} ).showRecordId();
[
{ _id: '#00000', val: 0.8956753135566624, '$recordId': Long('9') },
{ _id: '#00001', val: 0.4952318922619017, '$recordId': Long('10') },
{ _id: '#00002', val: 0.07658988613973294, '$recordId': Long('1') },
{ _id: '#00003', val: 0.3356001641172073, '$recordId': Long('8') },
{ _id: '#00004', val: 0.0732484258238264, '$recordId': Long('6') },
{ _id: '#00005', val: 0.7733464850237388, '$recordId': Long('7') },
{ _id: '#00006', val: 0.01935050813731909, '$recordId': Long('5') },
{ _id: '#00007', val: 0.8445363162277748, '$recordId': Long('4') },
{ _id: '#00008', val: 0.39893981577036675, '$recordId': Long('2') },
{ _id: '#00009', val: 0.5279631881196858, '$recordId': Long('3') }
]
This runs a IXSCAN instead of a COLLSCAN and returns the documents in the order of the index. You can verify it with .explain(), but it is also perceptible from the order of the document fetched, which follows the order of "_id" rather than the order of insertion as before (also called "natural" order).
Rather than using a hint, I can add a filter, and the query planner chooses the index. A filter like {$gt:MinKey} or {$lt:MaxKey} does not change the result, but changes the execution plan to an IXSCAN:
test> db.collection.find({_id:{$gt:MinKey}}).showRecordId();
[
{ _id: '#00000', val: 0.8956753135566624, '$recordId': Long('9') },
{ _id: '#00001', val: 0.4952318922619017, '$recordId': Long('10') },
{ _id: '#00002', val: 0.07658988613973294, '$recordId': Long('1') },
{ _id: '#00003', val: 0.3356001641172073, '$recordId': Long('8') },
{ _id: '#00004', val: 0.0732484258238264, '$recordId': Long('6') },
{ _id: '#00005', val: 0.7733464850237388, '$recordId': Long('7') },
{ _id: '#00006', val: 0.01935050813731909, '$recordId': Long('5') },
{ _id: '#00007', val: 0.8445363162277748, '$recordId': Long('4') },
{ _id: '#00008', val: 0.39893981577036675, '$recordId': Long('2') },
{ _id: '#00009', val: 0.5279631881196858, '$recordId': Long('3') }
]
An equality filter will also run an IXSCAN, and we observe the result fetched in that order:
test> db.collection.find({_id:{$ne:null}}).showRecordId();
[
{ _id: '#00000', val: 0.8956753135566624, '$recordId': Long('9') },
{ _id: '#00001', val: 0.4952318922619017, '$recordId': Long('10') },
{ _id: '#00002', val: 0.07658988613973294, '$recordId': Long('1') },
{ _id: '#00003', val: 0.3356001641172073, '$recordId': Long('8') },
{ _id: '#00004', val: 0.0732484258238264, '$recordId': Long('6') },
{ _id: '#00005', val: 0.7733464850237388, '$recordId': Long('7') },
{ _id: '#00006', val: 0.01935050813731909, '$recordId': Long('5') },
{ _id: '#00007', val: 0.8445363162277748, '$recordId': Long('4') },
{ _id: '#00008', val: 0.39893981577036675, '$recordId': Long('2') },
{ _id: '#00009', val: 0.5279631881196858, '$recordId': Long('3') }
]
This technique is used to add an unbounded range predicate on the indexed sort field to get the index used for the sort in the absence of an equality predicate: MongoDB Equality, Sort, Range (ESR) without Equality (SR)
Forcing a full scan with a hint for natural order
Hints specify the index definition, and you may wonder how to force a full scan instead of the index scan chosen by the query planner. Remember that it's an index on RecordId that stores the documents. So you can hint this internal index using the $natural operator - asking for natural order of the collection documents:
test> db.collection.find({_id:{$ne:null}}).hint({$natural:1}).showRecordId();
[
{ _id: '#00002', val: 0.07658988613973294, '$recordId': Long('1') },
{ _id: '#00008', val: 0.39893981577036675, '$recordId': Long('2') },
{ _id: '#00009', val: 0.5279631881196858, '$recordId': Long('3') },
{ _id: '#00007', val: 0.8445363162277748, '$recordId': Long('4') },
{ _id: '#00006', val: 0.01935050813731909, '$recordId': Long('5') },
{ _id: '#00004', val: 0.0732484258238264, '$recordId': Long('6') },
{ _id: '#00005', val: 0.7733464850237388, '$recordId': Long('7') },
{ _id: '#00003', val: 0.3356001641172073, '$recordId': Long('8') },
{ _id: '#00000', val: 0.8956753135566624, '$recordId': Long('9') },
{ _id: '#00001', val: 0.4952318922619017, '$recordId': Long('10') }
]
The documents are fetched in order of RecordId from a COLLSCAN. The hint syntax allows an ascending or descending option to start at the beginning or end of the collection. I'm showing this to explain how records are stored internally. However, if you need a specific order, you should use sort() and let the query planner decide whether to use the index to avoid a sort operation.
MongoDB is more than a NoSQL database:
- Like many NoSQL databases, it allows you to query the indexes directly with
.hint(), forcing the access path
- Like all SQL databases, it has a query planner offering data independence, allowing you to declare the collection and expected order with
.sort() and let the database optimize the access path.
Avoid combining storage-level instructions, such as .hint(), .min(), or .max(), with declarative query filters in find() or $match, as this can undermine the query planner's guarantees that results match the query predicates. For example, hinting at a partial index might lead to incomplete results.
Covering indexes and "_id" projection
Understanding what is stored in the index entries helps optimize queries to use an index-only scan (covering index).
For example, the following query reads the index on "_id" and projects only "_id" (which is by default) and "val":
test> db.collection.find(
{ _id: { $ne: null } },
{ val: 1 }
).explain().queryPlanner.winningPlan;
{
isCached: false,
stage: 'PROJECTION_SIMPLE',
transformBy: { val: 1 },
inputStage: {
stage: 'FETCH',
inputStage: {
stage: 'IXSCAN',
keyPattern: { _id: 1 },
indexName: '_id_',
isMultiKey: false,
isUnique: true,
isSparse: false,
isPartial: false,
indexVersion: 2,
direction: 'forward',
indexBounds: { _id: [ '[MinKey, null)', '(null, MaxKey]' ] }
}
}
}
Because the index on "_id" holds only the key ("_id") and RecordId, it must fetch the document (FETCH) before the projection (PROJECTION_SIMPLE). Even if it is a primary index from the application's point of view, it is physically equivalent to a secondary index.
I can see the same with another secondary index:
test> db.collection.createIndex( { val: 1 } );
val_1
test> db.collection.find(
{ val:{$gt: 0} },
{ val: 1 }
).explain().queryPlanner.winningPlan;
{
isCached: false,
stage: 'PROJECTION_SIMPLE',
transformBy: { val: 1 },
inputStage: {
stage: 'FETCH',
inputStage: {
stage: 'IXSCAN',
keyPattern: { val: 1 },
indexName: 'val_1',
isMultiKey: false,
multiKeyPaths: { val: [] },
isUnique: false,
isSparse: false,
isPartial: false,
indexVersion: 2,
direction: 'forward',
indexBounds: { val: [ '(0, inf.0]' ] }
}
}
}
Such query projects "_id" because it is there by default, and then the index on "val" is not covering all fields. To avoid the FETCH, I need to remove "_id" from the projection explicitly:
test> db.collection.find(
{ val:{$gt: 0} },
{ val: 1 , _id: 0 }
).explain().queryPlanner.winningPlan;
{
isCached: false,
stage: 'PROJECTION_COVERED',
transformBy: { val: 1, _id: 0 },
inputStage: {
stage: 'IXSCAN',
keyPattern: { val: 1 },
indexName: 'val_1',
isMultiKey: false,
multiKeyPaths: { val: [] },
isUnique: false,
isSparse: false,
isPartial: false,
indexVersion: 2,
direction: 'forward',
indexBounds: { val: [ '(0, inf.0]' ] }
}
}
Another possibility: if I need to project "_id", I can add it to the index definition, making it a covering index for my query:
test> db.collection.createIndex( { val: 1 , _id: 1 } );
val_1__id_1
test> db.collection.find(
{
by Franck Pachot
Percona Database Performance Blog
As enterprise software vendors race toward proprietary cloud ecosystems, some features long relied upon by businesses are being quietly deprecated. One recent example is MongoDB Enterprise Advanced and Atlas dropping support for LDAP authentication, a foundational identity protocol for countless organizations. At Percona, we’re taking a different path. We’ve supported LDAP in Percona Server for MongoDB for […]
by Radoslaw Szulgo
August 06, 2025
Murat Demirbas
This paper from HotStorage'25 presents OrcaCache, a design proposal for a coordinated caching framework tailored to disaggregated storage systems. In a disaggregated architecture, compute and storage resources are physically separated and connected via high-speed networks. These became increasingly common in modern data centers as they enable flexible resource scaling and improved fault isolation. (Follow the money as they say!) But accessing remote storage introduces serious latency and efficiency challenges. The paper positions OrcaCache as a solution to mitigate these challenges by orchestrating caching logic across clients and servers. Important note: in the paper's terminology the server means the storage node, and the client means the compute node.
As we did last week for another paper, Aleksey and I live-recorded our reading/discussion of this paper. We do this to teach the thought-process and mechanics of how experts read papers in real time. Check our discussion video below (please listen at 1.5x, I sound less horrible at that speed). The paper I annotated during our discussion is also available here.
The problem
Caching plays a crucial role in reducing the overheads of disaggregated storage, but the paper claims that current strategies (client-local caching, server-only caching, and independent client-server caching) fall short. Client-local caching is simple and avoids server overhead but underutilizes memory on the server. Server-only caching can reduce backend I/O pressure but comes at the cost of network round-trips and significant server CPU load. Independent client-server caching combines the two but lacks coordination between the caches, leading to data duplication, inefficient eviction and prefetching policies, and causes fairness issues in multi-client environments.
The proposed design
OrcaCache proposes to address these shortcomings by shifting the cache index and coordination responsibilities to the client side. Clients maintain a global view of the cache and communicate directly with the server-side cache using RDMA, which enables bypassing the server CPU in the common case. Server-side components are minimized to a daemon that tracks resource usage and allocates memory based on fairness and pressure.
Discussion
OrcaCache stops short of addressing the core system-level challenges in a realistic multi-client deployment. A single server single client setup is used in experiments in Figure 1, and also for most of the description in the paper. The paper's solution to dealing with multiple clients is to use a separate namespace for each client, but then at the server-side this uses up a lot of resources, cause duplication of cached items. There is no mutual benefit and collaboration among clients in this setup.
The paper also mentions how clients could interact with a server-side daemon, how RDMA-based lookups and cache updates would be issued, and how resources might be allocated based on monitored pressure, but many of these mechanisms remain speculative. The authors mention about flexible eviction and prefetching but do not explore the complexity of maintaining consistency or fairness across diverse workloads. AI/ML workloads mentioned/alluded but not really tested in the paper.
In the end, the paper's contribution lies more in reopening a line of thought from 1990s cooperative caching and global memory management research: how to make cache coherence across disaggregated compute and storage both efficient and scalable. The idea OrcaCache seems to lean on is that rather than burden the server, it makes the client responsible for coordination, enabled by fast networks and abundant memory.
Also despite the title, there was not much Tango in the paper. It was mostly cache.
by Murat (noreply@blogger.com)
Percona Database Performance Blog
If you’re running MySQL 8.0 databases, you need to know this: Oracle will stop supporting them in April 2026. That means no more security patches, bug fixes, or help when things go wrong. Maybe you’re thinking, “But April 2026 feels far away!“. But once that date hits, every day you keep running MySQL 8.0 makes […]
by David Quilty
Murat Demirbas
This paper from SIGMOD 2016 proposes a transaction healing approach to improve the scalability of Optimistic Concurrency Control (OCC) in main-memory OLTP systems running on multicore architectures. Instead of discarding the entire execution when validation fails, the system repairs only the inconsistent operations to improve throughput in high-contention scenarios.
If this sounds familiar, it's because we recently reviewed the Morty paper from EuroSys 2023, which applied healing ideas to interactive transactions using continuations to support re-execution. This 2016 Transaction Healing paper is scoped to static stored procedures, and focuses more on integrating healing into OCC for stored procedures.
Key Ideas
OCC works well under low contention because it separates reads from writes and keeps critical sections short (only for validation). But under high contention, especially in workloads with skewed access patterns (like Zipfian distributions), transactions are frequently invalidated by concurrent updates. The naive OCC response of abort and restart leads to wasting CPU cycles and degrading cache locality.
Transaction healing aims to address this problem by observing/betting that most validation failures affect only a subset of a transaction's operations. If only the affected operations can be detected and recovered, the system can avoid redoing the entire transaction. They implement this by leveraging two components.
First, a static analysis phase extracts operation dependencies from the stored procedure a priori. The dependency analysis distinguishes between two types of relations: key-dependencies, where the result of one operation determines the lookup key for another; and value-dependencies, where the value produced by one operation is used in a subsequent one. With this graph in hand, transaction healing can surgically repair any non-serializable operation at runtime.
Second, a runtime access cache, maintained per thread, tracks the behavior of each executed operation (its inputs, outputs, effects, and the memory addresses it accessed) and identifies conflicted parts of a transaction at runtime. The access cache supports this by recording memory addresses (avoiding repeated index lookups) and allowing efficient reuse of unaffected results.
Transaction healing
The healing process is triggered during the validation phase, when an inconsistency is detected in the read/write set. Rather than aborting immediately, the system identifies the earliest affected operation (using its dependency graph), and restores it. If the operation is value-dependent, healing updates its effects based on cached inputs and outputs. If it's key-dependent, a re-execution is necessary since the accessed record may change. The healing propagates forward through the dependency graph, recursively restoring all operations affected by the initial inconsistency.
The healing mechanism is built to preserve serializability. Validation acquires locks in a globally consistent order (e.g., sorted by memory address) to avoid deadlocks. If during healing a lock must be acquired out of order (e.g., due to new dependencies introduced by re-executed operations), the transaction is aborted in order not to risk a deadlock. The paper says this situation is rare due to validation-order optimizations. Despite occasional aborts, transaction healing guarantees forward progress and eventual termination: each transaction's read/write set is finite and every element is validated at most once, which ensures that healing either succeeds or fails definitively.
Evaluation Highlights
They implemented a C++ in-memory database engine, THEDB, to test these ideas. THEDB employs LLVM to perform static dependency analysis on stored procedures and includes support for standard database features like inserts, deletes, and range queries (the latter protected against phantoms via B+-tree versioning, as in Silo). The authors evaluate THEDB on a 48-core AMD machine using two common benchmarks: TPC-C and Smallbank. THEDB is compared against five systems: variants of OCC (including Silo-style), 2PL, a hybrid OCC-2PL approach, and a deterministic partitioned system.
The results show that, under high contention, THEDB significantly outperforms the alternatives, achieving up to 6.2x higher throughput than Silo and approaching the performance of an idealized OCC system with validation disabled. This shows that transaction healing adds minimal overhead and successfully eliminates the restart costs that dominate OCC's performance under load. Moreover, THEDB maintains stable throughput as contention increases (e.g., under more skewed Zipfian distributions), while traditional OCC and Silo degrade rapidly. Scalability is also great up to 48 cores.
Discussion
**** What are the limitations of static analysis used?
Transaction healing proposed here is limited to stored procedures because it relies on static dependency extraction. Unlike Morty, which handles interactive transactions using runtime continuations, this work cannot deal with dynamic control flow or unknown transaction logic at runtime. As a result, ad-hoc queries revert to standard OCC, where any healing benefit is lost.
On the other hand, there is some subtlety here. Transaction healing does not require read/write sets to be declared in advance as the deterministic systems like Calvin do. Deterministic systems must know the exact records a transaction will access before it begins execution, so they can assign transactions to partitions and establish a global execution order. Transaction healing avoids this rigidity. It doesn't need to know which specific records a transaction will access ahead of time. Instead, it relies on static analysis to extract the structure of the transaction logic, namely which operations depend on which others. These dependencies, such as key or value dependencies between operations, are known statically because the transaction logic is written as a stored procedure. But the actual keys and values involved are discovered dynamically as the transaction executes. The system uses an access cache to record which memory locations were read or written, and validation happens afterward. This flexibility allows transaction healing to support dynamic, cross-partition access patterns without prior declaration.
**** How does this compare with Morty?
Transaction Healing is designed for in-memory OLTP systems running with OCC on multicore machines, where the workload consists of static stored procedures. Morty, in contrast, is built for a distributed geo-replicated system and handles interactive transactions with dynamic control flow. It uses MVTSO, with speculative execution and a priori ordering. Unlike THEDB, Morty allows transactions to read from uncommitted versions, exposing concurrency that traditional systems suppress. It tracks execution through continuation-passing style (CPS) in order to make control dependencies explicit and enable partial re-execution of logic branches. While transaction healing employed LLVM to automatically perform static dependency analysis on stored procedures, Morty did not automate translation of transaction program to CPS program. Finally, since it is distributed and deployed over WAN, Morty integrates concurrency control with replication to reduce latency and uses quorum voting to maintain fault-tolerant correctness without centralized logging.
by Murat (noreply@blogger.com)