a curated list of database news from authoritative sources

August 08, 2025

August 07, 2025

Neurosymbolic AI: Why, What, and How

The paper (2023) argues for integrating two historically divergent traditions in artificial intelligence (neural networks and symbolic reasoning) into a unified paradigm called Neurosymbolic AI. It argues that the path to capable, explainable, and trustworthy artificial intelligence lies in marrying perception-driven neural systems with structure-aware symbolic models. 

The authors lean on Daniel Kahneman’s story of two systems in the mind (Thinking Fast and Slow). Neural networks are the fast ones: pattern-hungry, intuitive, good with unstructured mess. Symbolic methods are the slow ones: careful, logical, good with rules and plans. Neural networks, especially in their modern incarnation as large language models (LLMs), excel at pattern recognition, but fall short in tasks demanding multi-step reasoning, abstraction, constraint satisfaction, or explanation. Conversely, symbolic systems offer interpretability, formal correctness, and composability, but tend to be brittle (not incremental/monotonic), difficult to scale, and poorly suited to noisy or incomplete inputs.

The paper argues that true AI systems must integrate both paradigms, leveraging the adaptability of neural systems while grounding them in symbolic structure. This argument is compelling, particularly in safety-critical domains like healthcare or law, where transparency and adherence to rules are essential.

The paper divides Neurosymbolic AI into two complementary approaches: compressing symbolic knowledge into neural models, and lifting neural outputs into symbolic structures.

The first approach is to embed symbolic structures such as ontologies or knowledge graphs (KGs) into neural vector spaces suitable for integration with neural networks. This can be done via embedding techniques that convert symbolic structures into high-dimensional vectors, or through more direct inductive bias mechanisms that influence a model's architecture. While this enables neural systems to make use of background knowledge, it often loses semantic richness in the process. The neural model benefits from the knowledge, but the end-user gains little transparency, and the symbolic constraints are difficult to trace or modify. Nevertheless, this approach scales well and offers modest improvements in cognitive tasks like abstraction and planning.

The second approach works in the opposite direction. It begins with the neural representation and lifts it into symbolic form. This involves extracting structured symbolic patterns, explanations, or logical chains of reasoning from the output of neural systems. One common approach is through federated pipelines where a large language model decomposes a query into subtasks and delegates those to domain-specific symbolic solvers, such as math engines or search APIs. Another strategy involves building fully differentiable pipelines where symbolic constraints, expert-defined rules, and domain concepts are embedded directly into the neural training process. This allows for true end-to-end learning while preserving explainability and control. These lifting-based systems show the greatest potential: they not only maintain large-scale perception but also achieve high marks in abstraction, analogy, and planning, along with excellent explainability and adaptability.

The case study in mental health application shows promise. The system's ability to map raw social media text to clinical ontologies and generate responses constrained by medical knowledge illustrates the potential of well-integrated symbolic and neural components. However, these examples also hint at the limitations of current implementations: it is not always clear how the symbolic reasoning is embedded or whether the system guarantees consistency under update or multi-agent interaction.

Knowledge graphs versus symbolic solvers

The paper claims that knowledge graphs (KGs) are especially well-suited for this integration—serving as the symbolic scaffolding that supports neural learning. KGs are graph-structured representations of facts, typically in the form of triples: (subject, predicate, object). KGs are praised for their flexibility, updateability, and ability to represent dynamic real-world entities. But the paper then waves off formal logic, especially first-order logic (FOL) as static and brittle. That's not fair. Knowledge graphs are great for facts: "Marie Curie discovered radium". But when it comes to constraint satisfaction or verifying safety, you'll need real logic. The kind with proofs. 

First-order logic is only brittle when you try to do too much with it all at once. Modern logic systems (SMT solvers, expressive type systems, modular specs) can be quite robust. The paper misses a chance here. It doesn't mention the rich and growing field where LLMs and symbolic solvers already collaborate (e.g., GPT writes a function and Z3 checks if it's wrong, and logic engines validate that generated plans do not violate physics or safety).

Knowledge graphs and symbolic logic don’t need to fight, as they don't compete like Coke and Pepsi. They are more like peanut-butter and jelly. You can use a knowledge graph to instantiate a logic problem. You can generate FOL rules from ontologies. You can use SMT to enforce constraints (e.g., cardinality, ontological coherence). You can even use a theorem prover to validate new triples before inserting them into the graph. You can also run inference rules to expand a knowledge graph deductively.

But the paper doesn't explore how lifted neural outputs could feed into symbolic solvers for planning or synthesis or reasoning. It misses the current push to combine neural generation with symbolic checking, where LLMs propose, and the verifiers dispose in a feedback loop. 

MongoDB indexing internals: .showRecordId() and .hint({$natural:1})

You can understand how MongoDB stores documents internally with simple queries that rely on the physical storage ordering. Some databases store records (called rows or tuples) in heap tables, using their physical location in the data files, such as ROWID in Oracle or CTID in PostgreSQL, to reference those records from index entries. In contrast, databases like MySQL's InnoDB or YugabyteDB store records in the primary key index ("clustered index", or "index organized table"), storing them by the logical order of their primary key values, so that secondary indexes point to these logical locations with the primary key, or an encoded version of it.
MongoDB default collections are similar to heap tables because their documents are stored independently of the primary key ("_id") exposed to the application. Internally, the WiredTiger storage engine organizes collection documents using a B+Tree structure, with an internal RecordId as the key, assigned by MongoDB. This structure resembles a clustered index, but it is clustered on an internal key rather than the primary key.

MongoDB’s approach improves on traditional heap tables, especially for storing variable-size documents, because WiredTiger uses B+Tree nodes for efficient space management, reusing space and splitting pages as needed, rather than relying on settings like PCTFREE or FILLFACTOR to reserve space for updates, or SHRINK/VACUUM operations to defragment after deletes.
To cover all cases, with clustered collections, MongoDB can generate the RecordId from the "_id", the primary key exposed to the application, making storage similar to how some databases organize tables in clustered indexes, as "_id" can be a generated ObjectId or defined by the application at insert time. So, when looking at storage internals levels, there are two keys:

  • "_id" is the application's primary key, a generated surrogate key, or a natural key. It is always indexed, and this unique index, like other secondary indexes, references the document with a RecordId
  • RecordId is the internal key. It can be generated from "_id" (in clustered collections), but is more generally generated as a monotonically increasing 64-bit integer during inserts. It can be considered a physical address by the query layer, but it is not directly mapped to an address in the filesystem because files are B+Tree structures. This offers physical data independence since the primary key generation pattern does not affect the storage organization. However, it is helpful to understand how it functions when reviewing execution plans.

Another perspective is that, aside from clustered collections, all indexes in MongoDB, including the primary key index on "_id", are essentially secondary indexes, similar to those found in heap table databases (such as Db2, Oracle, and PostgreSQL). However, instead of a heap table with a fixed block size and row identification tied to their physical location, MongoDB documents are stored within an internal B+Tree index using the WiredTiger engine. Both approaches have their rationale:

  • SQL databases are primarily optimized for fixed, normalized schemas with small, uniform row lengths, where a PCTFREE or FILLFACTOR can be set according to the expected updates. Storing larger types involves row chaining or slicing (like Oracle LOB chunks or PostgreSQL TOAST)
  • MongoDB is designed for flexible schemas, and collections can contain documents of any size up to the BSON limit. Typically, documents are tens to hundreds of kilobytes, with some larger ones reaching a few megabytes. This flexibility requires adaptable storage management and efforts to minimize fragmentation beyond a small, fixed page size. A B+Tree with a flexible leaf block size is a suitable structure for this purpose.

The document size is flexible, thanks to the storage described above, but the ideal document size is a frequent question. Until I write a blog post on this, here’s a slide: the green area indicates where the most efficient access is, the red side is acceptable for outliers if they don't grow further, but may be a sign of embedding too much, and the blue side works for small documents inserted and queried together, but it may also be a sign that you did unnecessary normalization and should embed more to avoid runtime scattered lookups.

An example to understand the internal ordering

To demonstrate how it works, I generate ten documents and insert them asynchronously, so they may be written to the database in a random order:

db.collection.drop();
Array.from({ length: 10 }, (_, i) => {
  db.collection.insertOne({
    _id: `#${String(i).padStart(5, '0')}` ,
      val: Math.random()
  });
});

If I query without any filter or sort, the query planner chooses a COLLSCAN, which reads the records in the order of their RecordID, in the order they were inserted:

test> db.collection.find();

[
  { _id: '#00002', val: 0.07658988613973294 },
  { _id: '#00008', val: 0.39893981577036675 },
  { _id: '#00009', val: 0.5279631881196858 },
  { _id: '#00007', val: 0.8445363162277748 },
  { _id: '#00006', val: 0.01935050813731909 },
  { _id: '#00004', val: 0.0732484258238264 },
  { _id: '#00005', val: 0.7733464850237388 },
  { _id: '#00003', val: 0.3356001641172073 },
  { _id: '#00000', val: 0.8956753135566624 },
  { _id: '#00001', val: 0.4952318922619017 }
]

Keep in mind that I'm working with just a single node here. Sharding and parallel processing might retrieve rows in a different order than how they're stored. You should not rely on any "natural" order. Instead, unless you're conducting this type of investigation, where you're guessing the physical ordering from the query layer, ensure that you use an explicit sort operation to specify the expected order of results.

I can display the RecordId with .showRecordId(), which adds it to the cursor projection:

test> db.collection.find().showRecordId();

[
  { _id: '#00002', val: 0.07658988613973294, '$recordId': Long('1') },
  { _id: '#00008', val: 0.39893981577036675, '$recordId': Long('2') },
  { _id: '#00009', val: 0.5279631881196858, '$recordId': Long('3') },
  { _id: '#00007', val: 0.8445363162277748, '$recordId': Long('4') },
  { _id: '#00006', val: 0.01935050813731909, '$recordId': Long('5') },
  { _id: '#00004', val: 0.0732484258238264, '$recordId': Long('6') },
  { _id: '#00005', val: 0.7733464850237388, '$recordId': Long('7') },
  { _id: '#00003', val: 0.3356001641172073, '$recordId': Long('8') },
  { _id: '#00000', val: 0.8956753135566624, '$recordId': Long('9') },
  { _id: '#00001', val: 0.4952318922619017, '$recordId': Long('10') }
]

Documentation: showRecordId()

Forcing an index with a hint

I can force an index with a hint, for example the index on "_id" which was created automatically:

test> db.collection.find().hint( { _id: 1} ).showRecordId();

[
  { _id: '#00000', val: 0.8956753135566624, '$recordId': Long('9') },
  { _id: '#00001', val: 0.4952318922619017, '$recordId': Long('10') },
  { _id: '#00002', val: 0.07658988613973294, '$recordId': Long('1') },
  { _id: '#00003', val: 0.3356001641172073, '$recordId': Long('8') },
  { _id: '#00004', val: 0.0732484258238264, '$recordId': Long('6') },
  { _id: '#00005', val: 0.7733464850237388, '$recordId': Long('7') },
  { _id: '#00006', val: 0.01935050813731909, '$recordId': Long('5') },
  { _id: '#00007', val: 0.8445363162277748, '$recordId': Long('4') },
  { _id: '#00008', val: 0.39893981577036675, '$recordId': Long('2') },
  { _id: '#00009', val: 0.5279631881196858, '$recordId': Long('3') }
]

This runs a IXSCAN instead of a COLLSCAN and returns the documents in the order of the index. You can verify it with .explain(), but it is also perceptible from the order of the document fetched, which follows the order of "_id" rather than the order of insertion as before (also called "natural" order).

Rather than using a hint, I can add a filter, and the query planner chooses the index. A filter like {$gt:MinKey} or {$lt:MaxKey} does not change the result, but changes the execution plan to an IXSCAN:

test> db.collection.find({_id:{$gt:MinKey}}).showRecordId();

[
  { _id: '#00000', val: 0.8956753135566624, '$recordId': Long('9') },
  { _id: '#00001', val: 0.4952318922619017, '$recordId': Long('10') },
  { _id: '#00002', val: 0.07658988613973294, '$recordId': Long('1') },
  { _id: '#00003', val: 0.3356001641172073, '$recordId': Long('8') },
  { _id: '#00004', val: 0.0732484258238264, '$recordId': Long('6') },
  { _id: '#00005', val: 0.7733464850237388, '$recordId': Long('7') },
  { _id: '#00006', val: 0.01935050813731909, '$recordId': Long('5') },
  { _id: '#00007', val: 0.8445363162277748, '$recordId': Long('4') },
  { _id: '#00008', val: 0.39893981577036675, '$recordId': Long('2') },
  { _id: '#00009', val: 0.5279631881196858, '$recordId': Long('3') }
]

An equality filter will also run an IXSCAN, and we observe the result fetched in that order:

test> db.collection.find({_id:{$ne:null}}).showRecordId();

[
  { _id: '#00000', val: 0.8956753135566624, '$recordId': Long('9') },
  { _id: '#00001', val: 0.4952318922619017, '$recordId': Long('10') },
  { _id: '#00002', val: 0.07658988613973294, '$recordId': Long('1') },
  { _id: '#00003', val: 0.3356001641172073, '$recordId': Long('8') },
  { _id: '#00004', val: 0.0732484258238264, '$recordId': Long('6') },
  { _id: '#00005', val: 0.7733464850237388, '$recordId': Long('7') },
  { _id: '#00006', val: 0.01935050813731909, '$recordId': Long('5') },
  { _id: '#00007', val: 0.8445363162277748, '$recordId': Long('4') },
  { _id: '#00008', val: 0.39893981577036675, '$recordId': Long('2') },
  { _id: '#00009', val: 0.5279631881196858, '$recordId': Long('3') }
]

This technique is used to add an unbounded range predicate on the indexed sort field to get the index used for the sort in the absence of an equality predicate: MongoDB Equality, Sort, Range (ESR) without Equality (SR)

Forcing a full scan with a hint for natural order

Hints specify the index definition, and you may wonder how to force a full scan instead of the index scan chosen by the query planner. Remember that it's an index on RecordId that stores the documents. So you can hint this internal index using the $natural operator - asking for natural order of the collection documents:

test> db.collection.find({_id:{$ne:null}}).hint({$natural:1}).showRecordId();

[
  { _id: '#00002', val: 0.07658988613973294, '$recordId': Long('1') },
  { _id: '#00008', val: 0.39893981577036675, '$recordId': Long('2') },
  { _id: '#00009', val: 0.5279631881196858, '$recordId': Long('3') },
  { _id: '#00007', val: 0.8445363162277748, '$recordId': Long('4') },
  { _id: '#00006', val: 0.01935050813731909, '$recordId': Long('5') },
  { _id: '#00004', val: 0.0732484258238264, '$recordId': Long('6') },
  { _id: '#00005', val: 0.7733464850237388, '$recordId': Long('7') },
  { _id: '#00003', val: 0.3356001641172073, '$recordId': Long('8') },
  { _id: '#00000', val: 0.8956753135566624, '$recordId': Long('9') },
  { _id: '#00001', val: 0.4952318922619017, '$recordId': Long('10') }
]

The documents are fetched in order of RecordId from a COLLSCAN. The hint syntax allows an ascending or descending option to start at the beginning or end of the collection. I'm showing this to explain how records are stored internally. However, if you need a specific order, you should use sort() and let the query planner decide whether to use the index to avoid a sort operation.

MongoDB is more than a NoSQL database:

  • Like many NoSQL databases, it allows you to query the indexes directly with .hint(), forcing the access path
  • Like all SQL databases, it has a query planner offering data independence, allowing you to declare the collection and expected order with .sort() and let the database optimize the access path.

Avoid combining storage-level instructions, such as .hint(), .min(), or .max(), with declarative query filters in find() or $match, as this can undermine the query planner's guarantees that results match the query predicates. For example, hinting at a partial index might lead to incomplete results.

Covering indexes and "_id" projection

Understanding what is stored in the index entries helps optimize queries to use an index-only scan (covering index).

For example, the following query reads the index on "_id" and projects only "_id" (which is by default) and "val":

test> db.collection.find(
 { _id: { $ne: null } },
 { val: 1 }
).explain().queryPlanner.winningPlan;

{
  isCached: false,
  stage: 'PROJECTION_SIMPLE',
  transformBy: { val: 1 },
  inputStage: {
    stage: 'FETCH',
    inputStage: {
      stage: 'IXSCAN',
      keyPattern: { _id: 1 },
      indexName: '_id_',
      isMultiKey: false,
      isUnique: true,
      isSparse: false,
      isPartial: false,
      indexVersion: 2,
      direction: 'forward',
      indexBounds: { _id: [ '[MinKey, null)', '(null, MaxKey]' ] }
    }
  }
}

Because the index on "_id" holds only the key ("_id") and RecordId, it must fetch the document (FETCH) before the projection (PROJECTION_SIMPLE). Even if it is a primary index from the application's point of view, it is physically equivalent to a secondary index.

I can see the same with another secondary index:

test> db.collection.createIndex( { val: 1 } );

val_1

test> db.collection.find(
 { val:{$gt: 0} },
 { val: 1 }
).explain().queryPlanner.winningPlan;

{
  isCached: false,
  stage: 'PROJECTION_SIMPLE',
  transformBy: { val: 1 },
  inputStage: {
    stage: 'FETCH',
    inputStage: {
      stage: 'IXSCAN',
      keyPattern: { val: 1 },
      indexName: 'val_1',
      isMultiKey: false,
      multiKeyPaths: { val: [] },
      isUnique: false,
      isSparse: false,
      isPartial: false,
      indexVersion: 2,
      direction: 'forward',
      indexBounds: { val: [ '(0, inf.0]' ] }
    }
  }
}

Such query projects "_id" because it is there by default, and then the index on "val" is not covering all fields. To avoid the FETCH, I need to remove "_id" from the projection explicitly:

test> db.collection.find(
 { val:{$gt: 0} },
 { val: 1 , _id: 0 }
).explain().queryPlanner.winningPlan;

{
  isCached: false,
  stage: 'PROJECTION_COVERED',
  transformBy: { val: 1, _id: 0 },
  inputStage: {
    stage: 'IXSCAN',
    keyPattern: { val: 1 },
    indexName: 'val_1',
    isMultiKey: false,
    multiKeyPaths: { val: [] },
    isUnique: false,
    isSparse: false,
    isPartial: false,
    indexVersion: 2,
    direction: 'forward',
    indexBounds: { val: [ '(0, inf.0]' ] }
  }
}

Another possibility: if I need to project "_id", I can add it to the index definition, making it a covering index for my query:

test> db.collection.createIndex( { val: 1 , _id: 1 } );

val_1__id_1

test> db.collection.find(
 {
                                    
                                    
                                    
                                    
                                

MongoDB indexing internals: .showRecordId() and .hint({$natural:1})

You can understand how MongoDB stores documents internally with simple queries that rely on the physical storage ordering. Some databases store records (called rows or tuples) in heap tables, using their physical location in the data files, such as ROWID in Oracle or CTID in PostgreSQL, to reference those records from index entries. In contrast, databases like MySQL's InnoDB or YugabyteDB store records in the primary key index ("clustered index", or "index organized table"), storing them by the logical order of their primary key values, so that secondary indexes point to these logical locations with the primary key, or an encoded version of it.
MongoDB default collections are similar to heap tables because their documents are stored independently of the primary key ("_id") exposed to the application. Internally, the WiredTiger storage engine organizes collection documents using a B+Tree structure, with an internal RecordId as the key, assigned by MongoDB. This structure resembles a clustered index, but it is clustered on an internal key rather than the primary key.

MongoDB’s approach improves on traditional heap tables, especially for storing variable-size documents, because WiredTiger uses B+Tree nodes for efficient space management, reusing space and splitting pages as needed, rather than relying on settings like PCTFREE or FILLFACTOR to reserve space for updates, or SHRINK/VACUUM operations to defragment after deletes.
To cover all cases, with clustered collections, MongoDB can generate the RecordId from the "_id", the primary key exposed to the application, making storage similar to how some databases organize tables in clustered indexes, as "_id" can be a generated ObjectId or defined by the application at insert time. So, when looking at storage internals levels, there are two keys:

  • "_id" is the application's primary key, a generated surrogate key, or a natural key. It is always indexed, and this unique index, like other secondary indexes, references the document with a RecordId
  • RecordId is the internal key. It can be generated from "_id" (in clustered collections), but is more generally generated as a monotonically increasing 64-bit integer during inserts. It can be considered a physical address by the query layer, but it is not directly mapped to an address in the filesystem because files are B+Tree structures. This offers physical data independence since the primary key generation pattern does not affect the storage organization. However, it is helpful to understand how it functions when reviewing execution plans.

Another perspective is that, aside from clustered collections, all indexes in MongoDB, including the primary key index on "_id", are essentially secondary indexes, similar to those found in heap table databases (such as Db2, Oracle, and PostgreSQL). However, instead of a heap table with a fixed block size and row identification tied to their physical location, MongoDB documents are stored within an internal B+Tree index using the WiredTiger engine. Both approaches have their rationale:

  • SQL databases are primarily optimized for fixed, normalized schemas with small, uniform row lengths, where a PCTFREE or FILLFACTOR can be set according to the expected updates. Storing larger types involves row chaining or slicing (like Oracle LOB chunks or PostgreSQL TOAST)
  • MongoDB is designed for flexible schemas, and collections can contain documents of any size up to the BSON limit. Typically, documents are tens to hundreds of kilobytes, with some larger ones reaching a few megabytes. This flexibility requires adaptable storage management and efforts to minimize fragmentation beyond a small, fixed page size. A B+Tree with a flexible leaf block size is a suitable structure for this purpose.

The document size is flexible, thanks to the storage described above, but the ideal document size is a frequent question. Until I write a blog post on this, here’s a slide: the green area indicates where the most efficient access is, the red side is acceptable for outliers if they don't grow further, but may be a sign of embedding too much, and the blue side works for small documents inserted and queried together, but it may also be a sign that you did unnecessary normalization and should embed more to avoid runtime scattered lookups.

An example to understand the internal ordering

To demonstrate how it works, I generate ten documents and insert them asynchronously, so they may be written to the database in a random order:

db.collection.drop();
Array.from({ length: 10 }, (_, i) => {
  db.collection.insertOne({
    _id: `#${String(i).padStart(5, '0')}` ,
      val: Math.random()
  });
});

If I query without any filter or sort, the query planner chooses a COLLSCAN, which reads the records in the order of their RecordID, in the order they were inserted:

test> db.collection.find();

[
  { _id: '#00002', val: 0.07658988613973294 },
  { _id: '#00008', val: 0.39893981577036675 },
  { _id: '#00009', val: 0.5279631881196858 },
  { _id: '#00007', val: 0.8445363162277748 },
  { _id: '#00006', val: 0.01935050813731909 },
  { _id: '#00004', val: 0.0732484258238264 },
  { _id: '#00005', val: 0.7733464850237388 },
  { _id: '#00003', val: 0.3356001641172073 },
  { _id: '#00000', val: 0.8956753135566624 },
  { _id: '#00001', val: 0.4952318922619017 }
]

Keep in mind that I'm working with just a single node here. Sharding and parallel processing might retrieve rows in a different order than how they're stored. You should not rely on any "natural" order. Instead, unless you're conducting this type of investigation, where you're guessing the physical ordering from the query layer, ensure that you use an explicit sort operation to specify the expected order of results.

I can display the RecordId with .showRecordId(), which adds it to the cursor projection:

test> db.collection.find().showRecordId();

[
  { _id: '#00002', val: 0.07658988613973294, '$recordId': Long('1') },
  { _id: '#00008', val: 0.39893981577036675, '$recordId': Long('2') },
  { _id: '#00009', val: 0.5279631881196858, '$recordId': Long('3') },
  { _id: '#00007', val: 0.8445363162277748, '$recordId': Long('4') },
  { _id: '#00006', val: 0.01935050813731909, '$recordId': Long('5') },
  { _id: '#00004', val: 0.0732484258238264, '$recordId': Long('6') },
  { _id: '#00005', val: 0.7733464850237388, '$recordId': Long('7') },
  { _id: '#00003', val: 0.3356001641172073, '$recordId': Long('8') },
  { _id: '#00000', val: 0.8956753135566624, '$recordId': Long('9') },
  { _id: '#00001', val: 0.4952318922619017, '$recordId': Long('10') }
]

Documentation: showRecordId()

Forcing an index with a hint

I can force an index with a hint, for example the index on "_id" which was created automatically:

test> db.collection.find().hint( { _id: 1} ).showRecordId();

[
  { _id: '#00000', val: 0.8956753135566624, '$recordId': Long('9') },
  { _id: '#00001', val: 0.4952318922619017, '$recordId': Long('10') },
  { _id: '#00002', val: 0.07658988613973294, '$recordId': Long('1') },
  { _id: '#00003', val: 0.3356001641172073, '$recordId': Long('8') },
  { _id: '#00004', val: 0.0732484258238264, '$recordId': Long('6') },
  { _id: '#00005', val: 0.7733464850237388, '$recordId': Long('7') },
  { _id: '#00006', val: 0.01935050813731909, '$recordId': Long('5') },
  { _id: '#00007', val: 0.8445363162277748, '$recordId': Long('4') },
  { _id: '#00008', val: 0.39893981577036675, '$recordId': Long('2') },
  { _id: '#00009', val: 0.5279631881196858, '$recordId': Long('3') }
]

This runs a IXSCAN instead of a COLLSCAN and returns the documents in the order of the index. You can verify it with .explain(), but it is also perceptible from the order of the document fetched, which follows the order of "_id" rather than the order of insertion as before (also called "natural" order).

Rather than using a hint, I can add a filter, and the query planner chooses the index. A filter like {$gt:MinKey} or {$lt:MaxKey} does not change the result, but changes the execution plan to an IXSCAN:

test> db.collection.find({_id:{$gt:MinKey}}).showRecordId();

[
  { _id: '#00000', val: 0.8956753135566624, '$recordId': Long('9') },
  { _id: '#00001', val: 0.4952318922619017, '$recordId': Long('10') },
  { _id: '#00002', val: 0.07658988613973294, '$recordId': Long('1') },
  { _id: '#00003', val: 0.3356001641172073, '$recordId': Long('8') },
  { _id: '#00004', val: 0.0732484258238264, '$recordId': Long('6') },
  { _id: '#00005', val: 0.7733464850237388, '$recordId': Long('7') },
  { _id: '#00006', val: 0.01935050813731909, '$recordId': Long('5') },
  { _id: '#00007', val: 0.8445363162277748, '$recordId': Long('4') },
  { _id: '#00008', val: 0.39893981577036675, '$recordId': Long('2') },
  { _id: '#00009', val: 0.5279631881196858, '$recordId': Long('3') }
]

An equality filter will also run an IXSCAN, and we observe the result fetched in that order:

test> db.collection.find({_id:{$ne:null}}).showRecordId();

[
  { _id: '#00000', val: 0.8956753135566624, '$recordId': Long('9') },
  { _id: '#00001', val: 0.4952318922619017, '$recordId': Long('10') },
  { _id: '#00002', val: 0.07658988613973294, '$recordId': Long('1') },
  { _id: '#00003', val: 0.3356001641172073, '$recordId': Long('8') },
  { _id: '#00004', val: 0.0732484258238264, '$recordId': Long('6') },
  { _id: '#00005', val: 0.7733464850237388, '$recordId': Long('7') },
  { _id: '#00006', val: 0.01935050813731909, '$recordId': Long('5') },
  { _id: '#00007', val: 0.8445363162277748, '$recordId': Long('4') },
  { _id: '#00008', val: 0.39893981577036675, '$recordId': Long('2') },
  { _id: '#00009', val: 0.5279631881196858, '$recordId': Long('3') }
]

This technique is used to add an unbounded range predicate on the indexed sort field to get the index used for the sort in the absence of an equality predicate: MongoDB Equality, Sort, Range (ESR) without Equality (SR)

Forcing a full scan with a hint for natural order

Hints specify the index definition, and you may wonder how to force a full scan instead of the index scan chosen by the query planner. Remember that it's an index on RecordId that stores the documents. So you can hint this internal index using the $natural operator - asking for natural order of the collection documents:

test> db.collection.find({_id:{$ne:null}}).hint({$natural:1}).showRecordId();

[
  { _id: '#00002', val: 0.07658988613973294, '$recordId': Long('1') },
  { _id: '#00008', val: 0.39893981577036675, '$recordId': Long('2') },
  { _id: '#00009', val: 0.5279631881196858, '$recordId': Long('3') },
  { _id: '#00007', val: 0.8445363162277748, '$recordId': Long('4') },
  { _id: '#00006', val: 0.01935050813731909, '$recordId': Long('5') },
  { _id: '#00004', val: 0.0732484258238264, '$recordId': Long('6') },
  { _id: '#00005', val: 0.7733464850237388, '$recordId': Long('7') },
  { _id: '#00003', val: 0.3356001641172073, '$recordId': Long('8') },
  { _id: '#00000', val: 0.8956753135566624, '$recordId': Long('9') },
  { _id: '#00001', val: 0.4952318922619017, '$recordId': Long('10') }
]

The documents are fetched in order of RecordId from a COLLSCAN. The hint syntax allows an ascending or descending option to start at the beginning or end of the collection. I'm showing this to explain how records are stored internally. However, if you need a specific order, you should use sort() and let the query planner decide whether to use the index to avoid a sort operation.

MongoDB is more than a NoSQL database:

  • Like many NoSQL databases, it allows you to query the indexes directly with .hint(), forcing the access path
  • Like all SQL databases, it has a query planner offering data independence, allowing you to declare the collection and expected order with .sort() and let the database optimize the access path.

Avoid combining storage-level instructions, such as .hint(), .min(), or .max(), with declarative query filters in find() or $match, as this can undermine the query planner's guarantees that results match the query predicates. For example, hinting at a partial index might lead to incomplete results.

Covering indexes and "_id" projection

Understanding what is stored in the index entries helps optimize queries to use an index-only scan (covering index).

For example, the following query reads the index on "_id" and projects only "_id" (which is by default) and "val":

test> db.collection.find(
 { _id: { $ne: null } },
 { val: 1 }
).explain().queryPlanner.winningPlan;

{
  isCached: false,
  stage: 'PROJECTION_SIMPLE',
  transformBy: { val: 1 },
  inputStage: {
    stage: 'FETCH',
    inputStage: {
      stage: 'IXSCAN',
      keyPattern: { _id: 1 },
      indexName: '_id_',
      isMultiKey: false,
      isUnique: true,
      isSparse: false,
      isPartial: false,
      indexVersion: 2,
      direction: 'forward',
      indexBounds: { _id: [ '[MinKey, null)', '(null, MaxKey]' ] }
    }
  }
}

Because the index on "_id" holds only the key ("_id") and RecordId, it must fetch the document (FETCH) before the projection (PROJECTION_SIMPLE). Even if it is a primary index from the application's point of view, it is physically equivalent to a secondary index.

I can see the same with another secondary index:

test> db.collection.createIndex( { val: 1 } );

val_1

test> db.collection.find(
 { val:{$gt: 0} },
 { val: 1 }
).explain().queryPlanner.winningPlan;

{
  isCached: false,
  stage: 'PROJECTION_SIMPLE',
  transformBy: { val: 1 },
  inputStage: {
    stage: 'FETCH',
    inputStage: {
      stage: 'IXSCAN',
      keyPattern: { val: 1 },
      indexName: 'val_1',
      isMultiKey: false,
      multiKeyPaths: { val: [] },
      isUnique: false,
      isSparse: false,
      isPartial: false,
      indexVersion: 2,
      direction: 'forward',
      indexBounds: { val: [ '(0, inf.0]' ] }
    }
  }
}

Such query projects "_id" because it is there by default, and then the index on "val" is not covering all fields. To avoid the FETCH, I need to remove "_id" from the projection explicitly:

test> db.collection.find(
 { val:{$gt: 0} },
 { val: 1 , _id: 0 }
).explain().queryPlanner.winningPlan;

{
  isCached: false,
  stage: 'PROJECTION_COVERED',
  transformBy: { val: 1, _id: 0 },
  inputStage: {
    stage: 'IXSCAN',
    keyPattern: { val: 1 },
    indexName: 'val_1',
    isMultiKey: false,
    multiKeyPaths: { val: [] },
    isUnique: false,
    isSparse: false,
    isPartial: false,
    indexVersion: 2,
    direction: 'forward',
    indexBounds: { val: [ '(0, inf.0]' ] }
  }
}

Another possibility: if I need to project "_id", I can add it to the index definition, making it a covering index for my query:

test> db.collection.createIndex( { val: 1 , _id: 1 } );

val_1__id_1

test> db.collection.find(
 {
                                    
                                    
                                    
                                    
                                

LDAP Isn’t Going Away, and Neither Is Our Support for Percona Server for MongoDB

As enterprise software vendors race toward proprietary cloud ecosystems, some features long relied upon by businesses are being quietly deprecated. One recent example is MongoDB Enterprise Advanced and Atlas dropping support for LDAP authentication, a foundational identity protocol for countless organizations. At Percona, we’re taking a different path. We’ve supported LDAP in Percona Server for MongoDB for […]

August 06, 2025

Can a Client–Server Cache Tango Accelerate Disaggregated Storage?

This paper from HotStorage'25 presents OrcaCache, a design proposal for a coordinated caching framework tailored to disaggregated storage systems. In a disaggregated architecture, compute and storage resources are physically separated and connected via high-speed networks. These became increasingly common in modern data centers as they enable flexible resource scaling and improved fault isolation. (Follow the money as they say!) But accessing remote storage introduces serious latency and efficiency challenges. The paper positions OrcaCache as a solution to mitigate these challenges by orchestrating caching logic across clients and servers. Important note: in the paper's terminology the server means the storage node, and the client means the compute node.

As we did last week for another paper, Aleksey and I live-recorded our reading/discussion of this paper. We do this to teach the thought-process and mechanics of how experts read papers in real time. Check our discussion video below (please listen at 1.5x, I sound less horrible at that speed). The paper I annotated during our discussion is also available here.


The problem

Caching plays a crucial role in reducing the overheads of disaggregated storage, but the paper claims that current strategies (client-local caching, server-only caching, and independent client-server caching) fall short. Client-local caching is simple and avoids server overhead but underutilizes memory on the server. Server-only caching can reduce backend I/O pressure but comes at the cost of network round-trips and significant server CPU load. Independent client-server caching combines the two but lacks coordination between the caches, leading to data duplication, inefficient eviction and prefetching policies, and causes fairness issues in multi-client environments.


The proposed design

OrcaCache proposes to address these shortcomings by shifting the cache index and coordination responsibilities to the client side. Clients maintain a global view of the cache and communicate directly with the server-side cache using RDMA, which enables bypassing the server CPU in the common case. Server-side components are minimized to a daemon that tracks resource usage and allocates memory based on fairness and pressure. 

Discussion 

OrcaCache stops short of addressing the core system-level challenges in a realistic multi-client deployment. A single server single client setup is used in experiments in Figure 1, and also for most of the description in the paper. The paper's solution to dealing with multiple clients is to use a separate namespace for each client, but then at the server-side this uses up a lot of resources, cause duplication of cached items. There is no mutual benefit and collaboration among clients in this setup. 

The paper also mentions how clients could interact with a server-side daemon, how RDMA-based lookups and cache updates would be issued, and how resources might be allocated based on monitored pressure, but many of these mechanisms remain speculative. The authors mention about flexible eviction and prefetching but do not explore the complexity of maintaining consistency or fairness across diverse workloads. AI/ML workloads mentioned/alluded but not really tested in the paper.

In the end, the paper's contribution lies more in reopening a line of thought from 1990s cooperative caching and global memory management research: how to make cache coherence across disaggregated compute and storage both efficient and scalable. The idea OrcaCache seems to lean on is that rather than burden the server, it makes the client responsible for coordination, enabled by fast networks and abundant memory.

Also despite the title, there was not much Tango in the paper. It was mostly cache. 

MySQL 8.0 End of Life Date: What Happens Next?

If you’re running MySQL 8.0 databases, you need to know this: Oracle will stop supporting them in April 2026. That means no more security patches, bug fixes, or help when things go wrong. Maybe you’re thinking, “But April 2026 feels far away!“. But once that date hits, every day you keep running MySQL 8.0 makes […]

Transaction Healing: Scaling Optimistic Concurrency Control on Multicores

This paper from SIGMOD 2016 proposes a transaction healing approach to improve the scalability of Optimistic Concurrency Control (OCC) in main-memory OLTP systems running on multicore architectures. Instead of discarding the entire execution when validation fails, the system repairs only the inconsistent operations to improve throughput in high-contention scenarios.

If this sounds familiar, it's because we recently reviewed the Morty paper from EuroSys 2023, which applied healing ideas to interactive transactions using continuations to support re-execution. This 2016 Transaction Healing paper is scoped to static stored procedures, and focuses more on integrating healing into OCC for stored procedures. 


Key Ideas

OCC works well under low contention because it separates reads from writes and  keeps critical sections short (only for validation). But under high contention, especially in workloads with skewed access patterns (like Zipfian distributions), transactions are frequently invalidated by concurrent updates. The naive OCC response of abort and restart leads to wasting CPU cycles and degrading cache locality.

Transaction healing aims to address this problem by observing/betting that most validation failures affect only a subset of a transaction's operations. If only the affected operations can be detected and recovered, the system can avoid redoing the entire transaction. They implement this by leveraging two components.

First, a static analysis phase extracts operation dependencies from the stored procedure a priori. The dependency analysis distinguishes between two types of relations: key-dependencies, where the result of one operation determines the lookup key for another; and value-dependencies, where the value produced by one operation is used in a subsequent one. With this graph in hand, transaction healing can surgically repair any non-serializable operation at runtime.

Second, a runtime access cache, maintained per thread, tracks the behavior of each executed operation (its inputs, outputs, effects, and the memory addresses it accessed) and identifies conflicted parts of a transaction at runtime. The access cache supports this by recording memory addresses (avoiding repeated index lookups) and allowing efficient reuse of unaffected results.


Transaction healing

The healing process is triggered during the validation phase, when an inconsistency is detected in the read/write set. Rather than aborting immediately, the system identifies the earliest affected operation (using its dependency graph), and restores it. If the operation is value-dependent, healing updates its effects based on cached inputs and outputs. If it's key-dependent, a re-execution is necessary since the accessed record may change. The healing propagates forward through the dependency graph, recursively restoring all operations affected by the initial inconsistency.

The healing mechanism is built to preserve serializability. Validation acquires locks in a globally consistent order (e.g., sorted by memory address) to avoid deadlocks. If during healing a lock must be acquired out of order (e.g., due to new dependencies introduced by re-executed operations), the transaction is aborted in order not to risk a deadlock. The paper says this situation is rare due to validation-order optimizations. Despite occasional aborts, transaction healing guarantees forward progress and eventual termination: each transaction's read/write set is finite and every element is validated at most once, which ensures that healing either succeeds or fails definitively.


Evaluation Highlights

They implemented a C++ in-memory database engine, THEDB, to test these ideas. THEDB employs LLVM to perform static dependency analysis on stored procedures and includes support for standard database features like inserts, deletes, and range queries (the latter protected against phantoms via B+-tree versioning, as in Silo). The authors evaluate THEDB on a 48-core AMD machine using two common benchmarks: TPC-C and Smallbank. THEDB is compared against five systems: variants of OCC (including Silo-style), 2PL, a hybrid OCC-2PL approach, and a deterministic partitioned system.

The results show that, under high contention, THEDB significantly outperforms the alternatives, achieving up to 6.2x higher throughput than Silo and approaching the performance of an idealized OCC system with validation disabled. This shows that transaction healing adds minimal overhead and successfully eliminates the restart costs that dominate OCC's performance under load. Moreover, THEDB maintains stable throughput as contention increases (e.g., under more skewed Zipfian distributions), while traditional OCC and Silo degrade rapidly. Scalability is also great up to 48 cores.


Discussion

**** What are the limitations of static analysis used?

Transaction healing proposed here is limited to stored procedures because it relies on static dependency extraction. Unlike Morty, which handles interactive transactions using runtime continuations, this work cannot deal with dynamic control flow or unknown transaction logic at runtime. As a result, ad-hoc queries revert to standard OCC, where any healing benefit is lost.

On the other hand, there is some subtlety here. Transaction healing does not require read/write sets to be declared in advance as the deterministic systems like Calvin do. Deterministic systems must know the exact records a transaction will access before it begins execution, so they can assign transactions to partitions and establish a global execution order. Transaction healing avoids this rigidity. It doesn't need to know which specific records a transaction will access ahead of time. Instead, it relies on static analysis to extract the structure of the transaction logic, namely which operations depend on which others. These dependencies, such as key or value dependencies between operations, are known statically because the transaction logic is written as a stored procedure. But the actual keys and values involved are discovered dynamically as the transaction executes. The system uses an access cache to record which memory locations were read or written, and validation happens afterward. This flexibility allows transaction healing to support dynamic, cross-partition access patterns without prior declaration.


**** How does this compare with Morty?

Transaction Healing is designed for in-memory OLTP systems running with OCC on multicore machines, where the workload consists of static stored procedures. Morty, in contrast, is built for a distributed geo-replicated system and handles interactive transactions with dynamic control flow. It uses MVTSO, with speculative execution and a priori ordering. Unlike THEDB, Morty allows transactions to read from uncommitted versions, exposing concurrency that traditional systems suppress. It tracks execution through continuation-passing style (CPS) in order to make control dependencies explicit and enable partial re-execution of logic branches. While transaction healing employed LLVM to automatically perform static dependency analysis on stored procedures, Morty did not automate translation of transaction program to CPS program. Finally, since it is distributed and deployed over WAN, Morty integrates concurrency control with replication to reduce latency and uses quorum voting to maintain fault-tolerant correctness without centralized logging.


August 05, 2025

Planning Ahead for PostgreSQL 18: What Matters for Your Organization

PostgreSQL 18 is on the way, bringing a set of improvements that many organizations will find useful. It’s not a revolutionary release, but it does move things in a good direction, especially in performance, replication, and simplifying daily operations. For teams already using PostgreSQL, it’s a good time to look into what’s new. For others […]

Analysing Snapshot Isolation

This paper (PODC'2016) presents a clean and declarative treatment of Snapshot Isolation (SI) using dependency graphs. It builds on the foundation laid by prior work, including the SSI paper we reviewed recently, which had already identified that SI permits cycles with two adjacent anti-dependency (RW) edges, the so-called inConflict and outConflict edges. While the SSI work focused on algorithmic results and implementation, this paper focuses more on the theory (this is PODC after all) of defining a declarative dependency-graph-based model for SI. It strips away implementation details such as commit timestamps and lock management, and provides a purely symbolic framework. It also proves a soundness result (Theorem 10), and leverages the model for two practical static analyses: transaction chopping and robustness under isolation-level weakening.


Soundness result and dependency graph model

Let's begin with Theorem 10, which establishes both the soundness and completeness of the dependency graph characterization of SI. The soundness direction states that any dependency graph satisfying the SI condition (i.e., every cycle contains at least two adjacent RW edges) corresponds to a valid SI execution. The completeness direction, which follows from prior work, asserts that every valid SI execution induces such a dependency graph. The proof of soundness is technically involved, requiring the authors to construct valid SI executions from dependency graphs by solving a system of relational constraints that preserve the required visibility and ordering properties.

Building on foundational work by Adya, this model represents executions as graphs whose nodes are transactions and whose edges capture observable transactional dependencies in terms of 3 edge types: write-read (WR), write-write (WW), and the anti-dependency capturing read-write (RW) edges. The SI, Serializability (SER), and Parallel SI (PSI) isolation levels are then defined in terms of the structural properties in these graphs, specifically by the presence or absence of certain cycles. This abstraction supports symbolic reasoning about anomalies like write skew or long fork manifest as specific, checkable subgraphs. Informally, a WR edge from T to S means that S reads T’s write to some object x; a WW edge means that S overwrites T’s write; and a RW edge indicates that S overwrites the value of x read by T, introducing an anti-dependency.


Definition 4 and Figure 1 provide an elegant axiomatization of abstract executions. The visibility relation (VIS) must always be a subset of the commit order (CO), and in the case of Serializability, the two are equal. In my mind, this captures the key conceptual divide between SER and SI: Serializability enforces a total order over committed transactions, wheras SI permits partial orders.

Figure 2 illustrates the anomalies that differentiate SER, SI, and PSI. Figure 2(d) captures the classic write skew anomaly, which SI allows but SER prohibits. This scenario arises when two transactions read disjoint keys and then write disjoint values based on those reads, each unaware of the other's effects. SI permits this since it allows partial visibility so long as snapshots are consistent. On the other hand, the long fork anomaly shown in Figure 2(c) is prohibited by SI but allowed by PSI, which weakens the snapshot guarantees further.


Applications of the model

The second half of the paper shows applications of the model for static analyses. The first application is transaction chopping, where large transactions are split into smaller subtransactions to improve performance. The challenge here is to ensure that the interleaving of chopped pieces does not introduce new behaviors/anomalies that the original monolithic transaction would have prevented. This is captured through spliceability: whether an execution of chopped transactions can be "stitched back" into an execution that would have been legal under SI for the unchopped program. Spliceability is formulated through a chopping graph, which augments standard dependencies with session-local ordering among chopped subtransactions. A cycle in the chopping graph DCG(G) is considered critical if (1) it does not contain two occurrences of the same vertex, (2) it includes a sequence of three edges where a conflict edge is followed by a session (predecessor) edge and then another conflict edge, and (3) any two RW (anti-dependency) edges in the cycle are not adjacent. Such critical cycles represent dependency patterns that cannot be reconciled with the atomicity guarantees expected by the original transaction, and thus cannot be realized under SI. Figures 4, 5 and 6 illustrate how small structural differences in the chop can lead to either results that are sound (Figure 6) or unsound (Figure 5 creates a critical cycle). Compared to serializability, SI's more relaxed visibility rules allow for a wider range of safe chops, but care must still be taken to avoid dependency structures that violate snapshot consistency.


The second application of the dependency graph model is in analyzing robustness across isolation levels. The central question is whether a program behaves identically under SI and a weaker or stronger model. An interesting case here is the relation between SI and Parallel SI (PSI). We covered PSI in our earlier review of Walter (SOSP 2011). PSI weakens SI by discarding the prefix requirement on snapshots: it ensures only that visibility is transitive, not that it forms a prefix of the commit order. Thus, PSI admits behaviors that SI prohibits. Theorem 22 formalizes one such divergence. It shows that if a cycle in the dependency graph contains at least two RW edges and no two of them are adjacent, then this cycle is allowed under PSI but not under SI. This captures the long fork anomaly, in which concurrent writers are seen inconsistently by different readers (each reader forming a different branch of the history).

To illustrate the long fork, consider a cycle where T1 and T2 are concurrent writers, and two readers, T3 and T4, observe them inconsistently.

  • T1 --WR--> T3  
  • T2 --WR--> T4  
  • T3 --RW--> T2  
  • T4 --RW--> T1  

In this scenario, T3 sees T1's write but not T2's, and T4 sees T2's write but not T1's. Both readers construct transitive but incompatible snapshots that fork the timeline. SI prohibits this because it cannot construct prefix-closed snapshots that explain both T3 and T4's observations. But since PSI lacks the prefix constraint, it allows this behavior, while still disallowing anomalies like lost update (through its NOCONFLICT axiom).

Robustness from SI to PSI therefore requires ruling out that specific structural pattern: cycles with multiple RW edges where none are adjacent. If such a cycle appears in the dependency graph, PSI will admit the behavior, while SI will not, and robustness would fail.


Discussion

This does invite comparison to the Seeing is Believing (SiB) paper (PODC'17), one of my favorite papers, and its state-centric formulation of isolation guarantees. In SiB, executions are modeled as sequences of global states and snapshots. Transactions observe one of these states and transition the system to a new one. Isolation models are defined in terms of whether there exists a sequence of global states consistent with the observations and effects of each transaction.

While structurally different, the two models are not in conflict. It appears feasible to translate between the dependency graph and state-centric views. The SI model used in this PODC2016 paper already adopts a declarative, axiomatic approach centered on visibility and commit order that is already close to SiB.

For static program analysis, the dependency graph model seems to offer advantages. By abstracting away from global states, it allows symbolic reasoning directly over transactional dependencies. This makes it well-suited to analyses like transaction chopping and robustness checking, which rely on detecting structural patterns such as cycles with certain edge configurations. While the SiB model is semantically expressive and well-suited to observational reasoning, it may be less conducive to structural checks like cycle-freedom or anti-dependency adjacency.

Transaction performance 👉🏻 retry with backoff

A benchmark sponsored by EDB, a PostgreSQL company, in 2019 contributed to the myth that MongoDB transactions are slow. Even though the work was done by the reputable OnGres team, the code wasn't properly designed to test MongoDB's scalability. At that time, the feature was new, likely not well-documented, and some demos overlooked the retry logic. In this context, no one is to blame for past publications, but analyzing this benchmark will help prevent the spread of these myths.

MongoDB uses lock-free optimistic concurrency control (OCC) with fail-on-conflict as soon as a write detects concurrent changes to the Multi-Version Concurrency Control (MVCC), requiring applications to manage transient errors differently than traditional RDBMS with pessimistic locking and wait-on-conflict behavior. The benchmark developers, PostgreSQL experts, likely missed this because they based the benchmark on a MongoDB demo focused on capabilities, not performance, and neglecting proper concurrency control.
We should disregard this benchmark today, but this blog post series offers an opportunity to analyze its flaws, debunk myths, and educate readers on effective transaction handling in MongoDB applications.

The problematic code in the MongoDB 4.0 demo from 7 years ago was:

def run_transaction_with_retry(functor, session):
    assert (isinstance(functor, Transaction_Functor))
    while True:
        try:
            with session.start_transaction():
                result=functor(session)  # performs transaction
                commit_with_retry(session)
            break
        except (pymongo.errors.ConnectionFailure, pymongo.errors.OperationFailure) as exc:
            # If transient error, retry the whole transaction
            if exc.has_error_label("TransientTransactionError"):
                print("TransientTransactionError, retrying "
                      "transaction ...")
                continue
            else:
                raise

    return result

It was translated to Java in the benchmark code as:

  private void runWithRetry() {
    while (true) {
      try {
        runnable.run();
        break;
      } catch (RetryUserOperationException ex) {
        retryMeter.mark();
        continue;
      }
    }
  }

If you are familiar with Optimistic or Fail-on-Conflict Concurrency Control, you may recognize a significant issue: there is no wait (backoff) before retry. With such an infinite loop, high concurrency access acts like a DDoS attack on the database, rather than resolving contention.

A typical retry loop implements exponential backoff, and here is an example:

private void runWithRetry() {  
    final long initialDelayMillis = 5;      // start with 5ms  
    final long maxDelayMillis = 1000;       // max wait of 1s  
    long delay = initialDelayMillis;  

    while (true) {  
        try {  
            runnable.run();  
            break;  
        } catch (RetryUserOperationException ex) {  
            retryMeter.mark();  
            try {  
                // jitter by up to 50% to avoid thundering herd  
                long jitter = (long) (Math.random() * delay / 2);  
                long sleep = delay + jitter;  
                Thread.sleep(sleep);  
            } catch (InterruptedException ie) {  
                Thread.currentThread().interrupt();  
                // Optionally log or handle interruption here  
                throw new RuntimeException("Retry loop interrupted", ie);  
            }  
            delay = Math.min(delay * 2, maxDelayMillis);  
            continue;  
        }  
    }  
}  

This code makes the first retry wait 5–7 ms, then 10–13 ms, 20–25 ms, and so on, up to 1000–1500 ms. I you use Spring Data, you can simply annotate your @Transactional method with:

    @Retryable(  
        value = RetryUserOperationException.class,  
        maxAttempts = 10,  
        backoff = @Backoff(  
            delay = 5,        // first delay in ms  
            maxDelay = 1000,  // max delay in ms  
            multiplier = 2,   // exponential  
            random = true     // adds jitter  
        )  
    )  

MongoDB employs a similar approach for auto-commit single-document transactions, transparently, so that it appears as if the application is waiting for a lock to be acquired. However, it cannot automatically cancel and retry explicit transactions where the application might perform non-transactional work, such as writing to a file, pushing to a queue, or sending an email. For transactions involving multiple statements, no database can automatically retry the process. The application itself must handle retries.

In PostgreSQL, a conflict might cause a serializable error, even under the Read Committed isolation level, where deadlocks can still occur. PostgreSQL locks data while writing during a transaction using two-phase locking and typically waits for the lock to be released. In this case, the impact of an inefficient retry loop is minimal.
However, MongoDB is optimized for high concurrency, allowing it to avoid holding locks between database calls. Instead of waiting, it detects write conflicts instantly and raises a retriable error. Therefore, implementing an efficient retry mechanism is essential.

As I mentioned earlier, there's no one to blame for the benchmark's flaws, as it was created when transactions in MongoDB were relatively new and perhaps not well documented. The problem is people still referencing this benchmark without understanding what was wrong. The poor performance was due to unnecessary retries because there was no backoff implemented in the retry loop.
The authors of the benchmark have been looking for documentation that they believe explains this behavior, which likely contributed to their decision not to implement backoff in the application, mistakenly thinking it was handled by the database:

Since the probability of collision increases (possibly exponentially) with the effective number of transactions processed, it follows that MongoDB is more eager to retry transactions. This is consistent with the expectation set on MongoDB’s documentation about transactions and locking, which states that “by default, transactions waits up to 5 milliseconds to acquire locks required by the operations in the transaction. If the transaction cannot acquire its required locks within the 5 milliseconds, the transaction aborts”. This behavior can be changed by setting the maxTransactionLockRequestTimeoutMillis parameter.

What is called "lock" here is different from what SQL databases call "lock" with two-phase locking transactions where locks are acquired for the duration of the transaction. MongoDB is lock-free in that sense, using optimistic concurrency control rather than locking. What is called "lock" here is more similar to what SQL databases call "latch" or "lightweight locks", which are short duration and do not span multiple database calls. For such wait, five milliseconds is a good default. But this is not what the benchmark experienced.
Such timeout would raise the following exception: Unable to acquire lock ... within a max lock request timeout of '5ms' milliseconds.
What the benchmark catches in the retry loop is a write conflict, that happens before trying to acquire such short lock: Command failed with error 112 (WriteConflict): 'Caused by :: Write conflict during plan execution and yielding is disabled. :: Please retry your operation or multi-document transaction.'

Such write conflict has nothing to do with maxTransactionLockRequestTimeoutMillis, it doesn't try to acquire a lock because it has nothing to write, as transaction isolation (the 'I' in 'ACID') is not possible. When reading, it has detected that the read snapshot, the state as of the beginning of the transaction, has been modified by another transaction. It doesn’t wait because the snapshot would be stale if the other transaction commits, and it immediately returns to the application. The application must compensate or roll back what it did during the transaction, wait a small amount of time (the exponential backoff), and retry.

In PostgreSQL, when operating under the Read Committed isolation level, it can wait because it allows reading a state that may not be consistent with the transaction's start time. If a concurrent transaction commits during this time, PostgreSQL simply continues to read the committed data, mixing data from different states. This is not permitted in higher isolation levels, like serializable, and a transient error must be raised, like in MongoDB, to guarantee ACID properties. However, PostgreSQL uses a pessimistic locking approach, waits to determine whether the other transaction commits, allowing it to retry immediately once the conflict is resolved. This is why the retry logic without backoff does not have the same consequences.

You may wonder why MongoDB doesn't implement waiting like PostgreSQL does. PostgreSQL is designed for a single-writer instance, which cannot scale horizontally, making it simple to use a wait queue in shared memory. However, when sharding PostgreSQL using the Citus extension, this design breaks down, leading to eventual consistency for cross-shard reads. In contrast, MongoDB is built for horizontal scalability and opts for optimistic concurrency control instead of a distributed wait queue, providing consistent cross shard reads across nodes (when the read concern is set to majority).

I prefer not to link the benchmark paper to avoid helping search engines or LLM crawlers find outdated content, but it is easy to find. The benchmark code is available in a repository. I prefer to link to the MongoDB transaction documentation instead. Now you know where the myth about slow transactions comes from: incorrect understanding of MongoDB lock-free ACID transactions.

There's more to say about this benchmark. In the benchmark code, there's a hotspot on the "audit" table which is indexed for the PostgreSQL definition, but not for the MongoDB definition. This is visible as MongoDB logs slow queries by default:

mongo-1 | {"t":{"$date":"2025-08-05T21:31:22.655+00:00"},"s":"I", "c":"WRITE", "id":51803, "ctx":"conn31"
,"msg":"Slow query","attr":{"type":"update","isFromUserConnection":true,"ns":"postgres.audit","collectionType":"normal"
,"command":{"q":{"schedule_id":4778,"day":{"$date":"2025-08-05T00:00:00.000Z"}},"u":{"$set":{"date":{"$date":"2025-08-05T21:31:22.533Z"}},"$inc":{"seats_occupied":1}},"multi":false,"upsert":true}
,"planSummary":"COLLSCAN","planningTimeMicros":255,"keysExamined":0,"docsExamined":5962,"nMatched":1,"nModified":1,"nUpserted":0,"keysInserted":0,"keysDeleted":0,"numYields":0,"planCacheShapeHash":"99470B66","queryHash":"99470B66","planCacheKey":"031BFB16"
,"locks":{"MultiDocumentTransactionsBarrier":{"acquireCount":{"w":1}},"ReplicationStateTransition":{"acquireCount":{"w":3}},"Global":{"acquireCount":{"w":1}},"Database":{"acquireCount":{"w":1}},"Collection":{"acquireCount":{"w":5}}},"flowControl":{"acquireCount":1},"readConcern":{"level":"snapshot","provenance":"clientSupplied"},"storage":{"data":{"txnBytesDirty":128}},"cpuNanos":32601533,"remote":"172.18.0.6:42292","queues":{"execution":{"admissions":1},"ingress":{"admissions":1}},"workingMillis":121,"durationMillis":121}}

To improve performance for scalability, create indexes and avoid hotspots. If hotspots are unavoidable, fail fast by performing operations subject to write conflict early in the transaction, rather than at the end like it is done here. The data model should allow critical transactions to be single-document, avoiding the need for normalization across multiple tables, but this benchmark uses the same normalized data model on both databases. Finally, no real application will perform business transaction like this: reserving a flight seat, recording payment, and incrementing an audit counter all in one database transaction. You don't want to maintain a database state with locks while waiting for payment validation that typically depends on an external service.

August 04, 2025

Why MongoDB skips indexes when flattening or renaming sub-document fields in $project before $match aggregation pipeline

MongoDB is a general-purpose database that offers more than just a JSON datastore. It is recognized for its efficient indexing and aggregation capabilities within flexible schemas. The query planner optimizes the aggregation pipeline to filter data early and effectively utilize indexes.
Although this process looks similar to SQL databases, MongoDB stands out by supporting advanced indexing on nested objects and arrays.

It's important to note that some optimizer transformations applicable to tabular data in relational databases may not work the same way with documents. Keeping this in mind helps in understanding why an index is employed or not, as the query planner must ensure consistent results across all possible documents in the flexible schema.

As an example, a projection renaming a JSON dotted path before filtering can alter its semantics and hinder index usage.

Example

I store customer contact information like this:

db.customers.insertMany([
{  "customer_id": "C001",
  "contact": {
    "email": "marla.singer@fightclub.com",
    "phone": "555-1010"
  }
}
]);

{
  acknowledged: true,
  insertedIds: { '0': ObjectId('6890a5530400e901b3d4b0e1') }
}

The subdocument offers flexibility, accommodating customers who may have multiple emails, phone numbers or other contact information. However, I prefer to begin with a simple document that does not include arrays.

To enable efficient lookup by email, I create the following index:

db.customers.createIndex({ "contact.email": 1 })

contact.email_1

In a normalized SQL database, it is necessary to establish upfront that each user can only have one email address. If this is the case, the email should be stored in a column within the customers table or in a separate "contacts" table, with a unique foreign key to the customers table.

In contrast, MongoDB's document model allows flexibility, as you do not need to decide between One-to-One or One-to-Many relationships once for all future insertions. The data model remains accurate even if a customer has multiple contacts or a contact has several email addresses, thus avoiding significant refactoring when business rules change.

If you want to flatten the information in your aggregation pipeline, you may rename the contact's email to a simple top-level "email" field in a $project stage. You can achieve this by writing the following projection in an aggregation pipeline:

  {
    $project: {
      customer_id: 1,
      email: "$contact.email"
    }
  }

You may additionally search for one email and execute the following aggregation pipeline:

db.customers.aggregate([
  {
    $project: {
      customer_id: 1,
      email: "$contact.email"
    }
  },
  {
    $match: { email: "marla.singer@fightclub.com" }
  }
])

[
  {
    _id: ObjectId('6890a5530400e901b3d4b0e1'),
    customer_id: 'C001',
    email: 'marla.singer@fightclub.com'
  }
]

Intuitively, this seems reasonable and you test it on your data sample, assuming one email per customer. However you realize that your index is not used:

db.customers.aggregate([
  {
    $project: {
      customer_id: 1,
      email: "$contact.email"
    }
  },
  {
    $match: { email: "marla.singer@fightclub.com" }
  }
]).explain( ).stages[0]['$cursor'].queryPlanner.winningPlan

{
  isCached: false,
  stage: 'PROJECTION_DEFAULT',
  transformBy: { _id: true, customer_id: true, email: '$contact.email' },
  inputStage: { stage: 'COLLSCAN', direction: 'forward' }
}

The $match operation was not pushed down to the $project stage, resulting in a COLLSCAN instead of an IXSCAN.

If you are used to SQL databases and tabular data, you might assume that the projection only renames "contact.email" to "email". This could lead to an expectation that the query planner would transform the expression $match: { "email": ... } to $match: { "contact.email": ... } and use the index on "contact.email".
However, this is not the case in a document database. The reason lies in the potential existence of arrays and embedded documents. Renaming dotted paths to a top-level field can alter the semantics based on the data's structure.

To use your index, you must start with the $match stage and $project later:

db.customers.aggregate([
  {
    $match: { "contact.email": "marla.singer@fightclub.com" }
  },
  {
    $project: {
      customer_id: 1,
      email: "$contact.email"
    }
  }
]).explain( ).queryPlanner.winningPlan

{
  isCached: false,
  stage: 'PROJECTION_DEFAULT',
  transformBy: { _id: true, customer_id: true, email: '$contact.email' },
  inputStage: {
    stage: 'FETCH',
    inputStage: {
      stage: 'IXSCAN',
      keyPattern: { 'contact.email': 1 },
      indexName: 'contact.email_1',
      isMultiKey: false,
      multiKeyPaths: { 'contact.email': [] },
      isUnique: false,
      isSparse: false,
      isPartial: false,
      indexVersion: 2,
      direction: 'forward',
      indexBounds: {
        'contact.email': [ '["marla.singer@fightclub.com", "marla.singer@fightclub.com"]' ]
      }
    }
  }
}

The index was used and the result is apparently the same. However, it is semantically different and the result could be different with array fields, which is why the query planner doesn't perform the transformation.

With MongoDB flexible schema, I can insert sub-documents with different shape, like an array of multiple contacts, or even array of arrays:

db.customers.insertMany([
  {
    customer_id: "C002",
    contact: [
      { email: "robert.paulson@fightclub.com" },
      { phone: "555-2020" }
    ]
  },
  {
    customer_id: "C003",
    contact: [
      { email: ["narrator@fightclub.com", "tyler.durden@fightclub.com"] },
      { email: "jack@fightclub.com" },
    ]
  }
]);

{
  acknowledged: true,
  insertedIds: {
    '0': ObjectId('6890a74d0400e901b3d4b0e2'),
    '1': ObjectId('6890a74d0400e901b3d4b0e3')
  }
}

MongoDB accommodates changing business requirements and increasingly complex data structures throughout the application lifecycle. In C002, I introduced an array of contact elements, including email addresses and phone numbers. In C003, I added multiple values for a single element, enabling users to store numerous emails under a single "contact.email" path.

The projection of "contact.email" to the top level still allows the filter to be applied to array elements in C002. However, the projection of C003 results in an array of arrays.

db.customers.aggregate([
  {
    $project: {
      customer_id: 1,
      email: "$contact.email"
    }
  }
])

[
  {
    _id: ObjectId('6890a5530400e901b3d4b0e1'),
    customer_id: 'C001',
    email: 'marla.singer@fightclub.com'
  },
  {
    _id: ObjectId('6890a74d0400e901b3d4b0e2'),
    customer_id: 'C002',
    email: [ 'robert.paulson@fightclub.com' ]
  },
  {
    _id: ObjectId('6890a74d0400e901b3d4b0e3'),
    customer_id: 'C003',
    email: [
      [ 'narrator@fightclub.com', 'tyler.durden@fightclub.com' ],
      'jack@fightclub.com'
    ]
  }
]

With such a projection, $match: { "contact.email": "robert.paulson@fightclub.com" } will find one document that has one item equal to "robert.paulson@fightclub.com". However, $match: { "contact.email": "tyler.durden@fightclub.com" } will find no document. The projection of C003 is an array that has one item being an array, and would be in the result of $match: { "contact.email": [ 'narrator@fightclub.com', 'tyler.durden@fightclub.com' ] } but not on $match: { "contact.email": 'tyler.durden@fightclub.com' } alone.

With the projection first, I get a result for C002:


db.customers.aggregate([
  {
    $project: {
      customer_id: 1,
      email: "$contact.email"
    }
  },
  {
    $match: { email: "robert.paulson@fightclub.com" }
  }
]);

[
  {
    _id: ObjectId('6890a74d0400e901b3d4b0e2'),
    customer_id: 'C002',
    email: [ 'robert.paulson@fightclub.com' ]
  }
]

However, still with the projection first, I get no result for C003:

db.customers.aggregate([
  {
    $project: {
      customer_id: 1,
      email: "$contact.email"
    }
  },
  {
    $match: { email: "tyler.durden@fightclub.com" }
  }
]);

(no result)

To get C003 I need to match one of the first-level array items fully:


db.customers.aggregate([
  {
    $project: {
      customer_id: 1,
      email: "$contact.email"
    }
  },
  {
    $match: { email: [ 'narrator@fightclub.com', 'tyler.durden@fightclub.com' ] }
  }
]);

[
  {
    _id: ObjectId('6890a74d0400e901b3d4b0e3'),
    customer_id: 'C003',
    email: [
      [ 'narrator@fightclub.com', 'tyler.durden@fightclub.com' ],
      'jack@fightclub.com'
    ]
  }
]

db.customers.aggregate([
  {
    $project: {
      customer_id: 1,
      email: "$contact.email"
    }
  },
  {
    $match: { email: 'jack@fightclub.com' }
  }
]);

[
  {
    _id: ObjectId('6890a74d0400e901b3d4b0e3'),
    customer_id: 'C003',
    email: [
      [ 'narrator@fightclub.com', 'tyler.durden@fightclub.com' ],
      'jack@fightclub.com'
    ]
  }
]

This is different from a query that filters first, using a dotted path:

db.customers.aggregate([
  {
    $match: { "contact.email": 'tyler.durden@fightclub.com' }
  },
  {
    $project: {
      customer_id: 1,
      email: "$contact.email"
    }
  }
]);

[
  {
    _id: ObjectId('6890a74d0400e901b3d4b0e3'),
    customer_id: 'C003',
    email: [
      [ 'narrator@fightclub.com', 'tyler.durden@fightclub.com' ],
      'jack@fightclub.com'
    ]
  }
]

When building pipelines or views in MongoDB, avoid renaming dotted fields (e.g., "contact.email" to "email") if you intend to query and index the flattened field in downstream filters, especially in views or multi-stage aggregations. MongoDB's query optimizer cannot safely push a $match down before that $project, because the semantics with arrays/embedded documents and arrays-of-arrays would result in incorrect matches.

This differs from SQL, where the schema is fixed and column renaming in views or subqueries is safe. With MongoDB’s flexible schema and support for arrays at any level, such rewrite optimizations are not always correct.

For best performance (index usage), always filter on the original dotted path before projecting/renaming it. This may require placing filtering stages early, or restructuring the pipeline to avoid flattening nested fields you want to use for indexed lookups.

Integrating Citus with Patroni: Sharding and High Availability Together

Citus is a robust PostgreSQL extension that aids in scaling data distribution and provides a solid sharding mechanism. It enriches features like distributed tables, reference tables, columnar storage, schema-based sharding, etc. We have already covered the basics of Citus and the initial setup part in some earlier blog posts: How To Scale a Single-Host PostgreSQL […]

How hard is it to migrate AWAY from Convex?

The video walks through an experiment in “de-lock-in-ifying” a small Convex app: starting with the basic TanStack Start template, the author recreates Convex queries, mutations and actions as TanStack Start server functions; swaps Convex’s reactive data layer for React Query (with manual cache invalidation); and replaces Convex’s built-in cloud database with a self-hosted Postgres instance accessed via Drizzle ORM—eventually wrapping Drizzle in a Convex-style API so most original code can be copy-pasted. They also bolt on transactions, discuss substitutes for other Convex features (file storage, realtime, auth, scheduling, search, etc.), and note that exporting Convex data is straightforward. The upshot: you can migrate off Convex without huge code changes, but you trade Convex’s “batteries-included” simplicity for extra infrastructure to manage—so the easiest escape hatch is still running Convex in self-hosted mode.

August 02, 2025

MongoDB High Availability: Replica Set in a Docker Lab

MongoDB guarantees consistent and durable write operations through write-ahead logging, which protects data from instance crashes by flushing the journal to disk upon commit. It also protects against network partitions and storage failures with synchronous replication to a quorum of replicas. Replication and failover are built-in and do not require external tools or extensions. To set up a replica set, start three mongod instances as members of the same replica set using the --replSet option with the same name. To initiate the replica set, connect to one of the nodes and specify all members along with their priorities to become primary for the Raft election.

To experiment with replication, I run it in a lab with Docker Compose, where each node is a container. However, the network and disk latencies are too small compared to real deployments. I use Linux utilities tc and strace to inject some artificial latencies and test the setup in terms of latency, consistency, and resilience.

For this post, I write to the primary and read from each node to explain the write concern and its consequences for latency. Take this as an introduction. The examples don't show all the details, which also depend on read concerns, sharding, and resilience to failures.

Replica Set

I use the following Dockerfile to add some utilities to the MongoDB image:

FROM mongodb/mongodb-community-server
USER root
RUN apt-get update && apt-get install -y iproute2 strace

I start 3 replicas with the following Docker Compose service:

  mongo:
    build: .
    volumes:
      - .:/scripts:ro
    # inject 100ms network latency and 50ms disk sync latency 
    cap_add:
      - NET_ADMIN   # for tc
      - SYS_PTRACE  # for strace
    command: |
     bash -xc '
     tc qdisc add dev eth0 root netem delay 100ms ;
     strace -e inject=fdatasync:delay_enter=50000 -f -Te trace=fdatasync -o /dev/null mongod --bind_ip_all --replSet rs0 --logpath /var/log/mongod
     '
    deploy:
      replicas: 3

The command injects a 100ms network latency: with tc qdisc add dev eth0 root netem delay 100ms (it requires NET_ADMIN capability). The MongoDB server is started with strace (it requires SYS_PTRACE capability), which injects a delay of 50000 microseconds (delay_enter=50000) on each call to fdatasync

I declared a service to initiate the replicaset:

  init-replica-set:
    build: .
    depends_on:
      mongo:
        condition: service_started
    entrypoint: |
      bash -xc '
        sleep 3 ; 
        mongosh --host mongo --eval "
         rs.initiate( {_id: \"rs0\", members: [
          {_id: 0, priority: 3, host: \"${COMPOSE_PROJECT_NAME}-mongo-1:27017\"},
          {_id: 1, priority: 2, host: \"${COMPOSE_PROJECT_NAME}-mongo-2:27017\"},
          {_id: 2, priority: 1, host: \"${COMPOSE_PROJECT_NAME}-mongo-3:27017\"}]
         });
        ";
        sleep 1
      '

Read after Write application

I use a service to run the client application:

  client:
    build: .
    depends_on:
      init-replica-set:
        condition: service_completed_successfully
    volumes:
      - .:/scripts:ro
    entrypoint: |
      bash -xc '
        mongosh --host mongo -f /scripts/read-and-write.js
      '

The read-and-write.js script connects to each node with direct connection, labeled 1️⃣, 2️⃣, and 3️⃣, and also connects to the replica set, labeled 🔢, which writes to the primary and can read from secondary nodes:

const connections = {    
  "🔢": 'mongodb://rs-mongo-1:27017,rs-mongo-2:27017,rs-mongo-3:27017/test?replicaSet=rs0&readPreference=secondaryPreferred&retryWrites=true&w=majority&journal=true',    
  "1️⃣": 'mongodb://rs-mongo-1:27017/test?directConnection=true&connectTimeoutMS=900&serverSelectionTimeoutMS=500&socketTimeoutMS=300',    
  "2️⃣": 'mongodb://rs-mongo-2:27017/test?directConnection=true&connectTimeoutMS=900&serverSelectionTimeoutMS=500&socketTimeoutMS=300',    
  "3️⃣": 'mongodb://rs-mongo-3:27017/test?directConnection=true&connectTimeoutMS=900&serverSelectionTimeoutMS=500&socketTimeoutMS=300',    
};    

After defining the connection strings, the script attempts to establish separate connections to each MongoDB node in the replica set, as well as a connection using the replica set URI that can send reads to secondaries. It continuously retries connections until at least one node responds and a primary is detected. The script keeps references to all active connections.

Once the environment is ready, the script enters an infinite loop to perform and monitor read and write operations. On each loop iteration, it first determines the current primary node. It then writes a counter value, which is a simple incrementing integer, to the primary node by updating a document identified by the client’s hostname. After performing the write call, it reads the same document from all nodes—primary, secondaries, and the replica set URI—recording the value retrieved from each and the time it took for the read to return.

For every read and write, the script logs details, including the value read or written, the node that handled the operation, the time it took, and whether the results match expectations. It uses checkmarks to indicate success and issues mismatch warnings if a value is stale. If an operation fails (such as when a node is temporarily unavailable), the script automatically attempts to reconnect to that node in the background for future operations.

I made all this available in the following repo:

https://github.com/FranckPachot/lab-mongodb-replicaset/tree/blog-202507-mongodb-high-availability-replicaset-in-a-docker-lab

Just start it with:


docker compose up --build

Write Concern majority - wait for network and disk

The connection string specifies w=majority

Once initialized, each line shows the value that is written to the replica set connection 🔢 and read from each connection 🔢,1️⃣, 2️⃣,3️⃣:

Screenshot:

Here is a sample output:

client-1            | 2025-07-08T20:19:01.044Z Write 19 to 🔢 ✅(  358ms) Read 19 from 🔢 ✅(  104ms) 19 from 1️⃣ ✅(  105ms) 19 from 2️⃣ ✅(  105ms) 19 from 3️⃣ ✅(  105ms) client e0edde683498
client-1            | 2025-07-08T20:19:02.111Z Write 20 to 🔢 ✅(  357ms) Read 20 from 🔢 ✅(  104ms) 20 from 1️⃣ ✅(  104ms) 20 from 2️⃣ ✅(  105ms) 20 from 3️⃣ ✅(  104ms) client e0edde683498
client-1            | 2025-07-08T20:19:03.179Z Write 21 to 🔢 ✅(  357ms) Read 21 from 🔢 ✅(  103ms) 21 from 1️⃣ ✅(  104ms) 21 from 2️⃣ ✅(  103ms) 21 from 3️⃣ ✅(  104ms) client e0edde683498
client-1            | 2025-07-08T20:19:04.244Z Write 22 to 🔢 ✅(  357ms) Read 22 from 🔢 ✅(  103ms) 22 from 1️⃣ ✅(  103ms) 22 from 2️⃣ ✅(  104ms) 22 from 3️⃣ ✅(  104ms) client e0edde683498
client-1            | 2025-07-08T20:19:05.310Z Write 23 to 🔢 ✅(  357ms) Read 23 from 🔢 ✅(  105ms) 23 from 1️⃣ ✅(  105ms) 23 from 2️⃣ ✅(  104ms) 23 from 3️⃣ ✅(  104ms) client e0edde683498
client-1            | 2025-07-08T20:19:06.377Z Write 24 to 🔢 ✅(  357ms) Read 24 from 🔢 ✅(  105ms) 24 from 1️⃣ ✅(  105ms) 24 from 2️⃣ ✅(  104ms) 24 from 3️⃣ ✅(  104ms) client e0edde683498
client-1            | 2025-07-08T20:19:07.443Z Write 25 to 🔢 ✅(  357ms) Read 25 from 🔢 ✅(  104ms) 25 from 1️⃣ ✅(  104ms) 25 from 2️⃣ ✅(  104ms) 25 from 3️⃣ ✅(  104ms) client e0edde683498
client-1            | 2025-07-08T20:19:08.508Z Write 26 to 🔢 ✅(  357ms) Read 26 from 🔢 ✅(  104ms) 26 from 1️⃣ ✅(  104ms) 26 from 2️⃣ ✅(  104ms) 26 from 3️⃣ ✅(  105ms) client e0edde683498

The program verifies that the read gets the latest write (✅), but keep in mind, this is not guaranteed. The default write concern is 'majority', which serves as a durability guarantee. It ensures that a write operation is saved to persistent storage on the majority of replicas in the journal. However, it does not wait for the write to be applied to the database and to be visible by reads. The goal is to measure the latency involved in acknowledging durability.

With an artificial latency of 100ms on the network and 50ms on the disk, we observe a connection time of 100ms to a node for both read and write operations.
For writes, it adds 250ms for the majority write concern:

  • 100ms for a secondary to pull the write operation (oplog)
  • 50ms to sync the journal to disk on the secondary
  • 100ms for the secondary to update the sync state to the primary

The total duration is 350ms. It also includes syncing to disk on the primary, which occurs in parallel with the replication.

MongoDB replication differs from many databases in that it employs a mechanism similar to Raft to achieve consistency across multiple nodes. However, changes are pulled by the secondary nodes rather than pushed by the primary. The primary node waits for a commit state, indicated by a Hybrid Logical Clock timestamp, sent by the secondary.

Write Concern: 0 - do not wait for durability

Another difference when comparing with traditional databases is that the client driver is part of the consensus protocol. To demonstrate it, I changed w=majority to w=0 not to wait for any acknowledgment of the write call, and restarted the client, with five replicas of it:


 docker compose up --scale client=5

The write is faster, not waiting on the network or disk, but the value that is read is stale:

client-5            | 2025-07-08T20:48:50.823Z Write 113 to 🔢 🚫(    1ms) Read 112 from 🔢 🚫(  103ms) 113 from 1️⃣ ✅(  103ms) 112 from 2️⃣ 🚫(  103ms) 112 from 3️⃣ 🚫(  103ms) client e0e3c8b1bafd
client-3            | 2025-07-08T20:48:50.824Z Write 113 to 🔢 🚫(    1ms) Read 112 from 🔢 🚫(  104ms) 113 from 1️⃣ ✅(  104ms) 112 from 2️⃣ 🚫(  104ms) 112 from 3️⃣ 🚫(  104ms) client 787c2676d17e
client-2            | 2025-07-08T20:48:51.459Z Write 114 to 🔢 🚫(    1ms) Read 113 from 🔢 🚫(  105ms) 114 from 1️⃣ ✅(  104ms) 113 from 2️⃣ 🚫(  105ms) 113 from 3️⃣ 🚫(  104ms) client 9fd577504268
client-1            | 2025-07-08T20:48:51.520Z Write 114 to 🔢 🚫(    1ms) Read 113 from 🔢 🚫(  105ms) 114 from 1️⃣ ✅(  105ms) 113 from 2️⃣ 🚫(  104ms) 113 from 3️⃣ 🚫(  104ms) client e0edde683498
client-4            | 2025-07-08T20:48:51.522Z Write 114 to 🔢 🚫(    1ms) Read 113 from 🔢 🚫(  103ms) 114 from 1️⃣ ✅(  103ms) 113 from 2️⃣ 🚫(  103ms) 113 from 3️⃣ 🚫(  103ms) client a6c1eaab69a7
client-5            | 2025-07-08T20:48:51.530Z Write 114 to 🔢 🚫(    0ms) Read 113 from 🔢 🚫(  103ms) 114 from 1️⃣ ✅(  103ms) 113 from 2️⃣ 🚫(  103ms) 113 from 3️⃣ 🚫(  103ms) client e0e3c8b1bafd
client-3            | 2025-07-08T20:48:51.532Z Write 114 to 🔢 🚫(    1ms) Read 113 from 🔢 🚫(  104ms) 114 from 1️⃣ ✅(  103ms) 113 from 2️⃣ 🚫(  103ms) 113 from 3️⃣ 🚫(  103ms) client 787c2676d17e
client-2            | 2025-07-08T20:48:52.168Z Write 115 to 🔢 🚫(    1ms) Read 114 from 🔢 🚫(  103ms) 115 from 1️⃣ ✅(  103ms) 114 from 2️⃣ 🚫(  103ms) 114 from 3️⃣ 🚫(  103ms) client 9fd577504268
client-4            | 2025-07-08T20:48:52.230Z Write 115 to 🔢 🚫(    1ms) Read 114 from 🔢 🚫(  103ms) 115 from 1️⃣ ✅(  103ms) 114 from 2️⃣ 🚫(  103ms) 114 from 3️⃣ 🚫(  103ms) client a6c1eaab69a7
client-1            | 2025-07-08T20:48:52.229Z Write 115 to 🔢 🚫(    1ms) Read 114 from 🔢 🚫(  104ms) 115 from 1️⃣ ✅(  104ms) 114 from 2️⃣ 🚫(  103ms) 114 from 3️⃣ 🚫(  103ms) client e0edde683498
client-5            | 2025-07-08T20:48:52.237Z Write 115 to 🔢 🚫(    2ms) Read 114 from 🔢 🚫(  103ms) 115 from 1️⃣ ✅(  103ms) 114 from 2️⃣ 🚫(  103ms) 114 from 3️⃣ 🚫(  103ms) client e0e3c8b1bafd
client-3            | 2025-07-08T20:48:52.240Z Write 115 to 🔢 🚫(    1ms) Read 114 from 🔢 🚫(  103ms) 115 from 1️⃣ ✅(  103ms) 114 from 2️⃣ 🚫(  103ms) 114 from 3️⃣ 🚫(  103ms) client 787c2676d17e
client-2            | 2025-07-08T20:48:52.876Z Write 116 to 🔢 🚫(    1ms) Read 115 from 🔢 🚫(  103ms) 116 from 1️⃣ ✅(  104ms) 115 from 2️⃣ 🚫(  104ms) 115 from 3️⃣ 🚫(  103ms) client 9fd577504268
client-4            | 2025-07-08T20:48:52.936Z Write 116 to 🔢 🚫(    1ms) Read 115 from 🔢 🚫(  103ms) 116 from 1️⃣ ✅(  104ms) 115 from 2️⃣ 🚫(  103ms) 115 from 3️⃣ 🚫(  103ms) client a6c1eaab69a7

The write occurs immediately, succeeding as soon as it is buffered on the driver. While this doesn't guarantee the durability of the acknowledged writes, it does avoid the costs associated with any network latency. In scenarios such as IoT, prioritizing throughput is crucial, even if it means accepting potential data loss during failures.

Because the write is acknowleged immediately, but has to be replicated and applied on other nodes, I read stale values (indicated by 🚫) except when the time to read was higher than the time to replicate and apply, but there's no guarantee on it.

Write Concern: 1 journal: false

I adjusted the write concern to w=1, which means that the system will wait for acknowledgment from the primary node. By default, this acknowledgment ensures that the journal recording the write operation is saved to persistent storage. However, I disabled it by setting journal=false, allowing the write latency to be reduced to just the network time to the primary, which is approximately 100ms:

client-2            | 2025-07-08T20:50:08.756Z Write 10 to 🔢 (  104ms) Read 10 from 🔢 (  105ms) 10 from 1️⃣ (  105ms) 10 from 2️⃣ (  104ms) 10 from 3️⃣ (  104ms) client 9fd577504268
client-4            | 2025-07-08T20:50:08.949Z Write 10 to 🔢 (  103ms) Read 10 from 🔢 (  105ms) 10 from 1️⃣ (  105ms) 10 from 2️⃣ (  106ms) 10 from 3️⃣ (  105ms) client a6c1eaab69a7
client-1            | 2025-07-08T20:50:08.952Z Write 10 to 🔢 (  103ms) Read 10 from 🔢 (  104ms) 10 from 1️⃣ (  104ms) 10 from 2️⃣ (  104ms) 10 from 3️⃣ (  105ms) client e0edde683498
client-3            | 2025-07-08T20:50:08.966Z Write 10 to 🔢 (  103ms) Read 10 from 🔢 (  104ms) 10 from 1️⃣ (  105ms) 10 from 2️⃣ (  104ms) 10 from 3️⃣ (  104ms) client 787c2676d17e
client-5            | 2025-07-08T20:50:08.970Z Write 10 to 🔢 (  103ms) Read 10 from 🔢 (  105ms) 10 from 1️⃣ (  105ms) 10 from 2️⃣ (  105ms) 10 from 3️⃣ (  105ms) client e0e3c8b1bafd
client-2            | 2025-07-08T20:50:09.569Z Write 11 to 🔢 (  103ms) Read 11 from 🔢 (  104ms) 11 from 1️⃣ (  104ms) 11 from 2️⃣ (  104ms) 11 from 3️⃣ (  104ms) client 9fd577504268
client-4            | 2025-07-08T20:50:09.762Z Write 11 to 🔢 (  104ms) Read 10 from 🔢 🚫(  105ms) 11 from 1️⃣ (  106ms) 11 from 2️⃣ (  105ms) 11 from 3️⃣ (  105ms) client a6c1eaab69a7
client-1            | 2025-07-08T20:50:09.765Z Write 11 to 🔢 (  103ms) Read 11 from 🔢 (  107ms) 10 from 1️⃣ 🚫(  104ms) 11 from 2️⃣ (  105ms) 11 from 3️⃣ (  106ms) client e0edde683498
client-3            | 2025-07-08T20:50:09.778Z Write 11 to 🔢 (  105ms) Read 11 from 🔢 (  104ms) 11 from 1️⃣ (  105ms) 11 from 2️⃣ (  105ms) 11 from 3️⃣ (  104ms) client 787c2676d17e
client-5            | 2025-07-08T20:50:09.782Z Write 11 to 🔢 (  103ms) Read 11 from 🔢 (  105ms) 11 from 1️⃣ (  104ms) 11 from 2️⃣ (  105ms) 11 from 3️⃣ (  105ms) client e0e3c8b1bafd
client-2            | 2025-07-08T20:50:10.381Z Write 12 to 🔢 (  103ms) Read 11 from 🔢 🚫(  105ms) 11 from 1️⃣ 🚫(  105ms) 12 from 2️⃣ (  105ms) 12 from 3️⃣ (  105ms) client 9fd577504268
client-1            | 2025-07-08T20:50:10.578Z Write 12 to 🔢 (  104ms) Read 12 from 🔢 (  106ms) 12 from 1️⃣ (  105ms) 12 from 2️⃣ (  105ms) 12 from 3️⃣ (  106ms) client e0edde683498
client-4            | 2025-07-08T20:50:10.579Z Write 12 to 🔢 (  104ms) Read 12 from 🔢 (  106ms) 12 from 1️⃣ (  106ms) 12 from 2️⃣ (  105ms) 12 from 3️⃣ (  105ms) client a6c1eaab69a7
client-5            | 2025-07-08T20:50:10.594Z Write 12 to 🔢 (11751ms) Read 11 from 🔢 🚫(  106ms) 12 from 1️⃣ (  106ms) 11 from 2️⃣ 🚫(  106ms) 11 from 3️⃣ 🚫(  105ms) client e0e3c8b1bafd
client-3            | 2025-07-08T20:50:10.592Z Write 12 to 🔢 (11753ms) Read 11 from 🔢 🚫(  105ms) 12 from 1️⃣ (  105ms) 11 from 2️⃣ 🚫(  105ms) 11 from 3️⃣ 🚫(  105ms) client 787c2676d17e

It is important to understand the consequences of failure. The change is written to the filesystem buffers but may not have been fully committed to disk since fdatasync() is called asynchronously every 100 milliseconds. This means that if the Linux instance crashes, up to 100 milliseconds of acknowledged transactions could be lost. However, if the MongoDB instance fails, there is no data loss, as the filesystem buffers remain intact.

Write Concern: 1 journal: true

Still with w=1, but the default journal=true, an fdatasync() is run before the acknowledgment of the write, to guarantee durability on that node. With my injected latency, it adds 50 milliseconds:

client-1            | 2025-07-08T20:52:34.922Z Write 48 to 🔢 ✅(  155ms) Read 48 from 🔢 ✅(  105ms) 48 from 1️⃣ ✅(  105ms) 47 from 2️⃣ 🚫(  105ms) 48 from 3️⃣ ✅(  105ms) client e0edde683498
client-3            | 2025-07-08T20:52:35.223Z Write 50 to 🔢 ✅(  154ms) Read 50 from 🔢 ✅(  104ms) 50 from 1️⃣ ✅(  105ms) 49 from 2️⃣ 🚫(  105ms) 50 from 3️⃣ ✅(  105ms) client 787c2676d17e
client-2            | 2025-07-08T20:52:35.276Z Write 49 to 🔢 ✅(  155ms) Read 49 from 🔢 ✅(  104ms) 49 from 1️⃣ ✅(  105ms) 48 from 2️⃣ 🚫(  105ms) 49 from 3️⃣ ✅(  105ms) client 9fd577504268
client-5            | 2025-07-08T20:52:35.377Z Write 49 to 🔢 ✅(  155ms) Read 49 from 🔢 ✅(  105ms) 49 from 1️⃣ ✅(  104ms) 48 from 2️⃣ 🚫(  105ms) 49 from 3️⃣ ✅(  104ms) client e0e3c8b1bafd
client-4            | 2025-07-08T20:52:35.430Z Write 50 to 🔢 ✅(  154ms) Read 50 from 🔢 ✅(  104ms) 50 from 1️⃣ ✅(  105ms) 49 from 2️⃣ 🚫(  105ms) 50 from 3️⃣ ✅(  105ms) client a6c1eaab69a7
client-1            | 2025-07-08T20:52:35.785Z Write 49 to 🔢 ✅(  154ms) Read 49 from 🔢 ✅(  103ms) 49 from 1️⃣ ✅(  103ms) 48 from 2️⃣ 🚫(  103ms) 49 from 3️⃣ ✅(  103ms) client e0edde683498
client-3            | 2025-07-08T20:52:36.086Z Write 51 to 🔢 ✅(  154ms) Read 51 from 🔢 ✅(  104ms) 51 from 1️⃣ ✅(  105ms) 50 from 2️⃣ 🚫(  104ms) 51 from 3️⃣ ✅(  104ms) client 787c2676d17e
client-2            | 2025-07-08T20:52:36.140Z Write 50 to 🔢 ✅(  154ms) Read 50 from 🔢 ✅(  105ms) 50 from 1️⃣ ✅(  104ms) 49 from 2️⃣ 🚫(  104ms) 50 from 3️⃣ ✅(  105ms) client 9fd577504268
client-5            | 2025-07-08T20:52:36.241Z Write 50 to 🔢 ✅(  155ms) Read 50 from 🔢 ✅(  104ms) 50 from 1️⃣ ✅(  103ms) 49 from 2️⃣ 🚫(  103ms) 50 from 3️⃣ ✅(  104ms) client e0e3c8b1bafd
client-4            | 2025-07-08T20:52:36.294Z Write 51 to 🔢 ✅(  154ms) Read 51 from 🔢 ✅(  102ms) 51 from 1️⃣ ✅(  103ms) 50 from 2️⃣ 🚫(  103ms) 51 from 3️⃣ ✅(  103ms) client a6c1eaab69a7
client-1            | 2025-07-08T20:52:36.645Z Write 50 to 🔢 ✅(  154ms) Read 50 from 🔢 ✅(  103ms) 50 from 1️⃣ ✅(  103ms) 49 from 2️⃣ 🚫(  103ms) 50 from 3️⃣ ✅(  103ms) client e0edde683498
client-3            | 2025-07-08T20:52:36.950Z Write 52 to 🔢 ✅(  154ms) Read 52 from 🔢 ✅(  104ms) 52 from 1️⃣ ✅(  103ms) 51 from 2️⃣ 🚫(  103ms) 52 from 3️⃣ ✅(  104ms) client 787c2676d17e
client-2            | 2025-07-08T20:52:37.003Z Write 51 to 🔢 ✅(  154ms) Read 51 from 🔢 ✅(  105ms) 51 from 1️⃣ ✅(  105ms) 50 from 2️⃣ 🚫(  105ms) 51 from 3️⃣ ✅(  104ms) client 9fd577504268
client-5            | 2025-07-08T20:52:37.103Z Write 51 to 🔢 ✅(  155ms) Read 51 from 🔢 ✅(  103ms) 51 from 1️⃣ ✅(  104ms) 50 from 2️⃣ 🚫(  104ms) 51 from 3️⃣ ✅(  104ms) client e0e3c8b1bafd
client-4            | 2025-07-08T20:52:37.155Z Write 52 to 🔢 ✅(  155ms) Read 52 from 🔢 ✅(  104ms) 52 from 1️⃣ ✅(  104ms) 51 from 2️⃣ 🚫(  104ms) 52 from 3️⃣ ✅(  103ms) client a6c1eaab69a7
client-1            | 2025-07-08T20:52:37.508Z Write 51 to 🔢 ✅(  154ms) Read 51 from 🔢 ✅(  104ms) 51 from 1️⃣ ✅(  104ms) 50 from 2️⃣ 🚫(  104ms) 51 from 3️⃣ ✅(  104ms) client e0edde683498

In summary, MongoDB allows applications to balance performance (lower latency) and durability (resilience to failures) rather than relying on one-size-fits-all configuration that waits even when it is not necessary according to business requirements. For any given setup, the choice must consider the business requirements as well as the infrastructure: resilience of compute and storage services, local or remote storage, and network latency between nodes. In a lab, injecting network and disk latency can help simulate scenarios that illustrate the consequences of reading from secondary nodes or recovering from a failure.

To fully understand how it works, I recommend checking your understanding by reading the documentation on Write Concern and practicing in a lab. The defaults may vary per driver and version, and the consequences may not be visible without a high load or failure. In current versions, MongoDB favors data protection with the write consistency defaulting to "majority" and journaling to true (writeConcernMajorityJournalDefault), but if you set w:1 journaling defaults to false.

Enhancing PostgreSQL Security: How to Encrypt the pgBackRest Repository

This post was originally written in 2023 and was updated in 2025. Encryption protects sensitive information by converting it into an unreadable format unless the correct passphrase or decryption key is supplied. In this post, we’ll look at how to encrypt the pgBackRest repository. pgBackRest is the backup solution widely used for PostgreSQL database backups, […]

PostgreSQL Database Security: External Server-Based Authentication

This post was originally written in 2023 and was updated in 2025. Protecting sensitive information in PostgreSQL starts with strong authentication. It’s the process of verifying who’s trying to connect, whether through usernames and passwords, tokens, or one-time codes. For DBAs, that means not only choosing the right method but also monitoring and auditing logs […]

August 01, 2025

Postgres 18 beta2: large server, Insert Benchmark, part 2

I repeated the benchmark for one of the workloads used in a recent blog post on Postgres 18 beta2 performance. The workload used 1 client and 1 table with 50M rows that fits in the Postgres buffer pool. In the result I explain here, one of the benchmark steps was run for ~10X more time. Figuring out how long to run the steps in the Insert Benchmark is always a work in progress -- I want to test more things, so I don't want to run steps for too long, but there will be odd results if the run times are too short.

tl;dr

  • up to 2% less throughput on range queries in the qr100 benchmark step. This is similar to what I saw in my previous report.
  • up to 12% more throughput on the l.i2 benchmark step in PG beta1 and beta2. This is much better than what I saw in my previous report.

Details

Details on the benchmark are in my previous post.

The benchmark is explained here and was run for one workloads -- 1 client, cached.

  • run with 1 client, 1 table and a cached database
  • load 50M rows in step l.i0, do 160M writes in step l.i1 and 40M in l.i2. Note that here the l.i1 and l.i2 steps run for ~10X longer than in my previous post.
The benchmark steps are:

  • l.i0
    • insert X million rows per table in PK order. The table has a PK index but no secondary indexes. There is one connection per client.
  • l.x
    • create 3 secondary indexes per table. There is one connection per client.
  • l.i1
    • use 2 connections/client. One inserts Y million rows per table and the other does deletes at the same rate as the inserts. Each transaction modifies 50 rows (big transactions). This step is run for a fixed number of inserts, so the run time varies depending on the insert rate.
  • l.i2
    • like l.i1 but each transaction modifies 5 rows (small transactions) and Z million rows are inserted and deleted per table.
    • Wait for N seconds after the step finishes to reduce variance during the read-write benchmark steps that follow. The value of N is a function of the table size.
  • qr100
    • use 3 connections/client. One does range queries and performance is reported for this. The second does does 100 inserts/s and the third does 100 deletes/s. The second and third are less busy than the first. The range queries use covering secondary indexes. This step is run for 1800 seconds. If the target insert rate is not sustained then that is considered to be an SLA failure. If the target insert rate is sustained then the step does the same number of inserts for all systems tested.
  • qp100
    • like qr100 except uses point queries on the PK index
  • qr500
    • like qr100 but the insert and delete rates are increased from 100/s to 500/s
  • qp500
    • like qp100 but the insert and delete rates are increased from 100/s to 500/s
  • qr1000
    • like qr100 but the insert and delete rates are increased from 100/s to 1000/s
  • qp1000
    • like qp100 but the insert and delete rates are increased from 100/s to 1000/s
Results: overview

The performance report is here.

The summary section has 3 tables. The first shows absolute throughput by DBMS tested X benchmark step. The second has throughput relative to the version from the first row of the table. The third shows the background insert rate for benchmark steps with background inserts. The second table makes it easy to see how performance changes over time. The third table makes it easy to see which DBMS+configs failed to meet the SLA.

Below I use relative QPS (rQPS) to explain how performance changes. It is: (QPS for $me / QPS for $base) where $me is the result for some version $base is the result from Postgres 17.4.

When rQPS is > 1.0 then performance improved over time. When it is < 1.0 then there are regressions. When it is 0.90 then I claim there is a 10% regression. The Q in relative QPS measures: 
  • insert/s for l.i0, l.i1, l.i2
  • indexed rows/s for l.x
  • range queries/s for qr100, qr500, qr1000
  • point queries/s for qp100, qp500, qp1000
Below I use colors to highlight the relative QPS values with red for <= 0.97, green for >= 1.03 and grey for values between 0.98 and 1.02.

Results: 1 client, cached

Normally I summarize the summary but I don't do that here to save space.

But the tl;dr is:
  • up to 2% less throughput on range queries in the qr100 benchmark step. This is similar to what I saw in my previous report.
  • up to 12% more throughput on the l.i2 benchmark step in PG beta1 and beta2. This is much better than what I saw in my previous report.