database.news | database technology news aggregator

May 06, 2025

Franck Pachot

HTAP Using a Star Query on MongoDB Atlas Search Index

MongoDB is used for its strength in managing online transaction processing (OLTP) with a document model that naturally aligns with domain-specific transactions and their access patterns. In addition to these capabilities, MongoDB supports advanced search techniques through its Atlas Search index, based on Apache Lucene. This can be used for near-real-time analytics and, combined with aggregation pipeline, add some online analytical processing (OLAP) capabilities. Thanks to the document model, this analytics capability doesn't require a different data structure and enables MongoDB to execute hybrid transactional and analytical (HTAP) workloads efficiently, as demonstrated in this article with an example from a healthcare domain.

Traditional relational databases employ a complex query optimization method known as "star transformation" and rely on multiple single-column indexes, along with bitmap operations, for efficient ad-hoc queries. This requires a dimensional schema, or star schema, which differs from the normalized operational schema updated by transactions. In contrast, MongoDB can be queried with a similar strategy using its document schema for operational use cases, simply requiring the addition of an Atlas Search index on the collection that stores transaction facts.

To demonstrate how a single index on a fact collection enables efficient queries even when filters are applied to other dimension collections, I utilize the MedSynora DW dataset, which is similar to a star schema with dimensions and facts. This dataset, published by M. Ebrar Küçük on Kaggle, is a synthetic hospital data warehouse covering patient encounters, treatments, and lab tests, and is compliant with privacy standards for healthcare data science and machine learning.

Import the dataset

The dataset is accessible on Kaggle as a folder of comma-separated values (CSV) files for dimensions and facts compressed into a 730MB zip file. The largest fact table that I'll use holds 10 million records.

I download the CSV files and uncompress them:

curl -L -o medsynora-dw.zip "https://www.kaggle.com/api/v1/datasets/download/mebrar21/medsynora-dw"

unzip medsynora-dw.zip

I import each file into a collection, using mongoimport from the MongoDB Database Tools:


for i in "MedSynora DW"/*.csv
do
 mongoimport -d "MedSynoraDW" --file="$i"  --type=csv --headerline -c "$(basename "$i" .csv)" -j 8                                                              
done

For this demo, I'm interested in two fact tables: FactEncounter and FactLabTest. Here are the fields described in the file headers:

# head -1 "MedSynora DW"/Fact{Encounter,LabTests}.csv

==> MedSynora DW/FactEncounter.csv <==
Encounter_ID,Patient_ID,Disease_ID,ResponsibleDoctorID,InsuranceKey,RoomKey,CheckinDate,CheckoutDate,CheckinDateKey,CheckoutDateKey,Patient_Severity_Score,RadiologyType,RadiologyProcedureCount,EndoscopyType,EndoscopyProcedureCount,CompanionPresent

==> MedSynora DW/FactLabTests.csv <==
Encounter_ID,Patient_ID,Phase,LabType,TestName,TestValue

The fact tables reference the following dimensions:


# head -1 "MedSynora DW"/Dim{Disease,Doctor,Insurance,Patient,Room}.csv

==> MedSynora DW/DimDisease.csv <==
Disease_ID,Admission Diagnosis,Disease Type,Disease Severity,Medical Unit

==> MedSynora DW/DimDoctor.csv <==
Doctor_ID,Doctor Name,Doctor Surname,Doctor Title,Doctor Nationality,Medical Unit,Max Patient Count

==> MedSynora DW/DimInsurance.csv <==
InsuranceKey,Insurance Plan Name,Coverage Limit,Deductible,Excluded Treatments,Partial Coverage Treatments

==> MedSynora DW/DimPatient.csv <==
Patient_ID,First Name,Last Name,Gender,Birth Date,Height,Weight,Marital Status,Nationality,Blood Type

==> MedSynora DW/DimRoom.csv <==
RoomKey,Care_Level,Room Type

Here is the dimensional model, often referred to as a "star schema" because the fact tables are located at the center, referencing the dimensions. Because of normalization, when facts contain a one-to-many composition it is described in two CSV files to fit into two SQL tables:

Star schema with facts and dimensions. The facts are stored in two tables in CSV files or a SQL database, but on a single collection in MongoDB. It holds the fact measures and dimension keys, which reference the key of the dimension collections.

MongoDB allows the storage of one-to-many compositions, such as Encounters and LabTests, within a single collection. By embedding LabTests as an array in Encounter documents, this design pattern promotes data colocation to reduce disk access and increase cache locality, minimizes duplication to improve storage efficiency, maintains data integrity without requiring additional foreign key processing, and enables more indexing possibilities. The document model also circumvents a common issue in SQL analytic queries, where joining prior to aggregation may yield inaccurate results due to the repetition of parent values in a one-to-many relationship.

As this would be the right data model for an operational database with such data, I create a new collection, using an aggregation pipeline, that I'll use instead of the two that were imported from the normalized CSV:


db.FactLabTests.createIndex({ Encounter_ID: 1, Patient_ID: 1 });

db.FactEncounter.aggregate([
  {
    $lookup: {
      from: "FactLabTests",
      localField: "Encounter_ID",
      foreignField: "Encounter_ID",
      as: "LabTests"
    }
  },
  {
    $addFields: {
      LabTests: {
        $map: {
          input: "$LabTests",
          as: "test",
          in: {
            Phase: "$$test.Phase",
            LabType: "$$test.LabType",
            TestName: "$$test.TestName",
            TestValue: "$$test.TestValue"
          }
        }
      }
    }
  },
  {
    $out: "FactEncounterLabTests"
  }
]);

Here is how one document looks:

AtlasLocalDev atlas [direct: primary] MedSynoraDW> 

db.FactEncounterLabTests.find().limit(1)
[
  {
    _id: ObjectId('67fc3d2f40d2b3c843949c97'),
    Encounter_ID: 2158,
    Patient_ID: 'TR479',
    Disease_ID: 1632,
    ResponsibleDoctorID: 905,
    InsuranceKey: 82,
    RoomKey: 203,
    CheckinDate: '2024-01-23 11:09:00',
    CheckoutDate: '2024-03-29 17:00:00',
    CheckinDateKey: 20240123,
    CheckoutDateKey: 20240329,
    Patient_Severity_Score: 63.2,
    RadiologyType: 'None',
    RadiologyProcedureCount: 0,
    EndoscopyType: 'None',
    EndoscopyProcedureCount: 0,
    CompanionPresent: 'True',
    LabTests: [
      {
        Phase: 'Admission',
        LabType: 'CBC',
        TestName: 'Lymphocytes_abs (10^3/µl)',
        TestValue: 1.34
      },
      {
        Phase: 'Admission',
        LabType: 'Chem',
        TestName: 'ALT (U/l)',
        TestValue: 20.5
      },
      {
        Phase: 'Admission',
        LabType: 'Lipids',
        TestName: 'Triglycerides (mg/dl)',
        TestValue: 129.1
      },
      {
        Phase: 'Discharge',
        LabType: 'CBC',
        TestName: 'RBC (10^6/µl)',
        TestValue: 4.08
      },
...

In MongoDB, the document model utilizes embedding and reference design patterns, resembling a star schema with a primary fact collection and references to various dimension collections. It is crucial to ensure that the dimension references are properly indexed before querying these collections.

Atlas Search index

Search indexes are distinct from regular indexes, which rely on a single composite key, as they can index multiple fields without requiring a specific order to establish a key. This feature makes them perfect for ad-hoc queries, where the filtering dimensions are not predetermined.

I create a single Atlas Search index that encompasses all dimensions or measures that I might use in predicates, including those found in an embedded document:

db.FactEncounterLabTests.createSearchIndex(
  "SearchFactEncounterLabTests", {
    mappings: {
      dynamic: false,
      fields: {
        "Encounter_ID":        { "type": "number"  },
        "Patient_ID":          { "type": "token"  },
        "Disease_ID":          { "type": "number"  },
        "InsuranceKey":        { "type": "number"  },
        "RoomKey":             { "type": "number"  },
        "ResponsibleDoctorID": { "type": "number" },
        "CheckinDate":         { "type": "token"  },
        "CheckoutDate":        { "type": "token"  },
        "LabTests":            {
          "type": "document" , fields: {
            "Phase":           { "type": "token"  },
            "LabType":         { "type": "token"  },
            "TestName":        { "type": "token"  },
            "TestValue":       { "type": "number" }
          }
        }
      }
    }
  }
);

Since I don't need extra text searching on the keys, I designate the character string ones as token. I label the integer keys as number. Generally, the keys are utilized for equality predicates. However, some can be employed for ranges when the format permits, such as check-in and check-out dates formatted as YYYY-MM-DD.

In a relational database, the star schema approach emphasizes the importance of limiting the number of columns in the fact tables, as they typically contain numerous rows. Smaller dimension tables can hold more columns and are typically denormalized in SQL databases (favoring a star schema over a snowflake schema). Likewise, in document modeling, incorporating all dimension fields would unnecessarily increase the size of the fact collection documents, making it more straightforward to reference the dimension collection. The general principles of data modeling in MongoDB enable querying it as a star schema without requiring extra consideration, as MongoDB databases are designed for the application access patterns.

Star query

A star schema allows processing queries which filter fields within dimension collections in several stages:

In the first stage, filters are applied to the dimension collections to extract all dimension keys. These keys typically do not require additional indexes, as the dimensions are generally small in size.
In the second stage, a search is conducted using all previously obtained dimension keys on the fact collection. This process utilizes the search index built on those keys, allowing for quick access to the required documents.
A third stage may retrieve additional dimensions to gather the necessary fields for aggregation or projection. This multi-stage process ensures that the applied filter reduces the dataset from the large fact collection before any further operations are conducted.

For an example query, I aim to analyze lab test records for female patients who are over 170 cm tall, underwent lipid lab tests, have insurance coverage exceeding 80%, and were treated by Japanese doctors in deluxe rooms for hematological conditions.

Search aggregation pipeline

To optimize the fact collection process and apply all filters, I will begin with a simple aggregation pipeline that starts with a search on the search index. This allows for filters to be applied directly to the fact collection's fields, while additional filters will be incorporated in stage one of the star query. I utilize a local variable with a compound operator to facilitate the addition of more filters for each dimension in stage one of the star query.

Before going though the star query stages to add filters on dimensions, my query has a filter on the lab type which is in the fact collection, and indexed.

const search =  {
    "$search": {
      "index": "SearchFactEncounterLabTests",
      "compound": {
        "must": [
          { "in":    { "path": "LabTests.LabType" , "value": "Lipids"   } },
        ]
      },
      "sort": { CheckoutDate: -1 }
    }
  }

I have added a "sort" operation to sort the result by check-out date in descending order. This illustrates the advantage of sorting during the index search rather than in later steps of the aggregation pipeline, particularly when a "limit" is applied.

I'll use this local variable to add more filters in Stage 1 of the star query, so that it can be executed for Stage 2, and collect documents for Stage 3.

Stage 1: Query the dimension collections

In the first phase of the star query, I obtain the dimension keys from the dimension collections. For every dimension with a filter, get the dimension keys, with a find() on the dimension, and append a "must" condition to the "compound" of the fact index search.

The following adds the conditions on the Patient (female patients over 170 cm):

search["$search"]["compound"]["must"].push( { in: {
 path: "Patient_ID",                         // Foreign Key in Fact
 value: db.DimPatient.find(                  // Dimension collection
  {Gender: "Female", Height: { "$gt": 170 }} // filter on Dimension
 ).map(doc => doc["Patient_ID"]).toArray() } // Primary Key in Dimension
})

The following adds the conditions on the Doctor (Japanese):

search["$search"]["compound"]["must"].push( { in: {
 path: "ResponsibleDoctorID",               // Foreign Key in Fact
 value: db.DimDoctor.find(                  // Dimension collection
  {"Doctor Nationality": "Japanese" }       // filter on Dimension
 ).map(doc => doc["Doctor_ID"]).toArray() } // Primary Key in Dimension
})

The following adds the condition on the Room (Deluxe):

search["$search"]["compound"]["must"].push( { in: {
 path: "RoomKey",                         // Foreign Key in Fact
 value: db.DimRoom.find(                  // Dimension collection
  {"Room Type": "Deluxe" }                // filter on Dimension
 ).map(doc => doc["RoomKey"]).toArray() } // Primary Key in Dimension
})

The following adds the condition on the Disease (Hematology):

search["$search"]["compound"]["must"].push( { in: {
 path: "Disease_ID",                         // Foreign Key in Fact
 value: db.DimDisease.find(                  // Dimension collection
  {"Disease Type": "Hematology" } // filter on Dimension
 ).map(doc => doc["Disease_ID"]).toArray() } // Primary Key in Dimension
})

Finally, the condition on the Insurance coverage (greater than 80%):

search["$search"]["compound"]["must"].push( { in: {
 path: "InsuranceKey",                         // Foreign Key in Fact
 value: db.DimInsurance.find(                  // Dimension collection
  {"Coverage Limit": { "$gt": 0.8 } }          // filter on Dimension
 ).map(doc => doc["InsuranceKey"]).toArray() } // Primary Key in Dimension
})

All these search criteria have the same shape: a find() on the dimension collection, with the filters from the query, resulting in an array of dimension keys (like a primary key in a dimension table) that are used to search in the fact documents using it as a reference (like a foreign key in a fact table).

Each of those steps has queried the dimension collection to obtain a simple array of dimension keys, which are added to the aggregation pipeline. Rather than joining tables like in a relational database, the criteria on the dimensions are pushed down to the query on the fact tables.

Stage 2: Query the fact search index

With short queries on the dimensions, I have built the following pipeline search step:

AtlasLocalDev atlas [direct: primary] MedSynoraDW> print(search)
{
  '$search': {
    index: 'SearchFactEncounterLabTests',
    compound: {
      must: [
        { in: { path: 'LabTests.LabType', value: 'Lipids' } },
        {
          in: {
            path: 'Patient_ID',
            value: [
              'TR551',    'TR751',    'TR897',    'TRGT201',  'TRJB261',
              'TRQG448',  'TRSQ510',  'TRTP535',  'TRUC548',  'TRVT591',
              'TRABU748', 'TRADD783', 'TRAZG358', 'TRBCI438', 'TRBTY896',
              'TRBUH905', 'TRBXU996', 'TRCAJ063', 'TRCIM274', 'TRCXU672',
              'TRDAB731', 'TRDFZ885', 'TRDGE890', 'TRDJK974', 'TRDKN003',
              'TRE004',   'TRMN351',  'TRRY492',  'TRTI528',  'TRAKA962',
              'TRANM052', 'TRAOY090', 'TRARY168', 'TRASU190', 'TRBAG384',
              'TRBYT021', 'TRBZO042', 'TRCAS072', 'TRCBF085', 'TRCOB419',
              'TRDMD045', 'TRDPE124', 'TRDWV323', 'TREUA926', 'TREZX079',
              'TR663',    'TR808',    'TR849',    'TRKA286',  'TRLC314',
              'TRMG344',  'TRPT435',  'TRVZ597',  'TRXC626',  'TRACT773',
              'TRAHG890', 'TRAKW984', 'TRAMX037', 'TRAQR135', 'TRARX167',
              'TRARZ169', 'TRASW192', 'TRAZN365', 'TRBDW478', 'TRBFG514',
              'TRBOU762', 'TRBSA846', 'TRBXR993', 'TRCRL507', 'TRDKA990',
              'TRDKD993', 'TRDTO238', 'TRDSO212', 'TRDXA328', 'TRDYU374',
              'TRDZS398', 'TREEB511', 'TREVT971', 'TREWZ003', 'TREXW026',
              'TRFVL639', 'TRFWE658', 'TRGIZ991', 'TRGVK314', 'TRGWY354',
              'TRHHV637', 'TRHNS790', 'TRIMV443', 'TRIQR543', 'TRISL589',
              'TRIWQ698', 'TRIWL693', 'TRJDT883', 'TRJHH975', 'TRJHT987',
              'TRJIM006', 'TRFVZ653', 'TRFYQ722', 'TRFZY756', 'TRGNZ121',
              ... 6184 more items
            ]
          }
        },
        {
          in: {
            path: 'ResponsibleDoctorID',
            value: [ 830, 844, 862, 921 ]
          }
        },
        { in: { path: 'RoomKey', value: [ 203 ] } },
        {
          in: {
            path: 'Disease_ID',
            value: [
              1519, 1506, 1504, 1510,
              1515, 1507, 1503, 1502,
              1518, 1517, 1508, 1513,
              1509, 1512, 1516, 1511,
              1505, 1514
            ]
          }
        },
        { in: { path: 'InsuranceKey', value: [ 83, 84 ] } }
      ]
    },
    sort: { CheckoutDate: -1
  }
}

MongoDB Atlas Search indexes, built on Apache Lucene, efficiently handle complex queries with multiple conditions and manage long arrays of values. In this example, a search operation integrates the compound operator with the "must" clause to apply filters across attributes. This capability simplifies query design after resolving complex filters into lists of dimension keys.

With the "search" operation created above, I can run an aggregation pipeline to get the document I'm interested in:

db.FactEncounterLabTests.aggregate([
 search,
])

With my example, nine documents are returned in 50 milliseconds.

Estimate the count

This approach is ideal for queries with filters on many conditions, where none are very selective alone, but the combination is highly selective. Using queries on dimensions and a search index on facts avoids reading more documents than necessary. However, depending on the operations you will add to the aggregation pipeline, it is a good idea to estimate the number of records returned by the search index to avoid runaway queries.

Typically, an application that allows users to execute multi-criteria queries may define a threshold and return an error or warning when the estimated number of documents exceeds it, prompting the user to add more filters. For such cases, you can run a "$searchMeta" on the index before a "$search" operation. For example, the following checks that the number of documents returned by the filter is lower than 10,000:

MedSynoraDW>
                                    
                                    
                                    
                                        by Franck Pachot


                            
                                
                                    
                                        Murat Demirbas
                                    
                                    
                                        Notes from the TLA+ Community Event
                                    
                                    
                                    
                                        I attended the TLA+ Community Event at Hamilton, Ontario on Sunday. Several talks pushed the boundaries of formal methods in the real world through incorporating testing, conformance, model translation, and performance estimation. The common thread was that: TLA+ isn't just for specs anymore. It's being integrated into tooling: fuzzers, trace validators, and compilers. The community is building bridges from models to reality, and it's a good time to be in the loop.
Below is a summary of selected talks, followed by some miscellaneous observations. This is just a teaser; the recordings will be posted soon on the TLA+ webpage.

Model-Based Fuzzing for Distributed Systems — Srinidhi Nagendra
Traditional fuzzing relies on random inputs and coverage-guided mutation, and works well for single-process software. But it fails for distributed systems due to too many concurrent programs, too many interleavings, and no clear notion of global coverage.
Srinidhi's work brings model-based fuzzing to distributed protocols using TLA+ models for coverage feedback. The approach, ModelFuzz, samples test cases from the implementation (e.g., Raft), simulates them on the model, and uses coverage data to guide mutation. Test cases are not sequences of messages, but of scheduling choices and failure events. This avoids over-generating invalid traces (e.g., a non-leader sending an AppendEntries).
The model acts as a coverage oracle. But direct enumeration of model executions is infeasible because of too many traces, too much instrumentation, too much divergence from optimized implementations (e.g., snapshotting in etcd). Instead, ModelFuzz extracts traces with lightweight instrumentation as mentioned above, simulates them on the model, and mutates schedules in simple ways: swapping events, crashes, and message deliveries. This turns out to be surprisingly effective. They found 1 new bug in etcd, 2 known and 12 new bugs in RedisRaft. They also showed faster bug-finding compared to prior techniques.

TraceLink: Automating Trace Validation with PGo — Finn Hackett & Ivan Beschastnikh
Validating implementation traces against TLA+ specs is still hard. Distributed systems don't hand you a total order. Logs are huge. Instrumentation is brittle. This talk introduced TraceLink, a toolchain that builds on PGo (a compiler from PlusCal to Go) to automate trace validation.
There are three key ideas. First, compress traces by grouping symbolically and using the binary format. Second, track causality using vector clocks, and either explore all possible event orderings (breadth-first) or just one (depth-first). Third, generate diverse traces via controlled randomness (e.g., injecting exponential backoffs between high-level PlusCal steps).
TraceLink is currently tied to PGo-compiled code, but they plan to support plain TLA+ models. Markus asked: instead of instrumenting with vector clocks, why not just log with a high-resolution global clock? That might work too. 
Finn is a MongoDB PhD fellow, and will be doing his summer internship with us at MongoDB Research in the Distributed Systems Research Group.

Translating C to PlusCal for Model Checking — Asterios Tech
Asterios Tech (a Safran defense group subsidiary) works on embedded systems with tiny kernels and tight constraints. They need to verify their scheduler, and manual testing doesn't cut it. So the idea they explore is to translate a simplified C subset to PlusCal automatically, then model check the algorithms for safety to the face of concurrency.
The translator, c2pluscal, is built as a Frama-C plugin. Due to the nature of the embedded programming domain, the C code is limited: no libc, no malloc, no dynamic structures. This simplicity helps in the translation process. Pointers are modeled using a TLA+ record with fields for memory location, frame pointer, and offset. Load/store/macros are mapped to PlusCal constructs. Arrays become sequences. Structs become records. Loops and pointer arithmetic are handled conservatively.
I am impressed that they model pointer arithmetic. This is a promising approach for analyzing legacy embedded C code formally, without rewriting it by hand.

More talks
The "TLA+ for All: Notebook-Based Teaching" talk introduced Jupyter-style TLA+ notebooks for pedagogy supporting inline explanations, executable specs, and immediate feedback.
I presented the talk "TLA+ Modeling of MongoDB Transactions" (joint work with Will Schultz). We will post a writeup soon.
Jesse J. Davis presented "Are We Serious About Using TLA+ For Statistical Properties?". He plans to blog about it.
Andrew Helwer presented "It’s never been easier to write TLA⁺ tooling!", and I defer to his upcoming blog post as well. 	
Markus Kuppe, who did the crux of organizing the event, demonstrated that GitHub Copilot can solve the Diehard problem with TLA+ in 4 minutes of reasoning, with some human intervention. He said that the TLA+ Foundation and NVidia is funding the "TLAI" challenge, for exploring novel AI augmentation of TLA+ modeling. 

Miscellaneous
The 90-minute lunch breaks were very European. A U.S. conference would cap it at an hour, and DARPA or NSF would eliminate it entirely: brown bag through talks. The long break was great for conversation.
In our workshop, audience questions were frequent and sharp. We are a curious bunch.
The venue was McMaster University in Hamilton, 90 minutes drive from home. Border crossings at the Lewiston-Queenston bridge were smooth without delays. But questions from border officers still stressed my daughters (ages 9 and 13). I reminded them how much worse we had it when we had visas, instead of the US citizenship.
My daughters also noticed how everything (roads, buildings, parks) is called Queen's this and Queen's that. My 9th year old tried to argue that since Canada is so close to US and since it looks so similar to US, it feels more like a U.S. state than a separate country. Strong Trump vibes with that one.
USD to CAD exchange rate is $1.38. I still remembered them to be pretty much on par, so I was surprised. We hadn’t visited Canada since 2020. A Canadian friend told me there's widespread discontent about the economy, rent and housing prices.
Canadians are reputed to be very nice. But drivers were aggressive—cutting in, speeding in mall lots. I also received tense, passive-aggressive encounters from two cashiers and a McMaster staff. Eh.
                                    
                                    
                                    
                                    
                                        by Murat (noreply@blogger.com)
                                    
                                    
                                
                            
                                
                                    
                                        Percona Database Performance Blog
                                    
                                    
                                        MongoDB Benchmarking Made Effortless With MongoDB Workload Generator
                                    
                                    
                                    
                                        Recognizing a gap in the availability of straightforward tools for MongoDB benchmarking, particularly those that do not require complex compilation, configuration, and intricate setups, I developed MongoDB Workload Generator. The aim was to provide MongoDB users with an accessible solution for effortlessly generating realistic data and simulating application workloads on both sharded and non-sharded clusters.  […]
                                    
                                    
                                    
                                    
                                        by Daniel Almeida
                                    
                                    
                                
                            
                                
                                    
                                        Tinybird Engineering Blog
                                    
                                    
                                        Explorations: a chat UI for real-time analytics
                                    
                                    
                                    
                                        Introducing Explorations, a conversational UI to chat with your real-time data in Tinybird.
                                    
                                    
                                    
                                
                            
                                
                                    
                                        Tinybird Engineering Blog
                                    
                                    
                                        Building a conversational AI tool for real-time analytics
                                    
                                    
                                    
                                        We just launched a conversational AI feature. Here's how we built the feature that lets you chat with your data.


                    
                        
                            May 05, 2025
                            
                            
                                
                                    
                                        Franck Pachot
                                    
                                    
                                        Amazon DocumentDB and multi-key indexing
                                    
                                    
                                    
                                        In previous posts, I discussed the limitations of MongoDB emulations on databases like Oracle and PostgreSQL. AWS offers Amazon DocumentDB, which provides compatibility with MongoDB 5.0 and may run on top of Aurora PostgreSQL, a guess due to some similarities, never confirmed officially, but the storage capabilities are those of Aurora. 

MongoDB's strength is not just in document storage like a key-value store, but also in its efficient indexing for queries with equality, sorting, and range filtering on flexible schemas with embedded arrays and sub-documents.

None of the compatible options I've tested can execute a simple .find().sort().limit() effectively without excessive document scanning, because of the underlying inverted indexes limitations. Is Amazon DocumentDB better?

I create a single-node instance with the highest version available, which compatible with MongoDB 5.0, an old version (I'm testing this in May 2025):



I created the same collection as in the previous post:




for (let i = 0; i < 10000; i++) {
 db.demo.insertOne( {
  a: 1 ,
  b: Math.random(),
 ts: new Date()
} )
}

db.demo.createIndex({ "a": 1 , ts: -1 }) ;

db.demo.countDocuments( { a: 1 } );

db.demo.find( { a: 1 } ).explain("executionStats")







The index I have created is not used, the whole collection is read (COLLSCAN returning 10000 documents):




rs0 [direct: primary] test> db.demo.countDocuments( { a: 1 } )

10000

rs0 [direct: primary] test> db.demo.find( { a: 1 } 
).explain("executionStats")

{
  queryPlanner: {
    plannerVersion: 1,
    namespace: 'test.demo',
    winningPlan: { stage: 'COLLSCAN' }
  },
  executionStats: {
    executionSuccess: true,
    executionTimeMillis: '10.268',
    planningTimeMillis: '0.124',
    executionStages: {
      stage: 'COLLSCAN',
      nReturned: '10000',
      executionTimeMillisEstimate: '9.803'
    }
  },
  serverInfo: { host: 'docdb-2025-05-02-12-45-45', port: 27017, version: '5.0.0' },
  ok: 1,
  operationTime: Timestamp({ t: 1746190938, i: 1 })
}






All rows match my filter so maybe Amazon DocumentDB has a different query planner and a full scan is valid here. 

To validate that the index can be used, I can add a hint:




rs0 [direct: primary] test> db.demo.find( { a: 1 }
).hint({a:1,ts:-1}).explain("executionStats")

{
  queryPlanner: {
    plannerVersion: 1,
    namespace: 'test.demo',
    winningPlan: {
      stage: 'SUBSCAN',
      inputStage: { stage: 'IXSCAN', indexName: 'a_1_ts_-1', direction: 'forward' }
    }
  },
  executionStats: {
    executionSuccess: true,
    executionTimeMillis: '73.495',
    planningTimeMillis: '16.987',
    executionStages: {
      stage: 'SUBSCAN',
      nReturned: '10000',
      executionTimeMillisEstimate: '55.689',
      inputStage: {
        stage: 'IXSCAN',
        nReturned: '10000',
        executionTimeMillisEstimate: '42.151',
        indexName: 'a_1_ts_-1',
        direction: 'forward'
      }
    }
  },
  serverInfo: { host: 'docdb-2025-05-02-12-45-45', port: 27017, version: '5.0.0' },
  ok: 1,
  operationTime: Timestamp({ t: 1746190980, i: 1 })
}






The execution plan is not as verbose as MongoDB, so this doesn't give lots of information about the seek and index keys, but at least I know that my index can be used. Note that SUBSCAN is not a MongoDB execution plan stage, and there's no information telling me if { a: 1 } was filtered efficiently by the index. The next test will tell more.

Time to test what failed on CosmosDB - a simple compound index used for equality and sort:




rs0 [direct: primary] test> db.demo.find( { a: 1 } 
).sort({ts:-1}).limit(10).explain("executionStats")

{
  queryPlanner: {
    plannerVersion: 1,
    namespace: 'test.demo',
    winningPlan: {
      stage: 'SUBSCAN',
      inputStage: {
        stage: 'LIMIT_SKIP',
        inputStage: {
          stage: 'IXSCAN',
          indexName: 'a_1_ts_-1',
          direction: 'forward'
        }
      }
    }
  },
  executionStats: {
    executionSuccess: true,
    executionTimeMillis: '0.398',
    planningTimeMillis: '0.161',
    executionStages: {
      stage: 'SUBSCAN',
      nReturned: '10',
      executionTimeMillisEstimate: '0.200',
      inputStage: {
        stage: 'LIMIT_SKIP',
        nReturned: '10',
        executionTimeMillisEstimate: '0.195',
        inputStage: {
          stage: 'IXSCAN',
          nReturned: '10',
          executionTimeMillisEstimate: '0.193',
          indexName: 'a_1_ts_-1',
          direction: 'forward'
        }
      }
    }
  },
  serverInfo: { host: 'docdb-2025-05-02-12-45-45', port: 27017, version: '5.0.0' },
  ok: 1,
  operationTime: Timestamp({ t: 1746191080, i: 1 })
}
rs0 [direct: primary] test> ;







This looks good. Finally, is there a MongoDB-compatible API that can use an index to optimize pagination queries?

This was simple, as the index values are all scalar. However, the flexible schema of MongoDB allows arrays when one document has a One to Many relationship instead of a One to One.

In my opinion, this is the main advantage of a document database: not having to change the complete data model, which carries risks to the existing data not concerned by the change, when a business rule evolves. I explained an example of this, a multi-city airport, in a previous article.

I added similar documents but with an array of values in "a":




for (let i = 0; i < 10000; i++) {
 db.demo.insertOne( {
  a: [0,1,2] ,
  b: Math.random(),
 ts: new Date()
} )
}






In parallel, I did the same on a MongoDB Atlas database, running the genuine document database, version 8.0, to illustrate the expected outcomes. The index is utilized efficiently, reading just 10 index keys and fetching 10 documents without subsequent filtering or sorting:



But it is not the same on Amazon DocumentDB where the full collection is scanned, and sorted, before picking the 10 documents of the result:




rs0 [direct: primary] test> db.demo.find( { a: 1 }
).sort({ts:-1}).limit(10).explain("executionStats")

{
  queryPlanner: {
    plannerVersion: 1,
    namespace: 'test.demo',
    winningPlan: {
      stage: 'SUBSCAN',
      inputStage: {
        stage: 'LIMIT_SKIP',
        inputStage: {
          stage: 'SORT',
          sortPattern: { ts: -1 },
          inputStage: { stage: 'COLLSCAN' }
        }
      }
    }
  },
  executionStats: {
    executionSuccess: true,
    executionTimeMillis: '53.092',
    planningTimeMillis: '0.479',
    executionStages: {
      stage: 'SUBSCAN',
      nReturned: '10',
      executionTimeMillisEstimate: '52.572',
      inputStage: {
        stage: 'LIMIT_SKIP',
        nReturned: '10',
        executionTimeMillisEstimate: '52.564',
        inputStage: {
          stage: 'SORT',
          nReturned: '10',
          executionTimeMillisEstimate: '52.561',
          sortPattern: { ts: -1 },
          inputStage: {
            stage: 'COLLSCAN',
            nReturned: '20000',
            executionTimeMillisEstimate: '37.111'
          }
        }
      }
    }
  },
  serverInfo: { host: 'docdb-2025-05-02-12-45-45', port: 27017, version: '5.0.0' },
  ok: 1,
  operationTime: Timestamp({ t: 1746193272, i: 1 })
}






Although I can force the index with a hint, the process results in a full scan of all index entries. This method fails to apply the equality filter on the key and does not utilize the ordering of entries effectively:




rs0 [direct: primary] test> db.demo.find( { a: 1 }
).sort({ts:-1}).limit(10).hint({a:1,ts:-1}).explain("executionStats")

{
  queryPlanner: {
    plannerVersion: 1,
    namespace: 'test.demo',
    winningPlan: {
      stage: 'SUBSCAN',
      inputStage: {
        stage: 'LIMIT_SKIP',
        inputStage: {
          stage: 'SORT',
          sortPattern: { ts: -1 },
          inputStage: {
            stage: 'FETCH',
            inputStage: { stage: 'IXSCAN', indexName: 'a_1_ts_-1' }
          }
        }
      }
    }
  },
  executionStats: {
    executionSuccess: true,
    executionTimeMillis: '27.382',
    planningTimeMillis: '0.241',
    executionStages: {
      stage: 'SUBSCAN',
      nReturned: '10',
      executionTimeMillisEstimate: '27.102',
      inputStage: {
        stage: 'LIMIT_SKIP',
        nReturned: '10',
        executionTimeMillisEstimate: '27.094',
        inputStage: {
          stage: 'SORT',
          nReturned: '10',
          executionTimeMillisEstimate: '27.091',
          sortPattern: { ts: -1 },
          inputStage: {
            stage: 'FETCH',
            nReturned: '20000',
            executionTimeMillisEstimate: '15.592',
            inputStage: {
              stage: 'IXSCAN',
              nReturned: '20000',
              executionTimeMillisEstimate: '4.338',
              indexName: 'a_1_ts_-1'
            }
          }
        }
      }
    }
  },
  serverInfo: { host: 'docdb-2025-05-02-12-45-45', port: 27017, version: '5.0.0' },
  ok: 1,
  operationTime: Timestamp({ t: 1746193437, i: 1 })
}






Surprisingly, the result appears to be ordered when I use an index hint. I verified this on my result, although it may be a side effect of the implementation:




db.demo.find({ a: 1 }).hint({ a: 1, ts: -1 }).toArray().forEach((doc, index, docsArray) => {
    if (index > 0) {
        const previousDoc = docsArray[index - 1];
        if (doc.ts > previousDoc.ts) {
            console.log("The results are not ordered by 'ts' in descending order.");
        }
    }
});






But wait, I inserted a batch of scalars and then a batch of arrays. It is possible that each has a different partial index in the underlying PostgreSQL. I started with only scalar values, so one index, and was able to use its full filtering and ordering capabilities. In a second batch, with all arrays, a second index may have been created for them. It can use both indexes, as I've seen when forcing it with a hint, but with two branches in the execution plan, one for each index. With a concatenation, similar to a UNION ALL, PostgreSQL can preserve the order with a merge sort. However, if a bitmap scan is used, the ordering is lost. As my sort() in on the insertion timestamp, it is possible that, by chance, the indexes were scanned in the right order even of the ordering is not guaranteed.

To validate the ordering can be lost, reason why the query planner adds a sort stage, I inserted one more document with a scalar and ran my ordering test again:




rs0 [direct: primary] test> db.demo.insertOne( { a: 1 , b: Math.random(), ts: new Date() } )

{
  acknowledged: true,
  insertedId: ObjectId('68151d8420044b5ec3d9aea2')
}

rs0 [direct: primary] test> db.demo.find({ a: 1 }).hint({ a: 1, ts: -1 }).toArray().forEach((doc, index, docsArray) => {
...     if (index > 0) {
...         const previousDoc = docsArray[index - 1];
...         if (doc.ts > previousDoc.ts) {
...             console.log("The results are not ordered by 'ts' in descending order.");
...         }
...     }
... });

The results are not ordered by 'ts' in descending order.






Now that scalar values and arrays are interleaved, reading from two distinct indexes does not preserve the order. That would explain why the query planner cannot optimize the query.

The document model's primary advantage lies in having a single index over a flexible schema, as opposed to two indexes on separate tables in a normalized relational database. If my hypothesis holds true, the existence of multiple indexes due to the flexible schema negates this key benefit of a document database.

I tried multiple combinations of index creation and insertions. If I delete all rows (db.demo.deleteMany({})) it still behaves like a multi-key where the index cannot be used for equality and sort.

Dropping and re-creating the index resets this state. However, if I insert rows with scalar and arrays before creating the index, I got a case where I cannot force the index with a hint:




rs0 [direct: primary] test> db.demo.dropIndex({ "a": 1 , ts: -1 }) ; 

{                                                                                                                                                                                                         
  nIndexesWas: 2,                                                                                                                                                                                         
  ok: 1,                                                                                                                                                                                                  
  operationTime: Timestamp({ t: 1746215608, i: 1 })                                                                                                                                                       
}                         

rs0 [direct: primary] test> db.demo.insertOne( { a: 1    , b: Math.random(), ts: new Date() } )   

{                                                                                                                                                                                                         
  acknowledged: true,                                                                                                                                                                                     
  insertedId: ObjectId('681522ba20044b5ec3d9fcc8')                                                                                                                                                        
} 

rs0 [direct: primary] test> db.demo.insertOne( { a: [ 1,2,3 ] , b: Math.random(), ts: new Date() } )    

{                                                                                                                                                                                                         
  acknowledged: true,                                                                                                                                                                                     
  insertedId: ObjectId('681522bb20044b5ec3d9fcc9')                                                                                                                                                        
}                

rs0 [direct: primary] test> db.demo.createIndex({ "a": 1 , ts: -1 }) ;    

a_1_ts_-1      

rs0 [direct: primary] test> db.demo.find( { a: 1 }
).hint({a:1,ts:-1}).explain("executionStats")  

MongoServerError: Cannot use Hint for this Query. 
Index is multi key index , partial index or sparse index and query is not optimized to use this index.







The available statistics and logs are insufficient to identify the type of index in use. It is neither a MongoDB index, as it doesn't have the same capabilities,  nor does it behave like one single PostgreSQL index.

In PostgreSQL, an inverted index, such as GIN, utilizes bitmaps for scanning and does not maintain order, even with scalar values. While an expression index can facilitate equality, sorting, and range queries, it cannot be established on a flexible schema with arrays in the key.

I suspect that multiple partial expression indexes are created, and full features can only be utilized before a multi-key is detected. Maybe the SUBSCAN is the operation that combines results from multiple partial indexes, to emulate MongoDB flexible schema, but loosing the optimization of pagination queries.

To effectively compare databases with a compatible API, focus on indexing options and execution plans for OLTP query patterns. Benchmarks must run over extended periods with increasing data, testing critical query patterns like find().sort().limit(). Always examine execution plans first to avoid wasting cloud resources in long tests. You can compare plans scanning MongoDB compound indexes. 

Note that Amazon DocumentDB may have other advantages, like the integration with some Amazon services, and Aurora-like separation of compute and storage, but when it comes to take the full advantages of a document database, MongoDB is more advanced, and is available on AWS, like other clouds, with Atlas.
                                    
                                    
                                    
                                    
                                        by Franck Pachot
                                    
                                    
                                
                            
                        
                    
                        
                            May 04, 2025
                            
                            
                                
                                    
                                        Franck Pachot
                                    
                                    
                                        Amazon DocumentDB != Microsoft DocumentDB extension for PostgreSQL
                                    
                                    
                                    
                                        My next post will be about Amazon DocumentDB and how it compares to MongoDB in terms of indexing a flexible schema with multiple keys. There's a lot of confusion today with the "DocumentDB" name because earlier this year Microsoft announced DocumentDB: Open-Source Announcement which has nothing to do with Amazon DocumentDB.


Amazon DocumentDB is a managed NoSQL database service that supports document data structures and is compatible with MongoDB versions 3.6, 4.0, and 5.0. It allows users to store, query, and index JSON data. Its storage capabilities resemble those of Amazon Aurora, featuring compute and storage separation, a monolithic read-write node, up to 15 read-only replicas, and multi-AZ storage. There are guesses that the API is built on PostgreSQL and, which, if true, brings another similarity with the Microsoft DocumentDB extension for PostgreSQL.
DocumentDB is the name of a PostgreSQL extension used by Microsoft in vCore-based Azure Cosmos DB for MongoDB to emulate some MongoDB capabilities with BSON and RUM indexes.


You might wonder how Microsoft, in 2015, adopted a name already in use by a similar product released six years earlier by its main competitor. Actually, Microsoft owned this name and built a DocumentDB service long before in 2013. Let's use the Web Archive to do some archeology.


In 2011 the documentdb.com was for sale





in 2012 Microsoft registered the domain


in 2013 Microsoft was probably building a document data store product and prepared a web site for it


in 2014 they have put a page with generic content about NoSQL, JSON and document databases. It doesn't mention Microsoft, but one video shows documents in Word and Access.
in 2015 the released a preview for a DocumentDB with an SQL API over JSON.


Later, the www.documentdb.com redirected to http://azure.microsoft.com/en-us/services/documentdb and then to http://azure.microsoft.com/en-us/services/cosmos-db. The product went through a MongoDB-like protocol on top of DocumentDB

Microsoft has a history of renaming services that never went popular in order to attract a broader audience. There's also another PostgreSQL with another extension (Citus) in CosmosDB, which has nothing to do with the PostgreSQL (with DocumentDB extension) behind the vCore-based Azure Cosmos DB for MongoDB.



  
    
      
      
        
      
    
  
  
    
      Is CosmosDB a new SQL database? Is CitusDB a distributed SQL database? Has Hyperscale vanished in the Hyperspace?
      Franck Pachot for YugabyteDB ・ Oct 13 '22
      
        #sql
        #database
        #nosql
        #citusdb
      
    
  



I hope that clears some confusion. The best is to say "Amazon DocumentDB" when talking about the AWS service
                                    
                                    
                                    
                                    
                                        by Franck Pachot
                                    
                                    
                                
                            
                        
                    
                        
                            May 03, 2025
                            
                            
                                
                                    
                                        Kyle Kingsbury (Aphyr / Jepsen)
                                    
                                    
                                        When Flat Rate Movers Won't Answer Your Calls
                                    
                                    
                                    
                                        Back in 2023 I went through an inter-state move with Flat Rate Movers,
LTD. (US DOT 488466, MC 254356). They bungled the job, damaged my stuff
and my house, and then refused to talk to me or their own insurance company. I
placed dozens of calls, wrote letters, emails, nothing worked. I finally got
some money for the damage by working with their insurer directly. I know a lot
of folks struggle with moving companies, and figured I’d write up a bit of this
story in case it helps. You can skip ahead if you
just want to know about the FMCSA insurer database.
In November 2022 I scheduled the move with Flat Rate. We took a detailed video
inventory of everything to be moved. I paid over $14,000 for an “all inclusive
move”, which, per the move plan, included full packing services, blankets for furniture, boxes for electronics, lamps, and bicycles, a custom crate for a tabletop, and “extended valuation coverage”—their insurance policy.
A few days before the move in March 2023, Flat Rate called to tell me a
five-person team would arrive. When the crew arrived, it wasn’t Flat Rate. It
was a team of two from Esquire Moving Inc (US DOT 2820253, MC 940499). They had
no idea they were supposed to provide packing services. The crate I paid for was nowhere to be found. They didn’t have enough boxes or blankets. I had to have everything out that day, so
there wasn’t any choice—I worked alongside the team for thirteen hours to
complete the packing and move-out.
I let the team handle furniture protection and loading the truck, which was a
mistake. As they unloaded a few days later, I realized much of the furniture had
been completely unprotected. My bicycle wasn’t boxed. Lamps were tossed in
loose. There was extensive damage to furniture, art, books, and my home: I
watched the crew rip a doorframe off the wall. The crew provided me with a
backdated bill of lading only on delivery.
I called Flat Rate a couple days later to discuss my concerns. Their
representive apologized and acknowledged that a crew of two was insufficient. I
sent Flat Rate a detailed email explaining their failure to provide promised
services, accompanied by photographs of the move process and the resulting
damage. I asked for a partial refund, and for Flat Rate to pay for the damage
they caused. I received only an out-of-office auto-response.
On March 21st, I received a generic form letter from Flat Rate’s claims
department. I filled out the claim forms and submitted them on March 25th, and
repeated that I needed more than an insurance claim: Flat Rate obviously failed
to provide the services I’d paid for.
Then things got weird. I placed call after call to the claims department. They
didn’t answer. I left voicemail after voicemail. On April 28th I reached an
actual human, but their computers were apparently down, and no supervisor was
in the building—the representative couldn’t do anything but take down a note
to call me back. They never did. I left another voicemail on May 5th. Again on
May 23rd. Again on August 4th. August 8th. I began to realize that Flat Rate’s
strategy was simply to never answer the phone.
On August 14th, a new hire in the claims department called me: she’d just been
hired and was taking over my case. She dispatched a local furniture company
to inspect the damages and help complete the claim. The inspectors agreed: the
movers had done extensive damage. I provided them with documentation and photos
as well.
On September 14 I called the furniture company and confirmed that yes, they had
submitted their report to Flat Rate. However, the furniture company had a
confidentiality agreement with Flat Rate, and could not release their report to
me. I called their contact at Flat Rate, who didn’t answer, and left a
voicemail. I called the claims representative’s direct line: her number was no
longer in service. On September 19th I got a call from the claims
representative, who promised she’d have a decision by the end of the day. Flat Rate never talked to me again.
I called again and again, but got voicemail every time. I tried a letter from
my lawyers, an arbitration request with the American Trucking
Association, complaints with the Federal Motor Carrier Safety
Administration,
the New York Department of Transportation’s
Motor Carrier complaint department, and the New York State Department of
Financial Services; none of this went anywhere.
Finding Their Insurance
Here’s the thing: movers offer coverage, they have claims departments, and they do many insurance-like things in house. But for complicated reasons, they are not actually insurance companies, and are not regulated as such. Their coverage is backstopped by separate, real insurance companies. I eventually discovered I could bypass Flat Rate and talk to their backing insurance company directly. The Federal Motor Carrier Safety
Administration maintains a public database of moving companies. Putting in their US DOT number
(488466) yields a page with some basic info. At the bottom are links with their
active and pending insurance, and the history of their insurers.


The insurance history told me their cargo insurer at the time of my move was
Hanover Insurance Company, with policy RHX H706231. The database also had
Hanover’s phone number. I was able to call Hanover and ask about the
policy—it took a few tries, but I got through to a claims adjuster there who
was able to confirm that yes, I could file a claim directly with Hanover. I
sent over a package of documentation, including the photos and the story so
far.
Hanover Insurance informed me they’d never heard anything about my claim, and
called Flat Rate to check. Over the next few days we discovered that Flat Rate
wasn’t just refusing to answer my calls—they wouldn’t answer calls from their insurance company either. Hanover wasn’t able to obtain the report from the furniture company, but they did agree, on the basis of the photos, to pay a reasonable amount for the damage to cargo. I had a check a few days later.
I’m still unhappy with Flat Rate: they charged me a good deal of money for services
they failed to deliver, and never paid for damage to the house. I was gearing up for small claims court—but during the tail end of this fiasco I was entering a fresh hell of home renovation, and got distracted. If you’re reading this, Flat Rate, you still owe me.
                                    
                                    
                                    
                                    
                                        by Aphyr
                                    
                                    
                                
                            
                                
                                    
                                        AWS Database Blog - Amazon Aurora
                                    
                                    
                                        Understanding transaction visibility in PostgreSQL clusters with read replicas
                                    
                                    
                                    
                                        On April 29, 2025, Jepsen published a report about transaction visibility behavior in Amazon RDS for PostgreSQL Multi-AZ clusters. We appreciate Jepsen’s thorough analysis and would like to provide additional context about this behavior, which exists both in Amazon RDS and community PostgreSQL. In this post, we dive into the specifics of the issue to provide further clarity, discuss what classes of architectures it might affect, share workarounds, and highlight our ongoing commitment to improving community PostgreSQL in all areas, including correctness.
                                    
                                    
                                    
                                    
                                        by Sergey Melnik
                                    
                                    
                                
                            
                        
                    
                        
                            May 02, 2025
                            
                            
                                
                                    
                                        Percona Database Performance Blog
                                    
                                    
                                        MongoDB Operators Explained: Features, Limitations, and Open Source Alternatives
                                    
                                    
                                    
                                        If you’ve tried running MongoDB on Kubernetes, you’ve likely run into some unique challenges. Stateful workloads like replica sets and sharded clusters can be messy in containers, and things that should be easy, like backups, scaling, failovers, and upgrades, suddenly take more time and attention than they should. That’s exactly what good MongoDB Operators are […]
                                    
                                    
                                    
                                    
                                        by David Quilty
                                    
                                    
                                
                            
                                
                                    
                                        Percona Database Performance Blog
                                    
                                    
                                        MongoDB Operators Explained: Features, Limitations, and Open Source Alternatives
                                    
                                    
                                    
                                        If you’ve tried running MongoDB on Kubernetes, you’ve likely run into some unique challenges. Stateful workloads like replica sets and sharded clusters can be messy in containers, and things that should be easy, like backups, scaling, failovers, and upgrades, suddenly take more time and attention than they should. That’s exactly what good MongoDB Operators are […]
                                    
                                    
                                    
                                    
                                        by David Quilty
                                    
                                    
                                
                            
                                
                                    
                                        Small Datum - Mark Callaghan
                                    
                                    
                                        The impact of innodb_doublewrite_pages in MySQL 8.0.41
                                    
                                    
                                    
                                        After reading a blog post from JFG on changes to innodb_doublewrite_pages and bug 111353, I wanted to understand the impact from that on the Insert Benchmark using a large server.

I test the impact from:
using a larger (non-default) value for innodb_doublewrite_pages
disabling the doublewrite buffer
tl;dr
Using a larger value for innodb_doublewrite_pages improves QPS by up to 10%
Disabling the InnoDB doublewrite buffer is great for performance, but bad for durability. I don't suggest you do this in production.
Builds, configuration and hardware

I compiled upstream MySQL 8.0.41 from source.

The server is an ax162-s from Hetzner with 48 cores, AMD 128G RAM and AMD SMT disabled. It uses Ubuntu 22.04 and storage is ext4 using SW RAID 1 over 2 locally attached NVMe devices. More details on it are here. At list prices a similar server from Google Cloud costs 10X more than from Hetzner.

The MySQL configuration files are:
cz11a_c32r128 - the base configuration file that does not set innodb_doublewrite_pages and gets innodb_doublewrite_pages=8
cz11e_c32r128 - adds innodb_doublewrite_pages=128 to the base config
cz11f_c32r128 - adds innodb_doublewrite=0 to the base config (disables doublewrite)
The Benchmark

The benchmark is explained here and is run with 20 clients and a table per client with an IO-bound workload. The database is larger than memory with 200M rows per table and 20 tables.

The benchmark steps are:
l.i0
insert 200 million rows per table in PK order. The table has a PK index but no secondary indexes. There is one connection per client.
l.x
create 3 secondary indexes per table. There is one connection per client.
l.i1
use 2 connections/client. One inserts 4M rows per table and the other does deletes at the same rate as the inserts. Each transaction modifies 50 rows (big transactions). This step is run for a fixed number of inserts, so the run time varies depending on the insert rate.
l.i2
like l.i1 but each transaction modifies 5 rows (small transactions) and 1M rows are inserted and deleted per table.
Wait for X seconds after the step finishes to reduce variance during the read-write benchmark steps that follow. The value of X is a function of the table size.
qr100
use 3 connections/client. One does range queries and performance is reported for this. The second does does 100 inserts/s and the third does 100 deletes/s. The second and third are less busy than the first. The range queries use covering secondary indexes. This step is run for 1800 seconds. If the target insert rate is not sustained then that is considered to be an SLA failure. If the target insert rate is sustained then the step does the same number of inserts for all systems tested.
qp100
like qr100 except uses point queries on the PK index
qr500
like qr100 but the insert and delete rates are increased from 100/s to 500/s
qp500
like qp100 but the insert and delete rates are increased from 100/s to 500/s
qr1000
like qr100 but the insert and delete rates are increased from 100/s to 1000/s
qp1000
like qp100 but the insert and delete rates are increased from 100/s to 1000/s
Results: overview

The performance report is here.

The summary section in the performance report has 3 tables. The first shows absolute throughput by DBMS tested X benchmark step. The second has throughput relative to the version from the first row of the table. The third shows the background insert rate for benchmark steps with background inserts and all systems sustained the target rates. The second table makes it easy to see how performance changes over time. The third table makes it easy to see which DBMS+configs failed to meet the SLA.

Below I use relative QPS to explain how performance changes. It is: (QPS for $me / QPS for $base) where $me is the result with the cz11e_c32r128 or cz11f_c32r128 configs and $base is the result from the cz11a_c32r128 config. The configs are explained above, cz11e_c32r128 increases innodb_doublewrite_pages and cz11f_c32r128 disabled the doublewrite buffer.

When relative QPS is > 1.0 then performance improved over time. When it is < 1.0 then there are regressions. The Q in relative QPS measures: 
insert/s for l.i0, l.i1, l.i2
indexed rows/s for l.x
range queries/s for qr100, qr500, qr1000
point queries/s for qp100, qp500, qp1000
Below I use colors to highlight the relative QPS values with red for <= 0.95, green for >= 1.05 and grey for values between 0.95 and 1.05.

Results: more IO-bound

The performance summary is here.

From the cz11e_c32r128 config that increases innodb_doublewrite_pages to 128:
the impact on write-heavy steps is mixed: create index was ~7% slower and l.i2 was ~10% faster
the impact on range query + write steps is positive but small. The improvements were 0%, 0% and 4%. Note that these steps are not as IO-bound as point query + write steps and the range queries do ~0.3 reads per query (see here).
the impact on point query + write steps is positive and larger. The improvements were 3%, 8% and 9%. These benchmark steps are much more IO-bound than the steps that do range queries.
From the cz11f_c32r128 config that disables the InnoDB doublewrite buffer:
the impact on write-heavy steps is large -- from 1% to 36% faster.
the impact on range query + write steps is positive but small. The improvements were 0%, 2% and 15%. Note that these steps are not as IO-bound as point query + write steps and the range queries do ~0.3 reads per query (see here).
the impact on point query + write steps is positive and larger. The improvements were 14%, 41% and 42%.


                                    
                                    
                                    
                                    
                                        by Mark Callaghan (noreply@blogger.com)
                                    
                                    
                                
                            
                        
                    
                        
                            May 01, 2025
                            
                            
                                
                                    
                                        Franck Pachot
                                    
                                    
                                        Microsoft DocumentDB: RUM instead of GIN but same limitations on JSON paths
                                    
                                    
                                    
                                        Storing documents in PostgreSQL does not transform it to a document database. Embedded documents in JSONB require GIN indexes, which are not effective for range or pagination queries. Microsoft recognized some limitations of JSONB and GIN indexes, and developped the DocumentDB extension for BSON storage. However, this does not resolve the pagination issues. In DocumentDB, GIN indexes are replaced by RUM indexes, but show the same limitation.


  
  
  Install DocumentDB


To verify I used FerretDB v2 which includes PostgreSQL with the DocumentDB extension and adds a MongoDB API emulation.

I start FerretDB with using the Docker Compose file provided by the documentation. I've added a mongosh service to run the MongoDB client. I've added auto_explain configuration to the start of PostgreSQL.




services:
  postgres:
    image: ghcr.io/ferretdb/postgres-documentdb:17-0.102.0-ferretdb-2.1.0
    platform: linux/amd64
    restart: on-failure
    environment:
      - POSTGRES_USER=username
      - POSTGRES_PASSWORD=password
      - POSTGRES_DB=postgres
    volumes:
      - ./data:/var/lib/postgresql/data
    command:
      postgres -c shared_preload_libraries=auto_explain,pg_stat_statements,pg_cron,pg_documentdb_core,pg_documentdb -c auto_explain.log_min_duration=0 -c auto_explain.log_analyze=on -c auto_explain.log_buffers=on -c auto_explain.log_nested_statements=on

  ferretdb:
    image: ghcr.io/ferretdb/ferretdb:2.1.0
    restart: on-failure
    ports:
      - 27017:27017
    environment:
      - FERRETDB_POSTGRESQL_URL=postgres://username:password@postgres:5432/postgres

  mongosh:
    image: mongo
    deploy:
     replicas: 0
    command: mongosh mongodb://username:password@ferretdb/







I start the services and log stderr so display the output of Auto Explain:




docker compose up -d
docker compose logs -f postgres






I connect to the MongoDB API emulation:




docker compose run --rm -it mongosh







It emulates MongoDB 7.0 on top of PostgreSQL 17.0 with DocumentDB extension:




Current Mongosh Log ID: 6813828db4f75c9887d861df
Connecting to:          mongodb://<credentials>@ferretdb/?directConnection=true&appName=mongosh+2.5.0
Using MongoDB:          7.0.77
Using Mongosh:          2.5.0

------
   The server generated these startup warnings when booting
   2025-05-01T14:17:52.106Z: Powered by FerretDB v2.1.0 and DocumentDB 0.102.0 (PostgreSQL 17.4).
   2025-05-01T14:17:52.124Z: Please star 🌟 us on GitHub: https://github.com/FerretDB/FerretDB.                                                                                                 
------







  
  
  Test with FerretDB


I create a simple collection with 1000 documents:




for (let i = 0; i < 10000; i++) {   
 db.demo.insertOne( {
  a: 1 ,
  b: Math.random(),
 ts: new Date()
} );
}






Here is what Auto Explain logged in PostgreSQL:




postgres-1  | 2025-05-01 13:49:54.012 UTC [37] LOG:  duration: 0.219 ms  plan:
postgres-1  |   Query Text:
postgres-1  |   Query Parameters: $1 = '\x13000000070068137c029be252893cd8696900', $2 = '\x34000000075f69640068137c029be252893cd86969106100f30100000162006466d90936e7d73f09747300a6671c8c9601000000'
postgres-1  |   Insert on documents_13 collection  (cost=0.00..0.01 rows=1 width=80) (actual time=0.216..0.217 rows=1 loops=1)                                       
postgres-1  |     Buffers: shared hit=20
postgres-1  |     ->  Values Scan on "values"  (cost=0.00..0.01 rows=1 width=80) (actual time=0.003..0.003 rows=1 loops=1)                                           
postgres-1  | 2025-05-01 13:49:54.012 UTC [37] LOG:  duration: 0.485 ms  plan:
postgres-1  |   Query Text: SELECT p_result::bytea, p_success FROM documentdb_api.insert($1, $2::bytea, $3::bytea)                                                   
postgres-1  |   Query Parameters: $1 = 'test', $2 = '\x9900000002696e73657274000500000064656d6f0004646f63756d656e7473003c00000003300034000000106100f30100000162006466d90936e7d73f09747300a6671c8c96010000075f69640068137c029be252893cd869690000086f7264657265640001036c736964001e00000005696400100000000431ae93d9c7074468a630c26b67a8709b00022464620005000000746573740000', $3 = NULL
postgres-1  |   Function Scan on insert  (cost=0.01..0.02 rows=1 width=33) (actual time=0.482..0.482 rows=1 loops=1)                                                 
postgres-1  |     Buffers: shared hit=20






I'm happy that FerretDB provides a MongoDB-like API to avoid calling the raw DocumentDB functions like:




 documentdb_api.insert('test', '\x9900000002696e73657274000500000064656d6f0004646f63756d656e7473003c00000003300034000000106100f30100000162006466d90936e7d73f09747300a6671c8c96010000075f69640068137c029be252893cd869690000086f7264657265640001036c736964001e00000005696400100000000431ae93d9c7074468a630c26b67a8709b00022464620005000000746573740000'::bytea, NULL::bytea) 






In PostgreSQL, modified by the DocumentDB extension, inserting a small document (rows=1) into a collection without indexes affects 20 pages (Buffers: shared hit=20). While the syntax resembles that of MongoDB, the performance differs due to PostgreSQL's heap tables and 8k blocks, which introduce additional overhead.

I created a simple index that adheres to the MongoDB Equality, Sort, Range rule. This index is designed for queries utilizing an equality filter on "a" and sorting based on "ts":




db.demo.createIndex({ "a": 1 , ts: -1 }) ;






My goal is to test the most frequent pattern in OLTP applications: pagination queries. This exists in many domains, like retrieving the last ten orders for a customer, the last ten measures from a device, or the last ten payments on an account.


  
  
  Heap table and RUM index


I connect to PostgreSQL and check the SQL table that stores the documents:




# docker compose run -it -e PGUSER=username -e PGPASSWORD=password postgres psql -h postgres postgres

psql (17.4 (Debian 17.4-1.pgdg120+2))
Type "help" for help.

postgres=# \d+ documentdb_data.documents_15*

                                               Table "documentdb_data.documents_15"
     Column      |           Type           | Collation | Nullable | Default | Storage  | Compression | Stats target | Description 
-----------------+--------------------------+-----------+----------+---------+----------+-------------+--------------+-------------
 shard_key_value | bigint                   |           | not null |         | plain    |             |              | 
 object_id       | documentdb_core.bson     |           | not null |         | extended |             |              | 
 document        | documentdb_core.bson     |           | not null |         | extended |             |              | 
 creation_time   | timestamp with time zone |           |          |         | plain    |             |              | 
Indexes:
    "collection_pk_15" PRIMARY KEY, btree (shard_key_value, object_id)
    "documents_rum_index_35" documentdb_rum (document documentdb_api_catalog.bson_rum_single_path_ops (path=a, tl='2691'), document documentdb_api_catalog.bson_rum_single_path_ops (path=ts, tl='2691'))
Check constraints:
    "shard_key_value_check" CHECK (shard_key_value = '15'::bigint)
Access method: heap






The table includes two extended storage columns, a heap table, a primary index, and a secondary index supporting the collection index created. It is important to note that this is a RUM index, not a GIN index. The differences between the two has a good explanation in an Alibaba blog post. 

The table stores 154 pages and the index has to read 3 pages to find the first row:




postgres=# explain (analyze , buffers, serialize binary) 
           select * from documentdb_data.documents_15
;
                                                   QUERY PLAN                                                    
-----------------------------------------------------------------------------------------------------------------
 Seq Scan on documents_15  (cost=0.00..254.00 rows=10000 width=89) (actual time=0.010..0.994 rows=10000 loops=1)
   Buffers: shared hit=154
 Planning Time: 0.070 ms
 Serialization: time=4.160 ms  output=1026kB  format=binary
 Execution Time: 6.176 ms
(5 rows)

postgres=# explain (analyze , buffers, serialize binary) select * from documentdb_data.documents_15 order by shard_key_value,object_id limit 1;
                                                           QUERY PLAN                                                           
--------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.29..0.36 rows=1 width=89) (actual time=0.021..0.021 rows=1 loops=1)
   Buffers: shared hit=3
   ->  Index Scan using _id_ on documents_15  (cost=0.29..760.10 rows=10000 width=89) (actual time=0.020..0.020 rows=1 loops=1)
         Buffers: shared hit=3
 Planning Time: 0.091 ms
 Serialization: time=0.005 ms  output=1kB  format=binary
 Execution Time: 0.085 ms
(7 rows)







  
  
  Simple query (Equality)


The first query I tested has no pagination. On my small collection it retreives 14 document for one value of "a":




test> db.demo.countDocuments( { a: 1 } );
14

test> db.demo.find( { a: 1 } ).explain("executionStats")
{
  queryPlanner: {
    Plan: {
      'Node Type': 'Bitmap Heap Scan',
      'Parallel Aware': false,
      'Async Capable': false,
      'Relation Name': 'documents_15',
      'Plan Rows': 5000,
      'Recheck Cond': "(document OPERATOR(documentdb_api_catalog.@=) 'BSONHEX0c0000001061000100000000'::documentdb_core.bson)",                                      
      Plans: [
        {
          'Parent Relationship': 'Outer',
          'Parallel Aware': false,
          'Index Name': 'a_1_ts_-1',
          'Startup Cost': 0,
          'Total Cost': 0,
          'Plan Rows': 100,
          'Node Type': 'Bitmap Index Scan',
          'Async Capable': false,
          'Plan Width': 0,
          'Index Cond': "(document OPERATOR(documentdb_api_catalog.@=) 'BSONHEX0c0000001061000100000000'::documentdb_core.bson)"                                     
        }
      ],
      Alias: 'collection',
      'Startup Cost': 1.25,
      'Total Cost': 146.58,
      'Plan Width': 53
    }
  },
  explainVersion: '1',
  command: { find: 'demo', filter: { a: 1 }, '$db': 'test' },
  serverInfo: {
    host: 'a898d7c3cd9a',
    port: 27017,
    version: '7.0.77',
    gitVersion: '05ed2b952c612533cb12c1ff1a0319a4e7f2e4b5',
    ferretdb: { version: 'v2.1.0' }
  },
  ok: 1
}






The access methods for RUM and GIN indexes are quite similar, utilizing bitmaps for operations like 'Bitmap Index Scan' and 'Bitmap Heap Scan.' However, FerretDB only displays the explain estimations rather than execution statistics, even when using explain("executionStats"). 

Additional insights can be gathered from PostgreSQL Auto Explain:




postgres-1  | 2025-05-01 14:01:47.749 UTC [74] LOG:  duration: 0.041 ms  plan:
postgres-1  |   Query Text: SELECT (index_spec).index_name FROM documentdb_api_catalog.collection_indexes WHERE index_id = 35                                        
postgres-1  |   Seq Scan on collection_indexes  (cost=0.00..1.09 rows=1 width=32) (actual time=0.030..0.031 rows=1 loops=1)                                          
postgres-1  |     Filter: (index_id = 35)
postgres-1  |     Rows Removed by Filter: 7
postgres-1  |     Buffers: shared hit=1
postgres-1  | 2025-05-01 14:01:47.749 UTC [45] LOG:  duration: 1.023 ms  plan:
postgres-1  |   Query Text:
postgres-1  |   Query Parameters: $1 = 'BSONHEX6600000004636f6e74696e756174696f6e00050000000010676574706167655f6261746368436f756e7400ca00000010676574706167655f626174636853697a6548696e74000000000110676574706167655f626174636853697a6541747472000100000000'                                                                               
postgres-1  |   Custom Scan (DocumentDBApiScan)  (cost=0.42..150.16 rows=1667 width=85) (actual time=0.974..1.012 rows=14 loops=1)                                   
postgres-1  |     Page Row Count: 202 rows
postgres-1  |     Page Size Hint: 16777216 bytes
postgres-1  |     Buffers: shared hit=16
postgres-1  |     ->  Bitmap Heap Scan on documents_15 collection  (cost=0.42..146.00 rows=1667 width=89) (actual time=0.968..0.994 rows=14 loops=1)                 
postgres-1  |           Recheck Cond: (document OPERATOR(documentdb_api_catalog.@=) 'BSONHEX0c0000001061000100000000'::documentdb_core.bson)                         
postgres-1  |           Filter: documentdb_api_internal.cursor_state(document, 'BSONHEX6600000004636f6e74696e756174696f6e00050000000010676574706167655f6261746368436f756e7400ca00000010676574706167655f626174636853697a6548696e74000000000110676574706167655f626174636853697a6541747472000100000000'::documentdb_core.bson)                
postgres-1  |           Heap Blocks: exact=14
postgres-1  |           Buffers: shared hit=16
postgres-1  |           ->  Bitmap Index Scan on "a_1_ts_-1"  (cost=0.00..0.00 rows=100 width=0) (actual time=0.949..0.950 rows=14 loops=1)                          
postgres-1  |                 Index Cond: (document OPERATOR(documentdb_api_catalog.@=) 'BSONHEX0c0000001061000100000000'::documentdb_core.bson)                     
postgres-1  |                 Buffers: shared hit=2
postgres-1  | 2025-05-01 14:01:47.749 UTC [45] LOG:  duration: 39.943 ms  plan:
postgres-1  |   Query Text: SELECT cursorpage::bytea, continuation::bytea, persistconnection, cursorid FROM documentdb_api.find_cursor_first_page($1, $2::bytea, $3) 
postgres-1  |   Query Parameters: $1 = 'test', $2 = '\x5a0000000266696e64000500000064656d6f000366696c746572000c0000001061000100000000036c736964001e00000005696400100000000431ae93d9c7074468a630c26b67a8709b00022464620005000000746573740000', $3 = '0'
postgres-1  |   Function Scan on find_cursor_first_page  (cost=0.00..0.02 rows=1 width=73) (actual time=39.937..39.938 rows=1 loops=1)                               
postgres-1  |     Buffers: shared hit=202 read=1 dirtied=1






There's a lot happening, but the crucial detail is the number of pages and rows that have been processed. The Bitmap Index Scan efficiently located the 14 entries needed for the result (rows=14) by only accessing two index pages (Buffers: shared hit=2). The Bitmap Heap Scan expanded this by including one heap page per document (Heap Blocks: exact=14).

On top of this, a custom scan (DocumentDBApiScan) keeps track of the MongoDB cursor and paging. It reports the following:




     Page Row Count: 202 rows
     Page Size Hint: 16777216 bytes
     Buffers: shared hit=16






The PostgreSQL scans have read 16 pages, but the DocumentDBApiScan emulates 16MB pages of MongoDB with 202 rows. I don't know exactly how to interpret the numbers here. DocumentDB is not PostgreSQL, and even though it is open-source, its code lacks the internal documentation quality of PostgreSQL.

On one side, I don't think there are really 202 rows in that page, as only 14 have been read from storage, but it seems that it has iterated over those 202 rows by re-reading the PostgreSQL pages, as indicated by Buffers: shared hit=202.

  
  
  Pagination query (Equality, Sort)


OLTP applications commonly implement pagination to limit their results to what is displayed to the user. I executed a query ordered by timestamp to retrieve the last ten entries:




test> db.demo.find(
  { a: 1 } ).sort({ts:-1}).limit(10).explain("executionStats")
;

{
  queryPlanner: {
    Plan: {
      'Parallel Aware': false,
      'Async Capable': false,
      'Startup Cost': 267.13,
      'Total Cost': 267.15,
      'Plan Rows': 10,
      'Plan Width': 85,
      Plans: [
        {
          'Node Type': 'Sort',
          'Parallel Aware': false,
          'Async Capable': false,
          'Startup Cost': 267.13,
          'Total Cost': 279.63,
          'Plan Width': 85<... (truncated)
                                    

                                    
                                    
                                    
                                        by Franck Pachot
                                    
                                    
                                

                            
                                
                                    
                                        Murat Demirbas
                                    
                                    
                                        TLA+ Community Event at ETAPS 2025
                                    
                                    
                                    
                                        This Sunday, I'll be attending (and speaking at) the TLA+ Community Event, co-located with ETAPS 2025 in Hamilton, Ontario. 
The setting is fitting. ETAPS (European Joint Conferences on Theory and Practice of Software) has long been a hub for research that combines theory with software engineering. It seems that, while the U.S. academia largely left software engineering to industry, European researchers remained more strongly involved in software engineering discipline. ETAPS has consistently hosted work on model checking, type systems, static analysis, and formal methods. Think of work on abstract interpretation, K frameworks, or compilers verified in Coq.
I have never been to ETAPS before. It seems that they are rebranding as "International Joint Conferences On Theory and Practice of Software" and droppping the European. And this year is the first time, after 28 years, that the event moves outside of Europe. Interesting.
McMaster University, the ETAPS 2025 host, is a strong research school, particularly in health sciences and engineering. Huh, the department is called "Department of Computing & Software", and it gives a degree in Computer Science and several others in Software Engineering. It's also just an hour's drive from Buffalo, where I live, so this is a rare hometown event for me.

The TLA+ Community Event runs all day Sunday, May 4. The program features researchers and practitioners from academia and industry. Some highlights:
ModelFuzz for distributed systems (MPI-SWS)  
Source-level safety checking via C-to-PlusCal translation (Asterios Technologies)  
TLA+ in Python notebooks (Loyola University Chicago)  
Modeling and Modular Verification of MongoDB’s distributed transactions  (joint work between Will Schultz and yours truly)
How do we use TLA+ for statistical properties? (by Jesse Jiryu Davis)
And a talk on building TLA+ tooling (by Andrew Helwer)
There's no formal proceedings from the event, but slides and recordings will be online.
I would be amiss if I don't mention the TLA+ Foundation Grant Program. The TLA+ Foundation is accepting proposals for grant funding to support projects that advance the state of the art in TLA+ and improve the experience of using TLA+ in research and industry. Grants will be awarded based on the significance of the proposed work and its potential to benefit the TLA+ community.


                                    
                                    
                                    
                                    
                                        by Murat (noreply@blogger.com)
                                    
                                    
                                
                            
                        

                    
                        
                            April 30, 2025
                            
                            
                                
                                    
                                        Franck Pachot
                                    
                                    
                                        Querying embedded arrays in JSON (PostgreSQL JSONB and MongoDB documents)
                                    
                                    
                                    
                                        When working with document-based data structures, the fields at the root can be indexed using simple expression indexes. However, when an array is present in the path, representing a One-to-Many relationship, PostgreSQL requires a GIN index and the use of JSON path operators for indexing, more efficient than SQL/JSON queries.


  
  
  Example


I create the following table to store books. I decided to embed more information with a flexible schema and added a "data" column to store JSON data:




create table books (  
   primary key(book_id),  
   book_id bigint,  
   title text,  
   data jsonb  
);






I insert one book and add some reviews in my flexible schema document:




insert into books values (  
   8675309,  
   'Brave New World',  
   '{ "reviews":[  
      { "name": "John", "text": "Amazing!" },  
      { "name": "Jane", "text": "Incredible book!" }  
   ] }'
);






There’s no need for another table, as reviews are inherently linked to the books they discuss. A book cannot be reviewed without being displayed alongside its review, making any separate table unnecessary. I know it looks like violating the first normal form, but there's no update anomaly possible here because there's no duplication. From a normalization point of view, this is not very different from storing text, which is an array of char, or embeddings, which are arrays of numbers.


  
  
  Inefficient query with SQL join


If you're comfortable with SQL, you might want to query this structure using SQL. Simply unnest the JSON document array and use it like a relational table:




SELECT DISTINCT title FROM books    
JOIN LATERAL jsonb_array_elements(books.data->'reviews') AS review 
ON review->>'name' = 'John'
;  






jsonb_array_elements expands a JSON array into rows for SQL queries. The lateral join adds book information, the ON or WHERE clause filters by reviewer name, and DISTINCT removes duplicate titles. This is standard SQL syntax but cannot use an index to filter on the reviewer name before unnesting, requiring a read of all rows and documents:




                              QUERY PLAN                               
-----------------------------------------------------------------------
 Unique
   ->  Sort
         Sort Key: books.title
         ->  Nested Loop
               ->  Seq Scan on books
               ->  Function Scan on jsonb_array_elements review
                     Filter: ((value ->> 'name'::text) = 'John'::text)






While this is a valid SQL syntax, and JSON is a valid SQL datatype, they are not so friendly because a relational database is not a document database. When using documents in PostgreSQL, you must learn how to query them and index them.

Note that jsonb_array_elements is not SQL standard, but PostgreSQL 17 introduced the JSON_TABLE which is aprt of the standard. The query can be re-written as:




SELECT books.title  
FROM books  
JOIN JSON_TABLE(  
  books.data->'reviews',  
  '$[*]' COLUMNS (  
    name TEXT PATH '$.name' 
  )  
) AS review  
ON review.name = 'John'
;  






This is the standard SQL/JSON way to query documents. Unfortunately, it is not efficient as no index scan is possible. Don't forget that SQL indexes are not part of the SQL standard.


  
  
  Efficient query with JSON operators


To efficiently query JSONB data for reviews by a specific person, we need to utilize PostgreSQL's containment operator @> instead of relying on standard SQL:




SELECT title FROM books  
WHERE data->'reviews' @> '[{"name": "John"}]'
;






Now that I filter directly on the table without transforming the document, I can create an index. Since there can be multiple keys per table row, an inverted index is necessary:




CREATE INDEX ON books USING gin ((data->'reviews') jsonb_path_ops)
;






With an index for the JSON path operators, each key corresponds to an item in the array. This can be utilized when querying with an equality filter on the embedded array field:




                                   QUERY PLAN                                   
--------------------------------------------------------------------------------
 Bitmap Heap Scan on books
   Recheck Cond: ((data -> 'reviews'::text) @> '[{"name": "John"}]'::jsonb)
   ->  Bitmap Index Scan on books_expr_idx
         Index Cond: ((data -> 'reviews'::text) @> '[{"name": "John"}]'::jsonb)







GIN (Generalized Inverted Index) is designed for datatypes that includes multiple keys, such as array items or words, stems, or trigrams in text. While powerful, GIN has limitations: it cannot support range queries, optimize ORDER BY clauses, or perform covering projections (no Index Only Scan).


  
  
  Comparison with a document database


While PostgreSQL offers flexibility in storing and indexing JSON documents, it does not replace a document database where documents are native types. For instance, in MongoDB the fields within an array are used like any other fields. I insert similar document in MongoDB:




db.books.insertOne({  
  book_id: 8675309,  
  title: "Brave New World",  
  reviews: [  
    { name: "John", text: "Amazing!" },  
    { name: "Jane", text: "Incredible book!" }  
  ]  
}); 






There is no need for special operators, and I can query the embedded field like any other field:




db.books.find(
  { "reviews.name": "John" } // filter
, { title: 1, _id: 0 }      // projection
);  

[ { title: 'Brave New World' } ]







There is no need for special index type, and I can index the embedded field like any other field:




db.books.createIndex({ "reviews.name": 1 })
;  






The execution plan confirms that the index is used to filter on "reviews.name":




db.books.find(
  { "reviews.name": "John" } // filter
, { title: 1, _id: 0 }      // projection
).explain().queryPlanner.winningPlan
; 

{
  isCached: false,
  stage: 'PROJECTION_SIMPLE',
  transformBy: { title: 1, _id: 0 },
  inputStage: {
    stage: 'FETCH',
    inputStage: {
      stage: 'IXSCAN',
      keyPattern: { 'reviews.name': 1 },
      indexName: 'reviews.name_1',
      isMultiKey: true,
      multiKeyPaths: { 'reviews.name': [ 'reviews' ] },
      isUnique: false,
      isSparse: false,
      isPartial: false,
      indexVersion: 2,
      direction: 'forward',
      indexBounds: { 'reviews.name': [ '["John", "John"]' ] }
    }
  }
}






It is a regular index, the only particularity is that it allows multi-key entries. 

Unlike PostgreSQL's GIN index, which requires a Bitmap Scan that doesn't maintain entry order, MongoDB employs a regular index that supports range queries. For instance, if I only know the beginning of a name, I can utilize a Regular Expression to filter the results effectively:




db.books.find(  
  { "reviews.name": { $regex: "^Joh" } }, // filter using regex  
  { title: 1, _id: 0 }                    // projection  
).explain().queryPlanner.winningPlan
;  

{
  isCached: false,
  stage: 'PROJECTION_SIMPLE',
  transformBy: { title: 1, _id: 0 },
  inputStage: {
    stage: 'FETCH',
    inputStage: {
      stage: 'IXSCAN',
      keyPattern: { 'reviews.name': 1 },
      indexName: 'reviews.name_1',
      isMultiKey: true,
      multiKeyPaths: { 'reviews.name': [ 'reviews' ] },
      isUnique: false,
      isSparse: false,
      isPartial: false,
      indexVersion: 2,
      direction: 'forward',
      indexBounds: { 'reviews.name': [ '["Joh", "Joi")', '[/^Joh/, /^Joh/]' ] }
    }
  }
}






MongoDB utilized the index efficiently, as the query planner transformed the regular expression /^Joh/ into a range scan, specifically ["Joh", "Joi").


  
  
  Conclusion


When comparing PostgreSQL and MongoDB, it is essential to understand their querying and indexing mechanisms, and not rely only on their ability to store JSON.

Like other RDBMS, PostgreSQL excels as a centralized, monolithic database utilized by multiple applications. With its specialized JSONB functions and GIN indexes, it adds some flexibility to the normalized tables.

MongoDB is ideal for development agility, particularly in microservices and domain-driven design, where access patterns are well-defined, but the application evolves with high velocity. Its document model aligns well with business objects.

Ultimately, the choice of a database should be based on your team's expertise, comfort with database syntax, data modeling, optimal indexing, and access to new hires and educational resources. The best database for a specific workload will not perform as expected if there's no expertise to code efficient queries, read execution plans, and index the access paths. Another database may be good enough when it fits better with the development organization and provides a better developer experience and simplifies the optimization.
                                    
                                    
                                    
                                    
                                        by Franck Pachot
                                    
                                    
                                
                            
                                
                                    
                                        Percona Database Performance Blog
                                    
                                    
                                        Benchmarking PostgreSQL: The Hidden Cost of Over-Indexing
                                    
                                    
                                    
                                        Indexing is one of PostgreSQL’s most effective tools for improving query performance, but like any powerful tool, it can cause real problems when overused. A while ago, my colleague Jobin wrote a blog post exploring the negative side effects of over-indexing in PostgreSQL: PostgreSQL Indexes Can Hurt Performance: Exploring the Negative Effects and the Costs […]
                                    
                                    
                                    
                                    
                                        by Pablo Svampa
                                    
                                    
                                
                            
                        
                    
                    
                    
                    
                        
                        
                            ←
                        
                        
                        
                            1
                            
                                ...
                            
                        
                        
                        
                            
                                5
                            
                        
                            
                                6
                            
                        
                            
                                7
                            
                        
                            
                                8
                            
                        
                            
                                9
                            
                        
                            
                                10
                            
                        
                            
                                11
                            
                        
                            
                                12
                            
                        
                            
                                13
                            
                        
                            
                                14
                            
                        
                        
                        
                            
                                ...
                            
                            62
                        
                        
                        
                        
                            →