a curated list of database news from authoritative sources

March 27, 2025

Things that go wrong with disk IO

There are a few interesting scenarios to keep in mind when writing applications (not just databases!) that read and write files, particularly in transactional contexts where you actually care about the integrity of the data and when you are editing data in place (versus copy-on-write for example).

We'll go into a few scenarios where the following can happen:

  • Data you write never actually makes it to disk
  • Data you write get sent to the wrong location on disk
  • Data you read is read from the wrong location on disk
  • Data gets corrupted on disk

And how real-world data systems think about these scenarios. (They don't always think of them at all!)

If I don't say otherwise I'm talking about behavior on Linux.

The post is largely a review of two papers: Parity Lost and Parity Regained and Characteristics, Impact, and Tolerance of Partial Disk Failures. These two papers also go into the frequency of some of the issues discussed here. These behaviors actually happen in real life!

Thank you to Alex Miller and George Xanthakis for reviewing a draft of this post.

Terminology

Some of these terms are reused in different contexts, and sometimes they are reused because they effectively mean the same thing in a certain configuration. But I'll try to be explicit to avoid confusion.

Sector

The smallest amount of data that can be read and written atomically by hardware. It used to be 512 bytes, but on modern disks it is often 4KiB. There doesn't seem to be any safe assumption you can make about sector size, despite file system defaults (see below). You must check your disks to know.

Block (filesystem/kernel view)

Typically set to the sector size since only this block size is atomic. The default in ext4 is 4KiB.

Page (kernel view)

A disk block that is in memory. Any reads/writes less than the size of a block will read the entire block into kernel memory even if less than that amount is sent back to userland.

Page (database/application view)

The smallest amount of data the system (database, application, etc.) chooses to act on, when it's read or written or held in memory. The page size is some multiple of the filesystem/kernel block size (including the multiple being 1). SQLite's default page size is 4KiB. MySQL's default page size is 16KiB. Postgres's default page size is 8KiB.

Things that go wrong

The data didn't reach disk

By default, file writes succeed when the data is copied into kernel memory (buffered IO). The man page for write(2) says:

A successful return from write() does not make any guarantee that data has been committed to disk. On some filesystems, including NFS, it does not even guarantee that space has successfully been reserved for the data. In this case, some errors might be delayed until a future write(), fsync(2), or even close(2). The only way to be sure is to call fsync(2) after you are done writing all your data.

If you don't call fsync on Linux the data isn't necessarily durably on disk, and if the system crashes or restarts before the disk writes the data to non-volatile storage, you may lose data.

With O_DIRECT, file writes succeed when the data is copied to at least the disk cache. Alternatively you could open the file with O_DIRECT|O_SYNC (or O_DIRECT|O_DSYNC) and forgo fsync calls.

fsync on macOS is a no-op.

If you're confused, read Userland Disk I/O.

Postgres, SQLite, MongoDB, MySQL fsync data before considering a transaction successful by default. RocksDB does not.

The data was fsynced but fsync failed

fsync isn't guaranteed to succeed. And when it fails you can't tell which write failed. It may not even be a failure of a write to a file that your process opened:

Ideally, the kernel would report errors only on file descriptions on which writes were done that subsequently failed to be written back. The generic pagecache infrastructure does not track the file descriptions that have dirtied each individual page however, so determining which file descriptors should get back an error is not possible.

Instead, the generic writeback error tracking infrastructure in the kernel settles for reporting errors to fsync on all file descriptions that were open at the time that the error occurred. In a situation with multiple writers, all of them will get back an error on a subsequent fsync, even if all of the writes done through that particular file descriptor succeeded (or even if there were no writes on that file descriptor at all).

Don't be 2018-era Postgres.

The only way to have known which exact write failed would be to open the file with O_DIRECT|O_SYNC (or O_DIRECT|O_DSYNC), though this is not the only way to handle fsync failures.

The data was corrupted

If you don't checksum your data on write and check the checksum on read (as well as periodic scrubbing a la ZFS) you will never be aware if and when the data gets corrupted and you will have to restore (who knows how far back in time) from backups if and when you notice.

ZFS, MongoDB (WiredTiger), MySQL (InnoDB), and RocksDB checksum data by default. Postgres and SQLite do not (though databases created from Postgres 18+ will).

You should probably turn on checksums on any system that supports it, regardless of the default.

The data was partially written

Only when the page size you write = block size of your filesystem = sector size of your disk is a write guaranteed to be atomic. If you need to write multiple sectors of data atomically there is the risk that some sectors are written and then the system crashes or restarts. This behavior is called torn writes or torn pages.

Postgres, SQLite, and MySQL (InnoDB) handle torn writes. Torn writes are by definition not relevant to immutable storage systems like RocksDB (and other LSM Tree or Copy-on-Write systems like MongoDB (WiredTiger)) unless writes (that update metadata) span sectors.

If your file system duplicates all writes like MySQL (InnoDB) does (like you can with data=journal in ext4) you may also not have to worry about torn writes. On the other hand, this amplifies writes 2x.

The data didn't reach disk, part 2

Sometimes fsync succeeds but the data isn't actually on disk because the disk is lying. This behavior is called lost writes or phantom writes. You can be resilient to phantom writes by always reading back what you wrote (expensive) or versioning what you wrote.

Databases and file systems generally do not seem to handle this situation.

The data was written to the wrong place, read from the wrong place

If you aren't including where data is supposed to be on disk as part of the checksum or page itself, you risk being unaware that you wrote data to the wrong place or that you read from the wrong place. This is called misdirected writes/reads.

Databases and file systems generally do not seem to handle this situation.

Further reading

In increasing levels of paranoia (laudatory) follow ZFS, Andrea and Remzi Arpaci-Dusseau, and TigerBeetle.

March 26, 2025

Up and running with Apache OFBiz and Amazon Aurora DSQL

In this post, we show you a worked example of taking an existing application that works on PostgreSQL databases and adapting it to work with an Aurora DSQL database. In addition to adapting for the previously mentioned aspects, we also address some data type incompatibilities and work around some limits that currently exist in Aurora DSQL.

MongoDB Vector Search Index, with local Atlas and Ollama

For this demo, I'm using the Eurovision songs available at Kaggle, which contain lyrics in their original language, and translated into English

Sample data

I downloaded and uncompressed the files:


wget -c -O eurovision-song-lyrics.zip eurovision-song-lyrics.zip https://www.kaggle.com/api/v1/datasets/download/minitree/eurovision-song-lyrics 
  unzip -o eurovision-song-lyrics.zip 
  rm -f    eurovision-song-lyrics.zip

MongoDB Atlas and Shell

I install MongoDB Atlas CLI and start a local instance:

curl https://fastdl.mongodb.org/mongocli/mongodb-atlas-cli_1.41.1_linux_arm64.tar.gz | 
 tar -xzvf - &&
 alias atlas=$PWD/mongodb-atlas-cli_1.41.1_linux_arm64/bin/atlas

atlas deployments setup  atlas --type local --port 27017 --force

This runs MongoDB Atlas in a Docker container:

I also installed MongoDB Shell to connect and run JavaScript and Node.js:

curl https://downloads.mongodb.com/compass/mongosh-2.4.2-linux-arm64.tgz | 
 tar -xzvf - && 
 alias mongosh=$PWD/mongosh-2.4.2-linux-arm64/bin/mongosh                                                                      

Ollama Large Language Model

I need a local LLM model to generate embeddings from lyrics. To avoid relying on external services, I will use Ollama with the nomic-embed-text model, which I install locally:


curl -fsSL https://ollama.com/install.sh | sh

ollama pull nomic-embed-text

npm install ollama

mongosh

I've installed the Ollama module for node.js and started MongoDB Shell.

Load data into MongoDB

I load the files into an eurovision collection:


const fs = require('fs/promises');
async function loadJsonToMongoDB() {
 const fileContent = await fs.readFile('eurovision-lyrics-2023.json', 'utf8');
 const jsonData = JSON.parse(fileContent);
 const documents = Object.values(jsonData);
 const result = await db.eurovision.insertMany(documents);
}

db.eurovision.drop();
loadJsonToMongoDB();
db.eurovision.countDocuments();

Generate embeddings

I update the MongoDB collection to add embeddings, generated from the lyrics, and generating embeddings with Ollama:


const ollama = require("ollama"); // Ollama Node.js client

// Calculate embeddings
async function calculateEmbeddings(collection) {
  try {
    // Process each document and update the embedding
    const cursor = collection.find();
    let counter = 0;
    for await (const doc of cursor) {
      // Call the embedding API
      const data = {
        model: 'nomic-embed-text',
        prompt: doc["Lyrics translation"]
      };
      const { embedding } = await ollama.default.embeddings(data);
      // Update the document with the new embedding
      await collection.updateOne(
        { _id: doc._id },
        { $set: { embedding: embedding } }
      );
      counter++;
      console.log(`Added embeddings for ${doc.Year} ${doc.Country}`);
    }
  } catch (error) {
    console.error('Error:', error);
  }
}
calculateEmbeddings(db.eurovision);

This takes some time:

MongoDB Vector Search Index

I create a vector search index (Ollama nomic-embed-text, like BERT, has 768 dimensions):


db.eurovision.createSearchIndex(
   "vectorSearchOnLyrics",
   "vectorSearch",
   {
      fields: [
         {
            type: "vector",
            numDimensions: 768,
            path: "embedding",
            similarity: "cosine"
         }
      ]
   }
)

db.eurovision.getSearchIndexes()

Don't forget the name of the index, it will be used to query, and a wrong index name simply results on no results.

Aggregation pipeline with vector search

Here is the function I'll use to query with a prompt, converting the prompt to vector embedding with the same model:


const ollama = require("ollama"); // Ollama Node.js client

async function vectorSearch(collection, prompt) {
  try {
    // Get the embedding for the prompt
      const data = {
        model: 'nomic-embed-text',
        prompt: prompt,
      };
      const { embedding } = await ollama.default.embeddings(data);
    // Perform a vector search in aggregation pipeline
    const results=collection.aggregate([
      {
      "$vectorSearch": {
        "index": "vectorSearchOnLyrics",
        "path": "embedding",
        "queryVector": embedding,
        "numCandidates": 10,
        "limit": 5
      }
      },{
      "$project": {
        "Year": 1,
        "Country": 1,
        "Artist": 1,
        "Song": 1,
        "Language": 1,
        "score": { "$meta": "vectorSearchScore" }
       }
     }
     ]);
    // Display the result
    results.forEach(doc => {
      console.log(`${doc.score.toFixed(2)} ${doc.Year} ${doc.Country} ${doc.Artist} ${doc.Song} (${doc.Language})`);
    });
  } catch (error) {
    console.error('Error during vector search:', error);
  }
}

Here are a few prompts that I tried to find the songs for which I remember some bits, using different languages:


vectorSearch(db.eurovision,
 "un nino y un pájaro"
);

vectorSearch(db.eurovision,
 "a wax doll singing"
);

vectorSearch(db.eurovision,
 "Un chico llamado Mercy"
);

vectorSearch(db.eurovision,
 "lyrics were about being unique and not like the others"
);

Here are my results, the songs I had in mind appeared in the forst or second position:

Conclusion

This article discusses implementing vector search in a local MongoDB setup using Ollama and the Eurovision song lyrics dataset. We populated the database with vectorized data embedded in the documents and created a vector search index for retrieving songs based on semantic similarity.

Storing embeddings with data, rather than in a separate database, is advantageous because embeddings are generated from document fields and can be indexed like any other fields. MongoDB stores arrays natively with the document model. The next step will involve generating embeddings without moving data out of the database, by integrating Voyager AI into MongoDB.

MongoDB Vector Search Index, with local Atlas and Ollama

For this demo, I'm using the Eurovision songs available at Kaggle, which contain lyrics in their original language, and translated into English

Sample data

I downloaded and uncompressed the files:


wget -c -O eurovision-song-lyrics.zip eurovision-song-lyrics.zip https://www.kaggle.com/api/v1/datasets/download/minitree/eurovision-song-lyrics 
  unzip -o eurovision-song-lyrics.zip 
  rm -f    eurovision-song-lyrics.zip

MongoDB Atlas and Shell

I install MongoDB Atlas CLI and start a local instance:

curl https://fastdl.mongodb.org/mongocli/mongodb-atlas-cli_1.41.1_linux_arm64.tar.gz | 
 tar -xzvf - &&
 alias atlas=$PWD/mongodb-atlas-cli_1.41.1_linux_arm64/bin/atlas

atlas deployments setup  atlas --type local --port 27017 --force

This runs MongoDB Atlas in a Docker container:

I also installed MongoDB Shell to connect and run JavaScript and Node.js:

curl https://downloads.mongodb.com/compass/mongosh-2.4.2-linux-arm64.tgz | 
 tar -xzvf - && 
 alias mongosh=$PWD/mongosh-2.4.2-linux-arm64/bin/mongosh                                                                      

Ollama Large Language Model

I need a local LLM model to generate embeddings from lyrics. To avoid relying on external services, I will use Ollama with the nomic-embed-text model, which I install locally:


curl -fsSL https://ollama.com/install.sh | sh

ollama pull nomic-embed-text

npm install ollama

mongosh

I've installed the Ollama module for node.js and started MongoDB Shell.

Load data into MongoDB

I load the files into an eurovision collection:


const fs = require('fs/promises');
async function loadJsonToMongoDB() {
 const fileContent = await fs.readFile('eurovision-lyrics-2023.json', 'utf8');
 const jsonData = JSON.parse(fileContent);
 const documents = Object.values(jsonData);
 const result = await db.eurovision.insertMany(documents);
}

db.eurovision.drop();
loadJsonToMongoDB();
db.eurovision.countDocuments();

Generate embeddings

I update the MongoDB collection to add embeddings, generated from the lyrics, and generating embeddings with Ollama:


const ollama = require("ollama"); // Ollama Node.js client

// Calculate embeddings
async function calculateEmbeddings(collection) {
  try {
    // Process each document and update the embedding
    const cursor = collection.find();
    let counter = 0;
    for await (const doc of cursor) {
      // Call the embedding API
      const data = {
        model: 'nomic-embed-text',
        prompt: doc["Lyrics translation"]
      };
      const { embedding } = await ollama.default.embeddings(data);
      // Update the document with the new embedding
      await collection.updateOne(
        { _id: doc._id },
        { $set: { embedding: embedding } }
      );
      counter++;
      console.log(`Added embeddings for ${doc.Year} ${doc.Country}`);
    }
  } catch (error) {
    console.error('Error:', error);
  }
}
calculateEmbeddings(db.eurovision);

This takes some time:

MongoDB Vector Search Index

I create a vector search index (Ollama nomic-embed-text, like BERT, has 768 dimensions):


db.eurovision.createSearchIndex(
   "vectorSearchOnLyrics",
   "vectorSearch",
   {
      fields: [
         {
            type: "vector",
            numDimensions: 768,
            path: "embedding",
            similarity: "cosine"
         }
      ]
   }
)

db.eurovision.getSearchIndexes()

Don't forget the name of the index, it will be used to query, and a wrong index name simply results on no results.

Aggregation pipeline with vector search

Here is the function I'll use to query with a prompt, converting the prompt to vector embedding with the same model:


const ollama = require("ollama"); // Ollama Node.js client

async function vectorSearch(collection, prompt) {
  try {
    // Get the embedding for the prompt
      const data = {
        model: 'nomic-embed-text',
        prompt: prompt,
      };
      const { embedding } = await ollama.default.embeddings(data);
    // Perform a vector search in aggregation pipeline
    const results=collection.aggregate([
      {
      "$vectorSearch": {
        "index": "vectorSearchOnLyrics",
        "path": "embedding",
        "queryVector": embedding,
        "numCandidates": 10,
        "limit": 5
      }
      },{
      "$project": {
        "Year": 1,
        "Country": 1,
        "Artist": 1,
        "Song": 1,
        "Language": 1,
        "score": { "$meta": "vectorSearchScore" }
       }
     }
     ]);
    // Display the result
    results.forEach(doc => {
      console.log(`${doc.score.toFixed(2)} ${doc.Year} ${doc.Country} ${doc.Artist} ${doc.Song} (${doc.Language})`);
    });
  } catch (error) {
    console.error('Error during vector search:', error);
  }
}

Here are a few prompts that I tried to find the songs for which I remember some bits, using different languages:


vectorSearch(db.eurovision,
 "un nino y un pájaro"
);

vectorSearch(db.eurovision,
 "a wax doll singing"
);

vectorSearch(db.eurovision,
 "Un chico llamado Mercy"
);

vectorSearch(db.eurovision,
 "lyrics were about being unique and not like the others"
);

Here are my results, the songs I had in mind appeared in the forst or second position:

Conclusion

This article discusses implementing vector search in a local MongoDB setup using Ollama and the Eurovision song lyrics dataset. We populated the database with vectorized data embedded in the documents and created a vector search index for retrieving songs based on semantic similarity.

Storing embeddings with data, rather than in a separate database, is advantageous because embeddings are generated from document fields and can be indexed like any other fields. MongoDB stores arrays natively with the document model. The next step will involve generating embeddings without moving data out of the database, by integrating Voyager AI into MongoDB.

Enhancing Keyhole: Pulling More Metrics from MongoDB’s FTDC

Everyone knows that MongoDB has FTDC (Full-Time Diagnostic Data Capture), which helps MongoDB engineers analyze server behavior, tune parameters, and conduct forensic work when issues occur within their clusters. Here at Percona, we’ve been using the Keyhole tool for a while, and it’s great! It’s very useful for tuning parameters or analyzing what happened at […]

Vibe code tools, not toys.

How I vibe coded an internal anomaly detection system that had previously taken me months to build and deploy.

March 25, 2025

Transition a pivot query that includes dynamic columns from SQL Server to PostgreSQL

When assisting customers with migrating their workloads from SQL Server to PostgreSQL, we often encounter a scenario where the PIVOT function is used extensively for generating dynamic reports. In this post, we show you how to use the crosstab function, provided by PostgreSQL’s tablefunc extension, to implement functionality similar to SQL Server’s PIVOT function, offering greater flexibility.

Integrate natural language processing and generative AI with relational databases

In this post, we present an approach to using natural language processing (NLP) to query an Amazon Aurora PostgreSQL-Compatible Edition database. The solution presented in this post assumes that an organization has an Aurora PostgreSQL database. We create a web application framework using Flask for the user to interact with the database. JavaScript and Python code act as the interface between the web framework, Amazon Bedrock, and the database.

March 24, 2025

Cedar: A New Language for Expressive, Fast, Safe, and Analyzable Authorization

This paper (2024) presents Cedar, AWS's new authorization policy language. By providing a clear declarative way to manage access control policies, Cedar addresses the common limitations of embedding authorization logic directly into application code: problems with correctness, auditing, and maintainence. Cedar introduces a domain-specific language (DSL) to express policies that are separate from application code, and improves readability and manageability. In that sense, this is like aspect-oriented programming but for authorization policy.

The language is designed with four main objectives: expressiveness, performance, safety, and analyzability. Cedar balances these goals by supporting role-based (RBAC), attribute-based (ABAC), and relationship-based (ReBAC) access control models.


Policy Structure and Evaluation

Cedar policies consist of three primary components: effect, scope, and conditions. The effect can either be "permit" or "forbid", defining whether access is granted or denied. The scope specifies the principal (user), action, and resource. Conditions provide additional constraints using entity attributes.

Entities in Cedar represent users, resources, etc. Each entity belongs to a specific type, such as User, Team, or List. Entities are uniquely identified by a combination of their type and a string identifier (e.g., User::"andrew"). Entity attributes are key-value pairs associated with entities, providing additional information. Attributes can include primitive data types like strings or numbers, collections like lists or sets, or references to other entities. For example, a List entity might have attributes like owner, readers, editors, and tasks, where owner references a User entity, and readers references a set of Team entities. Entities and their attributes form a hierarchical structure, which Cedar uses to evaluate policies efficiently.


Cedar ensures safety by applying a deny-by-default model. If no "permit" policy is applicable, access is denied. Additionally, "forbid" policies always take precedence over "permit" policies.

A schema defines entity types, attributes, and valid actions, and Cedar's policy validator uses optional typing and static capabilities to catch potential errors, like accessing non-existent attributes. Another key feature is policy templates, which allow developers to define reusable policy patterns with placeholders. These placeholders are instantiated with specific entities or attributes at runtime, reducing redundancy and simplifying policy management.


Symbolic Analysis and Proofs

Cedar supports deeper policy analysis through symbolic compilation. Symbolic compilation reduces policies to SMT (Satisfiability Modulo Theories) formulas, enabling automated reasoning about policy behavior. This allows checking for policy equivalence, identifying inconsistencies, and verifying security invariants.

Checking policy equivalence is particularly useful for ensuring that policy refactoring or updates do not introduce unintended behavior. By comparing two versions of a policy set, Cedar can determine if they produce identical authorization decisions for all possible requests. This is crucial for organizations with frequent policy updates to ensure no permissions are inadvertently granted or revoked.

By compiling to SMT, Cedar leverages the rapid advancements in SMT solvers over the past two decades. Improvements in solver algorithms, better heuristics for decision procedures, and more efficient memory management have significantly enhanced SMT solver performance. Tools like Z3 and CVC5 are now capable of solving complex logical formulas quickly, making real-time policy analysis feasible. These advancements enable Cedar to perform sophisticated policy checks with minimal overhead.

Cedar is formalized in the Lean programming language and uses its proof assistant to establish key properties like correctness of the authorization algorithm, sound slicing, and validation soundness. The compiler's encoding is proved to be decidable, sound, and complete. This result, the first of its kind for a non-trivial policy language, is made possible by Cedar's careful control over expressiveness and by leveraging invariants ensured by Cedar's policy validator.


Implementation and Evaluation

Cedar is implemented in Rust and open-sourced at https://github.com/cedar-policy/. The implementation is extensively tested using differential random testing to ensure correctness.


Cedar was evaluated against two prominent open-source, general-purpose authorization languages, OpenFGA and Rego, using three example sets of policies. The evaluation demonstrated that Cedar's policy slicing significantly reduces evaluation time. For large policy sets, Cedar achieved a 28.7x to 35.2x speedup over OpenFGA and a 42.8x to 80.8x speedup over Rego.

Unlike OpenFGA and Rego, which show scaling challenges with increased input sizes, Cedar maintains consistently low evaluation latency. In a simulated Google Drive authorization scenario, Cedar handled requests in a median time of 4.0 microseconds (𝜇s) with 5 Users/Groups/Documents/Folders, increasing only to 5.0𝜇s with 50 Users/Groups/Documents/Folders. Similarly, in a GitHub-like authorization scenario, Cedar maintained a median performance of around 11.0𝜇s across all input ranges. Even at the 99th percentile (p99), Cedar's response times remained low, with under 10𝜇s for Google Drive and under 20𝜇s for GitHub. Further validating its real-world applicability, Cedar is deployed at scale within Amazon Web Services, providing secure and efficient authorization for large and complex systems.



MongoDB TTL and Disk Storage

In a previous blog post, I explained how MongoDB TTL indexes work and their optimization to avoid fragmentation during scans. However, I didn’t cover the details of on-disk storage. A recent Reddit question is the occasion to explore this aspect further.

Reproducible example

Here is a small program that inserts documents in a loop, with a timestamp and some random text:

// random string to be not too friendly with compression
function getRandomString(length) {
  const characters = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789';
  let result = '';
  const charactersLength = characters.length;
  for (let i = 0; i < length; i++) {
    result += characters.charAt(Math.floor(Math.random() * charactersLength));
  }
  return result;
}
// MongoDB loop for inserting documents with a random string
while (true) {
  const doc = { text: getRandomString(1000), ts: new Date() };
  db.ttl.insertOne(doc);
  insertedCount++;
}

Before executing this, I ran a background function to display statistics every minute, including the number of records, memory usage, and disk size.

// Prints stats every minute
 let insertedCount = 0;
 let maxThroughput = 0;
 let maxCollectionCount = 0;
 let maxCollectionSize = 0;
 let maxStorageSize = 0;
 let maxTotalIndexSize = 0;
 setInterval(async () => {
      const stats = await db.ttl.stats();
      const throughput = insertedCount / 10; // assuming measure over 10 seconds
      const collectionCount = stats.count;
      const collectionSizeMB = stats.size / 1024 / 1024;
      const storageSizeMB = stats.storageSize / 1024 / 1024;
      const totalIndexSizeMB = stats.totalIndexSize / 1024 / 1024;
      maxThroughput = Math.max(maxThroughput, throughput);
      maxCollectionCount = Math.max(maxCollectionCount, collectionCount);
      maxCollectionSize = Math.max(maxCollectionSize, collectionSizeMB);
      maxStorageSize = Math.max(maxStorageSize, storageSizeMB);
      maxTotalIndexSize = Math.max(maxTotalIndexSize, totalIndexSizeMB);
      console.log(`Collection Name: ${stats.ns}
   Throughput:        ${throughput.toFixed(0).padStart(10)} docs/min (max: ${maxThroughput.toFixed(0)} docs/min)
   Collection Size:   ${collectionSizeMB.toFixed(0).padStart(10)} MB (max: ${maxCollectionSize.toFixed(0)} MB)
   Number of Records: ${collectionCount.toFixed(0).padStart(10)}     (max: ${maxCollectionCount.toFixed(0)} docs)
   Storage Size:      ${storageSizeMB.toFixed(0).padStart(10)}    MB (max: ${maxStorageSize.toFixed(0)} MB)
   Total Index Size:  ${totalIndexSizeMB.toFixed(0).padStart(10)} MB (max: ${maxTotalIndexSize.toFixed(0)} MB)`);
      insertedCount = 0;
 }, 60000); // every minute

I created the collection with a TTL index, which automatically expires data older than five minutes:

// TTL expire after 5 minutes
db.ttl.drop();
db.ttl.createIndex({ ts: 1 }, { expireAfterSeconds: 300 });

I let this running to see how the storage size evolves. Note that this was run in on MongoDB 7.0.16 (without the auto compact job that appeared in 8.0).

Output after 3 hours

The consistent insert rate, combined with TTL expiration, keeps the number of documents in the collection relatively stable. Deletions occur every minute, ensuring that the overall document count remains consistent.

I observe the same from the statistics I log every minute:

The storage size also remains constant, at 244MB for the table and 7MB for the indexes. The size of files has increased for the first 6 minutes and then remained constant:

This is sufficient to show that the deletion and insertion do not have a fragmentation effect that would require additional consideration. About 25% is marked as available for reuse and is effectively reused.

It is possible to force a compaction, to temporarily reclaim more space, but it is not necessary:

Collection Name: test.ttl
   Throughput:              3286 docs/min (max: 3327 docs/min)
   Collection Size:          170 MB (max: 198 MB)
   Number of Records:     171026     (max: 198699 docs)
   Storage Size:             244    MB (max: 244 MB)
   Total Index Size:           7 MB (max: 7 MB)

Collection Name: test.ttl
   Throughput:              3299 docs/min (max: 3327 docs/min)
   Collection Size:          170 MB (max: 198 MB)
   Number of Records:     170977     (max: 198699 docs)
   Storage Size:             244    MB (max: 244 MB)
   Total Index Size:           6 MB (max: 7 MB)

Collection Name: test.ttl
   Throughput:              3317 docs/min (max: 3327 docs/min)
   Collection Size:          170 MB (max: 198 MB)
   Number of Records:     170985     (max: 198699 docs)
   Storage Size:             244    MB (max: 244 MB)
   Total Index Size:           6 MB (max: 7 MB)

test> db.runCommand({ compact: 'ttl' });
{ bytesFreed: 49553408, ok: 1 }

Collection Name: test.ttl
   Throughput:              1244 docs/min (max: 3327 docs/min)
   Collection Size:          150 MB (max: 198 MB)
   Number of Records:     150165     (max: 198699 docs)
   Storage Size:             197    MB (max: 244 MB)
   Total Index Size:           6 MB (max: 7 MB)
Collection Name: test.ttl
   Throughput:              3272 docs/min (max: 3327 docs/min)
   Collection Size:          149 MB (max: 198 MB)
   Number of Records:     149553     (max: 198699 docs)
   Storage Size:             203    MB (max: 244 MB)
   Total Index Size:           6 MB (max: 7 MB)

While this has reduced storage, it eventually returns to its normal volume. It's typical for a B-tree to maintain some free space, which helps to minimize frequent space allocations and reclaim.

Here is a focus when I've run manual compaction:

Conclusion

TTL deletion makes space available for reuse instead of reclaiming it immediately, but this doesn't increase the fragmentation. This space is reused automatically to maintain a total size proportional to the document count, with a constant free space of 25% in my case, to minimize frequent allocations typical of B-Tree implementations.
The TTL mechanism operates autonomously, requiring no manual compaction. If you have any doubt, monitor it. MongoDB offers statistics to compare logical and physical sizes at both the MongoDB layer and WiredTiger storage.

MongoDB TTL and Disk Storage

In a previous blog post, I explained how MongoDB TTL indexes work and their optimization to avoid fragmentation during scans. However, I didn’t cover the details of on-disk storage. A recent Reddit question is the occasion to explore this aspect further.

Reproducible example

Here is a small program that inserts documents in a loop, with a timestamp and some random text:

// random string to be not too friendly with compression
function getRandomString(length) {
  const characters = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789';
  let result = '';
  const charactersLength = characters.length;
  for (let i = 0; i < length; i++) {
    result += characters.charAt(Math.floor(Math.random() * charactersLength));
  }
  return result;
}
// MongoDB loop for inserting documents with a random string
while (true) {
  const doc = { text: getRandomString(1000), ts: new Date() };
  db.ttl.insertOne(doc);
  insertedCount++;
}

Before executing this, I ran a background function to display statistics every minute, including the number of records, memory usage, and disk size.

// Prints stats every minute
 let insertedCount = 0;
 let maxThroughput = 0;
 let maxCollectionCount = 0;
 let maxCollectionSize = 0;
 let maxStorageSize = 0;
 let maxTotalIndexSize = 0;
 setInterval(async () => {
      const stats = await db.ttl.stats();
      const throughput = insertedCount / 10; // assuming measure over 10 seconds
      const collectionCount = stats.count;
      const collectionSizeMB = stats.size / 1024 / 1024;
      const storageSizeMB = stats.storageSize / 1024 / 1024;
      const totalIndexSizeMB = stats.totalIndexSize / 1024 / 1024;
      maxThroughput = Math.max(maxThroughput, throughput);
      maxCollectionCount = Math.max(maxCollectionCount, collectionCount);
      maxCollectionSize = Math.max(maxCollectionSize, collectionSizeMB);
      maxStorageSize = Math.max(maxStorageSize, storageSizeMB);
      maxTotalIndexSize = Math.max(maxTotalIndexSize, totalIndexSizeMB);
      console.log(`Collection Name: ${stats.ns}
   Throughput:        ${throughput.toFixed(0).padStart(10)} docs/min (max: ${maxThroughput.toFixed(0)} docs/min)
   Collection Size:   ${collectionSizeMB.toFixed(0).padStart(10)} MB (max: ${maxCollectionSize.toFixed(0)} MB)
   Number of Records: ${collectionCount.toFixed(0).padStart(10)}     (max: ${maxCollectionCount.toFixed(0)} docs)
   Storage Size:      ${storageSizeMB.toFixed(0).padStart(10)}    MB (max: ${maxStorageSize.toFixed(0)} MB)
   Total Index Size:  ${totalIndexSizeMB.toFixed(0).padStart(10)} MB (max: ${maxTotalIndexSize.toFixed(0)} MB)`);
      insertedCount = 0;
 }, 60000); // every minute

I created the collection with a TTL index, which automatically expires data older than five minutes:

// TTL expire after 5 minutes
db.ttl.drop();
db.ttl.createIndex({ ts: 1 }, { expireAfterSeconds: 300 });

I let this running to see how the storage size evolves. Note that this was run in on MongoDB 7.0.16 (without the auto compact job that appeared in 8.0).

Output after 3 hours

The consistent insert rate, combined with TTL expiration, keeps the number of documents in the collection relatively stable. Deletions occur every minute, ensuring that the overall document count remains consistent.

I observe the same from the statistics I log every minute:

The storage size also remains constant, at 244MB for the table and 7MB for the indexes. The size of files has increased for the first 6 minutes and then remained constant:

This is sufficient to show that the deletion and insertion do not have a fragmentation effect that would require additional consideration. About 25% is marked as available for reuse and is effectively reused.

It is possible to force a compaction, to temporarily reclaim more space, but it is not necessary:

Collection Name: test.ttl
   Throughput:              3286 docs/min (max: 3327 docs/min)
   Collection Size:          170 MB (max: 198 MB)
   Number of Records:     171026     (max: 198699 docs)
   Storage Size:             244    MB (max: 244 MB)
   Total Index Size:           7 MB (max: 7 MB)

Collection Name: test.ttl
   Throughput:              3299 docs/min (max: 3327 docs/min)
   Collection Size:          170 MB (max: 198 MB)
   Number of Records:     170977     (max: 198699 docs)
   Storage Size:             244    MB (max: 244 MB)
   Total Index Size:           6 MB (max: 7 MB)

Collection Name: test.ttl
   Throughput:              3317 docs/min (max: 3327 docs/min)
   Collection Size:          170 MB (max: 198 MB)
   Number of Records:     170985     (max: 198699 docs)
   Storage Size:             244    MB (max: 244 MB)
   Total Index Size:           6 MB (max: 7 MB)

test> db.runCommand({ compact: 'ttl' });
{ bytesFreed: 49553408, ok: 1 }

Collection Name: test.ttl
   Throughput:              1244 docs/min (max: 3327 docs/min)
   Collection Size:          150 MB (max: 198 MB)
   Number of Records:     150165     (max: 198699 docs)
   Storage Size:             197    MB (max: 244 MB)
   Total Index Size:           6 MB (max: 7 MB)
Collection Name: test.ttl
   Throughput:              3272 docs/min (max: 3327 docs/min)
   Collection Size:          149 MB (max: 198 MB)
   Number of Records:     149553     (max: 198699 docs)
   Storage Size:             203    MB (max: 244 MB)
   Total Index Size:           6 MB (max: 7 MB)

While this has reduced storage, it eventually returns to its normal volume. It's typical for a B-tree to maintain some free space, which helps to minimize frequent space allocations and reclaim.

Here is a focus when I've run manual compaction:

Conclusion

TTL deletion makes space available for reuse instead of reclaiming it immediately, but this doesn't increase the fragmentation. This space is reused automatically to maintain a total size proportional to the document count, with a constant free space of 25% in my case, to minimize frequent allocations typical of B-Tree implementations.
The TTL mechanism operates autonomously, requiring no manual compaction. If you have any doubt, monitor it. MongoDB offers statistics to compare logical and physical sizes at both the MongoDB layer and WiredTiger storage.

Percona XtraBackup 8.4 Pro: Reduce Server Locking by up to 4300X

When performing backups, reducing the amount of time your server is locked can significantly improve performance and minimize disruptions. Percona XtraBackup 8.4 Pro introduces improvements in how DDL (Data Definition Language) locks (aka Backup Locks) are managed, allowing for reduced locking during backups. In this post, we’ll explore the impact of these enhancements. TL;DR (Summary) […]

Reducing PostgreSQL Costs in the Cloud

This post was originally published in March 2023 and was updated in March 2025. If you’re using PostgreSQL in the cloud, there’s a good chance you’re spending more than you need in order to get the results required for your business. Effectively managing PostgreSQL costs in the cloud is crucial, and this post explores practical […]

March 21, 2025

Merging Streams of Convex data

New convex-helpers are available now for fetching streams of documents, merging them together, filtering them them out, and paginating the results. With these helpers, you can replicate patterns you may know from SQL: UNION ALL, JOIN, DISTINCT, GROUP BY, and WHERE clauses where index fields are skipped.

Less ceremony, more shipping

Deployments in data finally get their due. With tb deploy, live schema migrations happen painlessly and automatically.

March 20, 2025

Percona Monitoring and Management 2: Clarifying the End-of-Life and Transition to PMM 3

At Percona, we’re committed to providing you with the best database monitoring and management tools. With the release of Percona Monitoring and Management 3 (PMM 3), we’re now entering a critical phase in the lifecycle of PMM 2. We understand that PMM 2 remains a vital tool for many of our users, and we want […]

ParallelRaft: Out-of-Order Executions in PolarFS

This paper (2021) dives deeper in the Parallel Raft protocol introduced with PolarFS.

PolarFS (VLDB'18) is a distributed file system with ultra-low latency and high availability, developed by Alibaba. Its variant of the Raft consensus protocol, ParallelRaft,  relaxes Raft's strict serialization constraints. ParallelRaft allows state machines to commit and execute log entries out of order.

This study provides a formal specification of ParallelRaft in TLA+, proves its correctness, and identifies consistency issues caused by ghost log entries. To address these issues, the refined ParallelRaft-CE protocol limits parallelism during leader transitions.


Introduction

Raft is a tight-ass. It enforces sequential commitment and execution of log entries. No funny business. Multi-Paxos is a bit rebellious. It allows out-of-order commitment, but it still insists on ordered execution. In order not to leave any performance on the table, PolarFS's ParallelRaft rebels completely: it not only allows out-of-order commitment, but out-of-order execution as well. 

How does ParallelRaft keep consistency of state machine replication across replicas? The trick is to permit out-of-order execution if commands are conflict-free. It is the oldest trick in the book, the generalized Paxos idea. Each command contains a Logical Block Address (LBA). Commands accessing non-overlapping LBAs are conflict-free and can execute in parallel (i.e., out of order). Overlapping commands must execute in log order.

Ok, still this is a big claim. Concurrency is a wild beast! Did you expect this to be simple? What about leader turnovers? There is a lot to consider there. This paper complains that the original PolarFS paper was too quick to call it a day without specifying the protocol completely or providing an opensource implementation. They suspect that, in the presence of leader turnovers, ParallelRaft protocol is vulnerable to inconsistency induced by ghost log entries.

To investigate, they introduce an intermediate protocol, ParallelRaft-SE (Sequential Execution), which allows out-of-order commitment but mandates sequential execution. The point of ParallelRaft-SE is to establish a refinement mapping to Multi-Paxos. They then relax it to get ParallelRaft-CE (Concurrent Execution), which supports both out-of-order commitment and execution. They show that ParallelRaft-CE is prone to state inconsistency due to ghost log entries, and they propose to mitigate this issue using stricter leader election rules and log recovery mechanisms, forbidding out-of-order execution during leader takeover.


Out-of-Order Executions and Ghost Log Entries

A key challenge of out-of-order execution is determining conflicts when log entries are missing. How do you track conflicts with holes in the log? ParallelRaft introduces a Look Behind Buffer (LBF) that tracks the LBAs of up to K preceding log entries. If no holes exceeding K exist, conflict detection is feasible.

However, the paper argues ghost log entries can cause inconsistency as in Figure 1.

In phase 1, leader s1 creates entries 1-4, with only entries 1-2 being committed before s1 fails. When s3 becomes leader in phase 2, it doesn't receive s1's uncommitted entries 3-4 during recovery. S3 then adds its own entries 3-6, commits entry 6 (which sets y←2), and executes this operation before failing.

In phase 3, new leader s2 unexpectedly receives entry 4 from the now-recovered s1 during recovery. After committing this entry, s2 must execute it (setting y←3). This operation should have executed before entry 6 according to ordering rules, but s3 has already executed y←2.

This "ghost log entries" phenomenon--where s1's uncommitted entries disappear during s3's leadership only to reappear during s2's leadership-- creates inconsistent execution order. The actual execution conflicts with what should have happened based on log position, leading to incorrect conflict determination and system inconsistency.


The proposed ParallelRaft-CE patch

ParallelRaft-CE addresses ghost entries by refining ParallelRaft-SE. The key idea is to track the generation moment of log entries by the sync number, and barrier entries from old terms.

The protocol creates a clear boundary between terms by allowing concurrent operations only for log entries with the same term, while enforcing sequential processing for entries from different terms. It tracks entry generation through the sync number and adds a "date" variable to each log entry (similar to Paxos proposal numbers) to determine recency.

During recovery, the new leader collects log entries from a majority of nodes and selects the latest entries based on their "date" values. This ensures that already committed entries remain in the log, previously uncommitted entries are either properly committed or discarded, and no ghost entries can persist undetected across term changes. By introducing this controlled concurrency model where the sync number is always less than or equal to a node's term, ParallelRaft-CE effectively uses leader election as a sequentialization bottleneck to clean up potential inconsistencies.

ParallelRaft-CE has three key components:

1. Log Synchronization Mechanism: Followers accept only log entries matching their sync number (equivalent to the current term). Entries from different terms are rejected.

2. Leader Election Mechanism: Similar to Raft, but with an additional check to ensure no ghost entries exist. Leaders are responsible for reconciling logs and eliminating inconsistencies.

3. Log Recovery Mechanism: During recovery, the new leader collects logs from a majority of nodes and applies a conflict resolution process. Uncommitted entries from the previous term are either committed or discarded. 

In sequential Raft, leader transition is simple, as the selected leader is already caught up: Logs of nodes in Raft do not have holes, and the log matching property between nodes is guaranteed. But in ParallelRaft, since there are holes, leader election needs to be augmented with recovery as in Paxos leader transitions. This is the price to pay for being relaxed in out-of-order commit and execution.


The ParallelRaft-CE protocol limits out-of-order execution to entries within the same term. Log entries from different terms follow a sequential recovery and commitment process to  eliminate ghost entries and guarantee state consistency.

March 19, 2025

Translate SQL into Convex Queries

Here’s a cheatsheet with examples of conversions between SQL queries and Convex queries. This article is geared towards developers (and LLMs) who have familiarity with SQL and want to translate those familiar patterns into Convex queries. You'll learn how to `UNION`, `JOIN`, `DISTINCT`, do `WHERE` clauses, and `SELECT` fields.

MongoDB equivalent for PostgreSQL JSONB operations

This article examines PostgreSQL operations that benefit from GIN indexes, as listed in Built-in GIN Operator Classes.

I created a table containing one million rows, plus three additional rows relevant to my queries. The goal is to demonstrate the benefits of using indexes, allowing for efficient data retrieval without the need to read all rows.

Sample table in PostgreSQL

Here is a table that simulates a document collection, with all attributes in a JSONB column:

CREATE TABLE data (
    id SERIAL PRIMARY KEY,
    details JSONB
);

INSERT INTO data (details)
SELECT jsonb_build_object(
    'stuff', md5(random()::text)
)
FROM generate_series(1, 1000000);

ANALYZE data;

INSERT INTO data (details) VALUES
('{"name": "Alice", "age": 30, "tags": ["developer", "engineer"]}'),
('{"name": "Bob", "age": 25, "tags": ["designer"]}'),
('{"name": "Carol", "age": 40, "tags": ["developer", "manager"]}');

Sample collection on MongoDB

I am creating the same data in a MongoDB collection:

const bulk = db.data.initializeUnorderedBulkOp();
for (let i = 0; i < 1000000; i++) {
    bulk.insert({
        stuff: Math.random().toString(36).substring(2, 15)
    });
}
bulk.execute();

db.data.insertMany([
    { "name": "Alice", "age": 30, "tags": ["developer", "engineer"] },
    { "name": "Bob", "age": 25, "tags": ["designer"] },
    { "name": "Carol", "age": 40, "tags": ["developer", "manager"] }
]);

@> (jsonb, jsonb): Contains

The following queries search for data where the tag array contains a value:

PostgreSQL:

SELECT * FROM data WHERE details @> '{"tags": ["developer"]}'
;

MongoDB:

db.data.find({ tags: { $all: ["developer"] } })
;

Without an index, this runs a SeqScan on PostgreSQL:

postgres=# explain (analyze, costs off, buffers, serialize text)
SELECT * FROM data WHERE details @> '{"tags": ["developer"]}'
;
                           QUERY PLAN
----------------------------------------------------------------
 Seq Scan on data (actual time=144.002..144.004 rows=2 loops=1)
   Filter: (details @> '{"tags": ["developer"]}'::jsonb)
   Rows Removed by Filter: 1000001
   Buffers: shared hit=5715 read=4595
 Planning Time: 0.065 ms
 Serialization: time=0.008 ms  output=1kB  format=text
 Execution Time: 144.236 ms

and a COLLSCAN on MongoDB:

test> print( 
db.data.find({ tags: { $all: ["developer"] } })
.explain('executionStats').executionStats 
);
{
  executionSuccess: true,
  nReturned: 2,
  executionTimeMillis: 437,
  totalKeysExamined: 0,
  totalDocsExamined: 1000003,
  executionStages: {
    stage: 'COLLSCAN',
    filter: { tags: { '$eq': 'developer' } },
    nReturned: 2,
    executionTimeMillisEstimate: 391,
    works: 1000004,
    advanced: 2,
    direction: 'forward',
    docsExamined: 1000003
  }
}

I create a jsonb_path_ops GIN index on PostgreSQL:

CREATE INDEX idx_details ON data USING GIN (details jsonb_path_ops)
;

and a multi-key index sparse index on MongoDB for the array of tags:

db.data.createIndex({ tags: 1 } , { sparse: true })
;

PostgreSQL uses the GIN index to find the two index entries, though a Bitmap Index Scan, and then the two rows where the Recheck Cond didn't have to filter more:

postgres=# explain (analyze, costs off, buffers, serialize text)
SELECT * FROM data WHERE details @> '{"tags": ["developer"]}'
;
                                    QUERY PLAN                                    
----------------------------------------------------------------------------------
 Bitmap Heap Scan on data (actual time=0.019..0.021 rows=2 loops=1)
   Recheck Cond: (details @> '{"tags": ["developer"]}'::jsonb)
   Heap Blocks: exact=1
   Buffers: shared hit=5
   ->  Bitmap Index Scan on idx_details (actual time=0.010..0.010 rows=2 loops=1)
         Index Cond: (details @> '{"tags": ["developer"]}'::jsonb)
         Buffers: shared hit=4
 Planning:
   Buffers: shared hit=1
 Planning Time: 0.079 ms
 Serialization: time=0.005 ms  output=1kB  format=text
 Execution Time: 0.041 ms

MongoDB efficiently reads two index entries and retrieves two documents without the need of bitmaps. This method preserves the index order, which is advantageous when handling multiple rows.

test> print( 
db.data.find({ tags: { $all: ["developer"] } })
. explain('executionStats').executionStats 
);
{
  executionSuccess: true,
  nReturned: 2,
  executionTimeMillis: 0,
  totalKeysExamined: 2,
  totalDocsExamined: 2,
  executionStages: {
    stage: 'FETCH',
    nReturned: 2,
    executionTimeMillisEstimate: 0,
    works: 3,
    advanced: 2,
    docsExamined: 2,
    inputStage: {
      stage: 'IXSCAN',
      nReturned: 2,
      executionTimeMillisEstimate: 0,
      works: 3,
      advanced: 2,
      keyPattern: { tags: 1 },
      indexName: 'tags_1',
      isMultiKey: true,
      multiKeyPaths: { tags: [ 'tags' ] },
      isSparse: true,
      isPartial: false,
      direction: 'forward',
      indexBounds: { tags: [ '["developer", "developer"]' ] },
      keysExamined: 2,
      seeks: 1,
      dupsTested: 2
    }
  }
}

@? (jsonb, jsonpath): JSON Path Match

The following query searches for developers, as well as non developers younger than 35:

On PostgreSQL:

SELECT * FROM data 
 WHERE details @? '$?(@.tags[*] == "developer" || @.age < 35)'
;

On MongoDB:

db.data.find({
  $or: [
    { tags: { $elemMatch: { $eq: "developer" } } },
    { age: { $lt: 35 } }
  ]
});

Without an additional index on "age", MongoDB chooses a COLLSCAN:

test> print( 
db.data.find({
  $or: [
    { tags: { $elemMatch: { $eq: "developer" } } },
    { age: { $lt: 35 } }
  ]
}).explain('executionStats').executionStats 
);
{
  executionSuccess: true,
  nReturned: 3,
  executionTimeMillis: 585,
  totalKeysExamined: 0,
  totalDocsExamined: 1000003,
  executionStages: {
    isCached: false,
    stage: 'SUBPLAN',
    nReturned: 3,
    executionTimeMillisEstimate: 547,
    works: 1000004,
    advanced: 3,
    needTime: 1000000,
    inputStage: {
      stage: 'COLLSCAN',
      filter: {
        '$or': [
          { tags: { '$elemMatch': [Object] } },
          { age: { '$lt': 35 } }
        ]
      },
      nReturned: 3,
      executionTimeMillisEstimate: 517,
      works: 1000004,
      advanced: 3,
      needTime: 1000000,
      direction: 'forward',
      docsExamined: 1000003
    }
  }
}

PostgreSQL uses the GIN index but not efficiently as it reads all index entries to remove them later by recheck:

postgres=# explain (analyze, costs off, buffers, serialize text)
SELECT * FROM data 
WHERE details @? '$?(@.tags[*] == "developer" || @.age < 35)'
;
                                         QUERY PLAN                                          
---------------------------------------------------------------------------------------------
 Bitmap Heap Scan on data (actual time=582.323..582.327 rows=3 loops=1)
   Recheck Cond: (details @? '$?(@."tags"[*] == "developer" || @."age" < 35)'::jsonpath)
   Rows Removed by Index Recheck: 1000000
   Heap Blocks: exact=10310
   Buffers: shared hit=14703
   ->  Bitmap Index Scan on idx_details (actual time=123.755..123.755 rows=1000003 loops=1)
         Index Cond: (details @? '$?(@."tags"[*] == "developer" || @."age" < 35)'::jsonpath)
         Buffers: shared hit=4393
 Planning:
   Buffers: shared hit=1
 Planning Time: 0.117 ms
 Serialization: time=0.009 ms  output=1kB  format=text
 Execution Time: 582.575 ms

The solution, in both databases, is to add an index on "age".

An expression-based index in PostgreSQL:

CREATE INDEX idx_age ON data ( ((details->>'age')::int) )
;

A regular index on MongoDB, but sparse as I don't need to index missing values:

db.data.createIndex({ age: 1 }, { sparse: true} );

Here is the new execution plan in MongoDB which can combine the multi-key index on "tags" and the regular index on "age":

print( 
db.data.find({
  $or: [
    { tags: { $elemMatch: { $eq: "developer" } } },
    { age: { $lt: 35 } }
  ]
}).explain('executionStats').executionStats 
);
{
  executionSuccess: true,
  nReturned: 3,
  executionTimeMillis: 0,
  totalKeysExamined: 4,
  totalDocsExamined: 5,
  executionStages: {
    isCached: false,
    stage: 'SUBPLAN',
    nReturned: 3,
    executionTimeMillisEstimate: 0,
    works: 6,
    advanced: 3,
    inputStage: {
      stage: 'FETCH',
      nReturned: 3,
      executionTimeMillisEstimate: 0,
      works: 6,
      advanced: 3,
      docsExamined: 3,
      alreadyHasObj: 2,
      inputStage: {
        stage: 'OR',
        nReturned: 3,
        executionTimeMillisEstimate: 0,
        works: 6,
        advanced: 3,
        dupsTested: 4,
        dupsDropped: 1,
        inputStages: [
          {
            stage: 'FETCH',
            filter: { '$or': [Array] },
            nReturned: 2,
            executionTimeMillisEstimate: 0,
            works: 3,
            advanced: 2,
            docsExamined: 2,
            alreadyHasObj: 0,
            inputStage: {
              stage: 'IXSCAN',
              nReturned: 2,
              executionTimeMillisEstimate: 0,
              works: 3,
              advanced: 2,
              keyPattern: [Object],
              indexName: 'tags_1',
              isMultiKey: true,
              multiKeyPaths: [Object],
              isUnique: false,
              isSparse: true,
              isPartial: false,
              indexVersion: 2,
              direction: 'forward',
              indexBounds: [Object],
              keysExamined: 2,
              seeks: 1,
              dupsTested: 2,
              dupsDropped: 0
            }
          },
          {
            stage: 'IXSCAN',
            nReturned: 2,
            executionTimeMillisEstimate: 0,
            works: 3,
            advanced: 2,
            keyPattern: { age: 1 },
            indexName: 'age_1',
            isMultiKey: false,
            multiKeyPaths: { age: [] },
            isUnique: false,
            isSparse: true,
            isPartial: false,
            indexVersion: 2,
            direction: 'forward',
            indexBounds: { age: [Array] },
            keysExamined: 2,
            seeks: 1,
            dupsTested: 0,
            dupsDropped: 0
          }
        ]
      }
    }
  }
}

I've detailed this OR expansion in a previous blog post:

PostgreSQL cannot do the same and you need to write the query differently:

postgres=# explain (analyze, costs off, buffers, serialize text)
-- Query for using GIN index on "details" column
SELECT * FROM data 
WHERE details @? '$?(@.tags[*] == "developer")'
UNION
-- Query for using B-tree index on "age" column
SELECT * FROM data 
WHERE (details->>'age')::int < 35
;
                                          QUERY PLAN                                          
----------------------------------------------------------------------------------------------
 HashAggregate (actual time=0.134..0.220 rows=3 loops=1)
   Group Key: data.id, data.details
   Batches: 1  Memory Usage: 1561kB
   Buffers: shared hit=9
   ->  Append (actual time=0.023..0.035 rows=4 loops=1)
         Buffers: shared hit=9
         ->  Bitmap Heap Scan on data (actual time=0.022..0.024 rows=2 loops=1)
               Recheck Cond: (details @? '$?(@."tags"[*] == "developer")'::jsonpath)
               Heap Blocks: exact=1
               Buffers: shared hit=5
               ->  Bitmap Index Scan on idx_details (actual time=0.012..0.012 rows=2 loops=1)
                     Index Cond: (details @? '$?(@."tags"[*] == "developer")'::jsonpath)
                     Buffers: shared hit=4
         ->  Bitmap Heap Scan on data data_1 (actual time=0.009..0.010 rows=2 loops=1)
               Recheck Cond: (((details ->> 'age'::text))::integer < 35)
               Heap Blocks: exact=1
               Buffers: shared hit=4
               ->  Bitmap Index Scan on idx_age (actual time=0.008..0.008 rows=2 loops=1)
                     Index Cond: (((details ->> 'age'::text))::integer < 35)
                     Buffers: shared hit=3
 Planning:
   Buffers: shared hit=1
 Planning Time: 0.160 ms
 Serialization: time=0
                                    
                                    
                                    
                                    
                                

MongoDB equivalent for PostgreSQL JSONB operations

This article examines PostgreSQL operations that benefit from GIN indexes, as listed in Built-in GIN Operator Classes.

I created a table containing one million rows, plus three additional rows relevant to my queries. The goal is to demonstrate the benefits of using indexes, allowing for efficient data retrieval without the need to read all rows.

Sample table in PostgreSQL

Here is a table that simulates a document collection, with all attributes in a JSONB column:

CREATE TABLE data (
    id SERIAL PRIMARY KEY,
    details JSONB
);

INSERT INTO data (details)
SELECT jsonb_build_object(
    'stuff', md5(random()::text)
)
FROM generate_series(1, 1000000);

ANALYZE data;

INSERT INTO data (details) VALUES
('{"name": "Alice", "age": 30, "tags": ["developer", "engineer"]}'),
('{"name": "Bob", "age": 25, "tags": ["designer"]}'),
('{"name": "Carol", "age": 40, "tags": ["developer", "manager"]}');

Sample collection on MongoDB

I am creating the same data in a MongoDB collection:

const bulk = db.data.initializeUnorderedBulkOp();
for (let i = 0; i < 1000000; i++) {
    bulk.insert({
        stuff: Math.random().toString(36).substring(2, 15)
    });
}
bulk.execute();

db.data.insertMany([
    { "name": "Alice", "age": 30, "tags": ["developer", "engineer"] },
    { "name": "Bob", "age": 25, "tags": ["designer"] },
    { "name": "Carol", "age": 40, "tags": ["developer", "manager"] }
]);

@> (jsonb, jsonb): Contains

The following queries search for data where the tag array contains a value:

PostgreSQL:

SELECT * FROM data WHERE details @> '{"tags": ["developer"]}'
;

MongoDB:

db.data.find({ tags: { $all: ["developer"] } })
;

Without an index, this runs a SeqScan on PostgreSQL:

postgres=# explain (analyze, costs off, buffers, serialize text)
SELECT * FROM data WHERE details @> '{"tags": ["developer"]}'
;
                           QUERY PLAN
----------------------------------------------------------------
 Seq Scan on data (actual time=144.002..144.004 rows=2 loops=1)
   Filter: (details @> '{"tags": ["developer"]}'::jsonb)
   Rows Removed by Filter: 1000001
   Buffers: shared hit=5715 read=4595
 Planning Time: 0.065 ms
 Serialization: time=0.008 ms  output=1kB  format=text
 Execution Time: 144.236 ms

and a COLLSCAN on MongoDB:

test> print( 
db.data.find({ tags: { $all: ["developer"] } })
.explain('executionStats').executionStats 
);
{
  executionSuccess: true,
  nReturned: 2,
  executionTimeMillis: 437,
  totalKeysExamined: 0,
  totalDocsExamined: 1000003,
  executionStages: {
    stage: 'COLLSCAN',
    filter: { tags: { '$eq': 'developer' } },
    nReturned: 2,
    executionTimeMillisEstimate: 391,
    works: 1000004,
    advanced: 2,
    direction: 'forward',
    docsExamined: 1000003
  }
}

I create a jsonb_path_ops GIN index on PostgreSQL:

CREATE INDEX idx_details ON data USING GIN (details jsonb_path_ops)
;

and a multi-key index sparse index on MongoDB for the array of tags:

db.data.createIndex({ tags: 1 } , { sparse: true })
;

PostgreSQL uses the GIN index to find the two index entries, though a Bitmap Index Scan, and then the two rows where the Recheck Cond didn't have to filter more:

postgres=# explain (analyze, costs off, buffers, serialize text)
SELECT * FROM data WHERE details @> '{"tags": ["developer"]}'
;
                                    QUERY PLAN                                    
----------------------------------------------------------------------------------
 Bitmap Heap Scan on data (actual time=0.019..0.021 rows=2 loops=1)
   Recheck Cond: (details @> '{"tags": ["developer"]}'::jsonb)
   Heap Blocks: exact=1
   Buffers: shared hit=5
   ->  Bitmap Index Scan on idx_details (actual time=0.010..0.010 rows=2 loops=1)
         Index Cond: (details @> '{"tags": ["developer"]}'::jsonb)
         Buffers: shared hit=4
 Planning:
   Buffers: shared hit=1
 Planning Time: 0.079 ms
 Serialization: time=0.005 ms  output=1kB  format=text
 Execution Time: 0.041 ms

MongoDB efficiently reads two index entries and retrieves two documents without the need of bitmaps. This method preserves the index order, which is advantageous when handling multiple rows.

test> print( 
db.data.find({ tags: { $all: ["developer"] } })
. explain('executionStats').executionStats 
);
{
  executionSuccess: true,
  nReturned: 2,
  executionTimeMillis: 0,
  totalKeysExamined: 2,
  totalDocsExamined: 2,
  executionStages: {
    stage: 'FETCH',
    nReturned: 2,
    executionTimeMillisEstimate: 0,
    works: 3,
    advanced: 2,
    docsExamined: 2,
    inputStage: {
      stage: 'IXSCAN',
      nReturned: 2,
      executionTimeMillisEstimate: 0,
      works: 3,
      advanced: 2,
      keyPattern: { tags: 1 },
      indexName: 'tags_1',
      isMultiKey: true,
      multiKeyPaths: { tags: [ 'tags' ] },
      isSparse: true,
      isPartial: false,
      direction: 'forward',
      indexBounds: { tags: [ '["developer", "developer"]' ] },
      keysExamined: 2,
      seeks: 1,
      dupsTested: 2
    }
  }
}

@? (jsonb, jsonpath): JSON Path Match

The following query searches for developers, as well as non developers younger than 35:

On PostgreSQL:

SELECT * FROM data 
 WHERE details @? '$?(@.tags[*] == "developer" || @.age < 35)'
;

On MongoDB:

db.data.find({
  $or: [
    { tags: { $elemMatch: { $eq: "developer" } } },
    { age: { $lt: 35 } }
  ]
});

Without an additional index on "age", MongoDB chooses a COLLSCAN:

test> print( 
db.data.find({
  $or: [
    { tags: { $elemMatch: { $eq: "developer" } } },
    { age: { $lt: 35 } }
  ]
}).explain('executionStats').executionStats 
);
{
  executionSuccess: true,
  nReturned: 3,
  executionTimeMillis: 585,
  totalKeysExamined: 0,
  totalDocsExamined: 1000003,
  executionStages: {
    isCached: false,
    stage: 'SUBPLAN',
    nReturned: 3,
    executionTimeMillisEstimate: 547,
    works: 1000004,
    advanced: 3,
    needTime: 1000000,
    inputStage: {
      stage: 'COLLSCAN',
      filter: {
        '$or': [
          { tags: { '$elemMatch': [Object] } },
          { age: { '$lt': 35 } }
        ]
      },
      nReturned: 3,
      executionTimeMillisEstimate: 517,
      works: 1000004,
      advanced: 3,
      needTime: 1000000,
      direction: 'forward',
      docsExamined: 1000003
    }
  }
}

PostgreSQL uses the GIN index but not efficiently as it reads all index entries to remove them later by recheck:

postgres=# explain (analyze, costs off, buffers, serialize text)
SELECT * FROM data 
WHERE details @? '$?(@.tags[*] == "developer" || @.age < 35)'
;
                                         QUERY PLAN                                          
---------------------------------------------------------------------------------------------
 Bitmap Heap Scan on data (actual time=582.323..582.327 rows=3 loops=1)
   Recheck Cond: (details @? '$?(@."tags"[*] == "developer" || @."age" < 35)'::jsonpath)
   Rows Removed by Index Recheck: 1000000
   Heap Blocks: exact=10310
   Buffers: shared hit=14703
   ->  Bitmap Index Scan on idx_details (actual time=123.755..123.755 rows=1000003 loops=1)
         Index Cond: (details @? '$?(@."tags"[*] == "developer" || @."age" < 35)'::jsonpath)
         Buffers: shared hit=4393
 Planning:
   Buffers: shared hit=1
 Planning Time: 0.117 ms
 Serialization: time=0.009 ms  output=1kB  format=text
 Execution Time: 582.575 ms

The solution, in both databases, is to add an index on "age".

An expression-based index in PostgreSQL:

CREATE INDEX idx_age ON data ( ((details->>'age')::int) )
;

A regular index on MongoDB, but sparse as I don't need to index missing values:

db.data.createIndex({ age: 1 }, { sparse: true} );

Here is the new execution plan in MongoDB which can combine the multi-key index on "tags" and the regular index on "age":

print( 
db.data.find({
  $or: [
    { tags: { $elemMatch: { $eq: "developer" } } },
    { age: { $lt: 35 } }
  ]
}).explain('executionStats').executionStats 
);
{
  executionSuccess: true,
  nReturned: 3,
  executionTimeMillis: 0,
  totalKeysExamined: 4,
  totalDocsExamined: 5,
  executionStages: {
    isCached: false,
    stage: 'SUBPLAN',
    nReturned: 3,
    executionTimeMillisEstimate: 0,
    works: 6,
    advanced: 3,
    inputStage: {
      stage: 'FETCH',
      nReturned: 3,
      executionTimeMillisEstimate: 0,
      works: 6,
      advanced: 3,
      docsExamined: 3,
      alreadyHasObj: 2,
      inputStage: {
        stage: 'OR',
        nReturned: 3,
        executionTimeMillisEstimate: 0,
        works: 6,
        advanced: 3,
        dupsTested: 4,
        dupsDropped: 1,
        inputStages: [
          {
            stage: 'FETCH',
            filter: { '$or': [Array] },
            nReturned: 2,
            executionTimeMillisEstimate: 0,
            works: 3,
            advanced: 2,
            docsExamined: 2,
            alreadyHasObj: 0,
            inputStage: {
              stage: 'IXSCAN',
              nReturned: 2,
              executionTimeMillisEstimate: 0,
              works: 3,
              advanced: 2,
              keyPattern: [Object],
              indexName: 'tags_1',
              isMultiKey: true,
              multiKeyPaths: [Object],
              isUnique: false,
              isSparse: true,
              isPartial: false,
              indexVersion: 2,
              direction: 'forward',
              indexBounds: [Object],
              keysExamined: 2,
              seeks: 1,
              dupsTested: 2,
              dupsDropped: 0
            }
          },
          {
            stage: 'IXSCAN',
            nReturned: 2,
            executionTimeMillisEstimate: 0,
            works: 3,
            advanced: 2,
            keyPattern: { age: 1 },
            indexName: 'age_1',
            isMultiKey: false,
            multiKeyPaths: { age: [] },
            isUnique: false,
            isSparse: true,
            isPartial: false,
            indexVersion: 2,
            direction: 'forward',
            indexBounds: { age: [Array] },
            keysExamined: 2,
            seeks: 1,
            dupsTested: 0,
            dupsDropped: 0
          }
        ]
      }
    }
  }
}

I've detailed this OR expansion in a previous blog post:

PostgreSQL cannot do the same and you need to write the query differently:

postgres=# explain (analyze, costs off, buffers, serialize text)
-- Query for using GIN index on "details" column
SELECT * FROM data 
WHERE details @? '$?(@.tags[*] == "developer")'
UNION
-- Query for using B-tree index on "age" column
SELECT * FROM data 
WHERE (details->>'age')::int < 35
;
                                          QUERY PLAN                                          
----------------------------------------------------------------------------------------------
 HashAggregate (actual time=0.134..0.220 rows=3 loops=1)
   Group Key: data.id, data.details
   Batches: 1  Memory Usage: 1561kB
   Buffers: shared hit=9
   ->  Append (actual time=0.023..0.035 rows=4 loops=1)
         Buffers: shared hit=9
         ->  Bitmap Heap Scan on data (actual time=0.022..0.024 rows=2 loops=1)
               Recheck Cond: (details @? '$?(@."tags"[*] == "developer")'::jsonpath)
               Heap Blocks: exact=1
               Buffers: shared hit=5
               ->  Bitmap Index Scan on idx_details (actual time=0.012..0.012 rows=2 loops=1)
                     Index Cond: (details @? '$?(@."tags"[*] == "developer")'::jsonpath)
                     Buffers: shared hit=4
         ->  Bitmap Heap Scan on data data_1 (actual time=0.009..0.010 rows=2 loops=1)
               Recheck Cond: (((details ->> 'age'::text))::integer < 35)
               Heap Blocks: exact=1
               Buffers: shared hit=4
               ->  Bitmap Index Scan on idx_age (actual time=0.008..0.008 rows=2 loops=1)
                     Index Cond: (((details ->> 'age'::text))::integer < 35)
                     Buffers: shared hit=3
 Planning:
   Buffers: shared hit=1
 Planning Time: 0.160 ms
 Serialization: time=0
                                    
                                    
                                    
                                    
                                

MongoDB Atlas On-Premise: Challenges and Flexible Alternatives

If you’re running enterprise applications, you might be facing a common dilemma: cloud solutions offer great features, but regulatory constraints and data sovereignty laws might prevent you from using them fully. In fact, about 60% of enterprises struggle with this exact problem. While cloud computing delivers scalability and convenience, your organization might require on-premises or […]

Scheduled scaling of Amazon Aurora Serverless with Amazon EventBridge Scheduler

In this post, we demonstrate how you can implement scheduled scaling for Aurora Serverless using Amazon EventBridge Scheduler. By proactively adjusting minimum Aurora Capacity Units (ACUs), you can achieve faster scaling rates during peak periods while maintaining cost efficiency during low-demand times.