a curated list of database news from authoritative sources

June 22, 2025

faster $lookup after $group in MongoDB aggregation pipeline

In the previous post One million $lookup challenge I said that $lookup is efficient at the end of an aggregation pipeline, not before the aggregation and I was referencing another post, Comparison of JOINS πŸ‘‰πŸ» aggregation pipeline and CTEs where I simply inverted the join ($lookup) and aggregation ($group) to fix a bad query. What if you have to lookup from millions of documents without an aggregation, like in the One million $lookup challenge?

I created a dimension table with one thousand documents, and a fact table with one million. The fact table has a "ref" field that references the "dimid" in the dimension table:

db.dim.drop();
db.fact.drop();

db.dim.insertMany(
    Array.from({ length: 1000 }, (_, i) => ({
        _id: i + 1,
        value: Math.random()
    }))
);

db.fact.insertMany(
    Array.from({ length: 1000000 }, () => ({
        ref: Math.ceil(Math.random() * 1000),
        value: Math.random()
    }))
);

A many-to-one relationship should embed values directly, eliminating the need for lookups within a document model. However, I diverged from this recommendation to build a demo illustrating that lookups are acceptable when dealing with a limited number of documents, such as after aggregation.

Lookup (IndexedLoopJoin): 10 seconds

Here is an aggregation pipeline with a lookup.

x=db.fact.aggregate([
    {
        $lookup: {
            from: "dim",
            localField: "ref",
            foreignField: "_id",
            as: "dim" ,
        }
    },
]).explain("executionStats")
;
print(x["executionStats"]["executionTimeMillis"]/1000+" seconds")

On this data, the operation completes in ten seconds. While the Index Nested Loop Join on one million documents with a scalar "ref" takes some time, performance can be improved by avoiding multiple lookups for the same value.

$group , $lookup , $unwind : 3 seconds

I group by the "ref" values prior to the join to minimize lookup operations, ensuring each value is processed only once:

x=db.fact.aggregate([
  { // one group per "ref" with an array of "facts" 
    $group: {
      _id: "$ref",
      facts: { $push: "$$ROOT" },
    }
  },
  { // join from the group
    $lookup: {
      from:          "dim",
      localField:    "_id",
      foreignField:  "_id",
      as:            "dim"
    }
  },
  { // un-group to get the  
    $unwind: "$facts"
  },
  { // project to original fields
    $project: {
      _id:           "$facts._id",
      ref:           "$facts.ref",
      value:         "$facts.value",
      dim:           "$dim",
    }
  }
]).explain("executionStats");
print(x["stages"][0]["$cursor"]
["executionStats"]["executionTimeMillis"]/1000+" seconds")

This takes three seconds to run. When only one nested loop per thousand documents is used, compared to one million, the lookup stage becomes faster. Although grouping and unwinding add some time, if grouping is already present, as is typical in an aggregation pipeline, performing the lookup after grouping is simply more efficient than doing it beforehand.

faster $lookup after $group in MongoDB aggregation pipeline

In the previous post One million $lookup challenge I said that $lookup is efficient at the end of an aggregation pipeline, not before the aggregation and I was referencing another post, Comparison of JOINS πŸ‘‰πŸ» aggregation pipeline and CTEs where I simply inverted the join ($lookup) and aggregation ($group) to fix a bad query. What if you have to lookup from millions of documents without an aggregation, like in the One million $lookup challenge?

I created a dimension table with one thousand documents, and a fact table with one million. The fact table has a "ref" field that references the "dimid" in the dimension table:

db.dim.drop();
db.fact.drop();

db.dim.insertMany(
    Array.from({ length: 1000 }, (_, i) => ({
        _id: i + 1,
        value: Math.random()
    }))
);

db.fact.insertMany(
    Array.from({ length: 1000000 }, () => ({
        ref: Math.ceil(Math.random() * 1000),
        value: Math.random()
    }))
);

A many-to-one relationship should embed values directly, eliminating the need for lookups within a document model. However, I diverged from this recommendation to build a demo illustrating that lookups are acceptable when dealing with a limited number of documents, such as after aggregation.

Lookup (IndexedLoopJoin): 10 seconds

Here is an aggregation pipeline with a lookup.

x=db.fact.aggregate([
    {
        $lookup: {
            from: "dim",
            localField: "ref",
            foreignField: "_id",
            as: "dim" ,
        }
    },
]).explain("executionStats")
;
print(x["executionStats"]["executionTimeMillis"]/1000+" seconds")

On this data, the operation completes in ten seconds. While the Index Nested Loop Join on one million documents with a scalar "ref" takes some time, performance can be improved by avoiding multiple lookups for the same value.

$group , $lookup , $unwind : 3 seconds

I group by the "ref" values prior to the join to minimize lookup operations, ensuring each value is processed only once:

x=db.fact.aggregate([
  { // one group per "ref" with an array of "facts" 
    $group: {
      _id: "$ref",
      facts: { $push: "$$ROOT" },
    }
  },
  { // join from the group
    $lookup: {
      from:          "dim",
      localField:    "_id",
      foreignField:  "_id",
      as:            "dim"
    }
  },
  { // un-group to get the  
    $unwind: "$facts"
  },
  { // project to original fields
    $project: {
      _id:           "$facts._id",
      ref:           "$facts.ref",
      value:         "$facts.value",
      dim:           "$dim",
    }
  }
]).explain("executionStats");
print(x["stages"][0]["$cursor"]
["executionStats"]["executionTimeMillis"]/1000+" seconds")

This takes three seconds to run. When only one nested loop per thousand documents is used, compared to one million, the lookup stage becomes faster. Although grouping and unwinding add some time, if grouping is already present, as is typical in an aggregation pipeline, performing the lookup after grouping is simply more efficient than doing it beforehand.

June 21, 2025

June 20, 2025

Implement a rollback strategy for Amazon Aurora PostgreSQL upgrades using Amazon RDS Blue/Green deployments

Amazon Aurora PostgreSQL-Compatible Edition supports managed blue/green deployments to help reduce downtime and minimize risk during updates. Even with thorough planning and testing in non-production environments, unexpected issues can emerge after a version upgrade. In these cases, having a rollback plan is essential to quickly restore service stability. While the managed Blue/Green deployment feature doesn’t currently include built-in rollback functionality, you can implement alternative solutions for version management. In this post, we show how you can manually set up a rollback cluster using self-managed logical replication to maintain synchronization with the newer version after an Amazon RDS Blue/Green deployment switchover.

MySQL Orchestrator Failover Behavior During Replication Lag

Managing farms of MySQL servers under a replication environment is very efficient with the help of a MySQL orchestrator tool. This ensures a smooth transition happens when there is any ad hoc failover or a planned/graceful switchover comes into action. Several configuration parameters play a crucial role in controlling and influencing failover behavior. In this […]

June 19, 2025

How an AWS customer in the learning services industry migrated and modernized SAP ASE to Amazon Aurora PostgreSQL

In this post, we explore how a leading AWS customer in the learning services industry successfully modernized its legacy SAP ASE environment by migrating to Amazon Aurora PostgreSQL-Compatible Edition. Partnering with AWS, the customer engineered a comprehensive migration strategy to transition from a proprietary system to an open source database while providing high availability, performance optimization, and cost-efficiency.

One million $lookup challenge

I you have read my previous post $lookup: more than just a SQL join, you understand that $lookup is not designed to join scalar values from thousands of documents. $lookup is useful at the end of an aggregation pipeline, not before the aggregation (examples in Comparison of JOINS πŸ‘‰πŸ» aggregation pipeline and CTEs) from a million documents collection. However, such collection should not require a join, as documents are designed to aggregate multiple related objects, unlike relational databases that normalize business data to multiple tables.

In a many-to-one relationship, it is common to embed fields, even when they are duplicated, in a document model. Normalization plays a crucial role in relational databases to prevent these duplicates, as RDBMS were designed for interactive users executing SQL statements. Missing updates can lead to data integrity issues. While triggers can help manage updates to duplicated values and prevent anomalies, they introduce new challenges as they operate behind the update statement.

When updates originate from well-reviewed and tested programs, it is manageable to modify data in multiple locations, particularly when such updates are infrequent. Let's illustrate joins and the absence of joins with a simple test.

To join multiple documents with a small lookup table, you can cache the lookup table in your application. In this post, I tested several methods for retrieving a value from a lookup table: using a collection, a map, and an array. I integrated these methods into an aggregation pipeline, but keep in mind that this can also be accomplished within the application itself.

I created a dimension table with one thousand documents, and a fact table with one million. The fact table has a "ref" field that references the "dimid" in the dimension table:

db.dim.drop();
db.fact.drop();

db.dim.insertMany(
    Array.from({ length: 1000 }, (_, i) => ({
        _id: i + 1,
        value: Math.random()
    }))
);

db.fact.insertMany(
    Array.from({ length: 1000000 }, () => ({
        ref: Math.ceil(Math.random() * 1000),
        value: Math.random()
    }))
);

Lookup (IndexedLoopJoin): 10 seconds

Here is an aggregation pipeline with a lookup.

x=db.fact.aggregate([
    {
        $lookup: {
            from: "dim",
            localField: "ref",
            foreignField: "_id",
            as: "dimData" ,
        }
    },
]).explain("executionStats")
;
print(x["executionStats"]["executionTimeMillis"]/1000+" seconds")

On this data, it runs in ten seconds. The query planner chooses an Index Nested Loop Join because there is an index. Without an index it could use a hash join.

Map to object and $getField: 61 seconds

To avoid the lookup, I read the dimension table into an object with a field per value, the field name being the "dimid", and get the value with $getField


const dimMap = {};  
db.dim.find().forEach(doc => {  
    dimMap[doc._id] = doc.value;  
});  
print( dimMap )


x=db.fact.aggregate([
    {  
        $addFields: {  
            dimValue: {  
                $getField: {  
                    field: { $toString: "$ref" },  
                    input: dimMap  
                }  
            }  
        }  
    }  
]).explain("executionStats")
;
print(x["stages"][0]["$cursor"]
["executionStats"]["executionTimeMillis"]/1000+" seconds")

On this data it runs in one minute. Accessing to a field by name is not an optimal operation and is O(n) so it is a viable solution only for very small lookup table.

Map to $switch branches: 23 seconds

Instead of using that map, I build a $switch statement to use in the aggregation pipeline.


const dimMap = {};  
db.dim.find().forEach(doc => {  
    dimMap[doc._id] = doc.value;  
});  
print( dimMap )

const switchBranches = Object.entries(dimMap).map(([id, value]) => ({  
    case: { $eq: ["$ref", parseInt(id)] },  
    then: value  
}));  
print( switchBranches )

x=db.fact.aggregate([  
    {  
        $addFields: {  
            dimValue: {  
                $switch: {  
                    branches: switchBranches,  
                    default: null  
                }  
            }  
        }  
    }  
]).explain("executionStats")
;
print(x["stages"][0]["$cursor"]
["executionStats"]["executionTimeMillis"]/1000+" seconds")

On this data, it runs in twenty seconds, and given that it puts the logic into the query, it is acceptable only for small lookup tables.

Map to array and $arrayElemAt: 1 second

Instead of a map, I use an array where the index is the "dimid". As I have no guarantee that the "dimid" is sequential with no gap, I build a sparse index that I fill with the existing values.


// Get the maximum ID
const maxId = db.dim.aggregate([
 {$group:{_id:null,max:{$max:"$_id"}}}
]).toArray()[0].max;  
// Create a sparse array for all values
const dimValues = new Array(maxId + 1).fill(null);  
// store the values at the right ID
db.dim.find({},{_id:1,value:1}).forEach(
 d => dimValues[d._id] = d.value
);  
print(dimValues)

//
x=db.fact.aggregate([  
    { $addFields: { dimValue: { $arrayElemAt: [dimValues, "$ref"] } } }  
]).explain("executionStats")
;
print(x["stages"][0]["$cursor"]
["executionStats"]["executionTimeMillis"]/1000+" seconds")

This is fast and runs in one second. However, it works only when the lookup identifier are in control, ideally starting from one and in a no-gap sequence.

Embed rather than join (denormalization)

Finally, as recommended for a document model (Model One-to-Many Relationships with Embedded Documents), I duplicate the dimension value into each fact documents. I run this update with an aggregation pipeline.


const startTime = new Date(); 
db.fact.aggregate([  
    {  
        $lookup: {  
            from: "dim",  
            localField: "ref",  
            foreignField: "_id",  
            as: "dimData"  
        }  
    },  
    {  
        $out: "fact"  
    }  
])  

const endTime = new Date(); 
const executionTime = (endTime - startTime) / 1000;  
print(`Update execution time: ${executionTime} seconds`);

This should be executed once, and then only the updated dimension values should be synchronized. This update took 16 seconds on my data.

To compare, I can simply read the document and project the embedded value:

x=db.fact.aggregate([  
    {  
        $project: {  
            _id: 1,  
            ref: 1,  
            dimValue: 1,  // Simply project the pre-computed field  
            // Add any other fact fields you need  
            someFactField: 1  
        }  
    }  
]).explain("executionStats");  
print(x["executionStats"]["executionTimeMillis"]/1000+" seconds")

This query takes around 0.5 seconds on my data. It is advisable unless you are dealing with frequently updated lookup tables. Additionally, in MongoDB, a single compound index can cover all fields within a document. Typically, when filtering on a dimension, lookup, or reference table, the filter is applied to a business field rather than the internal "_id".

Conclusion

I have tested my example using various cardinalities for both the fact table and the dimension lookup table. Below is the raw data.

dim fact lookup $getField $switch $arrayElemAt update single doc
10 1000 0.008s 0.002s 0.001s 0.001s 0.08s 0s
100 1000 0.008s 0.006s 0.005s 0.001s 0.078s 0s
1000 1000 0.011s 0.062s 0.033s 0.001s 0.082s 0s
10000 1000 0.013s 0.754s 0.067s 0.003s 0.08s 0s
10 10000 0.075s 0.021s 0.016s 0.012s 0.199s 0.005s
100 10000 0.078s 0.066s 0.055s 0.013s 0.191s 0.005s
1000 10000 0.105s 0.62s 0.292s 0.013s 0.229s 0.005s
10000 10000 0.104s 6.94s 0.305s 0.015s 0.237s 0.005s
10 100000 0.738s 0.215s 0.171s 0.129s 1.306s 0.052s
100 100000 0.781s 0.673s 0.571s 0.131s 1.359s 0.052s
1000 100000 1.044s 6.259s 2.71s 0.141s 1.756s 0.054s
10000 100000 1.068s 73.205s 2.702s 0.144s 1.769s 0.059s
10 1000000 7.583s 2.199s 1.761s 1.332s 12.524s 0.559s
100 1000000 7.992s 6.634s 5.741s 1.346s 13.03s 0.557s
1000 1000000 10.551s 62.385s 26.4s 1.398s 16.771s 0.557s
10000 1000000 10.794s 742.086s 26.039s 1.437s 17.008s 0.578s
10 10000000 76.225s 22.127s 17.795s 13.196s 124.922s 5.789s
100 10000000 80.828s 67.602s 57.981s 13.695s 131.738s 5.714s
1000 10000000 106.194s 622.382s 267.555s 14.054s 168.854s 5.778s
10000 10000000 107.211s 7351.675s 265.404s 14.046s 171.13s 5.767s

An array, when queried with $arrayElemAt, is optimized for quickly retrieving values, while other data structures have a complexity of O(n). However, arrays have fixed values, which limits their flexibility compared to tables or collections. You may find more suitable structures in your application language. These structures resemble how SQL databases use hash tables. MongoDB can utilize a hash join for $lookup when the lookup table is small, when spilling to disk is permissible, and when there's no index.

When the lookup table is infrequently updated, applying updates to the embedded values is generally preferable, paying the price once at write and getting faster reads. MongoDB offers developers greater control over data access patterns and cardinalities, rather than relying solely on the query planner, which can lead to plan instability and runaway queries. In contrast, SQL databases cannot implement this flexibility without violating Codd's rules on data independence for relational databases.

A key distinction between MongoDB and SQL databases, including those that use a MongoDB API on top of an RDBMS, is their physical data model capabilities. RDBMS systems prioritize normalization and utilize efficient join algorithms for relational data models. In contrast, MongoDB provides flexible schemas for application objects and supports a joins where the join key can be an array ($lookup: more than just a SQL join) as part of an aggregation pipeline. While this may be less efficient for simple many-to-one relationships with scalar values, MongoDB's document data model can often eliminate the need for joins altogether. Additionally, caching lookup values in the application is a viable option.

One million $lookup challenge

I you have read my previous post $lookup: more than just a SQL join, you understand that $lookup is not designed to join scalar values from thousands of documents. $lookup is useful at the end of an aggregation pipeline, not before the aggregation (examples in Comparison of JOINS πŸ‘‰πŸ» aggregation pipeline and CTEs) from a million documents collection. However, such collection should not require a join, as documents are designed to aggregate multiple related objects, unlike relational databases that normalize business data to multiple tables.

In a many-to-one relationship, it is common to embed fields, even when they are duplicated, in a document model. Normalization plays a crucial role in relational databases to prevent these duplicates, as RDBMS were designed for interactive users executing SQL statements. Missing updates can lead to data integrity issues. While triggers can help manage updates to duplicated values and prevent anomalies, they introduce new challenges as they operate behind the update statement.

When updates originate from well-reviewed and tested programs, it is manageable to modify data in multiple locations, particularly when such updates are infrequent. Let's illustrate joins and the absence of joins with a simple test.

To join multiple documents with a small lookup table, you can cache the lookup table in your application. In this post, I tested several methods for retrieving a value from a lookup table: using a collection, a map, and an array. I integrated these methods into an aggregation pipeline, but keep in mind that this can also be accomplished within the application itself.

I created a dimension table with one thousand documents, and a fact table with one million. The fact table has a "ref" field that references the "dimid" in the dimension table:

db.dim.drop();
db.fact.drop();

db.dim.insertMany(
    Array.from({ length: 1000 }, (_, i) => ({
        _id: i + 1,
        value: Math.random()
    }))
);

db.fact.insertMany(
    Array.from({ length: 1000000 }, () => ({
        ref: Math.ceil(Math.random() * 1000),
        value: Math.random()
    }))
);

Lookup (IndexedLoopJoin): 10 seconds

Here is an aggregation pipeline with a lookup.

x=db.fact.aggregate([
    {
        $lookup: {
            from: "dim",
            localField: "ref",
            foreignField: "_id",
            as: "dimData" ,
        }
    },
]).explain("executionStats")
;
print(x["executionStats"]["executionTimeMillis"]/1000+" seconds")

On this data, it runs in ten seconds. The query planner chooses an Index Nested Loop Join because there is an index. Without an index it could use a hash join.

Map to object and $getField: 61 seconds

To avoid the lookup, I read the dimension table into an object with a field per value, the field name being the "dimid", and get the value with $getField


const dimMap = {};  
db.dim.find().forEach(doc => {  
    dimMap[doc._id] = doc.value;  
});  
print( dimMap )


x=db.fact.aggregate([
    {  
        $addFields: {  
            dimValue: {  
                $getField: {  
                    field: { $toString: "$ref" },  
                    input: dimMap  
                }  
            }  
        }  
    }  
]).explain("executionStats")
;
print(x["stages"][0]["$cursor"]
["executionStats"]["executionTimeMillis"]/1000+" seconds")

On this data it runs in one minute. Accessing to a field by name is not an optimal operation and is O(n) so it is a viable solution only for very small lookup table.

Map to $switch branches: 23 seconds

Instead of using that map, I build a $switch statement to use in the aggregation pipeline.


const dimMap = {};  
db.dim.find().forEach(doc => {  
    dimMap[doc._id] = doc.value;  
});  
print( dimMap )

const switchBranches = Object.entries(dimMap).map(([id, value]) => ({  
    case: { $eq: ["$ref", parseInt(id)] },  
    then: value  
}));  
print( switchBranches )

x=db.fact.aggregate([  
    {  
        $addFields: {  
            dimValue: {  
                $switch: {  
                    branches: switchBranches,  
                    default: null  
                }  
            }  
        }  
    }  
]).explain("executionStats")
;
print(x["stages"][0]["$cursor"]
["executionStats"]["executionTimeMillis"]/1000+" seconds")

On this data, it runs in twenty seconds, and given that it puts the logic into the query, it is acceptable only for small lookup tables.

Map to array and $arrayElemAt: 1 second

Instead of a map, I use an array where the index is the "dimid". As I have no guarantee that the "dimid" is sequential with no gap, I build a sparse index that I fill with the existing values.


// Get the maximum ID
const maxId = db.dim.aggregate([
 {$group:{_id:null,max:{$max:"$_id"}}}
]).toArray()[0].max;  
// Create a sparse array for all values
const dimValues = new Array(maxId + 1).fill(null);  
// store the values at the right ID
db.dim.find({},{_id:1,value:1}).forEach(
 d => dimValues[d._id] = d.value
);  
print(dimValues)

//
x=db.fact.aggregate([  
    { $addFields: { dimValue: { $arrayElemAt: [dimValues, "$ref"] } } }  
]).explain("executionStats")
;
print(x["stages"][0]["$cursor"]
["executionStats"]["executionTimeMillis"]/1000+" seconds")

This is fast and runs in one second. However, it works only when the lookup identifier are in control, ideally starting from one and in a no-gap sequence.

Embed rather than join (denormalization)

Finally, as recommended for a document model (Model One-to-Many Relationships with Embedded Documents), I duplicate the dimension value into each fact documents. I run this update with an aggregation pipeline.


const startTime = new Date(); 
db.fact.aggregate([  
    {  
        $lookup: {  
            from: "dim",  
            localField: "ref",  
            foreignField: "_id",  
            as: "dimData"  
        }  
    },  
    {  
        $out: "fact"  
    }  
])  

const endTime = new Date(); 
const executionTime = (endTime - startTime) / 1000;  
print(`Update execution time: ${executionTime} seconds`);

This should be executed once, and then only the updated dimension values should be synchronized. This update took 16 seconds on my data.

To compare, I can simply read the document and project the embedded value:

x=db.fact.aggregate([  
    {  
        $project: {  
            _id: 1,  
            ref: 1,  
            dimValue: 1,  // Simply project the pre-computed field  
            // Add any other fact fields you need  
            someFactField: 1  
        }  
    }  
]).explain("executionStats");  
print(x["executionStats"]["executionTimeMillis"]/1000+" seconds")

This query takes around 0.5 seconds on my data. It is advisable unless you are dealing with frequently updated lookup tables. Additionally, in MongoDB, a single compound index can cover all fields within a document. Typically, when filtering on a dimension, lookup, or reference table, the filter is applied to a business field rather than the internal "_id".

Conclusion

I have tested my example using various cardinalities for both the fact table and the dimension lookup table. Below is the raw data.

dim fact lookup $getField $switch $arrayElemAt update single doc
10 1000 0.008s 0.002s 0.001s 0.001s 0.08s 0s
100 1000 0.008s 0.006s 0.005s 0.001s 0.078s 0s
1000 1000 0.011s 0.062s 0.033s 0.001s 0.082s 0s
10000 1000 0.013s 0.754s 0.067s 0.003s 0.08s 0s
10 10000 0.075s 0.021s 0.016s 0.012s 0.199s 0.005s
100 10000 0.078s 0.066s 0.055s 0.013s 0.191s 0.005s
1000 10000 0.105s 0.62s 0.292s 0.013s 0.229s 0.005s
10000 10000 0.104s 6.94s 0.305s 0.015s 0.237s 0.005s
10 100000 0.738s 0.215s 0.171s 0.129s 1.306s 0.052s
100 100000 0.781s 0.673s 0.571s 0.131s 1.359s 0.052s
1000 100000 1.044s 6.259s 2.71s 0.141s 1.756s 0.054s
10000 100000 1.068s 73.205s 2.702s 0.144s 1.769s 0.059s
10 1000000 7.583s 2.199s 1.761s 1.332s 12.524s 0.559s
100 1000000 7.992s 6.634s 5.741s 1.346s 13.03s 0.557s
1000 1000000 10.551s 62.385s 26.4s 1.398s 16.771s 0.557s
10000 1000000 10.794s 742.086s 26.039s 1.437s 17.008s 0.578s
10 10000000 76.225s 22.127s 17.795s 13.196s 124.922s 5.789s
100 10000000 80.828s 67.602s 57.981s 13.695s 131.738s 5.714s
1000 10000000 106.194s 622.382s 267.555s 14.054s 168.854s 5.778s
10000 10000000 107.211s 7351.675s 265.404s 14.046s 171.13s 5.767s

An array, when queried with $arrayElemAt, is optimized for quickly retrieving values, while other data structures have a complexity of O(n). However, arrays have fixed values, which limits their flexibility compared to tables or collections. You may find more suitable structures in your application language. These structures resemble how SQL databases use hash tables. MongoDB can utilize a hash join for $lookup when the lookup table is small, when spilling to disk is permissible, and when there's no index.

When the lookup table is infrequently updated, applying updates to the embedded values is generally preferable, paying the price once at write and getting faster reads. MongoDB offers developers greater control over data access patterns and cardinalities, rather than relying solely on the query planner, which can lead to plan instability and runaway queries. In contrast, SQL databases must do all optimizations in the query planner to follow Codd's rules on data independence for relational databases.

A key distinction between MongoDB and SQL databases, including those that use a MongoDB API on top of an RDBMS, is their physical data model capabilities. RDBMS systems prioritize normalization and utilize efficient join algorithms for relational data models. In contrast, MongoDB provides flexible schemas for application objects and supports a joins where the join key can be an array ($lookup: more than just a SQL join) as part of an aggregation pipeline. While this may be less efficient for simple many-to-one relationships with scalar values, MongoDB's document data model can often eliminate the need for joins altogether. Additionally, caching lookup values in the application is a viable option.

Looking Ahead: A Confident New Chapter for Percona

Percona has always been more than just a companyβ€”it’s a mission-driven community built on a strong foundation of innovation, resilience, and open source excellence. For nearly two decades, Percona has been a trusted partner to organizations around the world, remaining steadfast in our mission while continuing to grow, improve, and evolve. Today, we mark the […]

June 17, 2025

$lookup: more than just a SQL join

When transitioning from a SQL background to MongoDB, the $lookup operation in an aggregation pipeline resembles a LEFT OUTER JOIN. For instance, the following MongoDB query:

        $lookup: {
            from: "xxb",
            localField: "xxa1",
            foreignField: "xxb1",
            as: "xxb"
        }

is often compared to the following SQL statement

        SELECT *  
        FROM xxa  
        LEFT JOIN xxb
        ON xxa.xxa1 = xxb.xxb1;  

If you assume that these two operations are identical across various databases, you may misinterpret the expected outcomes and their performance:

  1. In MongoDB, any field can be a scalar or an array. The $lookup function operates as an intersection of two sets, unlike SQL's equality operator (=) which operates on two values. It is more similar to PostgreSQL's && operator for arrays.
  2. The results in MongoDB do not duplicate outer rows for each inner row as in SQL. Instead, an array of matching inner documents is projected to the outer document. If an $unwind stage follows, the $lookup will then unnest this array into multiple documents.

To illustrate the semantics, consider this simple example: I will begin with scalars from two collections ("xxa" and "xxb") and then used arrays in their fields instead of scalars.

Scalar on the outer and inner collections

I define two small collections with some values in common and some which do not match:

db.xxa.drop();
db.xxa.insert([
  { xxa1: 101 },
  { xxa1: 102 },
]);

db.xxb.drop();
db.xxb.insert([
  { xxb1: 102 },
  { xxb1: 103 },
]);

The following query "joins" the two with a $lookup:

db.xxa.aggregate([
    {
        $lookup: {
            from: "xxb",
            localField: "xxa1",
            foreignField: "xxb1",
            as: "xxb"
        }
    }
]);

[
  { _id: ObjectId('68504d1df99599b54cd4b118'), xxa1: 101, xxb: [] },
  {
    _id: ObjectId('68504d1df99599b54cd4b119'),
    xxa1: 102,
    xxb: [ { _id: ObjectId('68504d26f99599b54cd4b11a'), xxb1: 102 } ]
  }
]

For each document from the outer collection, an array is created. If there is no matching document in the inner collection, the array remains empty. If there is one matching document, it will be included in the array. This looks like a LEFT OUTER JOIN in SQL that returns a structured one-to-many rather than a tabular result with duplicated values. However, this is a special case where we are joining on scalar fields.

Array on the outer, scalar in the inner

I replace the outer collection with some arrays:

db.xxa.drop();
db.xxa.insert([
  { xxa1: [100,101] },
  { xxa1: [101,102] },
  { xxa1: [102,103] },
]);

I run the same query:

db.xxa.aggregate([
    {
        $lookup: {
            from: "xxb",
            localField: "xxa1",
            foreignField: "xxb1",
            as: "xxb"
        }
    }
]);

[
  {
    _id: ObjectId('68504d43f99599b54cd4b11c'),
    xxa1: [ 100, 101 ],
    xxb: []
  },
  {
    _id: ObjectId('68504d43f99599b54cd4b11d'),
    xxa1: [ 101, 102 ],
    xxb: [ { _id: ObjectId('68504d26f99599b54cd4b11a'), xxb1: 102 } ]
  },
  {
    _id: ObjectId('68504d43f99599b54cd4b11e'),
    xxa1: [ 102, 103 ],
    xxb: [
      { _id: ObjectId('68504d26f99599b54cd4b11a'), xxb1: 102 },
      { _id: ObjectId('68504d26f99599b54cd4b11b'), xxb1: 103 }
    ]
  }
]

The semantics are similar, except that a document from the inner collection matches as soon as its value exists in the outer array. When it is not contained, the added array is empty. When multiple values from the outer collection find a matching document from the inner collection, those multiple documents are added to the array in the result.

The array in the outer collection acts as a one-to-many reference. In SQL, one-to-many relationships cannot exist and are transformed into many-to-one on the opposite side, and an index is created on the foreign key to allow navigating in the other way.

Scalar on the outer, array in the inner

I do the opposite:

db.xxa.drop();
db.xxa.insert([
  { xxa1: 101 },
  { xxa1: 102 },
]);

db.xxb.drop();
db.xxb.insert([
  { xxb1: [100,101] },
  { xxb1: [101,102] },
  { xxb1: [102,103] },
]);

I run the same query:

db.xxa.aggregate([
    {
        $lookup: {
            from: "xxb",
            localField: "xxa1",
            foreignField: "xxb1",
            as: "xxb"
        }
    }
]);

[
  {
    _id: ObjectId('68504d5ef99599b54cd4b11f'),
    xxa1: 101,
    xxb: [
      { _id: ObjectId('68504d65f99599b54cd4b121'), xxb1: [ 100, 101 ] },
      { _id: ObjectId('68504d65f99599b54cd4b122'), xxb1: [ 101, 102 ] }
    ]
  },
  {
    _id: ObjectId('68504d5ef99599b54cd4b120'),
    xxa1: 102,
    xxb: [
      { _id: ObjectId('68504d65f99599b54cd4b122'), xxb1: [ 101, 102 ] },
      { _id: ObjectId('68504d65f99599b54cd4b123'), xxb1: [ 102, 103 ] }
    ]
  }
]

A document from the inner collection matches as soon as one of its values equals a value from the inner collection. If multiple inner documents match, they will all appear in the resulting document array.

Arrays on both side

The general case can handle arrays on both sides:

db.xxa.drop();
db.xxa.insert([
  { xxa1: [100,101] },
  { xxa1: [101,102] },
  { xxa1: [102,103] },
]);

db.xxa.aggregate([
    {
        $lookup: {
            from: "xxb",
            localField: "xxa1",
            foreignField: "xxb1",
            as: "xxb"
        }
    }
]);

[
  {
    _id: ObjectId('68504d7ef99599b54cd4b124'),
    xxa1: [ 100, 101 ],
    xxb: [
      { _id: ObjectId('68504d65f99599b54cd4b121'), xxb1: [ 100, 101 ] },
      { _id: ObjectId('68504d65f99599b54cd4b122'), xxb1: [ 101, 102 ] }
    ]
  },
  {
    _id: ObjectId('68504d7ef99599b54cd4b125'),
    xxa1: [ 101, 102 ],
    xxb: [
      { _id: ObjectId('68504d65f99599b54cd4b121'), xxb1: [ 100, 101 ] },
      { _id: ObjectId('68504d65f99599b54cd4b122'), xxb1: [ 101, 102 ] },
      { _id: ObjectId('68504d65f99599b54cd4b123'), xxb1: [ 102, 103 ] }
    ]
  },
  {
    _id: ObjectId('68504d7ef99599b54cd4b126'),
    xxa1: [ 102, 103 ],
    xxb: [
      { _id: ObjectId('68504d65f99599b54cd4b122'), xxb1: [ 101, 102 ] },
      { _id: ObjectId('68504d65f99599b54cd4b123'), xxb1: [ 102, 103 ] }
    ]
  }
]

In this context, 'matching' refers to the intersection between the set of keys from the outer document and the set of keys found in an inner document.

Representing the same in a SQL database is much more complex than two tables, as it needs additional tables instead of arrays:

-- Main entities  
CREATE TABLE xxa (  
    id SERIAL PRIMARY KEY  
    -- other scalar fields if any  
);  

CREATE TABLE xxb (  
    id SERIAL PRIMARY KEY  
    -- other scalar fields if any  
);  

-- Junction tables to represent the arrays  
CREATE TABLE xxa_values (  
    xxa_id INT REFERENCES xxa(id),  
    value INT,  
    PRIMARY KEY (xxa_id, value)  
);  

CREATE TABLE xxb_values (  
    xxb_id INT REFERENCES xxb(id),  
    value INT,  
    PRIMARY KEY (xxb_id, value)  
);  

Joining all tables and deduplicating the results serves as the equivalent of a MongoDB lookup:

SELECT   
    xxa.id as xxa_id,  
    ARRAY_AGG(DISTINCT xxa_vals.value) as xxa1,  
    ARRAY_AGG(  
        DISTINCT jsonb_build_object(  
            'id', xxb.id,  
            'xxb1', ARRAY(  
                SELECT value   
                FROM xxb_values   
                WHERE xxb_id = xxb.id   
                ORDER BY value  
            )  
        )  
    ) FILTER (WHERE xxb.id IS NOT NULL) as xxb  
FROM xxa  
LEFT JOIN xxa_values xxa_vals ON xxa.id = xxa_vals.xxa_id  
LEFT JOIN xxb_values xxb_vals ON xxa_vals.value = xxb_vals.value  
LEFT JOIN xxb ON xxb_vals.xxb_id = xxb.id  
GROUP BY xxa.id  
ORDER BY xxa.id;  

The next time you hear that a lookup in MongoDB is the same as a SQL join and don't understand the result or the performance, you should remember that it is a different, higher-level operation, on a flexible schema.
A simple left outer join in SQL often disappears in MongoDB, as the interconnected objects belong to a single document, like an aggregate in Domain Driver Design.

Conclusion

I used abstract names for collections and fields, in order to explain the behavior, but many modern applications benefit from a small array and an additional association table may complicate things unnecessarily. In contemporary apps, users are identified by one or more email addresses, resources by tags, blogs by categories, and videos by interests, and many queries look for common interests, categories, or tags.

$lookup: more than just a SQL join

When transitioning from a SQL background to MongoDB, the $lookup operation in an aggregation pipeline resembles a LEFT OUTER JOIN. For instance, the following MongoDB query:

        $lookup: {
            from: "xxb",
            localField: "xxa1",
            foreignField: "xxb1",
            as: "xxb"
        }

is often compared to the following SQL statement

        SELECT *  
        FROM xxa  
        LEFT JOIN xxb
        ON xxa.xxa1 = xxb.xxb1;  

If you assume that these two operations are identical across various databases, you may misinterpret the expected outcomes and their performance:

  1. In MongoDB, any field can be a scalar or an array. The $lookup function operates as an intersection of two sets, unlike SQL's equality operator (=) which operates on two values. It is more similar to PostgreSQL's && operator for arrays.
  2. The results in MongoDB do not duplicate outer rows for each inner row as in SQL. Instead, an array of matching inner documents is projected to the outer document. If an $unwind stage follows, the $lookup will then unnest this array into multiple documents.

To illustrate the semantics, consider this simple example: I will begin with scalars from two collections ("xxa" and "xxb") and then used arrays in their fields instead of scalars.

Scalar on the outer and inner collections

I define two small collections with some values in common and some which do not match:

db.xxa.drop();
db.xxa.insert([
  { xxa1: 101 },
  { xxa1: 102 },
]);

db.xxb.drop();
db.xxb.insert([
  { xxb1: 102 },
  { xxb1: 103 },
]);

The following query "joins" the two with a $lookup:

db.xxa.aggregate([
    {
        $lookup: {
            from: "xxb",
            localField: "xxa1",
            foreignField: "xxb1",
            as: "xxb"
        }
    }
]);

[
  { _id: ObjectId('68504d1df99599b54cd4b118'), xxa1: 101, xxb: [] },
  {
    _id: ObjectId('68504d1df99599b54cd4b119'),
    xxa1: 102,
    xxb: [ { _id: ObjectId('68504d26f99599b54cd4b11a'), xxb1: 102 } ]
  }
]

For each document from the outer collection, an array is created. If there is no matching document in the inner collection, the array remains empty. If there is one matching document, it will be included in the array. This looks like a LEFT OUTER JOIN in SQL that returns a structured one-to-many rather than a tabular result with duplicated values. However, this is a special case where we are joining on scalar fields.

Array on the outer, scalar in the inner

I replace the outer collection with some arrays:

db.xxa.drop();
db.xxa.insert([
  { xxa1: [100,101] },
  { xxa1: [101,102] },
  { xxa1: [102,103] },
]);

I run the same query:

db.xxa.aggregate([
    {
        $lookup: {
            from: "xxb",
            localField: "xxa1",
            foreignField: "xxb1",
            as: "xxb"
        }
    }
]);

[
  {
    _id: ObjectId('68504d43f99599b54cd4b11c'),
    xxa1: [ 100, 101 ],
    xxb: []
  },
  {
    _id: ObjectId('68504d43f99599b54cd4b11d'),
    xxa1: [ 101, 102 ],
    xxb: [ { _id: ObjectId('68504d26f99599b54cd4b11a'), xxb1: 102 } ]
  },
  {
    _id: ObjectId('68504d43f99599b54cd4b11e'),
    xxa1: [ 102, 103 ],
    xxb: [
      { _id: ObjectId('68504d26f99599b54cd4b11a'), xxb1: 102 },
      { _id: ObjectId('68504d26f99599b54cd4b11b'), xxb1: 103 }
    ]
  }
]

The semantics are similar, except that a document from the inner collection matches as soon as its value exists in the outer array. When it is not contained, the added array is empty. When multiple values from the outer collection find a matching document from the inner collection, those multiple documents are added to the array in the result.

The array in the outer collection acts as a one-to-many reference. In SQL, one-to-many relationships cannot exist and are transformed into many-to-one on the opposite side, and an index is created on the foreign key to allow navigating in the other way.

Scalar on the outer, array in the inner

I do the opposite:

db.xxa.drop();
db.xxa.insert([
  { xxa1: 101 },
  { xxa1: 102 },
]);

db.xxb.drop();
db.xxb.insert([
  { xxb1: [100,101] },
  { xxb1: [101,102] },
  { xxb1: [102,103] },
]);

I run the same query:

db.xxa.aggregate([
    {
        $lookup: {
            from: "xxb",
            localField: "xxa1",
            foreignField: "xxb1",
            as: "xxb"
        }
    }
]);

[
  {
    _id: ObjectId('68504d5ef99599b54cd4b11f'),
    xxa1: 101,
    xxb: [
      { _id: ObjectId('68504d65f99599b54cd4b121'), xxb1: [ 100, 101 ] },
      { _id: ObjectId('68504d65f99599b54cd4b122'), xxb1: [ 101, 102 ] }
    ]
  },
  {
    _id: ObjectId('68504d5ef99599b54cd4b120'),
    xxa1: 102,
    xxb: [
      { _id: ObjectId('68504d65f99599b54cd4b122'), xxb1: [ 101, 102 ] },
      { _id: ObjectId('68504d65f99599b54cd4b123'), xxb1: [ 102, 103 ] }
    ]
  }
]

A document from the inner collection matches as soon as one of its values equals a value from the inner collection. If multiple inner documents match, they will all appear in the resulting document array.

Arrays on both side

The general case can handle arrays on both sides:

db.xxa.drop();
db.xxa.insert([
  { xxa1: [100,101] },
  { xxa1: [101,102] },
  { xxa1: [102,103] },
]);

db.xxa.aggregate([
    {
        $lookup: {
            from: "xxb",
            localField: "xxa1",
            foreignField: "xxb1",
            as: "xxb"
        }
    }
]);

[
  {
    _id: ObjectId('68504d7ef99599b54cd4b124'),
    xxa1: [ 100, 101 ],
    xxb: [
      { _id: ObjectId('68504d65f99599b54cd4b121'), xxb1: [ 100, 101 ] },
      { _id: ObjectId('68504d65f99599b54cd4b122'), xxb1: [ 101, 102 ] }
    ]
  },
  {
    _id: ObjectId('68504d7ef99599b54cd4b125'),
    xxa1: [ 101, 102 ],
    xxb: [
      { _id: ObjectId('68504d65f99599b54cd4b121'), xxb1: [ 100, 101 ] },
      { _id: ObjectId('68504d65f99599b54cd4b122'), xxb1: [ 101, 102 ] },
      { _id: ObjectId('68504d65f99599b54cd4b123'), xxb1: [ 102, 103 ] }
    ]
  },
  {
    _id: ObjectId('68504d7ef99599b54cd4b126'),
    xxa1: [ 102, 103 ],
    xxb: [
      { _id: ObjectId('68504d65f99599b54cd4b122'), xxb1: [ 101, 102 ] },
      { _id: ObjectId('68504d65f99599b54cd4b123'), xxb1: [ 102, 103 ] }
    ]
  }
]

In this context, 'matching' refers to the intersection between the set of keys from the outer document and the set of keys found in an inner document.

Representing the same in a SQL database is much more complex than two tables, as it needs additional tables instead of arrays:

-- Main entities  
CREATE TABLE xxa (  
    id SERIAL PRIMARY KEY  
    -- other scalar fields if any  
);  

CREATE TABLE xxb (  
    id SERIAL PRIMARY KEY  
    -- other scalar fields if any  
);  

-- Junction tables to represent the arrays  
CREATE TABLE xxa_values (  
    xxa_id INT REFERENCES xxa(id),  
    value INT,  
    PRIMARY KEY (xxa_id, value)  
);  

CREATE TABLE xxb_values (  
    xxb_id INT REFERENCES xxb(id),  
    value INT,  
    PRIMARY KEY (xxb_id, value)  
);  

Joining all tables and deduplicating the results serves as the equivalent of a MongoDB lookup:

SELECT   
    xxa.id as xxa_id,  
    ARRAY_AGG(DISTINCT xxa_vals.value) as xxa1,  
    ARRAY_AGG(  
        DISTINCT jsonb_build_object(  
            'id', xxb.id,  
            'xxb1', ARRAY(  
                SELECT value   
                FROM xxb_values   
                WHERE xxb_id = xxb.id   
                ORDER BY value  
            )  
        )  
    ) FILTER (WHERE xxb.id IS NOT NULL) as xxb  
FROM xxa  
LEFT JOIN xxa_values xxa_vals ON xxa.id = xxa_vals.xxa_id  
LEFT JOIN xxb_values xxb_vals ON xxa_vals.value = xxb_vals.value  
LEFT JOIN xxb ON xxb_vals.xxb_id = xxb.id  
GROUP BY xxa.id  
ORDER BY xxa.id;  

The next time you hear that a lookup in MongoDB is the same as a SQL join and don't understand the result or the performance, you should remember that it is a different, higher-level operation, on a flexible schema.
A simple left outer join in SQL often disappears in MongoDB, as the interconnected objects belong to a single document, like an aggregate in Domain Driver Design.

Conclusion

I used abstract names for collections and fields, in order to explain the behavior, but many modern applications benefit from a small array and an additional association table may complicate things unnecessarily. In contemporary apps, users are identified by one or more email addresses, resources by tags, blogs by categories, and videos by interests, and many queries look for common interests, categories, or tags.

Percona Software for MongoDB Release Plan Updates

Percona commits to delivering robust, enterprise-grade, and community-driven software for MongoDB. We build our databases and tools to meet evolving data needs. As MongoDB continues to evolve, we’ve updated and clarified our release strategy for the Percona Software for MongoDB suite. This suite includes: Percona Server for MongoDB (PSMDB) Percona Backup for MongoDB (PBM) Percona […]

June 16, 2025

Implement row-level security in Amazon Aurora MySQL and Amazon RDS for MySQL

Row-level security (RLS) is a security mechanism that enhances data protection in scalable applications by controlling access at the individual row level. It enables organizations to implement fine-grained access controls based on user attributes, so users can only view and modify data they’re authorized to access. This post focuses on implementing a cost-effective custom RLS solution using native MySQL features, making it suitable for a wide range of use cases without requiring additional software dependencies. This solution is applicable for both Amazon Relational Database Service (Amazon RDS) for MySQL and Amazon Aurora MySQL-Compatible Edition, providing flexibility for users of either service.

Integrating Repmgr with Barman: Managing Backups During Switchovers

Repmgr is an open source tool suite for managing replication and failover in a cluster of PostgreSQL servers. It enhances PostgreSQL’s built-in hot-standby capabilities with tools to set up standby servers, monitor replication, and perform administrative tasks such as failover or manual switchover operations. If you’re already familiar with Patroni, think of repmgr as a […]

June 15, 2025

Queries on JSON πŸ‘‰πŸ» compound indexes (Equality, Sort, Range)

In a blog post titled New Benchmarks Show Postgres Dominating MongoDB in Varied Workloads, EDB claims that Postgres outperforms MongoDB in document-based data tests. While I generally find such marketing benchmarks useless, they highlight mistakes made by vendors comparing a database where they are experts with one they don't know and are unwilling to learn about. This provides an opportunity for educating on efficient index and query design.

There are four queries in this benchmark, with data loaded from github archive from 2015. This data is in JSON, which makes it suitable to test queries on documents, and queries are OLAP style, using aggregation pipelines in MongoDB.

Load documents to a collection

I used the same method to load data in a small lab to reproduce the queries:

for file in http://data.gharchive.org/2015-{01..12}-{01..31}-{0..23}.json.gz
 do
  wget -q -o /dev/null -O - $file |
  gunzip -c |
  mongoimport --collection="github2015" 
 done

Here is an example of one document:

db.github2015.find().limit(1);

[
  {
    _id: ObjectId('684ee281d8d90a383a078112'),
    id: '2489368070',
    type: 'PushEvent',
    actor: {
      id: 9152315,
      login: 'davidjhulse',
      gravatar_id: '',
      url: 'https://api.github.com/users/davidjhulse',
      avatar_url: 'https://avatars.githubusercontent.com/u/9152315?'
    },
    repo: {
      id: 28635890,
      name: 'davidjhulse/davesbingrewardsbot',
      url: 'https://api.github.com/repos/davidjhulse/davesbingrewardsbot'                                                                      
    },
    payload: {
      push_id: 536740396,
      size: 1,
      distinct_size: 1,
      ref: 'refs/heads/master',
      head: 'a9b22a6d80c1e0bb49c1cf75a3c075b642c28f81',
      before: '86ffa724b4d70fce46e760f8cc080f5ec3d7d85f',
      commits: [
        {
          sha: 'a9b22a6d80c1e0bb49c1cf75a3c075b642c28f81',
          author: {
            email: 'da8d7d1118ca5befd4d0d3e4f449c76ba6f1ee7e@live.com',
            name: 'davidjhulse'
          },
          message: 'Altered BingBot.jar\n\nFixed issue with multiple account support',                                                         
          distinct: true,
          url: 'https://api.github.com/repos/davidjhulse/davesbingrewardsbot/commits/a9b22a6d80c1e0bb49c1cf75a3c075b642c28f81'                 
        }
      ]
    },
    public: true,
    created_at: '2015-01-01T00:00:00Z'
  }
]

This dataset is ideal for testing a document database because:

  • documents have an average size of three kilobytes, with some up to two megabytes.
  • it features a structured format with sub-documents like "actor" and "payload."
  • it contains arrays, such as "payload.commit," necessitating multi-key or inverted indexes.

I will test the four queries used by the benchmark and include an additional one that queries the array "payload.commit," which the benchmark overlooked. MongoDB's multi-key indexes significantly outperform all SQL databases in this regard (see the Multi-key Indexes series). While many vendor benchmarks limit their test coverage to create a favorable impression, my objective is to transparently showcase indexing best practices.

Single-field indexes

Here are the indexes that were created for the vendor benchmark:

 db.github2015.createIndex( {type:1} )
 db.github2015.createIndex( {"repo.name":1} )
 db.github2015.createIndex( {"payload.action":1} )
 db.github2015.createIndex( {"actor.login":1} )
 db.github2015.createIndex( {"payload.issue.comments":1} )

Seeing the index definitions, I already know why they got better results on PostgreSQL. Complex queries rarely filter or sort on a single field, and a well-designed database should have compound indexes. Without the right compound indexes, PostgreSQL can combine multiple indexes with bitmap scans, which is not ideal, as it cannot cover range filters or sort orders. However, it can be used to reduce the number of indexes created, as they negatively impact the vacuum process.

The document model in MongoDB offers the advantage of having all important fields consolidated within a single document. This allows for the use of a compound index that can effectively handle equality, range, and sort order. These indexes can be applied to both scalar values and arrays.

Benchmark queries

I ran benchmark queries in my lab to educate on indexing practices. I created the appropriate indexes, and the best starting point for documentation is The ESR (Equality, Sort, Range) Guideline.

Query a) Repositories order by most open issues quantity

The query is:

db.github2015.aggregate([  
  { $match: { $and: [ { type: "IssuesEvent"} , { "payload.action" : "opened" } ] }},  
  { $group: { _id: "$repo.name", total: { $sum: 1 } } },  
  { $sort: { total: -1 } }  
])

Index according to the ESR guideline:

  • Equality: "type" and "payload.action", with the less selective first (better compression)
  • Sort: "repo.name", as grouping is faster on sorted keys
db.github2015.createIndex(
 { "payload.action": 1, "type": 1, "repo.name": 1 }
)  

In my test, the query scanned nearly three hundred thousands index keys and returned in 700 milliseconds:

        executionStats: {                                                                                                                       
          executionSuccess: true,                                                                                                               
          nReturned: 72361,                                                                                                                     
          executionTimeMillis: 777,                                                                                                             
          totalKeysExamined: 278489,                                                                                                            
          totalDocsExamined: 0,    

Query b) Return git event type order by quantity from major to minor

The query is:

db.github2015.aggregate([  
  { $group: { _id: "$type", total: { $sum: 1 } } },  
  { $sort: { total: -1 } }  
])

Index according to the ESR guideline:

  • Sort: "repo.name", as grouping is faster on sorted keys
db.github2015.createIndex({ "type": 1 })  

This index was created by the benchmark, however it is important to know that the query planner will not choose the index without an equality or range predicate, even if it could help with sorting or grouping.

The query should simply add an unbounded range on the sort key:

db.github2015.aggregate([  
  { $match: { "type": { $gt: MinKey}  }},  
  { $group: { _id: "$type", total: { $sum: 1 } } },  
  { $sort: { total: -1 } }  
])

In my test, the query aggregated ten million keys in 4.5 seconds:

        executionStats: {
          executionSuccess: true,
          nReturned: 14,
          executionTimeMillis: 4585,
          totalKeysExamined: 9480600,
          totalDocsExamined: 0, 

Query c) Return the top 10 most active actors

The query is:

db.github2015.aggregate([  
  { $group: { _id: "$actor.login", events: { $sum: 1 } } },  
  { $sort: { events: -1 } },  
  { $limit: 10 }  
])

Index according to the ESR guideline:

  • Sort: "repo.name", as grouping is faster on sorted keys
db.github2015.createIndex({ "actor.login": 1 })  

As seen above, we need to add an unbounded range:

db.github2015.aggregate([  
  { $match: { "actor.login": { $gt: MinKey}  }}, 
  { $group: { _id: "$actor.login", events: { $sum: 1 } } },  
  { $sort: { events: -1 } },  
  { $limit: 10 }  
])

In my test, the query aggregated ten million keys in 13 seconds:

        executionStats: {                                                                                                                       
          executionSuccess: true,                                                                                                               
          nReturned: 727312,
          executionTimeMillis: 13498,
          totalKeysExamined: 9480600,
          totalDocsExamined: 0,

If you need to optimize this access pattern further, use the computed design pattern, and increment a login counter in an actor's collection, leveraging the atomic $inc operator.

Query d) Return repositories that have more than two comments and a specific event type, order by average comments from major to minor

The query is:

db.github2015.aggregate([  
  { $match: { "type": "PushEvent", "payload.issue.comments": { $gt : 2 } } },  
  { $group: { _id: "$repo.name", avg: { $avg: "$payload.issue.comments" } } },  
  { $sort: { avg: -1 } }  
])

Index according to the ESR guideline:

  • Equality: "type"
  • Sort: "repo.name", as grouping is faster on sorted keys
  • Range: "payload.issue.comments"
db.github2015.createIndex(
 { "type": 1, "repo.name": 1, "payload.issue.comments": 1  }
)

I don't know if it was done on purpose for the benchmark, but the dataset has no "PushEvent" with a "payload.issue" so the query returns an empty result.
For this particular case, as there's no keys to sort, it is better to place the range key before the sort key:

db.github2015.createIndex(
 { "type": 1, "payload.issue.comments": 1, "repo.name": 1 }
)  

The index immediately finds that there are no keys for those bounds:


        executionStats: {
          executionSuccess: true,
          nReturned: 0,
          executionTimeMillis: 1,
          totalKeysExamined: 0,
          totalDocsExamined: 0,

Ad-Hoc queries for OLAP

While acceptable performance can be achieved with the right index, it may not be the optimal solution for OLAP use cases. Instead, consider creating a single Atlas Search index to handle all related queries. For further guidance, refer to my previous post: Search Index for Reporting.

Atlas Search Indexes are maintained asynchronously on a dedicated node, ensuring no impact on the operational database. They provide a near-real-time state without the complexity of streaming changes to another database.

Another Query: Recent push events by user's commit

The recent push by a user's commit is a relevant use-case for this dataset. However, the benchmark did not run any queries on "commits" since it is an array. While PostgreSQL supports JSON, it cannot be directly compared to a document database like MongoDB that handles non-scalar fields natively. If you run the benchmark on PostgreSQL, try this query:

SELECT   
 data->'repo'->>'name' as repo_name,  
 data->'payload'->'commits' as commits_info  
FROM "github2015"  
WHERE   
 data->>'type' = 'PushEvent'  
 AND data->'payload'->'commits' @> '[{"author": {"name": "ggolden@umich.edu"}}]'  
ORDER BY   
 data->>'created_at' DESC  
LIMIT 5;  

You can explore various indexes, such as a GIN index, but you will never find one that directly retrieves the five documents needed for the result. If you do, please correct me and show the execution plan in a comment.

On MongoDB, the same index guideline applies:

  • Equality: "type" (low selectivity) and "payload.commits.author.name" (will be multi-key)
  • Sort: "created_at" (a must for pagination query)
db.github2015.createIndex({  
  "type": 1,  
  "payload.commits.author.name": 1,  
  "created_at": -1  
}) 

The query is simple and doesn't even need an aggregation pipeline:

db.github2015.find({      
  "type": "PushEvent",      
  "payload.commits.author.name": "ggolden@umich.edu"      
}, {  
  "repo.name": 1,  
  "payload.commits.author.name": 1,  
  "payload.commits.message": 1,  
  _id: 0  
}).sort({      
  "created_at": -1      
}).limit(5) 

The execution statistics indicate that only 5 documents have been read, which is the minimum required for the results:

  executionStats: {                                                                                                                                                                                          
    executionSuccess: true,                                                                                                                                                                                  
    nReturned: 5,                                                                                                                                                                                            
    executionTimeMillis: 0,                                                                                                                                                                                  
    totalKeysExamined: 5,                                                                                                                                                                                    
    totalDocsExamined: 5,                                                                                                                                                                                    
    executionStages: {                                                                                                                                                                                       

The execution plan illustrates how a single seek has read five index entries by utilizing a multi-key index. It applies the index bounds for both the equality filters and the sort order, ensuring an efficient retrieval process:

  stage: 'IXSCAN',          
  nReturned: 5,          
  executionTimeMillisEstimate: 0,          
...       
  isMultiKey: true,          
  multiKeyPaths: {          
    type: [],          
    'payload.commits.author.name': [ 'payload.commits' ],          
    created_at: []          
  },          
...                                                                                                                                                                                        
  direction: 'forward',
  indexBounds: {
    type: [ '["PushEvent", "PushEvent"]' ],
    'payload.commits.author.name': [ '["ggolden@umich.edu", "ggolden@umich.edu"]' ],
    created_at: [ '[MaxKey, MinKey]' ]
  },
  keysExamined: 5,
  seeks: 1, 
  dupsTested: 5,

Conclusion

Vendor benchmarks can be misleading with their time comparisons and performance claims. However, when analyzed critically, they serve an educational role by highlighting common design mistakes in the database with which they compare to.

In this post, I applied the [The ESR (Equality, Sort, Range) Guideline], looked at the execution plan, and tested queries that are relevant to a document model, to show the power of MongoDB to query JSON data.

Queries on JSON πŸ‘‰πŸ» compound indexes (Equality, Sort, Range)

In a blog post titled New Benchmarks Show Postgres Dominating MongoDB in Varied Workloads, EDB claims that Postgres outperforms MongoDB in document-based data tests. While I generally find such marketing benchmarks useless, they highlight mistakes made by vendors who compare a database where they are experts with one they don't know and have no desire to learn. This provides an opportunity to remind some basics of on efficient index and query design.

There are four queries in this benchmark, with data loaded from github archive from 2015. This data is in JSON, which makes it suitable to test queries on documents, and queries are OLAP style, using aggregation pipelines in MongoDB.

Load documents to a collection

I used the same method to load data in a small lab to reproduce the queries:

for file in http://data.gharchive.org/2015-{01..12}-{01..31}-{0..23}.json.gz
 do
  wget -q -o /dev/null -O - $file |
  gunzip -c |
  mongoimport --collection="github2015" 
 done

Here is an example of one document:

db.github2015.find().limit(1);

[
  {
    _id: ObjectId('684ee281d8d90a383a078112'),
    id: '2489368070',
    type: 'PushEvent',
    actor: {
      id: 9152315,
      login: 'davidjhulse',
      gravatar_id: '',
      url: 'https://api.github.com/users/davidjhulse',
      avatar_url: 'https://avatars.githubusercontent.com/u/9152315?'
    },
    repo: {
      id: 28635890,
      name: 'davidjhulse/davesbingrewardsbot',
      url: 'https://api.github.com/repos/davidjhulse/davesbingrewardsbot'                                                                      
    },
    payload: {
      push_id: 536740396,
      size: 1,
      distinct_size: 1,
      ref: 'refs/heads/master',
      head: 'a9b22a6d80c1e0bb49c1cf75a3c075b642c28f81',
      before: '86ffa724b4d70fce46e760f8cc080f5ec3d7d85f',
      commits: [
        {
          sha: 'a9b22a6d80c1e0bb49c1cf75a3c075b642c28f81',
          author: {
            email: 'da8d7d1118ca5befd4d0d3e4f449c76ba6f1ee7e@live.com',
            name: 'davidjhulse'
          },
          message: 'Altered BingBot.jar\n\nFixed issue with multiple account support',                                                         
          distinct: true,
          url: 'https://api.github.com/repos/davidjhulse/davesbingrewardsbot/commits/a9b22a6d80c1e0bb49c1cf75a3c075b642c28f81'                 
        }
      ]
    },
    public: true,
    created_at: '2015-01-01T00:00:00Z'
  }
]

This dataset is ideal for testing a document database because:

  • documents have an average size of three kilobytes, with some up to two megabytes.
  • it features a structured format with sub-documents like "actor" and "payload."
  • it contains arrays, such as "payload.commit," necessitating multi-key or inverted indexes.

I will test the four queries used by the benchmark and include an additional one that queries the array "payload.commit," which the benchmark overlooked. MongoDB's multi-key indexes significantly outperform all SQL databases in this regard (see the Multi-key Indexes series). While many vendor benchmarks limit their test coverage to create a favorable impression, my objective is to transparently showcase indexing best practices.

Single-field indexes

Here are the indexes that were created for the vendor benchmark:

 db.github2015.createIndex( {type:1} )
 db.github2015.createIndex( {"repo.name":1} )
 db.github2015.createIndex( {"payload.action":1} )
 db.github2015.createIndex( {"actor.login":1} )
 db.github2015.createIndex( {"payload.issue.comments":1} )

Seeing the index definitions, I already know why they got better results on PostgreSQL for some queries. Complex queries rarely filter or sort on a single field, and a well-designed database should have compound indexes. Without the right compound indexes, PostgreSQL can combine multiple indexes with bitmap scans, which is not ideal, as it cannot cover range filters or sort orders. However, it can be used to reduce the number of indexes created, as they negatively impact the vacuum process.

The document model in MongoDB offers the advantage of having all important fields consolidated within a single document. This allows for the use of a compound index that can effectively handle equality, range, and sort order. These indexes can be applied to both scalar values and arrays.

Benchmark queries

I ran benchmark queries in my lab to educate on indexing practices. I created the appropriate indexes, and the best starting point for documentation is The ESR (Equality, Sort, Range) Guideline.

Query a) Repositories order by most open issues quantity

The query is:

db.github2015.aggregate([  
  { $match: { $and: [ { type: "IssuesEvent"} , { "payload.action" : "opened" } ] }},  
  { $group: { _id: "$repo.name", total: { $sum: 1 } } },  
  { $sort: { total: -1 } }  
]).explain("executionStats")

Index according to the ESR guideline:

  • Equality: "type" and "payload.action", with the less selective first (better compression)
  • Sort: "repo.name", as grouping is faster on sorted keys
db.github2015.createIndex(
 { "payload.action": 1, "type": 1, "repo.name": 1 }
)  

In my test, the query scanned nearly three hundred thousands index keys and returned in 700 milliseconds:

        executionStats: {                                                                                                                       
          executionSuccess: true,                                                                                                               
          nReturned: 72361,                                                                                                                     
          executionTimeMillis: 777,                                                                                                             
          totalKeysExamined: 278489,                                                                                                            
          totalDocsExamined: 0,    

In the vendor's benchmark results, MongoDB outperformed PostgreSQL for this query, even if it had to fetch half of the documents due to an index being used for a single filter. This suggests how MongoDB can be significantly faster than PostgreSQL with the appropriate indexing.

Query b) Return git event type order by quantity from major to minor

The query is:

db.github2015.aggregate([  
  { $group: { _id: "$type", total: { $sum: 1 } } },  
  { $sort: { total: -1 } }  
])

Index according to the ESR guideline:

  • Sort: "repo.name", as grouping is faster on sorted keys
db.github2015.createIndex({ "type": 1 })  

This index was created by the benchmark, however it is important to know that the query planner will consider the index without a filter on the key prefix (this may be improved in the future if SERVER-13197 is implemented).

The query should simply add an unbounded range on the sort key:

db.github2015.aggregate([  
  { $match: { "type": { $gt: MinKey}  }},  
  { $group: { _id: "$type", total: { $sum: 1 } } },  
  { $sort: { total: -1 } }  
])

In my test, the query aggregated ten million keys in 4.5 seconds:

        executionStats: {
          executionSuccess: true,
          nReturned: 14,
          executionTimeMillis: 4585,
          totalKeysExamined: 9480600,
          totalDocsExamined: 0, 

I enhanced performance significantly by optimizing the query without creating in new index. MongoDB provides users with greater control over data access than PostgreSQL, even allowing for the use of query planner hints to improve efficiency. In this case, it is sufficient to add a {$gt: MinKey} or { $lt: MaxKey}.

Query c) Return the top 10 most active actors

The query is:

db.github2015.aggregate([  
  { $group: { _id: "$actor.login", events: { $sum: 1 } } },  
  { $sort: { events: -1 } },  
  { $limit: 10 }  
])

Index according to the ESR guideline:

  • Sort: "repo.name", as grouping is faster on sorted keys
db.github2015.createIndex({ "actor.login": 1 })  

As seen above, we need to add an unbounded range filter to get the idnex considered by the query planner:

db.github2015.aggregate([  
  { $match: { "actor.login": { $gt: MinKey}  }}, 
  { $group: { _id: "$actor.login", events: { $sum: 1 } } },  
  { $sort: { events: -1 } },  
  { $limit: 10 }  
])

In my test, the query aggregated ten million keys in 13 seconds:

        executionStats: {                                                                                                                       
          executionSuccess: true,                                                                                                               
          nReturned: 727312,
          executionTimeMillis: 13498,
          totalKeysExamined: 9480600,
          totalDocsExamined: 0,

If you need to optimize this access pattern further, use the computed design pattern, and increment a login counter in an actor's collection, leveraging the atomic $inc operator.

Query d) Return repositories that have more than two comments and a specific event type, order by average comments from major to minor

The query is:

db.github2015.aggregate([  
  { $match: { "type": "PushEvent", "payload.issue.comments": { $gt : 2 } } },  
  { $group: { _id: "$repo.name", avg: { $avg: "$payload.issue.comments" } } },  
  { $sort: { avg: -1 } }  
])

Index according to the ESR guideline:

  • Equality: "type"
  • Sort: "repo.name", as grouping is faster on sorted keys
  • Range: "payload.issue.comments"
db.github2015.createIndex(
 { "type": 1, "repo.name": 1, "payload.issue.comments": 1  }
)

I don't know if it was done on purpose for the benchmark, but the dataset has no "PushEvent" with a "payload.issue" so the query returns an empty result.
For this particular case, as there's no keys to sort, it is better to place the range key before the sort key:

db.github2015.createIndex(
 { "type": 1, "payload.issue.comments": 1, "repo.name": 1 }
)  

The index immediately finds that there are no keys for those bounds:


        executionStats: {
          executionSuccess: true,
          nReturned: 0,
          executionTimeMillis: 1,
          totalKeysExamined: 0,
          totalDocsExamined: 0,

The query may mistakenly expects a "PushEvent" to be an "IssuesEvent," as it looks for "payload.issue" comments. To eliminate this ambiguity, MongoDB offers schema validation (see "Schema Later" considered harmful).

Ad-Hoc queries for OLAP

While acceptable performance can be achieved with the right index, it may not be the optimal solution for OLAP use cases. Instead, consider creating a single Atlas Search index to handle all related queries. For further guidance, refer to my previous post: Search Index for Reporting.

Atlas Search Indexes are maintained asynchronously on a dedicated node, ensuring no impact on the operational database. They provide a near-real-time state without the complexity of streaming changes to another database.

Another Query: Recent push events by user's commit

The recent push by a user's commit is a relevant use-case for this dataset. However, the benchmark did not run any queries on "commits" since it is an array. While PostgreSQL supports JSON, it cannot be directly compared to a document database like MongoDB that handles non-scalar fields natively. If you run the benchmark on PostgreSQL, try this query:

SELECT   
 data->'repo'->>'name' as repo_name,  
 data->'payload'->'commits' as commits_info  
FROM "github2015"  
WHERE   
 data->>'type' = 'PushEvent'  
 AND data->'payload'->'commits' @> '[{"author": {"name": "ggolden@umich.edu"}}]'  
ORDER BY   
 data->>'created_at' DESC  
LIMIT 5;  

You can explore various indexes, such as a GIN index, but you will never find one that directly retrieves the five documents needed for the result. If you do, please correct me and show the execution plan in a comment.

On MongoDB, the same index guideline applies:

  • Equality: "type" (low selectivity) and "payload.commits.author.name" (will be multi-key)
  • Sort: "created_at" (a must for pagination query)
db.github2015.createIndex({  
  "type": 1,  
  "payload.commits.author.name": 1,  
  "created_at": -1  
}) 

The query is simple and doesn't even need an aggregation pipeline:

db.github2015.find({      
  "type": "PushEvent",      
  "payload.commits.author.name": "ggolden@umich.edu"      
}, {  
  "repo.name": 1,  
  "payload.commits.author.name": 1,  
  "payload.commits.message": 1,  
  _id: 0  
}).sort({      
  "created_at": -1      
}).limit(5) 

The execution statistics indicate that only 5 documents have been read, which is the minimum required for the results:

  executionStats: {                                                                                                                                                                                          
    executionSuccess: true,                                                                                                                                                                                  
    nReturned: 5,                                                                                                                                                                                            
    executionTimeMillis: 0,                                                                                                                                                                                  
    totalKeysExamined: 5,                                                                                                                                                                                    
    totalDocsExamined: 5,                                                                                                                                                                                    
    executionStages: {                                                                                                                                                                                       

The execution plan illustrates how a single seek has read five index entries by utilizing a multi-key index. It applies the index bounds for both the equality filters and the sort order, ensuring an efficient retrieval process:

  stage: 'IXSCAN',          
  nReturned: 5,          
  executionTimeMillisEstimate: 0,          
...       
  isMultiKey: true,          
  multiKeyPaths: {          
    type: [],          
    'payload.commits.author.name': [ 'payload.commits' ],          
    created_at: []          
  },          
...                                                                                                                                                                                        
  direction: 'forward',
  indexBounds: {
    type: [ '["PushEvent", "PushEvent"]' ],
    'payload.commits.author.name': [ '["ggolden@umich.edu", "ggolden@umich.edu"]' ],
    created_at: [ '[MaxKey, MinKey]' ]
  },
  keysExamined: 5,
  seeks: 1, 
  dupsTested: 5,

Conclusion

Benchmarks are often biased because vendors optimize tests for their own databases, neglecting others. While their performance claims and time comparisons lack value, looking at the queries and execution plans may serve an educational role by highlighting common design mistakes they made for the databases they are comparing to.

In this post, I applied the The ESR (Equality, Sort, Range) Guideline, examined the execution plan, and tested queries relevant to a document model to demonstrate the power of MongoDB in querying JSON data. After replacing two indexes and improving two queries, you should notice that MongoDB performs significantly faster than PostgreSQL on those four queries.