a curated list of database news from authoritative sources

June 24, 2025

No pre-filtering in pgvector means reduced ANN recall

AI applications are expanding rapidly, and PostgreSQL is a popular choice among relational databases. The pgvector extension, a third-party add-on, enhances PostgreSQL by introducing a high-dimensional vector data type with similarity operations and search indexing.
Integrating embeddings directly into general-purpose databases eliminates the need for a separate one. Typically, approximate searches on embeddings are performed alongside exact searches on various other attributes, SQL columns or document fields, such as metadata, dates, or other dimensions.
PostgreSQL offers various index types, but it has notable limitations when combining them, as we have seen in PostgreSQL JSONB Indexing Limitations with B-Tree and GIN. Likewise, pgvector encounters similar issues.
Some users have moved to MongoDB Atlas Vector Search because it offers pre-filtering capabilities. They had incomplete results with PostgreSQL pgvector when filtering with other predicates. To better understand the impact of lacking pre-filtering in such scenarios, I built this simple demo.

Setup PostgreSQL with pgvector

I started a pgvector container:

docker run --name pgv -d -e POSTGRES_PASSWORD=franck pgvector/pgvector:0.8.0-pg17
docker exec   -it pgv psql -U postgres

I enable the extension:

create extension if not exists vector;

Importing a synthetic dataset

I create a function to generate a random vector:

create function random_embedding(dimensions int) returns vector as
$$
select
    array(
        select random()::real
        from generate_series(1, dimensions)
    )::vector
$$ language sql;

I create a table to store embeddings ("embedding") with some metadata ("color"):

create table embeddings_table (
    id        bigserial primary key,
    color     text,
    embedding vector(512)
);

I inserted two million rows, each containing a randomly generated 512-dimensional vector, and assigned one of three colors as metadata:

insert into embeddings_table (embedding,color)
 select random_embedding(512)
  ,(array['red', 'green', 'blue'])[generate_series(1, 2000000)%3+1]
;

I used the random() function to ensure data is evenly distributed. The points are positioned in a 512-dimensional space, with one-third of the rows assigned to each color. This is a synthetic dataset, which makes it easier to understand the result.

I create a vector index, HNSW (Hierarchical Navigable Small Worlds), on the embeddings, using cosine similarity:

create index i1 on embeddings_table 
 using hnsw ( embedding vector_cosine_ops )
;

Query example

I generated one more random vector to use in my queries:

select random_embedding(512)
\gset

postgres=# select :'random_embedding';
           ?column?
---------------------------------------------------------------------
[0.899858,0.08105531,0.78641415,0.07696906,0.08429382,...,0.5175713,0.8292444]

The cosine similarity search will find the points in the table for which the angle to axis is close to the reference point that I've stored in :'random_embedding' variable.

I want to query the 15 points that are most similar to this reference point, but only consider the green category.

The following query filters on green rows (where color='green'), calculates the cosine similarity (embedding <=> :'random_embedding') and filters the nearest 15 points (order by nn_cosine limit 15):

select id , color, embedding <=> :'random_embedding' nn_cosine
 from embeddings_table
 where color='green'
 order by nn_cosine limit 15;

I used "nn_cosine" for the nearest neighbor cosine search. In future queries, I'll use "enn_cosine" or "ann_cosine" depending on whether I expect an exact or approximate result, from a full table scan or an index scan.

Embeddings have too many dimensions for us to visualize easily, but here's an analogy in our three-dimensional world. My dataset is like a soft ball pool with red, green, and blue balls, where each ball's position represents the meaning of the data. Cosine similarity search is akin to pointing a laser from the center of the pool to a reference point, which corresponds to the meaning we are looking for, and identifying balls whose positions form the smallest angles with the laser ray. Post-filtering searches all balls, then discards red and blue balls afterward. Pre-filtering considers only green balls when searching around the laser ray.

Exact Nearest Neighbors (ENN) with full scan

First, I disable the index to get an exact result:

postgres=# set enable_indexscan to off;
SET                                                                                                                                                                                                          

postgres=# select id , color, embedding <=> :'random_embedding' enn_cosine
 from embeddings_table
 where color='green'
 order by enn_cosine limit 15
;

   id    | color |     enn_cosine      
---------+-------+---------------------
 1428352 | green | 0.19814620075833056
  328933 | green |  0.2024464516951111
 1261723 | green |  0.2031157228085848
 1836052 | green | 0.20319815669479213
 1536328 | green | 0.20353639191885098
 1221802 | green | 0.20355073458778694
 1328614 | green | 0.20373734017866685
  327802 | green | 0.20464172025637872
  238738 | green |  0.2048113256211399
  969943 | green | 0.20566046923407266
 1924495 | green |  0.2059847615560182
  486043 | green | 0.20615577737388402
 1911601 | green | 0.20652312839386933
 1777339 | green | 0.20658742123960594
  875029 | green | 0.20664456413189736
(15 rows)

Without an index, this query is slow because it calculates the distance for each row that meets the 'color' predicate, sorts them by this distance, and retrieves the Top-15 results, but it has the advantage of providing an exact result:

postgres=# explain (analyze, buffers, costs off, summary on)                                                                                                                                                  
 select id , color, embedding <=> :'random_embedding' enn_cosine                                                                                                                                              
 from embeddings_table                                                                                                                                                                                       
 where color='green'                                                                                                                                                                                         
 order by enn_cosine limit 15
;                                                                                                                                                                               
                            QUERY PLAN
-------------------------------------------------------------------                                                                                                                 
 Limit (actual time=1868.024..1878.636 rows=15 loops=1)
   Buffers: shared hit=1989174 read=692354
   ->  Gather Merge (actual time=1868.022..1878.632 rows=15 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         Buffers: shared hit=1989174 read=692354
         ->  Sort (actual time=1853.062..1853.063 rows=13 loops=3)
               Sort Key: ((embedding <=> '[0.899858,0.08105531,0.78641415,0.07696906,0.08429382,...,0.5175713,0.8292444]'::vector))
               Sort Method: top-N heapsort  Memory: 26kB
               Buffers: shared hit=1989174 read=692354
               Worker 0:  Sort Method: top-N heapsort  Memory: 26kB
               Worker 1:  Sort Method: top-N heapsort  Memory: 26kB
               ->  Parallel Seq Scan on embeddings_table (actual time=0.126..1797.436 rows=222222 loops=3)
                     Filter: (color = 'green'::text)
                     Rows Removed by Filter: 444444
                     Buffers: shared hit=1989107 read=692347
 Planning:
   Buffers: shared read=1
 Planning Time: 0.124 ms
 Execution Time: 1878.658 ms

PostgreSQL utilized a parallel degree of 3. A filter on "color" reduced the number of rows to 222222 per process, resulting in 666666 rows. This filter eliminated 444444 in each worker, which accounts for two-thirds of the total rows. Each process calculated distances for its assigned rows and sorted them accordingly. Finally, the coordinator gathered the top 15 results from the worker processes.

Approximate Nearest Neighbors (ANN) with index

I enable the index to get a faster, but approximate, result:

postgres=# set enable_indexscan to on;                                                                                                                                                                      
SET                                                                                                                                               

postgres=# explain (analyze, buffers, costs off, summary on)                                                                                                                                                  
 select id , color, embedding <=> :'random_embedding' ann_cosine                                                                                                                                              
 from embeddings_table
 where color='green'
 order by ann_cosine limit 15
;                                                                                                                                                                               
                            QUERY PLAN
-------------------------------------------------------------------    
 Limit (actual time=5.605..5.916 rows=11 loops=1)
   Buffers: shared hit=84 read=1470
   ->  Index Scan using i1 on embeddings_table (actual time=5.604..5.912 rows=11 loops=1)
         Order By: (embedding <=> '[0.899858,0.08105531,0.78641415,0.07696906,0.08429382,...,0.5175713,0.8292444]'::vector)
         Filter: (color = 'green'::text)
         Rows Removed by Filter: 29
         Buffers: shared hit=84 read=1470
 Planning:
   Buffers: shared read=1
 Planning Time: 0.089 ms
 Execution Time: 5.934 ms

The index was used to retrieve rows in their cosine similarity order related to my reference point (Order By: (embedding <=> '[...]'::vector)), but the search was limited to 40 candidates (the default hnsw.ef_search). 21 rows were discarded by the metadata filter ((color = 'green'::text)), leaving 11 rows remaining (rows=11). Because of this, I didn't have enough candidates for the expected result (limit 15) and I get less rows than expected:

postgres=# select id , color, embedding <=> :'random_embedding' ann_cosine
 from embeddings_table
 where color='green'
 order by ann_cosine limit 15
;          
   id    | color |     ann_cosine      
---------+-------+---------------------
 1875277 | green |  0.2076671534464677
  222817 | green | 0.21016644773554916
 1375990 | green |  0.2118530398002575
   21538 | green | 0.21207386707694031
 1355350 | green |  0.2121940467579876
  505036 | green | 0.21220934429072225
 1570702 | green | 0.21469847813732257
 1997836 | green | 0.21482420378988654
 1195270 | green | 0.21613844835346685
  634417 | green |  0.2172001587963871
 1995160 | green | 0.21794015870874028
(11 rows)

I used the default hnsw.iterative_scan set to off. Enabling iterative scan will get more results by running the scan again until there's enough candidate for limit 15.

Compare ENN and ANN

The nearest green neighbor found by the index has a cosine similarity of 0.2076671534464677, but the exact search identified fifteen closer green points. This discrepancy highlights a characteristic of Hierarchical Navigable Small Worlds (HNSW). The index is constructed with multiple layers, beginning with sparse upper layers that serve as entry points. These layers help restrict the search scope as it descends into the dense bottom layer (layer 0). While this layered design accelerates search times, it can also miss closer neighbors in layer 0 that were not explored when descending from the upper layers.
Since all points in my dataset were generated using random() values, it lacks natural clustering. This diminishes routing efficiency and increases the likelihood of missing closer neighbors during approximate searches. Therefore, I am illustrating the worst-case scenario.

I checked how many results the approximate search missed by disabling the pgvector index. I found that 23 results were missed before the first row from exact search, which is the 24th row in the following:

postgres=# set enable_indexscan to off;                                                                                                                                                                      
SET                                                                                                                                               

postgres=# select id , color, embedding <=> :'random_embedding' enn_cosine                                                                                                                                              
 from embeddings_table
   where color='green' 
   and embedding <=> :'random_embedding' <= 0.2076671534464677
 order by enn_cosine
;   

   id    | color |     enn_cosine      
---------+-------+---------------------
 1428352 | green | 0.19814620075833056
  328933 | green |  0.2024464516951111
 1261723 | green |  0.2031157228085848
 1836052 | green | 0.20319815669479213
 1536328 | green | 0.20353639191885098
 1221802 | green | 0.20355073458778694
 1328614 | green | 0.20373734017866685
  327802 | green | 0.20464172025637872
  238738 | green |  0.2048113256211399
  969943 | green | 0.20566046923407266
 1924495 | green |  0.2059847615560182
  486043 | green | 0.20615577737388402
 1911601 | green | 0.20652312839386933
 1777339 | green | 0.20658742123960594
  875029 | green | 0.20664456413189736
  593119 | green |  0.2066683273490607
 1354432 | green | 0.20685417261064953
 1898782 | green | 0.20697419915308368
 1429552 | green | 0.20704169544999784
 1293397 | green | 0.20746811422822542
 1371502 | green | 0.20746937923342468
  998884 | green |  0.2074836628885286
  845659 | green | 0.20759016691317878
 1875277 | green |  0.2076671534464677
(24 rows)

The measure for this approximation accuracy is called 'recall'. The definition from the MongoDB glossary is:

Recall measures the fraction of true nearest neighbors that were returned by an ANN search. This measure reflects how close the algorithm approximates the results of ENN search.

This approximation applies to all index searches, which are faster than full scans but may miss some closer neighbors. Post-filtering reduces recall even further, as some candidates are discarded, leading to the possibility of missing good matches. That's why it is better to use pre-filtering on large databases. Although pre-filtering is not available in PostgreSQL pgvector, we can analyze the data to understand its potential impact.

Post-filtering impact on recall

The problem is that pgvector lacks filtering capabilities in the index. As a result, it defaults to selecting 40 candidates, as defined by the ef_search parameter, and filtering on more columns, like "color', reduces the result.

To explain this, I've run the query without the filter, showing the first 40 candidates of the three colors:

postgres=# set enable_indexscan to on;                                                                                                                                                                      
SET
postgres=# select id , color, embedding <=> :'random_embedding' ann_cosine
 from embeddings_table
 -- where color='green'
 order by ann_cosine limit 40 -- current_setting('hnsw.ef_search')
;
   id    | color |     ann_cosine
---------+-------+---------------------
 1360308 | red   | 0.20296796169334463
 1732580 | blue  | 0.20459117837053364
 1085082 | red   | 0.20626595570441808
 1875277 | green |  0.2076671534464677
  656874 | red   | 0.20930762441475093
  504242 | blue  | 0.20984078446453025
  247001 | blue  | 0.20995935279258404
  222817 | green | 0.21016644773554916
  705833 | blue  |  0.2102792157006329
 1966310 | blue  | 0.21028852384517327
 1503321 | red   | 0.21044019511424406
  480572 | blue  | 0.21100294080666748
 1375990 | green |  0.2118530398002575
   21538 | green | 0.21207386707694031
 1355350 | green |  0.2121940467579876
  505036 | green | 0.21220934429072225
  106983 | red   | 0.21293893184210688
  238458 | red   | 0.21295064471740388
 1008380 | blue  | 0.21301481665902566
  298931 | blue  | 0.21304336639331967
 1147263 | red   | 0.21342607115241874
  562604 | blue  |  0.2135493812411281
 1570702 | green | 0.21469847813732257
 1997836 | green | 0.21482420378988654
 1291274 | blue  |  0.2159207514557735
 1195270 | green | 0.21613844835346685
 1035932 | blue  | 0.21623180106532514
 1010237 | blue  |  0.2164365008134519
 1256966 | blue  | 0.21652825716033564
 1748231 | blue  | 0.21676377376711842
  634417 | green |  0.2172001587963871
 1685391 | red   | 0.21723542532805584
 1964021 | blue  | 0.21723845625858207
 1056446 | blue  | 0.21757530726298147
  958670 | blue  | 0.21769898462687431
 1558046 | blue  |  0.2177076235462454
  516734 | blue  | 0.21777311307937175
 1995160 | green | 0.21794015870874028
  228096 | red   | 0.21866579506700412
  660161 | blue  |  0.2187276449697918
(40 rows)

This is what the query using the index did in a first step. If you keep only the green rows, you get 11 rows. If it was filtered before, the index scan would have returned 40 green rows and the query would have been able to return the Top-15 from it. Post-filtering misses some good green candidates that were ignored because some others where selected.

It would have been better to pre-filter during the approximate nearest neighbor search to consider only 'green' neighbors, rather than discarding non-matching ones afterward based on a post-filter on 'color'. However, pgvector does not support such pre-filtering. The consequence is a low recall in a single index scan.

Without filter - ENN and ANN

For queries without a filter, recall is generally better because, although approximate searches may miss some points, all candidates returned by the index are included in the results. Here are the Top 15 most similar points across all colors:

postgres=# set enable_indexscan to off;                                                                                                                                                                      
SET    

postgres=# select id , color, embedding <=> :'random_embedding' enn_cosine
 from embeddings_table
 --where color='green'
 order by enn_cosine limit 15;

   id    | color |     enn_cosine      
---------+-------+---------------------
 1506704 | blue  |  0.1943345774574703
 1428352 | green | 0.19814620075833056
  905583 | red   |  0.1986930398354949
 1887345 | red   | 0.19958922153843262
 1408551 | red   | 0.20155542317891084
 1761962 | blue  | 0.20168765608150285
   91001 | blue  | 0.20206633541960917
  328933 | green |  0.2024464516951111
  493388 | blue  | 0.20277316748365937
 1360308 | red   | 0.20296796169334463
 1261723 | green |  0.2031157228085848
 1836052 | green | 0.20319815669479213
  816615 | red   | 0.20350817237259144
 1536328 | green | 0.20353639191885098
 1221802 | green | 0.20355073458778694
(15 rows)

Here is the same using an index:

postgres=# set enable_indexscan to on;                                                                                                                                                                      
SET    

postgres=# select id , color, embedding <=> :'random_embedding' ann_cosine
 from embeddings_table
 --where color='green'
 order by ann_cosine limit 15;

   id    | color |     ann_cosine      
---------+-------+---------------------
 1360308 | red   | 0.20296796169334463
 1732580 | blue  | 0.20459117837053364
 1085082 | red   | 0.20626595570441808
 1875277 | green |  0.2076671534464677
  
                                        by Franck Pachot
                                    

Open Wins Again II: From Fork to Force

Forks aren’t just a technical event. We can look at them as philosophical ruptures, and the open source world isn’t the first to face this. Crypto communities have lived and breathed this reality for over a decade, often in the most chaotic, transparent, and high-stakes ways possible. Now, as Valkey rises to become the de […]

June 23, 2025

Migrate a self-managed MySQL database to Amazon Aurora MySQL using AWS DMS homogeneous data migrations

In this post, we provide a comprehensive, step-by-step guide for migrating an on-premises self-managed encrypted MySQL database to Amazon Aurora MySQL using AWS DMS homogeneous data migrations over a private network. We show a complete end-to-end example of setting up and executing an AWS DMS homogeneous migration, consolidating all necessary configuration steps and best practices.

June 22, 2025

faster $lookup after $group in MongoDB aggregation pipeline

In the previous post One million $lookup challenge I said that $lookup is efficient at the end of an aggregation pipeline, not before the aggregation and I was referencing another post, Comparison of JOINS 👉🏻 aggregation pipeline and CTEs where I simply inverted the join ($lookup) and aggregation ($group) to fix a bad query. What if you have to lookup from millions of documents without an aggregation, like in the One million $lookup challenge?

I created a dimension table with one thousand documents, and a fact table with one million. The fact table has a "ref" field that references the "dimid" in the dimension table:

db.dim.drop();
db.fact.drop();

db.dim.insertMany(
    Array.from({ length: 1000 }, (_, i) => ({
        _id: i + 1,
        value: Math.random()
    }))
);

db.fact.insertMany(
    Array.from({ length: 1000000 }, () => ({
        ref: Math.ceil(Math.random() * 1000),
        value: Math.random()
    }))
);

A many-to-one relationship should embed values directly, eliminating the need for lookups within a document model. However, I diverged from this recommendation to build a demo illustrating that lookups are acceptable when dealing with a limited number of documents, such as after aggregation.

Lookup (IndexedLoopJoin): 10 seconds

Here is an aggregation pipeline with a lookup.

x=db.fact.aggregate([
    {
        $lookup: {
            from: "dim",
            localField: "ref",
            foreignField: "_id",
            as: "dim" ,
        }
    },
]).explain("executionStats")
;
print(x["executionStats"]["executionTimeMillis"]/1000+" seconds")

On this data, the operation completes in ten seconds. While the Index Nested Loop Join on one million documents with a scalar "ref" takes some time, performance can be improved by avoiding multiple lookups for the same value.

$group , $lookup , $unwind : 3 seconds

I group by the "ref" values prior to the join to minimize lookup operations, ensuring each value is processed only once:

x=db.fact.aggregate([
  { // one group per "ref" with an array of "facts" 
    $group: {
      _id: "$ref",
      facts: { $push: "$$ROOT" },
    }
  },
  { // join from the group
    $lookup: {
      from:          "dim",
      localField:    "_id",
      foreignField:  "_id",
      as:            "dim"
    }
  },
  { // un-group to get the  
    $unwind: "$facts"
  },
  { // project to original fields
    $project: {
      _id:           "$facts._id",
      ref:           "$facts.ref",
      value:         "$facts.value",
      dim:           "$dim",
    }
  }
]).explain("executionStats");
print(x["stages"][0]["$cursor"]
["executionStats"]["executionTimeMillis"]/1000+" seconds")

This takes three seconds to run. When only one nested loop per thousand documents is used, compared to one million, the lookup stage becomes faster. Although grouping and unwinding add some time, if grouping is already present, as is typical in an aggregation pipeline, performing the lookup after grouping is simply more efficient than doing it beforehand.

faster $lookup after $group in MongoDB aggregation pipeline

In the previous post One million $lookup challenge I said that $lookup is efficient at the end of an aggregation pipeline, not before the aggregation and I was referencing another post, Comparison of JOINS 👉🏻 aggregation pipeline and CTEs where I simply inverted the join ($lookup) and aggregation ($group) to fix a bad query. What if you have to lookup from millions of documents without an aggregation, like in the One million $lookup challenge?

I created a dimension table with one thousand documents, and a fact table with one million. The fact table has a "ref" field that references the "dimid" in the dimension table:

db.dim.drop();
db.fact.drop();

db.dim.insertMany(
    Array.from({ length: 1000 }, (_, i) => ({
        _id: i + 1,
        value: Math.random()
    }))
);

db.fact.insertMany(
    Array.from({ length: 1000000 }, () => ({
        ref: Math.ceil(Math.random() * 1000),
        value: Math.random()
    }))
);

A many-to-one relationship should embed values directly, eliminating the need for lookups within a document model. However, I diverged from this recommendation to build a demo illustrating that lookups are acceptable when dealing with a limited number of documents, such as after aggregation.

Lookup (IndexedLoopJoin): 10 seconds

Here is an aggregation pipeline with a lookup.

x=db.fact.aggregate([
    {
        $lookup: {
            from: "dim",
            localField: "ref",
            foreignField: "_id",
            as: "dim" ,
        }
    },
]).explain("executionStats")
;
print(x["executionStats"]["executionTimeMillis"]/1000+" seconds")

On this data, the operation completes in ten seconds. While the Index Nested Loop Join on one million documents with a scalar "ref" takes some time, performance can be improved by avoiding multiple lookups for the same value.

$group , $lookup , $unwind : 3 seconds

I group by the "ref" values prior to the join to minimize lookup operations, ensuring each value is processed only once:

x=db.fact.aggregate([
  { // one group per "ref" with an array of "facts" 
    $group: {
      _id: "$ref",
      facts: { $push: "$$ROOT" },
    }
  },
  { // join from the group
    $lookup: {
      from:          "dim",
      localField:    "_id",
      foreignField:  "_id",
      as:            "dim"
    }
  },
  { // un-group to get the  
    $unwind: "$facts"
  },
  { // project to original fields
    $project: {
      _id:           "$facts._id",
      ref:           "$facts.ref",
      value:         "$facts.value",
      dim:           "$dim",
    }
  }
]).explain("executionStats");
print(x["stages"][0]["$cursor"]
["executionStats"]["executionTimeMillis"]/1000+" seconds")

This takes three seconds to run. When only one nested loop per thousand documents is used, compared to one million, the lookup stage becomes faster. Although grouping and unwinding add some time, if grouping is already present, as is typical in an aggregation pipeline, performing the lookup after grouping is simply more efficient than doing it beforehand.

June 21, 2025

June 20, 2025

Implement a rollback strategy for Amazon Aurora PostgreSQL upgrades using Amazon RDS Blue/Green deployments

Amazon Aurora PostgreSQL-Compatible Edition supports managed blue/green deployments to help reduce downtime and minimize risk during updates. Even with thorough planning and testing in non-production environments, unexpected issues can emerge after a version upgrade. In these cases, having a rollback plan is essential to quickly restore service stability. While the managed Blue/Green deployment feature doesn’t currently include built-in rollback functionality, you can implement alternative solutions for version management. In this post, we show how you can manually set up a rollback cluster using self-managed logical replication to maintain synchronization with the newer version after an Amazon RDS Blue/Green deployment switchover.

MySQL Orchestrator Failover Behavior During Replication Lag

Managing farms of MySQL servers under a replication environment is very efficient with the help of a MySQL orchestrator tool. This ensures a smooth transition happens when there is any ad hoc failover or a planned/graceful switchover comes into action. Several configuration parameters play a crucial role in controlling and influencing failover behavior. In this […]

June 19, 2025

How an AWS customer in the learning services industry migrated and modernized SAP ASE to Amazon Aurora PostgreSQL

In this post, we explore how a leading AWS customer in the learning services industry successfully modernized its legacy SAP ASE environment by migrating to Amazon Aurora PostgreSQL-Compatible Edition. Partnering with AWS, the customer engineered a comprehensive migration strategy to transition from a proprietary system to an open source database while providing high availability, performance optimization, and cost-efficiency.

One million $lookup challenge

I you have read my previous post $lookup: more than just a SQL join, you understand that $lookup is not designed to join scalar values from thousands of documents. $lookup is useful at the end of an aggregation pipeline, not before the aggregation (examples in Comparison of JOINS 👉🏻 aggregation pipeline and CTEs) from a million documents collection. However, such collection should not require a join, as documents are designed to aggregate multiple related objects, unlike relational databases that normalize business data to multiple tables.

In a many-to-one relationship, it is common to embed fields, even when they are duplicated, in a document model. Normalization plays a crucial role in relational databases to prevent these duplicates, as RDBMS were designed for interactive users executing SQL statements. Missing updates can lead to data integrity issues. While triggers can help manage updates to duplicated values and prevent anomalies, they introduce new challenges as they operate behind the update statement.

When updates originate from well-reviewed and tested programs, it is manageable to modify data in multiple locations, particularly when such updates are infrequent. Let's illustrate joins and the absence of joins with a simple test.

To join multiple documents with a small lookup table, you can cache the lookup table in your application. In this post, I tested several methods for retrieving a value from a lookup table: using a collection, a map, and an array. I integrated these methods into an aggregation pipeline, but keep in mind that this can also be accomplished within the application itself.

I created a dimension table with one thousand documents, and a fact table with one million. The fact table has a "ref" field that references the "dimid" in the dimension table:

db.dim.drop();
db.fact.drop();

db.dim.insertMany(
    Array.from({ length: 1000 }, (_, i) => ({
        _id: i + 1,
        value: Math.random()
    }))
);

db.fact.insertMany(
    Array.from({ length: 1000000 }, () => ({
        ref: Math.ceil(Math.random() * 1000),
        value: Math.random()
    }))
);

Lookup (IndexedLoopJoin): 10 seconds

Here is an aggregation pipeline with a lookup.

x=db.fact.aggregate([
    {
        $lookup: {
            from: "dim",
            localField: "ref",
            foreignField: "_id",
            as: "dimData" ,
        }
    },
]).explain("executionStats")
;
print(x["executionStats"]["executionTimeMillis"]/1000+" seconds")

On this data, it runs in ten seconds. The query planner chooses an Index Nested Loop Join because there is an index. Without an index it could use a hash join.

Map to object and $getField: 61 seconds

To avoid the lookup, I read the dimension table into an object with a field per value, the field name being the "dimid", and get the value with $getField


const dimMap = {};  
db.dim.find().forEach(doc => {  
    dimMap[doc._id] = doc.value;  
});  
print( dimMap )


x=db.fact.aggregate([
    {  
        $addFields: {  
            dimValue: {  
                $getField: {  
                    field: { $toString: "$ref" },  
                    input: dimMap  
                }  
            }  
        }  
    }  
]).explain("executionStats")
;
print(x["stages"][0]["$cursor"]
["executionStats"]["executionTimeMillis"]/1000+" seconds")

On this data it runs in one minute. Accessing to a field by name is not an optimal operation and is O(n) so it is a viable solution only for very small lookup table.

Map to $switch branches: 23 seconds

Instead of using that map, I build a $switch statement to use in the aggregation pipeline.


const dimMap = {};  
db.dim.find().forEach(doc => {  
    dimMap[doc._id] = doc.value;  
});  
print( dimMap )

const switchBranches = Object.entries(dimMap).map(([id, value]) => ({  
    case: { $eq: ["$ref", parseInt(id)] },  
    then: value  
}));  
print( switchBranches )

x=db.fact.aggregate([  
    {  
        $addFields: {  
            dimValue: {  
                $switch: {  
                    branches: switchBranches,  
                    default: null  
                }  
            }  
        }  
    }  
]).explain("executionStats")
;
print(x["stages"][0]["$cursor"]
["executionStats"]["executionTimeMillis"]/1000+" seconds")

On this data, it runs in twenty seconds, and given that it puts the logic into the query, it is acceptable only for small lookup tables.

Map to array and $arrayElemAt: 1 second

Instead of a map, I use an array where the index is the "dimid". As I have no guarantee that the "dimid" is sequential with no gap, I build a sparse index that I fill with the existing values.


// Get the maximum ID
const maxId = db.dim.aggregate([
 {$group:{_id:null,max:{$max:"$_id"}}}
]).toArray()[0].max;  
// Create a sparse array for all values
const dimValues = new Array(maxId + 1).fill(null);  
// store the values at the right ID
db.dim.find({},{_id:1,value:1}).forEach(
 d => dimValues[d._id] = d.value
);  
print(dimValues)

//
x=db.fact.aggregate([  
    { $addFields: { dimValue: { $arrayElemAt: [dimValues, "$ref"] } } }  
]).explain("executionStats")
;
print(x["stages"][0]["$cursor"]
["executionStats"]["executionTimeMillis"]/1000+" seconds")

This is fast and runs in one second. However, it works only when the lookup identifier are in control, ideally starting from one and in a no-gap sequence.

Embed rather than join (denormalization)

Finally, as recommended for a document model (Model One-to-Many Relationships with Embedded Documents), I duplicate the dimension value into each fact documents. I run this update with an aggregation pipeline.


const startTime = new Date(); 
db.fact.aggregate([  
    {  
        $lookup: {  
            from: "dim",  
            localField: "ref",  
            foreignField: "_id",  
            as: "dimData"  
        }  
    },  
    {  
        $out: "fact"  
    }  
])  

const endTime = new Date(); 
const executionTime = (endTime - startTime) / 1000;  
print(`Update execution time: ${executionTime} seconds`);

This should be executed once, and then only the updated dimension values should be synchronized. This update took 16 seconds on my data.

To compare, I can simply read the document and project the embedded value:

x=db.fact.aggregate([  
    {  
        $project: {  
            _id: 1,  
            ref: 1,  
            dimValue: 1,  // Simply project the pre-computed field  
            // Add any other fact fields you need  
            someFactField: 1  
        }  
    }  
]).explain("executionStats");  
print(x["executionStats"]["executionTimeMillis"]/1000+" seconds")

This query takes around 0.5 seconds on my data. It is advisable unless you are dealing with frequently updated lookup tables. Additionally, in MongoDB, a single compound index can cover all fields within a document. Typically, when filtering on a dimension, lookup, or reference table, the filter is applied to a business field rather than the internal "_id".

Conclusion

I have tested my example using various cardinalities for both the fact table and the dimension lookup table. Below is the raw data.

dim fact lookup $getField $switch $arrayElemAt update single doc
10 1000 0.008s 0.002s 0.001s 0.001s 0.08s 0s
100 1000 0.008s 0.006s 0.005s 0.001s 0.078s 0s
1000 1000 0.011s 0.062s 0.033s 0.001s 0.082s 0s
10000 1000 0.013s 0.754s 0.067s 0.003s 0.08s 0s
10 10000 0.075s 0.021s 0.016s 0.012s 0.199s 0.005s
100 10000 0.078s 0.066s 0.055s 0.013s 0.191s 0.005s
1000 10000 0.105s 0.62s 0.292s 0.013s 0.229s 0.005s
10000 10000 0.104s 6.94s 0.305s 0.015s 0.237s 0.005s
10 100000 0.738s 0.215s 0.171s 0.129s 1.306s 0.052s
100 100000 0.781s 0.673s 0.571s 0.131s 1.359s 0.052s
1000 100000 1.044s 6.259s 2.71s 0.141s 1.756s 0.054s
10000 100000 1.068s 73.205s 2.702s 0.144s 1.769s 0.059s
10 1000000 7.583s 2.199s 1.761s 1.332s 12.524s 0.559s
100 1000000 7.992s 6.634s 5.741s 1.346s 13.03s 0.557s
1000 1000000 10.551s 62.385s 26.4s 1.398s 16.771s 0.557s
10000 1000000 10.794s 742.086s 26.039s 1.437s 17.008s 0.578s
10 10000000 76.225s 22.127s 17.795s 13.196s 124.922s 5.789s
100 10000000 80.828s 67.602s 57.981s 13.695s 131.738s 5.714s
1000 10000000 106.194s 622.382s 267.555s 14.054s 168.854s 5.778s
10000 10000000 107.211s 7351.675s 265.404s 14.046s 171.13s 5.767s

An array, when queried with $arrayElemAt, is optimized for quickly retrieving values, while other data structures have a complexity of O(n). However, arrays have fixed values, which limits their flexibility compared to tables or collections. You may find more suitable structures in your application language. These structures resemble how SQL databases use hash tables. MongoDB can utilize a hash join for $lookup when the lookup table is small, when spilling to disk is permissible, and when there's no index.

When the lookup table is infrequently updated, applying updates to the embedded values is generally preferable, paying the price once at write and getting faster reads. MongoDB offers developers greater control over data access patterns and cardinalities, rather than relying solely on the query planner, which can lead to plan instability and runaway queries. In contrast, SQL databases cannot implement this flexibility without violating Codd's rules on data independence for relational databases.

A key distinction between MongoDB and SQL databases, including those that use a MongoDB API on top of an RDBMS, is their physical data model capabilities. RDBMS systems prioritize normalization and utilize efficient join algorithms for relational data models. In contrast, MongoDB provides flexible schemas for application objects and supports a joins where the join key can be an array ($lookup: more than just a SQL join) as part of an aggregation pipeline. While this may be less efficient for simple many-to-one relationships with scalar values, MongoDB's document data model can often eliminate the need for joins altogether. Additionally, caching lookup values in the application is a viable option.

One million $lookup challenge

I you have read my previous post $lookup: more than just a SQL join, you understand that $lookup is not designed to join scalar values from thousands of documents. $lookup is useful at the end of an aggregation pipeline, not before the aggregation (examples in Comparison of JOINS 👉🏻 aggregation pipeline and CTEs) from a million documents collection. However, such collection should not require a join, as documents are designed to aggregate multiple related objects, unlike relational databases that normalize business data to multiple tables.

In a many-to-one relationship, it is common to embed fields, even when they are duplicated, in a document model. Normalization plays a crucial role in relational databases to prevent these duplicates, as RDBMS were designed for interactive users executing SQL statements. Missing updates can lead to data integrity issues. While triggers can help manage updates to duplicated values and prevent anomalies, they introduce new challenges as they operate behind the update statement.

When updates originate from well-reviewed and tested programs, it is manageable to modify data in multiple locations, particularly when such updates are infrequent. Let's illustrate joins and the absence of joins with a simple test.

To join multiple documents with a small lookup table, you can cache the lookup table in your application. In this post, I tested several methods for retrieving a value from a lookup table: using a collection, a map, and an array. I integrated these methods into an aggregation pipeline, but keep in mind that this can also be accomplished within the application itself.

I created a dimension table with one thousand documents, and a fact table with one million. The fact table has a "ref" field that references the "dimid" in the dimension table:

db.dim.drop();
db.fact.drop();

db.dim.insertMany(
    Array.from({ length: 1000 }, (_, i) => ({
        _id: i + 1,
        value: Math.random()
    }))
);

db.fact.insertMany(
    Array.from({ length: 1000000 }, () => ({
        ref: Math.ceil(Math.random() * 1000),
        value: Math.random()
    }))
);

Lookup (IndexedLoopJoin): 10 seconds

Here is an aggregation pipeline with a lookup.

x=db.fact.aggregate([
    {
        $lookup: {
            from: "dim",
            localField: "ref",
            foreignField: "_id",
            as: "dimData" ,
        }
    },
]).explain("executionStats")
;
print(x["executionStats"]["executionTimeMillis"]/1000+" seconds")

On this data, it runs in ten seconds. The query planner chooses an Index Nested Loop Join because there is an index. Without an index it could use a hash join.

Map to object and $getField: 61 seconds

To avoid the lookup, I read the dimension table into an object with a field per value, the field name being the "dimid", and get the value with $getField


const dimMap = {};  
db.dim.find().forEach(doc => {  
    dimMap[doc._id] = doc.value;  
});  
print( dimMap )


x=db.fact.aggregate([
    {  
        $addFields: {  
            dimValue: {  
                $getField: {  
                    field: { $toString: "$ref" },  
                    input: dimMap  
                }  
            }  
        }  
    }  
]).explain("executionStats")
;
print(x["stages"][0]["$cursor"]
["executionStats"]["executionTimeMillis"]/1000+" seconds")

On this data it runs in one minute. Accessing to a field by name is not an optimal operation and is O(n) so it is a viable solution only for very small lookup table.

Map to $switch branches: 23 seconds

Instead of using that map, I build a $switch statement to use in the aggregation pipeline.


const dimMap = {};  
db.dim.find().forEach(doc => {  
    dimMap[doc._id] = doc.value;  
});  
print( dimMap )

const switchBranches = Object.entries(dimMap).map(([id, value]) => ({  
    case: { $eq: ["$ref", parseInt(id)] },  
    then: value  
}));  
print( switchBranches )

x=db.fact.aggregate([  
    {  
        $addFields: {  
            dimValue: {  
                $switch: {  
                    branches: switchBranches,  
                    default: null  
                }  
            }  
        }  
    }  
]).explain("executionStats")
;
print(x["stages"][0]["$cursor"]
["executionStats"]["executionTimeMillis"]/1000+" seconds")

On this data, it runs in twenty seconds, and given that it puts the logic into the query, it is acceptable only for small lookup tables.

Map to array and $arrayElemAt: 1 second

Instead of a map, I use an array where the index is the "dimid". As I have no guarantee that the "dimid" is sequential with no gap, I build a sparse index that I fill with the existing values.


// Get the maximum ID
const maxId = db.dim.aggregate([
 {$group:{_id:null,max:{$max:"$_id"}}}
]).toArray()[0].max;  
// Create a sparse array for all values
const dimValues = new Array(maxId + 1).fill(null);  
// store the values at the right ID
db.dim.find({},{_id:1,value:1}).forEach(
 d => dimValues[d._id] = d.value
);  
print(dimValues)

//
x=db.fact.aggregate([  
    { $addFields: { dimValue: { $arrayElemAt: [dimValues, "$ref"] } } }  
]).explain("executionStats")
;
print(x["stages"][0]["$cursor"]
["executionStats"]["executionTimeMillis"]/1000+" seconds")

This is fast and runs in one second. However, it works only when the lookup identifier are in control, ideally starting from one and in a no-gap sequence.

Embed rather than join (denormalization)

Finally, as recommended for a document model (Model One-to-Many Relationships with Embedded Documents), I duplicate the dimension value into each fact documents. I run this update with an aggregation pipeline.


const startTime = new Date(); 
db.fact.aggregate([  
    {  
        $lookup: {  
            from: "dim",  
            localField: "ref",  
            foreignField: "_id",  
            as: "dimData"  
        }  
    },  
    {  
        $out: "fact"  
    }  
])  

const endTime = new Date(); 
const executionTime = (endTime - startTime) / 1000;  
print(`Update execution time: ${executionTime} seconds`);

This should be executed once, and then only the updated dimension values should be synchronized. This update took 16 seconds on my data.

To compare, I can simply read the document and project the embedded value:

x=db.fact.aggregate([  
    {  
        $project: {  
            _id: 1,  
            ref: 1,  
            dimValue: 1,  // Simply project the pre-computed field  
            // Add any other fact fields you need  
            someFactField: 1  
        }  
    }  
]).explain("executionStats");  
print(x["executionStats"]["executionTimeMillis"]/1000+" seconds")

This query takes around 0.5 seconds on my data. It is advisable unless you are dealing with frequently updated lookup tables. Additionally, in MongoDB, a single compound index can cover all fields within a document. Typically, when filtering on a dimension, lookup, or reference table, the filter is applied to a business field rather than the internal "_id".

Conclusion

I have tested my example using various cardinalities for both the fact table and the dimension lookup table. Below is the raw data.

dim fact lookup $getField $switch $arrayElemAt update single doc
10 1000 0.008s 0.002s 0.001s 0.001s 0.08s 0s
100 1000 0.008s 0.006s 0.005s 0.001s 0.078s 0s
1000 1000 0.011s 0.062s 0.033s 0.001s 0.082s 0s
10000 1000 0.013s 0.754s 0.067s 0.003s 0.08s 0s
10 10000 0.075s 0.021s 0.016s 0.012s 0.199s 0.005s
100 10000 0.078s 0.066s 0.055s 0.013s 0.191s 0.005s
1000 10000 0.105s 0.62s 0.292s 0.013s 0.229s 0.005s
10000 10000 0.104s 6.94s 0.305s 0.015s 0.237s 0.005s
10 100000 0.738s 0.215s 0.171s 0.129s 1.306s 0.052s
100 100000 0.781s 0.673s 0.571s 0.131s 1.359s 0.052s
1000 100000 1.044s 6.259s 2.71s 0.141s 1.756s 0.054s
10000 100000 1.068s 73.205s 2.702s 0.144s 1.769s 0.059s
10 1000000 7.583s 2.199s 1.761s 1.332s 12.524s 0.559s
100 1000000 7.992s 6.634s 5.741s 1.346s 13.03s 0.557s
1000 1000000 10.551s 62.385s 26.4s 1.398s 16.771s 0.557s
10000 1000000 10.794s 742.086s 26.039s 1.437s 17.008s 0.578s
10 10000000 76.225s 22.127s 17.795s 13.196s 124.922s 5.789s
100 10000000 80.828s 67.602s 57.981s 13.695s 131.738s 5.714s
1000 10000000 106.194s 622.382s 267.555s 14.054s 168.854s 5.778s
10000 10000000 107.211s 7351.675s 265.404s 14.046s 171.13s 5.767s

An array, when queried with $arrayElemAt, is optimized for quickly retrieving values, while other data structures have a complexity of O(n). However, arrays have fixed values, which limits their flexibility compared to tables or collections. You may find more suitable structures in your application language. These structures resemble how SQL databases use hash tables. MongoDB can utilize a hash join for $lookup when the lookup table is small, when spilling to disk is permissible, and when there's no index.

When the lookup table is infrequently updated, applying updates to the embedded values is generally preferable, paying the price once at write and getting faster reads. MongoDB offers developers greater control over data access patterns and cardinalities, rather than relying solely on the query planner, which can lead to plan instability and runaway queries. In contrast, SQL databases must do all optimizations in the query planner to follow Codd's rules on data independence for relational databases.

A key distinction between MongoDB and SQL databases, including those that use a MongoDB API on top of an RDBMS, is their physical data model capabilities. RDBMS systems prioritize normalization and utilize efficient join algorithms for relational data models. In contrast, MongoDB provides flexible schemas for application objects and supports a joins where the join key can be an array ($lookup: more than just a SQL join) as part of an aggregation pipeline. While this may be less efficient for simple many-to-one relationships with scalar values, MongoDB's document data model can often eliminate the need for joins altogether. Additionally, caching lookup values in the application is a viable option.

Looking Ahead: A Confident New Chapter for Percona

Percona has always been more than just a company—it’s a mission-driven community built on a strong foundation of innovation, resilience, and open source excellence. For nearly two decades, Percona has been a trusted partner to organizations around the world, remaining steadfast in our mission while continuing to grow, improve, and evolve. Today, we mark the […]

June 17, 2025

$lookup: more than just a SQL join

When transitioning from a SQL background to MongoDB, the $lookup operation in an aggregation pipeline resembles a LEFT OUTER JOIN. For instance, the following MongoDB query:

        $lookup: {
            from: "xxb",
            localField: "xxa1",
            foreignField: "xxb1",
            as: "xxb"
        }

is often compared to the following SQL statement

        SELECT *  
        FROM xxa  
        LEFT JOIN xxb
        ON xxa.xxa1 = xxb.xxb1;  

If you assume that these two operations are identical across various databases, you may misinterpret the expected outcomes and their performance:

  1. In MongoDB, any field can be a scalar or an array. The $lookup function operates as an intersection of two sets, unlike SQL's equality operator (=) which operates on two values. It is more similar to PostgreSQL's && operator for arrays.
  2. The results in MongoDB do not duplicate outer rows for each inner row as in SQL. Instead, an array of matching inner documents is projected to the outer document. If an $unwind stage follows, the $lookup will then unnest this array into multiple documents.

To illustrate the semantics, consider this simple example: I will begin with scalars from two collections ("xxa" and "xxb") and then used arrays in their fields instead of scalars.

Scalar on the outer and inner collections

I define two small collections with some values in common and some which do not match:

db.xxa.drop();
db.xxa.insert([
  { xxa1: 101 },
  { xxa1: 102 },
]);

db.xxb.drop();
db.xxb.insert([
  { xxb1: 102 },
  { xxb1: 103 },
]);

The following query "joins" the two with a $lookup:

db.xxa.aggregate([
    {
        $lookup: {
            from: "xxb",
            localField: "xxa1",
            foreignField: "xxb1",
            as: "xxb"
        }
    }
]);

[
  { _id: ObjectId('68504d1df99599b54cd4b118'), xxa1: 101, xxb: [] },
  {
    _id: ObjectId('68504d1df99599b54cd4b119'),
    xxa1: 102,
    xxb: [ { _id: ObjectId('68504d26f99599b54cd4b11a'), xxb1: 102 } ]
  }
]

For each document from the outer collection, an array is created. If there is no matching document in the inner collection, the array remains empty. If there is one matching document, it will be included in the array. This looks like a LEFT OUTER JOIN in SQL that returns a structured one-to-many rather than a tabular result with duplicated values. However, this is a special case where we are joining on scalar fields.

Array on the outer, scalar in the inner

I replace the outer collection with some arrays:

db.xxa.drop();
db.xxa.insert([
  { xxa1: [100,101] },
  { xxa1: [101,102] },
  { xxa1: [102,103] },
]);

I run the same query:

db.xxa.aggregate([
    {
        $lookup: {
            from: "xxb",
            localField: "xxa1",
            foreignField: "xxb1",
            as: "xxb"
        }
    }
]);

[
  {
    _id: ObjectId('68504d43f99599b54cd4b11c'),
    xxa1: [ 100, 101 ],
    xxb: []
  },
  {
    _id: ObjectId('68504d43f99599b54cd4b11d'),
    xxa1: [ 101, 102 ],
    xxb: [ { _id: ObjectId('68504d26f99599b54cd4b11a'), xxb1: 102 } ]
  },
  {
    _id: ObjectId('68504d43f99599b54cd4b11e'),
    xxa1: [ 102, 103 ],
    xxb: [
      { _id: ObjectId('68504d26f99599b54cd4b11a'), xxb1: 102 },
      { _id: ObjectId('68504d26f99599b54cd4b11b'), xxb1: 103 }
    ]
  }
]

The semantics are similar, except that a document from the inner collection matches as soon as its value exists in the outer array. When it is not contained, the added array is empty. When multiple values from the outer collection find a matching document from the inner collection, those multiple documents are added to the array in the result.

The array in the outer collection acts as a one-to-many reference. In SQL, one-to-many relationships cannot exist and are transformed into many-to-one on the opposite side, and an index is created on the foreign key to allow navigating in the other way.

Scalar on the outer, array in the inner

I do the opposite:

db.xxa.drop();
db.xxa.insert([
  { xxa1: 101 },
  { xxa1: 102 },
]);

db.xxb.drop();
db.xxb.insert([
  { xxb1: [100,101] },
  { xxb1: [101,102] },
  { xxb1: [102,103] },
]);

I run the same query:

db.xxa.aggregate([
    {
        $lookup: {
            from: "xxb",
            localField: "xxa1",
            foreignField: "xxb1",
            as: "xxb"
        }
    }
]);

[
  {
    _id: ObjectId('68504d5ef99599b54cd4b11f'),
    xxa1: 101,
    xxb: [
      { _id: ObjectId('68504d65f99599b54cd4b121'), xxb1: [ 100, 101 ] },
      { _id: ObjectId('68504d65f99599b54cd4b122'), xxb1: [ 101, 102 ] }
    ]
  },
  {
    _id: ObjectId('68504d5ef99599b54cd4b120'),
    xxa1: 102,
    xxb: [
      { _id: ObjectId('68504d65f99599b54cd4b122'), xxb1: [ 101, 102 ] },
      { _id: ObjectId('68504d65f99599b54cd4b123'), xxb1: [ 102, 103 ] }
    ]
  }
]

A document from the inner collection matches as soon as one of its values equals a value from the inner collection. If multiple inner documents match, they will all appear in the resulting document array.

Arrays on both side

The general case can handle arrays on both sides:

db.xxa.drop();
db.xxa.insert([
  { xxa1: [100,101] },
  { xxa1: [101,102] },
  { xxa1: [102,103] },
]);

db.xxa.aggregate([
    {
        $lookup: {
            from: "xxb",
            localField: "xxa1",
            foreignField: "xxb1",
            as: "xxb"
        }
    }
]);

[
  {
    _id: ObjectId('68504d7ef99599b54cd4b124'),
    xxa1: [ 100, 101 ],
    xxb: [
      { _id: ObjectId('68504d65f99599b54cd4b121'), xxb1: [ 100, 101 ] },
      { _id: ObjectId('68504d65f99599b54cd4b122'), xxb1: [ 101, 102 ] }
    ]
  },
  {
    _id: ObjectId('68504d7ef99599b54cd4b125'),
    xxa1: [ 101, 102 ],
    xxb: [
      { _id: ObjectId('68504d65f99599b54cd4b121'), xxb1: [ 100, 101 ] },
      { _id: ObjectId('68504d65f99599b54cd4b122'), xxb1: [ 101, 102 ] },
      { _id: ObjectId('68504d65f99599b54cd4b123'), xxb1: [ 102, 103 ] }
    ]
  },
  {
    _id: ObjectId('68504d7ef99599b54cd4b126'),
    xxa1: [ 102, 103 ],
    xxb: [
      { _id: ObjectId('68504d65f99599b54cd4b122'), xxb1: [ 101, 102 ] },
      { _id: ObjectId('68504d65f99599b54cd4b123'), xxb1: [ 102, 103 ] }
    ]
  }
]

In this context, 'matching' refers to the intersection between the set of keys from the outer document and the set of keys found in an inner document.

Representing the same in a SQL database is much more complex than two tables, as it needs additional tables instead of arrays:

-- Main entities  
CREATE TABLE xxa (  
    id SERIAL PRIMARY KEY  
    -- other scalar fields if any  
);  

CREATE TABLE xxb (  
    id SERIAL PRIMARY KEY  
    -- other scalar fields if any  
);  

-- Junction tables to represent the arrays  
CREATE TABLE xxa_values (  
    xxa_id INT REFERENCES xxa(id),  
    value INT,  
    PRIMARY KEY (xxa_id, value)  
);  

CREATE TABLE xxb_values (  
    xxb_id INT REFERENCES xxb(id),  
    value INT,  
    PRIMARY KEY (xxb_id, value)  
);  

Joining all tables and deduplicating the results serves as the equivalent of a MongoDB lookup:

SELECT   
    xxa.id as xxa_id,  
    ARRAY_AGG(DISTINCT xxa_vals.value) as xxa1,  
    ARRAY_AGG(  
        DISTINCT jsonb_build_object(  
            'id', xxb.id,  
            'xxb1', ARRAY(  
                SELECT value   
                FROM xxb_values   
                WHERE xxb_id = xxb.id   
                ORDER BY value  
            )  
        )  
    ) FILTER (WHERE xxb.id IS NOT NULL) as xxb  
FROM xxa  
LEFT JOIN xxa_values xxa_vals ON xxa.id = xxa_vals.xxa_id  
LEFT JOIN xxb_values xxb_vals ON xxa_vals.value = xxb_vals.value  
LEFT JOIN xxb ON xxb_vals.xxb_id = xxb.id  
GROUP BY xxa.id  
ORDER BY xxa.id;  

The next time you hear that a lookup in MongoDB is the same as a SQL join and don't understand the result or the performance, you should remember that it is a different, higher-level operation, on a flexible schema.
A simple left outer join in SQL often disappears in MongoDB, as the interconnected objects belong to a single document, like an aggregate in Domain Driver Design.

Conclusion

I used abstract names for collections and fields, in order to explain the behavior, but many modern applications benefit from a small array and an additional association table may complicate things unnecessarily. In contemporary apps, users are identified by one or more email addresses, resources by tags, blogs by categories, and videos by interests, and many queries look for common interests, categories, or tags.

$lookup: more than just a SQL join

When transitioning from a SQL background to MongoDB, the $lookup operation in an aggregation pipeline resembles a LEFT OUTER JOIN. For instance, the following MongoDB query:

        $lookup: {
            from: "xxb",
            localField: "xxa1",
            foreignField: "xxb1",
            as: "xxb"
        }

is often compared to the following SQL statement

        SELECT *  
        FROM xxa  
        LEFT JOIN xxb
        ON xxa.xxa1 = xxb.xxb1;  

If you assume that these two operations are identical across various databases, you may misinterpret the expected outcomes and their performance:

  1. In MongoDB, any field can be a scalar or an array. The $lookup function operates as an intersection of two sets, unlike SQL's equality operator (=) which operates on two values. It is more similar to PostgreSQL's && operator for arrays.
  2. The results in MongoDB do not duplicate outer rows for each inner row as in SQL. Instead, an array of matching inner documents is projected to the outer document. If an $unwind stage follows, the $lookup will then unnest this array into multiple documents.

To illustrate the semantics, consider this simple example: I will begin with scalars from two collections ("xxa" and "xxb") and then used arrays in their fields instead of scalars.

Scalar on the outer and inner collections

I define two small collections with some values in common and some which do not match:

db.xxa.drop();
db.xxa.insert([
  { xxa1: 101 },
  { xxa1: 102 },
]);

db.xxb.drop();
db.xxb.insert([
  { xxb1: 102 },
  { xxb1: 103 },
]);

The following query "joins" the two with a $lookup:

db.xxa.aggregate([
    {
        $lookup: {
            from: "xxb",
            localField: "xxa1",
            foreignField: "xxb1",
            as: "xxb"
        }
    }
]);

[
  { _id: ObjectId('68504d1df99599b54cd4b118'), xxa1: 101, xxb: [] },
  {
    _id: ObjectId('68504d1df99599b54cd4b119'),
    xxa1: 102,
    xxb: [ { _id: ObjectId('68504d26f99599b54cd4b11a'), xxb1: 102 } ]
  }
]

For each document from the outer collection, an array is created. If there is no matching document in the inner collection, the array remains empty. If there is one matching document, it will be included in the array. This looks like a LEFT OUTER JOIN in SQL that returns a structured one-to-many rather than a tabular result with duplicated values. However, this is a special case where we are joining on scalar fields.

Array on the outer, scalar in the inner

I replace the outer collection with some arrays:

db.xxa.drop();
db.xxa.insert([
  { xxa1: [100,101] },
  { xxa1: [101,102] },
  { xxa1: [102,103] },
]);

I run the same query:

db.xxa.aggregate([
    {
        $lookup: {
            from: "xxb",
            localField: "xxa1",
            foreignField: "xxb1",
            as: "xxb"
        }
    }
]);

[
  {
    _id: ObjectId('68504d43f99599b54cd4b11c'),
    xxa1: [ 100, 101 ],
    xxb: []
  },
  {
    _id: ObjectId('68504d43f99599b54cd4b11d'),
    xxa1: [ 101, 102 ],
    xxb: [ { _id: ObjectId('68504d26f99599b54cd4b11a'), xxb1: 102 } ]
  },
  {
    _id: ObjectId('68504d43f99599b54cd4b11e'),
    xxa1: [ 102, 103 ],
    xxb: [
      { _id: ObjectId('68504d26f99599b54cd4b11a'), xxb1: 102 },
      { _id: ObjectId('68504d26f99599b54cd4b11b'), xxb1: 103 }
    ]
  }
]

The semantics are similar, except that a document from the inner collection matches as soon as its value exists in the outer array. When it is not contained, the added array is empty. When multiple values from the outer collection find a matching document from the inner collection, those multiple documents are added to the array in the result.

The array in the outer collection acts as a one-to-many reference. In SQL, one-to-many relationships cannot exist and are transformed into many-to-one on the opposite side, and an index is created on the foreign key to allow navigating in the other way.

Scalar on the outer, array in the inner

I do the opposite:

db.xxa.drop();
db.xxa.insert([
  { xxa1: 101 },
  { xxa1: 102 },
]);

db.xxb.drop();
db.xxb.insert([
  { xxb1: [100,101] },
  { xxb1: [101,102] },
  { xxb1: [102,103] },
]);

I run the same query:

db.xxa.aggregate([
    {
        $lookup: {
            from: "xxb",
            localField: "xxa1",
            foreignField: "xxb1",
            as: "xxb"
        }
    }
]);

[
  {
    _id: ObjectId('68504d5ef99599b54cd4b11f'),
    xxa1: 101,
    xxb: [
      { _id: ObjectId('68504d65f99599b54cd4b121'), xxb1: [ 100, 101 ] },
      { _id: ObjectId('68504d65f99599b54cd4b122'), xxb1: [ 101, 102 ] }
    ]
  },
  {
    _id: ObjectId('68504d5ef99599b54cd4b120'),
    xxa1: 102,
    xxb: [
      { _id: ObjectId('68504d65f99599b54cd4b122'), xxb1: [ 101, 102 ] },
      { _id: ObjectId('68504d65f99599b54cd4b123'), xxb1: [ 102, 103 ] }
    ]
  }
]

A document from the inner collection matches as soon as one of its values equals a value from the inner collection. If multiple inner documents match, they will all appear in the resulting document array.

Arrays on both side

The general case can handle arrays on both sides:

db.xxa.drop();
db.xxa.insert([
  { xxa1: [100,101] },
  { xxa1: [101,102] },
  { xxa1: [102,103] },
]);

db.xxa.aggregate([
    {
        $lookup: {
            from: "xxb",
            localField: "xxa1",
            foreignField: "xxb1",
            as: "xxb"
        }
    }
]);

[
  {
    _id: ObjectId('68504d7ef99599b54cd4b124'),
    xxa1: [ 100, 101 ],
    xxb: [
      { _id: ObjectId('68504d65f99599b54cd4b121'), xxb1: [ 100, 101 ] },
      { _id: ObjectId('68504d65f99599b54cd4b122'), xxb1: [ 101, 102 ] }
    ]
  },
  {
    _id: ObjectId('68504d7ef99599b54cd4b125'),
    xxa1: [ 101, 102 ],
    xxb: [
      { _id: ObjectId('68504d65f99599b54cd4b121'), xxb1: [ 100, 101 ] },
      { _id: ObjectId('68504d65f99599b54cd4b122'), xxb1: [ 101, 102 ] },
      { _id: ObjectId('68504d65f99599b54cd4b123'), xxb1: [ 102, 103 ] }
    ]
  },
  {
    _id: ObjectId('68504d7ef99599b54cd4b126'),
    xxa1: [ 102, 103 ],
    xxb: [
      { _id: ObjectId('68504d65f99599b54cd4b122'), xxb1: [ 101, 102 ] },
      { _id: ObjectId('68504d65f99599b54cd4b123'), xxb1: [ 102, 103 ] }
    ]
  }
]

In this context, 'matching' refers to the intersection between the set of keys from the outer document and the set of keys found in an inner document.

Representing the same in a SQL database is much more complex than two tables, as it needs additional tables instead of arrays:

-- Main entities  
CREATE TABLE xxa (  
    id SERIAL PRIMARY KEY  
    -- other scalar fields if any  
);  

CREATE TABLE xxb (  
    id SERIAL PRIMARY KEY  
    -- other scalar fields if any  
);  

-- Junction tables to represent the arrays  
CREATE TABLE xxa_values (  
    xxa_id INT REFERENCES xxa(id),  
    value INT,  
    PRIMARY KEY (xxa_id, value)  
);  

CREATE TABLE xxb_values (  
    xxb_id INT REFERENCES xxb(id),  
    value INT,  
    PRIMARY KEY (xxb_id, value)  
);  

Joining all tables and deduplicating the results serves as the equivalent of a MongoDB lookup:

SELECT   
    xxa.id as xxa_id,  
    ARRAY_AGG(DISTINCT xxa_vals.value) as xxa1,  
    ARRAY_AGG(  
        DISTINCT jsonb_build_object(  
            'id', xxb.id,  
            'xxb1', ARRAY(  
                SELECT value   
                FROM xxb_values   
                WHERE xxb_id = xxb.id   
                ORDER BY value  
            )  
        )  
    ) FILTER (WHERE xxb.id IS NOT NULL) as xxb  
FROM xxa  
LEFT JOIN xxa_values xxa_vals ON xxa.id = xxa_vals.xxa_id  
LEFT JOIN xxb_values xxb_vals ON xxa_vals.value = xxb_vals.value  
LEFT JOIN xxb ON xxb_vals.xxb_id = xxb.id  
GROUP BY xxa.id  
ORDER BY xxa.id;  

The next time you hear that a lookup in MongoDB is the same as a SQL join and don't understand the result or the performance, you should remember that it is a different, higher-level operation, on a flexible schema.
A simple left outer join in SQL often disappears in MongoDB, as the interconnected objects belong to a single document, like an aggregate in Domain Driver Design.

Conclusion

I used abstract names for collections and fields, in order to explain the behavior, but many modern applications benefit from a small array and an additional association table may complicate things unnecessarily. In contemporary apps, users are identified by one or more email addresses, resources by tags, blogs by categories, and videos by interests, and many queries look for common interests, categories, or tags.

Percona Software for MongoDB Release Plan Updates

Percona commits to delivering robust, enterprise-grade, and community-driven software for MongoDB. We build our databases and tools to meet evolving data needs. As MongoDB continues to evolve, we’ve updated and clarified our release strategy for the Percona Software for MongoDB suite. This suite includes: Percona Server for MongoDB (PSMDB) Percona Backup for MongoDB (PBM) Percona […]