a curated list of database news from authoritative sources

April 09, 2025

Percona Server for MySQL: Enhanced Encryption UDFs

In Percona Server for MySQL 8.0.41 / 8.4.4, we introduced several improvements in Encryption User-Defined Functions. Added support for RSAES-OAEP (OAEP) padding for RSA encrypt / decrypt operations. Added support for RSASSA-PSS (PSS) padding for RSA sign / verify operations. Added new encryption_udf.legacy_padding_scheme component system variable. Normalized character set support for all Encryption UDFs. PKCS1 […]

April 07, 2025

How to safely cancel a database query

Cancelling a query from a UI client is more nuanced than it might seem. Here's how we implemented safe KILL QUERY operations in Tinybird.

April 06, 2025

A case where SQL joins struggle but MongoDB documents shine

Claims such as "Joins are slow" or "Joins don't scale" often prompt me to showcase how efficiently rows can be joined in a SQL database (example). However, the user perception of slowness remains, and it's essential to listen to developers and understand their problems with joins.

Joining tables in relational databases can sometimes lead to suboptimal execution plans when data filtering occurs post-join. This limitation arises because indexes are efficient when selective predicates are on the same table. However, there is no multi-table index, at least for OLTP. Data warehouses in Oracle Databases can use bitmap join indexes, star transformation, and materialized views to overcome this, but they are not suited to OLTP workloads. These suboptimal execution plans may be the reason why developers think that joins are slow, even if it is not the execution of the join itself that is slow.

Denormalization can help, but it undermines the advantages of normalization. In contrast, document databases like MongoDB utilize embedded documents to optimize complex queries with fewer joins, and multi-key composite indexes offer efficient access paths that cover all selective filters .

Here is an example in PostgreSQL and MongoDB.

PostgreSQL relational model

Relational databases normalize one-to-many relationships to separate tables. For instance, consider the relationship between orders and their corresponding order details, where each order can have multiple associated entries in the order details table.

CREATE TABLE orders(
  id BIGSERIAL PRIMARY KEY,
  country_id INT,
  created_at TIMESTAMPTZ DEFAULT clock_timestamp()
);

CREATE TABLE order_details (
  id BIGINT REFERENCES orders ON DELETE CASCADE,
  line INT,
  product_id BIGINT,
  quantity INT,
  PRIMARY KEY(id, line)
);

I insert some data, with a distribution of products that decrease in orders over time:

BEGIN TRANSACTION;

INSERT INTO orders(country_id)
 SELECT 10 * random() FROM generate_series(1,1000000);

INSERT INTO order_details (id, line, product_id, quantity)
 SELECT 
  id,
  generate_series(1,10),
  log(2,(1 + id * random())::int),
  100 * random()
  FROM orders;

COMMIT;

A relational data model does not depend on specific access patterns. Optimizing these patterns requires indexes. My primary keys defined indexes for navigating between orders and order details, but I have another use case.
To analyze orders by country, product, and date range, I create the following indexes:


CREATE INDEX ON orders (country_id, created_at DESC, id);

CREATE INDEX ON order_details (product_id, id);

To analyze the latest orders for a specific country and product, I use the following SQL query:


PREPARE query(int,int,int) AS
 SELECT id, created_at, product_id, quantity
 FROM orders
 JOIN order_details d USING(id)
 WHERE country_id=$1 AND product_id=$2
 ORDER BY created_at DESC LIMIT $3
;

I vacuum and analyze to get the best execution plan:

postgres=# VACUUM ANALYZE orders, order_details;
VACUUM

Ideally, such query should read only the rows for one country and one product, and get them ordered by date to apply the Top-n ORDER BY LIMIT without having to read all rows and sort them.

postgres=# EXPLAIN (analyze, buffers, costs off)
           EXECUTE query(1, 15, 10)
;
                                                                                                                                                                                                 QUERY PLAN                                                         
----------------------------------------------------------------------------------------------------------------------------
 Limit (actual time=0.031..0.110 rows=10 loops=1)
   Buffers: shared hit=132
   ->  Nested Loop (actual time=0.030..0.108 rows=10 loops=1)
         Buffers: shared hit=132
         ->  Index Only Scan using orders_country_id_created_at_id_idx on orders (actual time=0.011..0.023 rows=39 loops=1)
               Index Cond: (country_id = 1)
               Heap Fetches: 0
               Buffers: shared hit=5
         ->  Index Scan using order_details_product_id_id_idx on order_details d (actual time=0.002..0.002 rows=0 loops=39)
               Index Cond: ((product_id = 15) AND (id = orders.id))
               Buffers: shared hit=127
 Planning:
   Buffers: shared hit=16
 Planning Time: 0.272 ms
 Execution Time: 0.127 ms
(15 rows)

The execution plan is effective, beginning with the index on orders.created_at to eliminate the need for sorting. To preserve the order and push down the join filter, it uses a nested loop join to retrieve the rows from the other table.
Since there is another filter on order_details.product_id, after the join, it had to read more rows (rows=39) to obtain the final required rows (rows=10), and then more loops. As my example is small, the consequence is minimal in terms of time, nested loops (loops=39) and buffers (shared hit=127), but highlights the issue: it requires reading four rows and ten pages for each row to return results.

If I run the same query with another product that hasn't been ordered recently, it reads lots of orders before finding ten ones that include this product:

postgres=# EXPLAIN (analyze, buffers, costs off)
           EXECUTE query(1, 8, 10)
;
                                                                 QUERY PLAN                                                                  
---------------------------------------------------------------------------------------------------------------------------------------------
 Limit (actual time=15.614..16.661 rows=10 loops=1)
   Buffers: shared hit=37582
   ->  Gather Merge (actual time=15.613..16.659 rows=10 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         Buffers: shared hit=37582
         ->  Nested Loop (actual time=1.396..9.112 rows=7 loops=3)
               Buffers: shared hit=37582
               ->  Parallel Index Only Scan using orders_country_id_created_at_id_idx on orders (actual time=0.015..0.546 rows=4165 loops=3)
                     Index Cond: (country_id = 1)
                     Heap Fetches: 0
                     Buffers: shared hit=70
               ->  Index Scan using order_details_product_id_id_idx on order_details d (actual time=0.002..0.002 rows=0 loops=12494)
                     Index Cond: ((product_id = 8) AND (id = orders.id))
                     Buffers: shared hit=37512
 Planning:
   Buffers: shared hit=16
 Planning Time: 0.272 ms
 Execution Time: 16.684 ms
(19 rows)

To get the 10 rows result, this execution has read 12495 rows with 3 parallel processes (rows=4165 loops=3), and 37582 pages in total, before it was able to find the Top-10 verifying all filters.

The problem is that the user doesn't understand why it can take longer, as it is the same query and returns the same number of rows. Moreover, reading so many unnecessary pages impacts other queries as it occupies space in the shared buffer.

When the query planner estimates that this is too much, it does not choose to avoid a sort and switches to a hash join.


postgres=# EXPLAIN (analyze, buffers, costs off)
           EXECUTE query(1, 5, 10)
;
                                                                 QUERY PLAN                                                                                                      
--------------------------------------------------------------------------------------------------------------------------------------------                                     
 Limit (actual time=30.370..30.373 rows=10 loops=1)
   Buffers: shared hit=1882
   ->  Sort (actual time=30.369..30.371 rows=10 loops=1)
         Sort Key: orders.created_at DESC
         Sort Method: top-N heapsort  Memory: 26kB
         Buffers: shared hit=1882
         ->  Hash Join (actual time=28.466..30.324 rows=236 loops=1)
               Hash Cond: (d.id = orders.id)
               Buffers: shared hit=1882
               ->  Index Scan using order_details_product_id_id_idx on order_details d (actual time=0.013..1.434 rows=2311 loops=1)                                              
                     Index Cond: (product_id = 5)
                     Buffers: shared hit=1387
               ->  Hash (actual time=28.400..28.401 rows=99672 loops=1)
                     Buckets: 131072  Batches: 1  Memory Usage: 5697kB
                     Buffers: shared hit=495
                     ->  Index Only Scan using orders_country_id_created_at_id_idx on orders (actual time=0.010..13.136 rows=99672 loops=1)                                      
                           Index Cond: (country_id = 1)
                           Heap Fetches: 0
                           Buffers: shared hit=495
 Planning:
   Buffers: shared hit=16
 Planning Time: 0.267 ms
 Execution Time: 30.415 ms
(23 rows)

This plan doesn't depend on the size of the result but must read too many rows (rows=2311 and rows=99672) before joining, filtering them (to rows=236), and sorting them. This is where it becomes a scalability problem: the response time depends on the size of the database rather than the result size. A query that is supposed to read orders from a small time window must read the whole history of orders for one country, and the whole history of details for one product.

Note that this example is the best case, where tables were freshly vacuumed, and Index Only Scan is optimal with Heap Fetches: 0. It will be more expensive on an active table.

MongoDB document model

MongoDB's document model allows embedding related data within a single collection, optimizing data locality in memory and disk.

Here is a collection that loads similar data to the previous sample, but with order details embedded in the orders document, like it is structered in a business document or an application object:

const bulkOps = [];
for (let i = 0; i < 1000000; i++) {
  const orderDetails = [];
  for (let line = 1; line <= 10; line++) {
    orderDetails.push({
      line: line,
      product_id: Math.floor(Math.log2(1 + i * Math.random())),
      quantity: Math.floor(100 * Math.random()),
    });
  }
  bulkOps.push({
    insertOne: {
      document: {
        country_id: Math.floor(10 * Math.random()),
        created_at: new Date(),
        order_details: orderDetails
      }
    }
  });
}
db.orders.bulkWrite(bulkOps);

One advantage of the document model is the ability to get an order with its details without any join:

test> db.orders.find().sort({created_at: -1}).limit(1);
[
  {
    _id: ObjectId('67f1a477aabaf2dad73f4791'),
    country_id: 3,
    created_at: ISODate('2025-04-05T21:45:21.546Z'),
    order_details: [
      { line: 1, product_id: 19, quantity: 40 },
      { line: 2, product_id: 18, quantity: 10 },
      { line: 3, product_id: 18, quantity: 75 },
      { line: 4, product_id: 18, quantity: 81 },
      { line: 5, product_id: 16, quantity: 66 },
      { line: 6, product_id: 14, quantity: 17 },
      { line: 7, product_id: 19, quantity: 82 },
      { line: 8, product_id: 19, quantity: 81 },
      { line: 9, product_id: 17, quantity: 56 },
      { line: 10, product_id: 19, quantity: 59 }
    ]
  }
]

Having all fields in one document allows creating a single index that covers all filters, and MongoDB supports multi-key indexes, which enables indexing fields in embedded subdocuments:

db.orders.createIndex(
 { "country_id": 1, "order_details.product_id": 1, "created_at": -1 }
);

The query to retreive the last ten orders for one country and one product is simple without join:

db.orders.find(
  { country_id: 1, order_details: { $elemMatch: { product_id: 15 } } }
).sort({ created_at: -1 }).limit(10);

Let's check the execution plan:

mdb> db.orders.find(
   { country_id: 1, order_details: { $elemMatch: { product_id: 15 } } }
 ).sort({ created_at: -1 }).limit(10).explain(`executionStats`).executionStats
;
{
  executionSuccess: true,
  nReturned: 10,
  executionTimeMillis: 0,
  totalKeysExamined: 10,
  totalDocsExamined: 10,
  executionStages: {
    isCached: false,
    stage: 'LIMIT',
    nReturned: 10,
    executionTimeMillisEstimate: 0,
    works: 11,
    advanced: 10,
    limitAmount: 10,
    inputStage: {
      stage: 'FETCH',
      filter: {
        order_details: { '$elemMatch': { product_id: { '$eq': 15 } } }
      },
      nReturned: 10,
      executionTimeMillisEstimate: 0,
      works: 10,
      advanced: 10,
      docsExamined: 10,
      alreadyHasObj: 0,
      inputStage: {
        stage: 'IXSCAN',
        nReturned: 10,
        executionTimeMillisEstimate: 0,
        works: 10,
        advanced: 10,
        keyPattern: {
          country_id: 1,
          'order_details.product_id': 1,
          created_at: -1
        },
        indexName: 'country_id_1_order_details.product_id_1_created_at_-1',
        isMultiKey: true,
        multiKeyPaths: {
          country_id: [],
          'order_details.product_id': [ 'order_details' ],
          created_at: []
        },
        isUnique: false,
        isSparse: false,
        isPartial: false,
        indexVersion: 2,
        direction: 'forward',
        indexBounds: {
          country_id: [ '[1, 1]' ],
          'order_details.product_id': [ '[15, 15]' ],
          created_at: [ '[MaxKey, MinKey]' ]
        },
        keysExamined: 10,
        seeks: 1,
        dupsTested: 10,
        dupsDropped: 0
      }
    }
  }
}

The plan shows lots of details, but the most important is:

  nReturned: 10,
  totalKeysExamined: 10,
  totalDocsExamined: 10,

To get the 10 rows for the result, MongoDB has read only 10 index entries and 10 documents. It is the most optimal, reading only what is necessary. The index scan is optimal as it contains the bounds for all equality filters and get the rows ordered without the need for an additional sort:

        indexBounds: {
          country_id: [ '[1, 1]' ],
          'order_details.product_id': [ '[15, 15]' ],
          created_at: [ '[MaxKey, MinKey]' ]
        },

In addition to be fast, the performance is predictable because this execution plan will always be the same. This is visible with allPlansExecution:

mdb> db.orders.find(
   { country_id: 1, order_details: { $elemMatch: { product_id: 15 } } }
 ).sort({ created_at: -1 }).limit(10).explain(`allPlansExecution`).queryPlanner
;
{
  namespace: 'test.orders',
  parsedQuery: {
    '$and': [
      {
        order_details: { '$elemMatch': { product_id: { '$eq': 42 } } }
      },
      { country_id: { '$eq': 1 } }
    ]
  },
  indexFilterSet: false,
  queryHash: '0DAE06A4',
  planCacheShapeHash: '0DAE06A4',
  planCacheKey: 'C3D96884',
  optimizationTimeMillis: 0,
  maxIndexedOrSolutionsReached: false,
  maxIndexedAndSolutionsReached: false,
  maxScansToExplodeReached: false,
  prunedSimilarIndexes: false,
  winningPlan: {
    isCached: false,
    stage: 'LIMIT',
    limitAmount: 10,
    inputStage: {
      stage: 'FETCH',
      filter: {
        order_details: { '$elemMatch': { product_id: { '$eq': 42 } } }
      },
      inputStage: {
        stage: 'IXSCAN',
        keyPattern: {
          country_id: 1,
          'order_details.product_id': 1,
          created_at: -1
        },
        indexName: 'country_id_1_order_details.product_id_1_created_at_-1',
        isMultiKey: true,
        multiKeyPaths: {
          country_id: [],
          'order_details.product_id': [ 'order_details' ],
          created_at<... (truncated)
                                    

Where SQL joins struggle but MongoDB documents shine

Claims such as "Joins are slow" or "Joins don't scale" often prompt me to showcase how efficiently rows can be joined in a SQL database (here, here, or here). However, user perception of slowness persists for some queries, making it crucial to listen to developers and understand why they feel joins are slow.

Joining tables in relational databases can sometimes lead to suboptimal execution plans when data filtering occurs post-join. This limitation arises because indexes are efficient when selective predicates are on the same table. However, there is no multi-table index, at least for OLTP. Data warehouses in Oracle Databases can use bitmap join indexes, star transformation, and materialized views to overcome this, but they are not suited to OLTP workloads. Dynamic bitmap scans can combine multiple indexes but at the price of more work and less possibilities.

These suboptimal execution plans may be the reason why developers think that joins are slow, even if it is not the execution of the join itself that is slow.

Denormalization can help, but it undermines the advantages of normalization. In contrast, document databases like MongoDB utilize embedded documents to optimize complex queries with fewer joins, and multi-key composite indexes offer efficient access paths that cover all selective filters .

Here is an example in PostgreSQL and MongoDB.

PostgreSQL relational model

Relational databases normalize one-to-many relationships to separate tables. For instance, consider the relationship between orders and their corresponding order details, where each order can have multiple associated entries in the order details table.

CREATE TABLE orders(
  id BIGSERIAL PRIMARY KEY,
  country_id INT,
  created_at TIMESTAMPTZ DEFAULT clock_timestamp()
);

CREATE TABLE order_details (
  id BIGINT REFERENCES orders ON DELETE CASCADE,
  line INT,
  product_id BIGINT,
  quantity INT,
  PRIMARY KEY(id, line)
);

I insert some data, with a distribution of products that decrease in orders over time:

BEGIN TRANSACTION;

INSERT INTO orders(country_id)
 SELECT 10 * random() FROM generate_series(1,1000000);

INSERT INTO order_details (id, line, product_id, quantity)
 SELECT 
  id,
  generate_series(1,10),
  log(2,(1 + id * random())::int),
  100 * random()
  FROM orders;

COMMIT;

A relational data model does not depend on specific access patterns. Optimizing these patterns requires indexes. My primary keys defined indexes for navigating between orders and order details, but I have another use case.
To analyze orders by country, product, and date range, I create the following indexes:


CREATE INDEX ON orders (country_id, created_at DESC, id);

CREATE INDEX ON order_details (product_id, id);

To analyze the latest orders for a specific country and product, I use the following SQL query:


PREPARE query(int,int,int) AS
 SELECT id, created_at, product_id, quantity
 FROM orders
 JOIN order_details d USING(id)
 WHERE country_id=$1 AND product_id=$2
 ORDER BY created_at DESC LIMIT $3
;

I vacuum and analyze to get the best execution plan:

postgres=# VACUUM ANALYZE orders, order_details;
VACUUM

Ideally, such query should read only the rows for one country and one product, and get them ordered by date to apply the Top-n ORDER BY LIMIT without having to read all rows and sort them.

postgres=# EXPLAIN (analyze, buffers, costs off)
           EXECUTE query(1, 15, 10)
;
                                                                                                                                                                                                 QUERY PLAN                                                         
----------------------------------------------------------------------------------------------------------------------------
 Limit (actual time=0.031..0.110 rows=10 loops=1)
   Buffers: shared hit=132
   ->  Nested Loop (actual time=0.030..0.108 rows=10 loops=1)
         Buffers: shared hit=132
         ->  Index Only Scan using orders_country_id_created_at_id_idx on orders (actual time=0.011..0.023 rows=39 loops=1)
               Index Cond: (country_id = 1)
               Heap Fetches: 0
               Buffers: shared hit=5
         ->  Index Scan using order_details_product_id_id_idx on order_details d (actual time=0.002..0.002 rows=0 loops=39)
               Index Cond: ((product_id = 15) AND (id = orders.id))
               Buffers: shared hit=127
 Planning:
   Buffers: shared hit=16
 Planning Time: 0.272 ms
 Execution Time: 0.127 ms
(15 rows)

The execution plan is effective, beginning with the index on orders.created_at to eliminate the need for sorting. To preserve the order and push down the join filter, it uses a nested loop join to retrieve the rows from the other table.
Since there is another filter on order_details.product_id, after the join, it had to read more rows (rows=39) to obtain the final required rows (rows=10), and then more loops. As my example is small, the consequence is minimal in terms of time, nested loops (loops=39) and buffers (shared hit=127), but highlights the issue: it requires reading four rows and ten pages for each row to return results.

If I run the same query with another product that hasn't been ordered recently, it reads lots of orders before finding ten ones that include this product:

postgres=# EXPLAIN (analyze, buffers, costs off)
           EXECUTE query(1, 8, 10)
;
                                                                 QUERY PLAN                                                                  
---------------------------------------------------------------------------------------------------------------------------------------------
 Limit (actual time=15.614..16.661 rows=10 loops=1)
   Buffers: shared hit=37582
   ->  Gather Merge (actual time=15.613..16.659 rows=10 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         Buffers: shared hit=37582
         ->  Nested Loop (actual time=1.396..9.112 rows=7 loops=3)
               Buffers: shared hit=37582
               ->  Parallel Index Only Scan using orders_country_id_created_at_id_idx on orders (actual time=0.015..0.546 rows=4165 loops=3)
                     Index Cond: (country_id = 1)
                     Heap Fetches: 0
                     Buffers: shared hit=70
               ->  Index Scan using order_details_product_id_id_idx on order_details d (actual time=0.002..0.002 rows=0 loops=12494)
                     Index Cond: ((product_id = 8) AND (id = orders.id))
                     Buffers: shared hit=37512
 Planning:
   Buffers: shared hit=16
 Planning Time: 0.272 ms
 Execution Time: 16.684 ms
(19 rows)

To get the 10 rows result, this execution has read 12495 rows with 3 parallel processes (rows=4165 loops=3), and 37582 pages in total, before it was able to find the Top-10 verifying all filters.

The problem is that the user doesn't understand why it can take longer, as it is the same query and returns the same number of rows. Moreover, reading so many unnecessary pages impacts other queries as it occupies space in the shared buffer.

When the query planner estimates that this is too much, it does not choose to avoid a sort and switches to a hash join.


postgres=# EXPLAIN (analyze, buffers, costs off)
           EXECUTE query(1, 5, 10)
;
                                                                 QUERY PLAN                                                                                                      
--------------------------------------------------------------------------------------------------------------------------------------------                                     
 Limit (actual time=30.370..30.373 rows=10 loops=1)
   Buffers: shared hit=1882
   ->  Sort (actual time=30.369..30.371 rows=10 loops=1)
         Sort Key: orders.created_at DESC
         Sort Method: top-N heapsort  Memory: 26kB
         Buffers: shared hit=1882
         ->  Hash Join (actual time=28.466..30.324 rows=236 loops=1)
               Hash Cond: (d.id = orders.id)
               Buffers: shared hit=1882
               ->  Index Scan using order_details_product_id_id_idx on order_details d (actual time=0.013..1.434 rows=2311 loops=1)                                              
                     Index Cond: (product_id = 5)
                     Buffers: shared hit=1387
               ->  Hash (actual time=28.400..28.401 rows=99672 loops=1)
                     Buckets: 131072  Batches: 1  Memory Usage: 5697kB
                     Buffers: shared hit=495
                     ->  Index Only Scan using orders_country_id_created_at_id_idx on orders (actual time=0.010..13.136 rows=99672 loops=1)                                      
                           Index Cond: (country_id = 1)
                           Heap Fetches: 0
                           Buffers: shared hit=495
 Planning:
   Buffers: shared hit=16
 Planning Time: 0.267 ms
 Execution Time: 30.415 ms
(23 rows)

This plan doesn't depend on the size of the result but must read too many rows (rows=2311 and rows=99672) before joining, filtering them (to rows=236), and sorting them. This is where it becomes a scalability problem: the response time depends on the size of the database rather than the result size. A query that is supposed to read orders from a small time window must read the whole history of orders for one country, and the whole history of details for one product.

Note that this example is the best case, where tables were freshly vacuumed, and Index Only Scan is optimal with Heap Fetches: 0. It will be more expensive on an active table.

MongoDB document model

MongoDB's document model allows embedding related data within a single collection, optimizing data locality in memory and disk.

Here is a collection that loads similar data to the previous sample, but with order details embedded in the orders document, like it is structered in a business document or an application object:

const bulkOps = [];
for (let i = 0; i < 1000000; i++) {
  const orderDetails = [];
  for (let line = 1; line <= 10; line++) {
    orderDetails.push({
      line: line,
      product_id: Math.floor(Math.log2(1 + i * Math.random())),
      quantity: Math.floor(100 * Math.random()),
    });
  }
  bulkOps.push({
    insertOne: {
      document: {
        country_id: Math.floor(10 * Math.random()),
        created_at: new Date(),
        order_details: orderDetails
      }
    }
  });
}
db.orders.bulkWrite(bulkOps);

One advantage of the document model is the ability to get an order with its details without any join:

test> db.orders.find().sort({created_at: -1}).limit(1);
[
  {
    _id: ObjectId('67f1a477aabaf2dad73f4791'),
    country_id: 3,
    created_at: ISODate('2025-04-05T21:45:21.546Z'),
    order_details: [
      { line: 1, product_id: 19, quantity: 40 },
      { line: 2, product_id: 18, quantity: 10 },
      { line: 3, product_id: 18, quantity: 75 },
      { line: 4, product_id: 18, quantity: 81 },
      { line: 5, product_id: 16, quantity: 66 },
      { line: 6, product_id: 14, quantity: 17 },
      { line: 7, product_id: 19, quantity: 82 },
      { line: 8, product_id: 19, quantity: 81 },
      { line: 9, product_id: 17, quantity: 56 },
      { line: 10, product_id: 19, quantity: 59 }
    ]
  }
]

Having all fields in one document allows creating a single index that covers all filters, and MongoDB supports multi-key indexes, which enables indexing fields in embedded subdocuments:

db.orders.createIndex(
 { "country_id": 1, "order_details.product_id": 1, "created_at": -1 }
);

The query to retreive the last ten orders for one country and one product is simple without join:

db.orders.find(
  { country_id: 1, order_details: { $elemMatch: { product_id: 15 } } }
).sort({ created_at: -1 }).limit(10);

Let's check the execution plan:

mdb> db.orders.find(
   { country_id: 1, order_details: { $elemMatch: { product_id: 15 } } }
 ).sort({ created_at: -1 }).limit(10).explain(`executionStats`).executionStats
;
{
  executionSuccess: true,
  nReturned: 10,
  executionTimeMillis: 0,
  totalKeysExamined: 10,
  totalDocsExamined: 10,
  executionStages: {
    isCached: false,
    stage: 'LIMIT',
    nReturned: 10,
    executionTimeMillisEstimate: 0,
    works: 11,
    advanced: 10,
    limitAmount: 10,
    inputStage: {
      stage: 'FETCH',
      filter: {
        order_details: { '$elemMatch': { product_id: { '$eq': 15 } } }
      },
      nReturned: 10,
      executionTimeMillisEstimate: 0,
      works: 10,
      advanced: 10,
      docsExamined: 10,
      alreadyHasObj: 0,
      inputStage: {
        stage: 'IXSCAN',
        nReturned: 10,
        executionTimeMillisEstimate: 0,
        works: 10,
        advanced: 10,
        keyPattern: {
          country_id: 1,
          'order_details.product_id': 1,
          created_at: -1
        },
        indexName: 'country_id_1_order_details.product_id_1_created_at_-1',
        isMultiKey: true,
        multiKeyPaths: {
          country_id: [],
          'order_details.product_id': [ 'order_details' ],
          created_at: []
        },
        isUnique: false,
        isSparse: false,
        isPartial: false,
        indexVersion: 2,
        direction: 'forward',
        indexBounds: {
          country_id: [ '[1, 1]' ],
          'order_details.product_id': [ '[15, 15]' ],
          created_at: [ '[MaxKey, MinKey]' ]
        },
        keysExamined: 10,
        seeks: 1,
        dupsTested: 10,
        dupsDropped: 0
      }
    }
  }
}

The plan shows lots of details, but the most important is:

  nReturned: 10,
  totalKeysExamined: 10,
  totalDocsExamined: 10,

To get the 10 rows for the result, MongoDB has read only 10 index entries and 10 documents. It is the most optimal, reading only what is necessary. The index scan is optimal as it contains the bounds for all equality filters and get the rows ordered without the need for an additional sort:

        indexBounds: {
          country_id: [ '[1, 1]' ],
          'order_details.product_id': [ '[15, 15]' ],
          created_at: [ '[MaxKey, MinKey]' ]
        },

In addition to be fast, the performance is predictable because this execution plan will always be the same. This is visible with allPlansExecution:

mdb> db.orders.find(
   { country_id: 1, order_details: { $elemMatch: { product_id: 15 } } }
 ).sort({ created_at: -1 }).limit(10).explain(`allPlansExecution`).queryPlanner
;
{
  namespace: 'test.orders',
  parsedQuery: {
    '$and': [
      {
        order_details: { '$elemMatch': { product_id: { '$eq': 42 } } }
      },
      { country_id: { '$eq': 1 } }
    ]
  },
  indexFilterSet: false,
  queryHash: '0DAE06A4',
  planCacheShapeHash: '0DAE06A4',
  planCacheKey: 'C3D96884',
  optimizationTimeMillis: 0,
  maxIndexedOrSolutionsReached: false,
  maxIndexedAndSolutionsReached: false,
  maxScansToExplodeReached: false,
  prunedSimilarIndexes: false,
  winningPlan: {
    isCached: false,
    stage: 'LIMIT',
    limitAmount: 10,
    inputStage: {
      stage: 'FETCH',
      filter: {
        order_details: { '$elemMatch': { product_id: { '$eq': 42 } } }
      },
      inputStage: {
        stage: 'IXSCAN',
        keyPattern: {
          country_id: 1,
          'order_details.product_id': 1,
          created_at: -1
        },
        indexName: 'country_id_1_order_details.product_id_1_created_at_-1',
        isMultiKey: true,
        multiKeyPaths: {
          country_id: 
                                        by Franck Pachot
                                    

April 04, 2025

April 03, 2025

The Tech that Convex replaced for ClarityText

Convex might look like just a database, but for modern app developers, it's a full backend-as-a-service platform that replaces the complexity of stitching together ORMs, caching, rate limiting, realtime sync, and CI/CD. In this post, a startup founder walks through how they used Convex to build ClarityText—an AI-powered chat knowledge base—and explains why Convex delivers faster MVPs, deeper type safety, and a seamless developer experience without third-party services like Prisma, Redis, or Firebase. Perfect for developers evaluating backend choices in 2024 and beyond.

4 Ways to Convert MongoDB BSON to JSON Format

This post was originally published in April, 2023 and was updated in April, 2025. Binary Javascript Object Notation (BSON) is a bin­ary-en­coded seri­al­iz­a­tion of JSON documents. JSON is easier to understand as it is human-readable, but compared to BSON, it supports fewer data types. BSON has been extended to add some optional non-JSON-native data types, […]

Foreign Keys: A must in SQL, but not in a Document Database?

In relational SQL databases, foreign keys are essential for guaranteeing referential integrity. There are two main reasons to declare them, particularly regarding their cascading behavior:

  1. Enforcing Strong Relationships: SQL databases require entities to be stored in separate tables, even when they share the same lifecycle. This separation is mandated by normalization rules, particularly the first normal form for One-To-Many relationships. Foreign keys establish a dependency between these tables and ensure that changes in one table (such as deletions) are either reflected in the child table (using "ON DELETE CASCADE") or prohibited (with "ON DELETE RESTRICT"). For instance, in a blog application, you want to avoid orphaned comments that are not linked to a blog post. If the blog post is deleted, all associated comments should also be deleted, or the deletion should be declined when comments exist.
    In addition to consistency, the declaration of the foreign key is necessary for the query planner to be aware of the strong relationship between the two tables, so that it can estimate the result of the join more accurately.

  2. Consistency in Join Operations: SQL tables are often joined using inner joins, the default, where unmatched rows are discarded. This can lead to inconsistencies in query results, depending on whether there is a join or not. Foreign keys help mitigate these inconsistencies by ensuring that a valid foreign key reference always matches a corresponding row. For example, an author may be removed due to the GDPR "Right to Erasure" but not if it could make some blog posts invisible. The application must do it differently, assigning an unknown author entry before the deletion. Another possibility would be to always use an outer join in queries, but this limits the possible join optimizations.

There are other reasons to enforce foreign keys, but they also involve some application logic and cannot be a simple database declaration, except if you accept business logic in triggers. For example, if you rename a blog category, do you want all blog posts to reflect the new name? Should the old name remain for existing posts, and the new category only apply to new posts? You might want to prevent the removal of a category that is still in use or allow it while removing the category from all posts. If it was the main category of a post, you may need to designate another category as the main one, but not pick a random one. These decisions depend on business reasons for renaming: was it a simple typo, or is the category replaced or being split into two?

Such complex scenarios cannot be simply managed by the database through generic ON DELETE or ON UPDATE commands. The business transaction that removes a category would first analyze the consequences and make the necessary changes. Typically, the category is logically deleted, meaning it can no longer be used. However, joins will still show results with the old name until the posts in that category are reassigned. The category can later be physically deleted, when there are no orphaned records.

What About MongoDB?

The primary difference between a relational data model and a document data model is that the former is designed for a central database without knowing the future use cases, while the latter is tailored to specific domains with known application access patterns.
A document model is application-centric, meaning the application itself implements a logic that maintains the integrity, including validating lists of values or existing references before insertion, and safe preparation before deletion or update.

In the two cases mentioned above, when referential integrity does not involve additional application logic but is part of the data model, MongoDB does not require foreign keys for enforcement:

  1. Enforcing Strong Relationships is inherent in the document model, as entities sharing the same lifecycle are typically embedded within a single document. Since MongoDB can store and index sub-documents without restrictions, the separation of a One-to-Many relationship into different documents is unnecessary. Sub-documents cannot become orphaned because they are owned by their parent document.

  2. Consistency in Join Operations is guaranteed through the $lookup operation, which functions similarly to a left outer join. This ensures that records from the driving collection are not removed even if their lookup references do not match.

To illustrate other cases, consider a Many-to-One relationship, such as a product catalog in an order-entry system. A key advantage of normalization is that changing a product name requires only a single row update in the reference table, reflecting across all queries that join the product table. However, this can lead to inconsistencies outside the database. For example, a customer may receive an email at order completion or a printed bill displaying the old product name, which could mislead them if a different product name is displayed on the website.

Instead of deleting or renaming a product, a new version is created in a relational database, while previous orders reference the earlier version. In a document model, product information associated with the order, as seen in confirmation emails or on bills, is stored directly in the document. While this may appear as denormalization, the duplication is driven by business logic, not changing what was sent or printed, rather than solely for performance.
Typically, additional fields are included as an extended reference, rather than embedding the entire product document. A separate collection of products still exists, allowing updates without cascading effects, since necessary details are copied to the order document itself. Conceptually, this resembles having two entities for products, because of the temporality: the current catalog for new orders and a snapshot of the catalog at the time of order, embedded within the order.

I did not cover Many-to-Many relationships, as they are transformed into two One-to-Many relationships in SQL with an additional association table. With document data modeling, this functions similarly as a reference or extended reference, as MongoDB accommodates an array of references. For example, a new blog post with multiple authors requires complex migration in SQL but fits seamlessly into the same model with the same indexes in MongoDB. Another example can be found in The World Is Too Messy for SQL to Work.

Comparison

In conclusion, MongoDB does not lack foreign keys. It simply manages relationships differently than relational SQL databases. While SQL relies on foreign keys for referential integrity, MongoDB employs an application-centric approach through data embedding that is not limited by the normal forms, and update logic to cascade the updates, though an aggregation pipeline. This strategy effectively maintains document relationships, ensuring consistency without the need for foreign key constraints.
Although it may require more code, avoiding unnecessary foreign keys facilitates sharding and horizontal scaling with minimal performance impact. Ultimately, with MongoDB, maintaining database integrity during reference table updates is a shared responsibility between the database and the application, contrasting SQL's independent in-database declarations that are use-case agnostic.

Migration

When migrating from PostgreSQL to MongoDB, provided that the relational data modeling was done correctly, consider the following:

  • ON DELETE CASCADE indicates a shared lifecycle and should be transformed into an embedded sub-document.
  • ON DELETE SET NULL/DEFAULT represents an independent lifecycle. Use an extended reference to copy relevant fields to the child at insert time instead of using joins during queries.
  • ON DELETE RESTRICT/NO ACTION requires application logic before deletion. In MongoDB, you can manage consistency and performance with logical deletions prior to physical ones, and utilize an aggregation pipeline to execute them.
  • ON UPDATE actions indicate a natural key was used, and any key exposed to the user might be updated, but it's preferable to utilize an immutable _id for references.
  • Primary Keys that include the Foreign Keys suggest a Many-to-Many association. Convert this into an array of references on the side where the size is predictable. This approach eliminates the need for an additional array on the opposite side because the multi-key index on the array serves for the navigation from the other side.

Foreign Keys: A must in SQL, but not in a Document Database?

In relational SQL databases, foreign keys are essential for guaranteeing referential integrity. There are two main reasons to declare them, particularly regarding their cascading behavior:

  1. Enforcing Strong Relationships: SQL databases require entities to be stored in separate tables, even when they share the same lifecycle. This separation is mandated by normalization rules, particularly the first normal form for One-To-Many relationships. Foreign keys establish a dependency between these tables and ensure that changes in one table (such as deletions) are either reflected in the child table (using "ON DELETE CASCADE") or prohibited (with "ON DELETE RESTRICT"). For instance, in a blog application, you want to avoid orphaned comments that are not linked to a blog post. If the blog post is deleted, all associated comments should also be deleted, or the deletion should be declined when comments exist.
    In addition to consistency, the declaration of the foreign key is necessary for the query planner to be aware of the strong relationship between the two tables, so that it can estimate the result of the join more accurately.

  2. Consistency in Join Operations: SQL tables are often joined using inner joins, the default, where unmatched rows are discarded. This can lead to inconsistencies in query results, depending on whether there is a join or not. Foreign keys help mitigate these inconsistencies by ensuring that a valid foreign key reference always matches a corresponding row. For example, an author may be removed due to the GDPR "Right to Erasure" but not if it could make some blog posts invisible. The application must do it differently, assigning an unknown author entry before the deletion. Another possibility would be to always use an outer join in queries, but this limits the possible join optimizations.

There are other reasons to enforce foreign keys, but they also involve some application logic and cannot be a simple database declaration, except if you accept business logic in triggers. For example, if you rename a blog category, do you want all blog posts to reflect the new name? Should the old name remain for existing posts, and the new category only apply to new posts? You might want to prevent the removal of a category that is still in use or allow it while removing the category from all posts. If it was the main category of a post, you may need to designate another category as the main one, but not pick a random one. These decisions depend on business reasons for renaming: was it a simple typo, or is the category replaced or being split into two?

Such complex scenarios cannot be simply managed by the database through generic ON DELETE or ON UPDATE commands. The business transaction that removes a category would first analyze the consequences and make the necessary changes. Typically, the category is logically deleted, meaning it can no longer be used. However, joins will still show results with the old name until the posts in that category are reassigned. The category can later be physically deleted, when there are no orphaned records.

What About MongoDB?

The primary difference between a relational data model and a document data model is that the former is designed for a central database without knowing the future use cases, while the latter is tailored to specific domains with known application access patterns.
A document model is application-centric, meaning the application itself implements a logic that maintains the integrity, including validating lists of values or existing references before insertion, and safe preparation before deletion or update.

In the two cases mentioned above, when referential integrity does not involve additional application logic but is part of the data model, MongoDB does not require foreign keys for enforcement:

  1. Enforcing Strong Relationships is inherent in the document model, as entities sharing the same lifecycle are typically embedded within a single document. Since MongoDB can store and index sub-documents, the separation of a One-to-Many relationship into different documents is unnecessary. Sub-documents cannot become orphaned because they are owned by their parent document.

  2. Consistency in Join Operations is guaranteed through the $lookup operation, which functions similarly to a left outer join. This ensures that records from the driving collection are not removed even if their lookup references do not match. The application is responsible of handling non-existing ones in the aggregation pipeline.

To illustrate other cases, consider a Many-to-One relationship, such as a product catalog in an order-entry system. A key advantage of normalization is that changing a product name requires only a single row update in the reference table, reflecting across all queries that join the product table. However, this can lead to inconsistencies outside the database. For example, a customer may receive an email at order completion or a printed bill displaying the old product name, which could mislead them if a different product name is displayed on the website.

In operational databases, instead of deleting or renaming a product, a new version is created in a relational database, while previous orders reference the earlier version. In a document model, product information associated with the order, as seen in confirmation emails or on bills, is stored directly in the document. While this may appear as denormalization, the duplication is driven by business logic, not changing what was sent or printed, rather than solely for performance.
Typically, additional fields are included as an extended reference, rather than embedding the entire product document. A separate collection of products still exists, allowing updates without cascading effects, since necessary details are copied to the order document itself. Conceptually, this resembles having two entities for products, because of the temporality: the current catalog for new orders and a snapshot of the catalog at the time of order, embedded within the order.

I did not cover Many-to-Many relationships, as they are transformed into two One-to-Many relationships in SQL with an additional association table. With document data modeling, this functions similarly as a reference or extended reference, as MongoDB accommodates an array of references. For example, accepting new blog posts with multiple authors requires complex schema migration in SQL, as the One-to-Many becomes a Many-to-Many, but fits seamlessly into the same model with the same indexes in MongoDB. Another example can be found in The World Is Too Messy for SQL to Work.

Comparison

In conclusion, MongoDB does not lack foreign keys. It simply manages relationships differently than relational SQL databases. While SQL relies on foreign keys for referential integrity, MongoDB employs an application-centric approach through data embedding that is not limited by the normal forms, and update logic to cascade the updates, though an aggregation pipeline. This strategy effectively maintains document relationships, ensuring consistency without the need for foreign key constraints.
Although it may require more code, avoiding unnecessary foreign keys facilitates sharding and horizontal scaling with minimal performance impact. Ultimately, with MongoDB, maintaining database integrity during reference table updates is a shared responsibility between the database and the application, contrasting SQL's independent in-database declarations that are use-case agnostic.

Migration

When migrating from PostgreSQL to MongoDB, provided that the relational data modeling was done correctly, consider the following:

  • ON DELETE CASCADE indicates a shared lifecycle and should be transformed into an embedded sub-document.
  • ON DELETE SET NULL/DEFAULT represents an independent lifecycle. Use an extended reference to copy relevant fields to the child at insert time instead of using joins during queries.
  • ON DELETE RESTRICT/NO ACTION requires application logic before deletion. In MongoDB, you can manage consistency and performance with logical deletions prior to physical ones, and utilize an aggregation pipeline to execute them.
  • ON UPDATE actions indicate a natural key was used, and any key exposed to the user might be updated, but it's preferable to utilize an immutable _id for references.
  • Primary Keys that include the Foreign Keys suggest a Many-to-Many association. Convert this into an array of references on the side where the size is predictable. This approach eliminates the need for an additional array on the opposite side because the multi-key index on the array serves for the navigation from the other side.

If you belong to my generation, you likely engaged in relational data modeling using UML class diagrams. This may help in understanding how I categorized the foreign keys by ownership and lifecycle:

Parent Ownership Lifecycle UML Role Document Modeling
Exclusive Shared ◆─ Composition part-of Embedded
Shared Indep. ◇─ Aggregation has-a Reference
None Indep. ── Association use-a Reference + Embed

We made an open source LLM Performance Tracker

You can't just build AI features, you have to operate them in production, which means observability. Here's an open source tool to watch your LLMs in real-time.

April 02, 2025

Improve PostgreSQL performance using the pgstattuple extension

In this post, we explore the pgstattuple extension in depth; what insights it offers, how to use it to diagnose issues in Amazon Aurora PostgreSQL-Compatible Edition and Amazon Relational Database Service (Amazon RDS) for PostgreSQL, and best practices for harnessing its capabilities.

Essential MongoDB Backup Best Practices for Data Protection

This blog was originally published in September 2020 and was updated in April 2025. As a MongoDB user, ensuring your data is safe and secure in the event of a disaster or system failure is crucial. That’s why it’s essential to implement effective MongoDB backup best practices and strategies. Regular database backups are the cornerstone […]

Fast Compilation or Fast Execution: Just Have Both!

Fast Compilation or Fast Execution: Just Have Both!

At CedarDB, we believe in high-performance computing. And one of the easiest tricks to high performance is to do only what you really need to do. So you want to reduce data movement or reduce unnecessary branches. But how can you be sure you are only doing what you need to do? Especially when you want to be flexible and configurable at runtime.

Basically, you want to tune the code based on as much information as you can get beforehand. If you’re a programmer tasked with optimizing a snippet, you’d probably start by gathering as much information as you can about the data set. Then you would look at data alignment and loops to remove any unnecessary checks and pre-calculate or cache as much as possible. Finally, the code is tailored as much as possible to your challenge so that it does only one thing, but does it extremely well. Unfortunately, as developers, we cannot just write code that does one thing because there are users. Users demand interactivity, and in our case, they want to decide how they access the data. So isn’t there a way to have your cake and eat it too?

April 01, 2025

Choosing the Right Storage for PostgreSQL on Kubernetes: A Benchmark Analysis

As more organizations move their PostgreSQL databases onto Kubernetes, a common question arises: Which storage solution best handles its demands? Picking the right option is critical, directly impacting performance, reliability, and scalability. For stateful workloads like PostgreSQL, storage must offer high availability and safeguard data integrity, even under intense, high-volume conditions. To address these concerns, […]

Kubernetes Sidecar Containers Explained: Benefits, Use Cases, and What’s New

Kubernetes is becoming a popular choice for running containerized applications. While the core idea is to have a single container running the application in a Pod, there are many cases where one or more containers need to run alongside the application container, such as containers for capturing logs, metrics, etc. This approach is typically referred […]

Is MongoDB Truly Open Source? A Critical Look at SSPL

This post was originally published in April 2023 and was updated in April 2025. When asked “Is MongoDB open source?”, at Percona, our definitive answer is “no.” Let’s start with this: MongoDB is accurately referred to as source-available software. Whereas many open source software offerings — like the community version of MySQL — use the […]

PostgreSQL Security: A Comprehensive Guide to Hardening Your Database

This blog was published in January of 2021 and was updated in April of 2025. PostgreSQL database security involves safeguarding multiple facets of the database ecosystem. It’s a broad topic because protection must extend beyond the database node itself. The diagram below illustrates key components requiring protection, emphasizing that the database is part of a […]

How to Run MongoDB on Kubernetes: Solutions, Pros and Cons

This blog was originally published in August 2022 and was updated in April 2025. In this blog, we’ll examine the increasingly popular practice of running MongoDB on Kubernetes. We will explore various solutions and approaches to this setup, from direct deployments as a stateful application to utilizing specialized operators and considering cloud-based solutions, guiding you […]