a curated list of database news from authoritative sources

April 12, 2025

Battle of the Mallocators

If you use RocksDB and want to avoid OOM then use jemalloc or tcmalloc and avoid glibc malloc. That was true in 2015 and remains true in 2025 (see here). The problem is that RocksDB can be an allocator stress test because it does an allocation (calls malloc) when a block is read from storage and then does a deallocation (calls free) on eviction. These allocations have very different lifetimes as some blocks remain cached for a long time and that leads to much larger RSS than expected when using glibc malloc. Fortunately, jemalloc and tcmalloc are better at tolerating that allocation pattern without making RSS too large.

I have yet to notice a similar problem with InnoDB, in part because it does a few large allocations at process start for the InnoDB buffer pool and it doesn't do malloc/free per block read from storage.

There was a recent claim from a MySQL performance expert, Dimitri Kravtchuk, that either RSS or VSZ can grow too large with InnoDB and jemalloc. I don't know all of the details for his setup and I failed to reproduce his result on my setup. Too be fair, I show here that VSZ for InnoDB + jemalloc can be larger than you might expect but that isn't a problem, it is just an artifact of jemalloc that can be confusing. But RSS for jemalloc with InnoDB is similar to what I get from tcmalloc.

tl;dr

  • For glibc malloc with MyRocks I get OOM on a server with 128G of RAM when the RocksDB buffer pool size is 50G. I might have been able to avoid OOM by using between 30G and 40G for the buffer pool. On that host I normally use jemalloc with MyRocks and a 100G buffer pool.
  • With respect to peak RSS
    • For InnoDB the peak RSS with all allocators is similar and peak RSS is ~1.06X larger than the InnoDB buffer pool.
    • For MyRocks the peak RSS is smallest with jemalloc, slightly larger with tcmalloc and much too large with glibc malloc. For (jemalloc, tcmalloc, glibc malloc) It was (1.22, 1.31, 3.62) times larger than the 10G MyRocks buffer pool. I suspect those ratios would be smaller for jemalloc and tcmalloc had I used an 80G buffer pool.
  • For performance, QPS with jemalloc and tcmalloc is slightly better than with glibc malloc
    • For InnoDB: [jemalloc, tcmalloc] get [2.5%, 3.5%] more QPS than glibc malloc
    • For MyRocks: [jemalloc, tcmalloc] get [5.1%, 3.0%] more QPS than glibc malloc

Prior art

I have several blog posts on using jemalloc with MyRocks.

  • October 2015 - MyRocks with glibc malloc, jemalloc and tcmalloc
  • April 2017 - Performance for large, concurrent allocations
  • April 2018 - RSS for MyRocks with jemalloc vs glibc malloc
  • August 2023 - RocksDB and glibc malloc
  • September 2023 - A regression in jemalloc 4.4.0 and 4.5.0 (too-large RSS) 
  • September 2023 - More on the regression in jemalloc 4.4.0 and 4.5.0
  • October 2023 - Even more on the regression in jemalloc 4.4.0 and 4.5.0

Builds, configuration and hardware

I compiled upstream MySQL 8.0.40 from source for InnoDB. I also compiled FB MySQL 8.0.32 from source for MyRocks. For FB MySQL I used source as of October 23, 2024 at git hash ba9709c9 with RocksDB 9.7.1.

The server is an ax162-s from Hetzner with 48 cores (AMD EPYC 9454P), 128G RAM and AMD SMT disabled. It uses Ubuntu 22.04 and storage is ext4 with SW RAID 1 over 2 locally attached NVMe devices. More details on it are here. At list prices a similar server from Google Cloud costs 10X more than from Hetzner.

For malloc the server uses:
  • glibc
    • version2.35-0ubuntu3.9
  • tcmalloc
    • provided by libgoogle-perftools-dev and apt-cache show claims this is version 2.9.1
    • enabled by malloc-lib=/usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so in my.cnf
  • jemalloc
    • provided by libjemalloc-dev and apt-cache show claims this is version 5.2.1-4ubuntu1
    • enabled by malloc-lib=/usr/lib/x86_64-linux-gnu/libjemalloc.so in my.cnf

The configuration files are here for InnoDB and for MyRocks. For InnoDB I used an 80G buffer pool. I tried to use a 50G buffer pool for MyRocks but with glibc malloc there was OOM so I repeated all tests with a 10G buffer pool. I might have been able avoid OOM with MyRocks and glibc malloc by using a between 30G and 40G for MyRocks -- but I didn't want to spend more time figuring that out when the real answer is to use jemalloc or tcmalloc.

Benchmark

I used sysbench and my usage is explained here. To save time I only run 27 of the 42 microbenchmarks and most test only 1 type of SQL statement.

The tests run with 16 tables and 50M rows/table. There are 256 client threads and each microbenchmark runs for 1200 seconds. Normally I don't run with (client threads / cores) >> 1 but I do so here to create more stress and to copy what I think Dimitri had done.

Normally when I run sysbench I configure it so that the test tables fit in the buffer pool (block cache) but I don't do that here because I want to MyRocks to do IO as allocations per storage read create much drama for the allocator.

The command line to run all tests is: bash r.sh 16 50000000 1200 1200 md2 1 0 256

Peak VSZ and RSS

The tables below show the peak values for VSZ and RSS from mysqld during the benchmark. The last column is the ratio (peak RSS / buffer pool size). I am not sure it is fair to compare these ratios between InnoDB and MyRocks from this work because the buffer pool size is so much larger for InnoDB. Regardless, RSS is more than 3X larger than the MyRocks buffer pool size with glibc malloc and that is a problem.

Peak values for InnoDB with 80G buffer pool
alloc           VSZ     RSS     RSS/80
glibc           88.2    86.5    1.08
tcmalloc        88.1    85.3    1.06
jemalloc        91.5    87.0    1.08

Peak values for MyRocks with 10G buffer pool
alloc           VSZ     RSS     RSS/10
glibc           46.1    36.2    3.62
tcmalloc        15.3    13.1    1.31
jemalloc        45.6    12.2    1.22

Performance: InnoDB

From the results here, QPS is mostly similar between tcmalloc and jemalloc but there are a few microbenchmarks where tcmalloc is much better than jemalloc and those are highlighted.

The results for read-only_range=10000 are an outlier (tcmalloc much faster than jemalloc) and from vmstat metrics here I see that CPU/operation (cpu/o) and context switches /operation (cs/o) are much larger for jemalloc than for tcmalloc.

These results use the relative QPS, which is the following where $allocator is tcmalloc or jemalloc. When this value is larger than 1.0 then QPS is larger with tcmalloc or jemalloc.
(QPS with $allocator) / (QPS with glibc malloc)
Relative to results with glibc malloc
col-1 : results with tcmalloc
col-2 : results with jemalloc

col-1 col-2
0.99 1.02 hot-points_range=100
1.05 1.04 point-query_range=100
0.96 0.99 points-covered-pk_range=100
0.98 0.99 points-covered-si_range=100
0.96 0.99 points-notcovered-pk_range=100
0.97 0.98 points-notcovered-si_range=100
0.97 1.00 random-points_range=1000
0.95 0.99 random-points_range=100
0.99 0.99 random-points_range=10
1.04 1.03 range-covered-pk_range=100
1.05 1.07 range-covered-si_range=100
1.04 1.03 range-notcovered-pk_range=100
0.98 1.00 range-notcovered-si_range=100
1.02 1.02 read-only-count_range=1000
1.05 1.07 read-only-distinct_range=1000
1.07 1.12 read-only-order_range=1000
1.28 1.09 read-only_range=10000
1.03 1.05 read-only_range=100
1.05 1.08 read-only_range=10
1.08 1.07 read-only-simple_range=1000
1.04 1.03 read-only-sum_range=1000
1.02 1.02 scan_range=100
1.01 1.00 delete_range=100
1.03 1.01 insert_range=100
1.02 1.02 read-write_range=100
1.03 1.03 read-write_range=10
1.01 1.02 update-index_range=100
1.15 0.98 update-inlist_range=100
1.06 0.99 update-nonindex_range=100
1.03 1.03 update-one_range=100
1.02 1.01 update-zipf_range=100
1.18 1.05 write-only_range=10000

Performance: MyRocks

From the results here, QPS is mostly similar between tcmalloc and jemalloc with a slight advantage for jemalloc but there are a few microbenchmarks where jemalloc is much better than tcmalloc and those are highlighted.

The results for hot-points below are odd (jemalloc is a lot faster than tcmalloc) and from vmstat metrics here I see that CPU/operation (cpu/o) and context switches /operation (cs/o) are both much larger for tcmalloc.

These results use the relative QPS, which is the following where $allocator is tcmalloc or jemalloc. When this value is larger than 1.0 then QPS is larger with tcmalloc or jemalloc.
(QPS with $allocator) / (QPS with glibc malloc)
Relative to results with glibc malloc
col-1 : results with tcmalloc
col-2 : results with jemalloc

col-1 col-2
0.68 1.00 hot-points_range=100
1.04 1.04 point-query_range=100
1.09 1.09 points-covered-pk_range=100
1.00 1.09 points-covered-si_range=100
1.09 1.09 points-notcovered-pk_range=100
1.10 1.12 points-notcovered-si_range=100
1.08 1.08 random-points_range=1000
1.09 1.09 random-points_range=100
1.05 1.10 random-points_range=10
0.99 1.07 range-covered-pk_range=100
1.01 1.03 range-covered-si_range=100
1.05 1.09 range-notcovered-pk_range=100
1.10 1.09 range-notcovered-si_range=100
1.07 1.05 read-only-count_range=1000
1.00 1.00 read-only-distinct_range=1000
0.98 1.04 read-only-order_range=1000
1.03 1.03 read-only_range=10000
0.96 1.03 read-only_range=100
1.02 1.04 read-only_range=10
0.98 1.07 read-only-simple_range=1000
1.07 1.09 read-only-sum_range=1000
1.02 1.02 scan_range=100
1.05 1.03 delete_range=100
1.11 1.07 insert_range=100
0.96 0.97 read-write_range=100
0.94 0.95 read-write_range=10
1.08 1.04 update-index_range=100
1.08 1.07 update-inlist_range=100
1.09 1.04 update-nonindex_range=100
1.04 1.04 update-one_range=100
1.07 1.04 update-zipf_range=100
1.03 1.02 write-only_range=10000

April 11, 2025

Oracle Multi-Value Index and ORDER BY Pagination queries

In the previous post, I highlighted a case where relational SQL databases struggle with optimizing index access with joins. In contrast, a document model using MongoDB's multi-key indexes efficiently retrieves only the necessary data for the results.

SQL databases implemented document datatypes, storing them in binary JSON columns in SQL tables, but that's not enough to compete as a document database. It's essential to look at the JSON path indexing possibilities?
A JSON document may include nested sub-documents, and you must index paths that may go through arrays. Oracle 21c introduced multi-value indexes for JSON. Let's test it with the same example from the previous post where MongoDB reads only the necessary index entries, filtering upfront thanks to embedding a One-to-Many instead of joining.

Example

As I demonstrated it on MongoDB, I used the MongoDB-compatible API of Oracle 23c to use the same syntax, which saves me migrating to the SQL/JSON syntax:

const bulkOps = [];
for (let i = 0; i < 1000000; i++) {
  const orderDetails = [];
  for (let line = 1; line <= 10; line++) {
    orderDetails.push({
      line: line,
      product_id: Math.floor(Math.log2(1 + i * Math.random())),
      quantity: Math.floor(100 * Math.random()),
    });
  }
  bulkOps.push({
    insertOne: {
      document: {
        country_id: Math.floor(10 * Math.random()),
        created_at: new Date(),
        order_details: orderDetails
      }
    }
  });
}
db.orders.bulkWrite(bulkOps);

db.orders.createIndex(
 { "country_id": 1, "order_details .product_id": 1, "created_at": -1 }
);

The MongoDB emulation on Oracle Database has converted this to the following SQL table with a multi-value index:

CREATE JSON COLLECTION TABLE "ORA"."orders" 
   ( "SYS_NC00005$" RAW(4000) GENERATED ALWAYS AS ( JSON_MKMVI(JSON_TABLE( "DATA", '$' PRESENT ON EMPTY MINIMAL CROSS PRODUCT WITH ERROR ON PARALLEL ARRAYS COLUMNS( NESTED PATH '$."country_id"[*]' COLUMNS( "K0" ANY ORA_RAWCOMPARE PATH '$' ERROR ON ERROR PRESENT ON EMPTY NULL ON MISMATCH ) , NESTED PATH '$."order_details"[*]' COLUMNS( NESTED PATH '$."product_id"[*]' COLUMNS( "K1" ANY ORA_RAWCOMPARE PATH '$' ERROR ON ERROR PRESENT ON EMPTY NULL ON MISMATCH ) ) , NESTED PATH '$."created_at"[*]' COLUMNS( "K2" ANY ORA_RAWCOMPARE PATH '$' ERROR ON ERROR PRESENT ON EMPTY NULL ON MISMATCH  DESC ) ) )  AS "K0","K1","K2" DESC )) VIRTUAL , 
    "SYS_NC00006$" RAW(4000) GENERATED ALWAYS AS (JSON_QUERY("DATA" FORMAT OSON , '$."order_details"[*]."product_id"[*]' RETURNING ANY ORA_RAWCOMPARE ASIS  WITHOUT ARRAY WRAPPER ERROR ON ERROR PRESENT ON EMPTY NULL ON MISMATCH TYPE(LAX)  MULTIVALUE)) VIRTUAL , 
    "SYS_NC00007$" RAW(6001) GENERATED ALWAYS AS (JSON_QUERY("DATA" FORMAT OSON , '$."created_at"[*]' RETURNING ANY ORA_RAWCOMPARE ASIS  WITHOUT ARRAY WRAPPER ERROR ON ERROR PRESENT ON EMPTY NULL ON MISMATCH TYPE(LAX)  MULTIVALUE DESC )) VIRTUAL 
   )  DEFAULT COLLATION "USING_NLS_COMP" ;

CREATE MULTIVALUE INDEX "ORA"."$ora:orders.country_id_1_order_details.product_id_1_created_at_-1" ON "ORA"."orders" (JSON_MKMVI(JSON_TABLE( "DATA", '$' PRESENT ON EMPTY MINIMAL CROSS PRODUCT WITH ERROR ON PARALLEL ARRAYS COLUMNS( NESTED PATH '$."country_id"[*]' COLUMNS( "K0" ANY ORA_RAWCOMPARE PATH '$' ERROR ON ERROR PRESENT ON EMPTY NULL ON MISMATCH ) , NESTED PATH '$."order_details"[*]' COLUMNS( NESTED PATH '$."product_id"[*]' COLUMNS( "K1" ANY ORA_RAWCOMPARE PATH '$' ERROR ON ERROR PRESENT ON EMPTY NULL ON MISMATCH ) ) , NESTED PATH '$."created_at"[*]' COLUMNS( "K2" ANY ORA_RAWCOMPARE PATH '$' ERROR ON ERROR PRESENT ON EMPTY NULL ON MISMATCH  DESC ) ) )  AS "K0","K1","K2" DESC ), JSON_QUERY("DATA" FORMAT OSON , '$."order_details"[*]."product_id"[*]' RETURNING ANY ORA_RAWCOMPARE ASIS  WITHOUT ARRAY WRAPPER ERROR ON ERROR PRESENT ON EMPTY NULL ON MISMATCH TYPE(LAX)  MULTIVALUE), JSON_QUERY("DATA" FORMAT OSON , '$."created_at"[*]' RETURNING ANY ORA_RAWCOMPARE ASIS  WITHOUT ARRAY WRAPPER ERROR ON ERROR PRESENT ON EMPTY NULL ON MISMATCH TYPE(LAX)  MULTIVALUE DESC ) DESC) 
  ;

The execution plan provided by the translation layer shows some SQL, but not the execution statistics:


ora> db.orders.find( { country_id: 1, order_details: { $elemMatch: { product_id: 15 } } } ).sort({ created_at: -1 }).limit(10).explain(`executionStatistics`)

{
  queryPlanner: {
    plannerVersion: 1,
    namespace: 'ora.orders',
    indexFilterSet: false,
    parsedQuery: {
      '$query': {
        '$and': [
          { country_id: { '$numberOnly': 1 } },
          { 'order_details[*]': { product_id: { '$numberOnly': 15 } } }
        ]
      },
      '$orderby': {
        '$fields': [ { path: 'created_at', order: 'desc', sortByMinMax: true } ],
        '$lax': true
      }
    },
    rewrittenQuery: {
      '$and': [
        {
          '$query': {
            '$and': [
              { country_id: { '$numberOnly': 1 } },
              {
                'order_details[*]': { product_id: { '$numberOnly': 15 } }
              }
            ]
          }
        },
        {
          '$orderby': {
            '$fields': [
              { path: 'created_at', order: 'desc', sortByMinMax: true }
            ],
            '$lax': true
          }
        }
      ]
    },
    winningPlan: {
      stage: 'SELECT STATEMENT',
      inputStage: {
        stage: 'COUNT',
        options: 'STOPKEY',
        columns: '"from$_subquery$_002"."DATA"[JSON,8200], "from$_subquery$_002"."RAWTOHEX("RESID")"[VARCHAR2,4000], "from$_subquery$_002"."ETAG"[RAW,16]',
        filterType: 'filter',
        filter: 'ROWNUM<=10',
        inputStage: {
          stage: 'VIEW',
          columns: '"from$_subquery$_002"."DATA"[JSON,8200], "from$_subquery$_002"."RAWTOHEX("RESID")"[VARCHAR2,4000], "from$_subquery$_002"."ETAG"[RAW,16]',
          inputStage: {
            stage: 'SORT',
            options: 'ORDER BY STOPKEY',
            columns: `(#keys=1) JSON_VALUE(JSON_QUERY("DATA" /*+ LOB_BY_VALUE */  FORMAT OSON , '$.created_at[*].max()' RETURNING JSON WITHOUT ARRAY WRAPPER NULL ON ERROR TYPE(LAX) ) FORMAT OSON , '$' RETURNING ANY ORA_RAWCOMPARE(32767) ERROR ON ERROR TYPE(LAX) )[32767], "DATA" /*+ LOB_BY_VALUE */ [JSON,8200], JSON_VALUE("DATA" /*+ LOB_BY_VALUE */  FORMAT OSON , '$._id' RETURNING ANY ORA_RAWCOMPARE(2000) NO ARRAY ERROR ON ERROR TYPE(LAX) )[RAW,2000], "ETAG"[RAW,16]`,
            filterType: 'filter',
            filter: 'ROWNUM<=10',
            path: "$.created_at[*].max()'",
            inputStage: {
              stage: 'TABLE ACCESS',
              options: 'BY INDEX ROWID BATCHED',
              source: 'orders',
              columns: `"DATA" /*+ LOB_BY_VALUE */ [JSON,8200], JSON_VALUE("DATA" /*+ LOB_BY_VALUE */  FORMAT OSON , '$._id' RETURNING ANY ORA_RAWCOMPARE(2000) NO ARRAY ERROR ON ERROR TYPE(LAX) )[RAW,2000], "ETAG"[RAW,16]`,
              path: "$._id'",
              inputStage: {
                stage: 'HASH',
                options: 'UNIQUE',
                columns: '(#keys=2) "orders".ROWID[ROWID,10], SYSVARCOL[8]',
                inputStage: {
                  stage: 'INDEX',
                  options: 'RANGE SCAN (MULTI VALUE)',
                  source: '$ora:orders.country_id_1_order_details.product_id_1_created_at_-1',
                  columns: `"orders".ROWID[ROWID,10], JSON_QUERY("DATA" /*+ LOB_BY_VALUE */  FORMAT OSON , '$."country_id"[*]' RETURNING ANY ORA_RAWCOMPARE ASIS  WITHOUT ARRAY WRAPPER ERROR ON ERROR PRESENT ON EMPTY NULL ON MISMATCH TYPE(LAX)  MULTIVALUE)[RAW,4000], JSON_QUERY("DATA" /*+ LOB_BY_VALUE */  FORMAT OSON , '$."order_details"[*]."product_id"[*]' RETURNING ANY ORA_RAWCOMPARE ASIS  WITHOUT ARRAY WRAPPER ERROR ON ERROR PRESENT ON EMPTY NULL ON MISMATCH TYPE(LAX)  MULTIVALUE)[RAW,4000], SYSVARCOL[8]`,
                  filterType: 'access',
                  filter: `JSON_QUERY("DATA" /*+ LOB_BY_VALUE */  FORMAT OSON , '$."country_id"[*]' RETURNING ANY ORA_RAWCOMPARE ASIS  WITHOUT ARRAY WRAPPER ERROR ON ERROR PRESENT ON EMPTY NULL ON MISMATCH TYPE(LAX)  MULTIVALUE)=SYS_CONS_ANY_SCALAR(:1, 3) AND JSON_QUERY("DATA" /*+ LOB_BY_VALUE */  FORMAT OSON , '$."order_details"[*]."product_id"[*]' RETURNING ANY ORA_RAWCOMPARE ASIS  WITHOUT ARRAY WRAPPER ERROR ON ERROR PRESENT ON EMPTY NULL ON MISMATCH TYPE(LAX)  MULTIVALUE)=SYS_CONS_ANY_SCALAR(:2, 3)`
                }
              }
            }
          }
        }
      }
    },
    rejectPlans: []
  },
  serverInfo: { host: 'localhost', port: 27017, version: '4.2.14' },
  ok: 1
}
ora> 

However, It's the SQL engine behind, and I captured the SQL statement from GV$SQL:


SELECT /*+ FIRST_ROWS(10) monitor */
    "DATA",
    RAWTOHEX("RESID"),
    "ETAG"
FROM
    "ORA"."orders"
WHERE
    JSON_EXISTS ( "DATA", '$?( 
        (@.country_id.numberOnly() == $B0)
        &&
        ( exists(@.order_details[*]?( (@.product_id.numberOnly() == $B1) )) ) 
    )'
    PASSING :1 AS "B0",
    :2 AS "B1" TYPE ( STRICT ) )
ORDER BY
    JSON_QUERY("DATA", '$.created_at[*].max()') DESC NULLS LAST
FETCH NEXT 10 ROWS ONLY

I have added the /* monitor */ hint to get the SQL execution plan with execution statistics:

Unlike MongoDB, which reads only ten documents with such .find().sort().limit(10) query, Oracle's multi-value index had to read all index entries (53K Rows) and documents, and sorted them, before returning the Top-10. The index was used for filtering on the WHERE clause, but not for ordering on the ORDER BY FETCH NEXT 10 ROWS ONLY despite having "created_at" in the key next to the equiality filter columns.

Possible explanation

I tried to understand if it is a limitation of Oracle multi-value indexes, or the query planner. The internals of Oracle multi-value indexes are not documented, but there's the patent that describes the idea: https://patents.google.com/patent/US11640380B2

My understanding from it is that the following document:

{
  "_id": ObjectId('67f1a477aabaf2dad73f4791'),
  "country_id": 1,
  "created_at": ISODate('2025-04-05T21:45:21.546Z'),
  "order_details": [
    { "line": 1, "product_id": 19, "quantity": 40 },
    { "line": 2, "product_id": 15, "quantity": 10 },
    { "line": 3, "product_id": 18, "quantity": 75 },
    { "line": 4, "product_id": 15, "quantity": 50 }
  ]
}

with the following index:

db.orders.createIndex(
 { "country_id": 1, "order_details.product_id": 1, "created_at": -1 }
);

results in the following index entries where # is the ordinal position in the "order_details" array:

+------------+--------------------+---------------------------+
| country_id |  order_details     | created_at                |
|            +---------------+----+                           |
|            | .product_id   | #  |                           |
+------------|---------------|----|---------------------------+
| 1          |  19           | 0  | 2025-04-05T21:45:21.546Z  |
| 1          |  15           | 1  | 2025-04-05T21:45:21.546Z  |
| 1          |  18           | 2  | 2025-04-05T21:45:21.546Z  |
| 1          |  15           | 3  | 2025-04-05T21:45:21.546Z  |
+------------|---------------|----|---------------------------+

By applying an equality predicate on "country_id" and "product_id," and considering that "created_at" is at the same level and not nested in "order_details," the optimizer could potentially recognize that the entries are sorted by "created_at". Note that it is possible that the ordinal position is not present as I see no FOR ORDINALITY clause in the CREATE MULTIVALUE INDEX statement.

The index entries are ordered and a TABLE ACCESS BY INDEX ROWID should preserve the order, as it is not BATCHED to reduce scattered reads, so a blocking sort operation could be avoided. This could enable the query to read only the necessary documents, but it is not what I observe from the execution plan. An HASH UNIQUE operation breaks the ordering in between. It is true that multiple entries may be returned for the same document and we want the document only once, but I would have expected a SORT UNIQUE NOSORT to preserve the ordering.

Note that all these are guesses, I've run this on the autonomous managed service where I do not have access to the datafiles to look at the internals of the multi-value index.

Conclusion

In exploring multi-value indexes in Oracle for sorting and filtering nested JSON data, a key performance differences emerges compared to MongoDB. Oracle's indexes manage complex JSON structures but struggle with key ordering, requiring an additional blocking sort, which is problematic in pagination queries. This was tested on Oracle 23ai (23.7.0.25.03) showing a MongoDB version of 4.2.14 and may change in the future as the multi-value indexes apparently store the entries in order.

Thus, when evaluating a SQL database to be used as a document database, it's vital to consider both efficient JSON storage and indexing options. A comment in the previous post shows that it's the same with PostgreSQL, as GIN indexes, required for JSON paths though arrays, does not return entries in order.

April 10, 2025

How Heroku migrated hundreds of thousands of self-managed PostgreSQL databases to Amazon Aurora

In this post, we discuss how Heroku migrated their multi-tenant PostgreSQL database fleet from self-managed PostgreSQL on Amazon Elastic Compute Cloud (Amazon EC2) to Amazon Aurora PostgreSQL-Compatible Edition. Heroku completed this migration with no customer impact, increasing platform reliability while simultaneously reducing operational burden. We dive into Heroku and their previous self-managed architecture, the new architecture, how the migration of hundreds of thousands of databases was performed, and the enhancements to the customer experience since its completion.

What I'd do as a College Freshman in 2025

Do Computer Science

Absolutely. Still would.

Many are spooked by LLMs. Some, like Jensen Huang, argue that "nobody has to learn how to program."  

I argue the opposite. And I double down. 

Being supported by AI tools is not a substitute for mastery. You can’t borrow skills. You have to earn them.  

Computer science builds vital skills: hacking, debugging, abstract thinking, and quick adaptation. These don’t go out of style.  

Do STEM. It’s LLM-resistant. LLMs can retrieve and remix information, but do you know what to do with them? Like the dog chasing the car, what now? STEM teaches you that. It teaches you to think, to reason, to act. It gets you from information to wisdom. But only after you've mastered the foundations.

We're heading into the age of π-shaped people: depth in two areas, and generalist across. Building depth first, and then ranging is good strategy.  

So yes, I would learn the foundations of both CS and AI. And then do AI + X, where X is systems, databases, or PL. These combos are powerful.


Build Soft Skills

You can't skip these. Soft skills are indispensible, classic Lindy.

Clear communication matters more than ever:

  • for working with others (especially remotely)
  • for building in the open
  • for working with LLMs.

Management skills matters too. Not just for leading teams, but also for leading yourself. Self-help book gets mocked for their vacuous/repetitive advice, but they are useful for the querying/investigating they ignite in you. Know yourself. Then fool yourself into greatness.

You don’t need a title to lead teams. Influence scales from the bottom. You have more leverage and autonomy than you realize, especially early in your career.


Stay in the U.S. Stay in College

Despite the noise, U.S. is still the best launchpad for a tech career. It has the resources, companies, and a decent (but imperfect) merit system.

Despite losing face, colleges still provide value. Credentialism is fading, but community isn't. College is still the best place to meet smart people, get inspired, and build with peers. Don't waste it competing and just following lectures. Learn from each other. Take initiative. Start things.


Be a Jeep or a Ferrari. Not a Corolla.

Be entrepreneurial. Be effectual. Effectual thinking is messy yet powerful. It starts with what you have. You act, learn, improvise. You don't wait for perfect conditions; you shape them.

My friend Mahesh Balakrishnan put it best: on his dream team, he wants Jeeps or Ferraris. Jeeps go anywhere. No roads, no map. Just point them at a challenge. Ferraris go fast --but only with a good road. What he doesn't want is Corollas. They are slow and they still need roads.

So here's the corollary (yes, pun intended):

  1. Be entrepreneurial. Be a Jeep.  
  2. If you must follow plans, be a Ferrari. Specialized, fast, precise.
  3. Don't be a Corolla. In the age of AI, that gets automated.


Play the Long Game

Tech rewards leverage, not shortcuts. Optimize for momentum and stack useful skills, relationships, and systems. Compounding is the strongest force in your career.

Don’t chase trends. Understand them. Ride the ones that match your strengths. Learn to take a step back, and aim for depth, clarity, and direction.


Here is more unsolicited advice from me.

April 09, 2025

Percona Server for MySQL: Enhanced Encryption UDFs

In Percona Server for MySQL 8.0.41 / 8.4.4, we introduced several improvements in Encryption User-Defined Functions. Added support for RSAES-OAEP (OAEP) padding for RSA encrypt / decrypt operations. Added support for RSASSA-PSS (PSS) padding for RSA sign / verify operations. Added new encryption_udf.legacy_padding_scheme component system variable. Normalized character set support for all Encryption UDFs. PKCS1 […]

April 07, 2025

How to safely cancel a database query

Cancelling a query from a UI client is more nuanced than it might seem. Here's how we implemented safe KILL QUERY operations in Tinybird.

How to safely cancel a database query

Cancelling a query from a UI client is more nuanced than it might seem. Here's how we implemented safe KILL QUERY operations in Tinybird.

April 06, 2025

A case where SQL joins struggle but MongoDB documents shine

Claims such as "Joins are slow" or "Joins don't scale" often prompt me to showcase how efficiently rows can be joined in a SQL database (example). However, the user perception of slowness remains, and it's essential to listen to developers and understand their problems with joins.

Joining tables in relational databases can sometimes lead to suboptimal execution plans when data filtering occurs post-join. This limitation arises because indexes are efficient when selective predicates are on the same table. However, there is no multi-table index, at least for OLTP. Data warehouses in Oracle Databases can use bitmap join indexes, star transformation, and materialized views to overcome this, but they are not suited to OLTP workloads. These suboptimal execution plans may be the reason why developers think that joins are slow, even if it is not the execution of the join itself that is slow.

Denormalization can help, but it undermines the advantages of normalization. In contrast, document databases like MongoDB utilize embedded documents to optimize complex queries with fewer joins, and multi-key composite indexes offer efficient access paths that cover all selective filters .

Here is an example in PostgreSQL and MongoDB.

PostgreSQL relational model

Relational databases normalize one-to-many relationships to separate tables. For instance, consider the relationship between orders and their corresponding order details, where each order can have multiple associated entries in the order details table.

CREATE TABLE orders(
  id BIGSERIAL PRIMARY KEY,
  country_id INT,
  created_at TIMESTAMPTZ DEFAULT clock_timestamp()
);

CREATE TABLE order_details (
  id BIGINT REFERENCES orders ON DELETE CASCADE,
  line INT,
  product_id BIGINT,
  quantity INT,
  PRIMARY KEY(id, line)
);

I insert some data, with a distribution of products that decrease in orders over time:

BEGIN TRANSACTION;

INSERT INTO orders(country_id)
 SELECT 10 * random() FROM generate_series(1,1000000);

INSERT INTO order_details (id, line, product_id, quantity)
 SELECT 
  id,
  generate_series(1,10),
  log(2,(1 + id * random())::int),
  100 * random()
  FROM orders;

COMMIT;

A relational data model does not depend on specific access patterns. Optimizing these patterns requires indexes. My primary keys defined indexes for navigating between orders and order details, but I have another use case.
To analyze orders by country, product, and date range, I create the following indexes:


CREATE INDEX ON orders (country_id, created_at DESC, id);

CREATE INDEX ON order_details (product_id, id);

To analyze the latest orders for a specific country and product, I use the following SQL query:


PREPARE query(int,int,int) AS
 SELECT id, created_at, product_id, quantity
 FROM orders
 JOIN order_details d USING(id)
 WHERE country_id=$1 AND product_id=$2
 ORDER BY created_at DESC LIMIT $3
;

I vacuum and analyze to get the best execution plan:

postgres=# VACUUM ANALYZE orders, order_details;
VACUUM

Ideally, such query should read only the rows for one country and one product, and get them ordered by date to apply the Top-n ORDER BY LIMIT without having to read all rows and sort them.

postgres=# EXPLAIN (analyze, buffers, costs off)
           EXECUTE query(1, 15, 10)
;
                                                                                                                                                                                                 QUERY PLAN                                                         
----------------------------------------------------------------------------------------------------------------------------
 Limit (actual time=0.031..0.110 rows=10 loops=1)
   Buffers: shared hit=132
   ->  Nested Loop (actual time=0.030..0.108 rows=10 loops=1)
         Buffers: shared hit=132
         ->  Index Only Scan using orders_country_id_created_at_id_idx on orders (actual time=0.011..0.023 rows=39 loops=1)
               Index Cond: (country_id = 1)
               Heap Fetches: 0
               Buffers: shared hit=5
         ->  Index Scan using order_details_product_id_id_idx on order_details d (actual time=0.002..0.002 rows=0 loops=39)
               Index Cond: ((product_id = 15) AND (id = orders.id))
               Buffers: shared hit=127
 Planning:
   Buffers: shared hit=16
 Planning Time: 0.272 ms
 Execution Time: 0.127 ms
(15 rows)

The execution plan is effective, beginning with the index on orders.created_at to eliminate the need for sorting. To preserve the order and push down the join filter, it uses a nested loop join to retrieve the rows from the other table.
Since there is another filter on order_details.product_id, after the join, it had to read more rows (rows=39) to obtain the final required rows (rows=10), and then more loops. As my example is small, the consequence is minimal in terms of time, nested loops (loops=39) and buffers (shared hit=127), but highlights the issue: it requires reading four rows and ten pages for each row to return results.

If I run the same query with another product that hasn't been ordered recently, it reads lots of orders before finding ten ones that include this product:

postgres=# EXPLAIN (analyze, buffers, costs off)
           EXECUTE query(1, 8, 10)
;
                                                                 QUERY PLAN                                                                  
---------------------------------------------------------------------------------------------------------------------------------------------
 Limit (actual time=15.614..16.661 rows=10 loops=1)
   Buffers: shared hit=37582
   ->  Gather Merge (actual time=15.613..16.659 rows=10 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         Buffers: shared hit=37582
         ->  Nested Loop (actual time=1.396..9.112 rows=7 loops=3)
               Buffers: shared hit=37582
               ->  Parallel Index Only Scan using orders_country_id_created_at_id_idx on orders (actual time=0.015..0.546 rows=4165 loops=3)
                     Index Cond: (country_id = 1)
                     Heap Fetches: 0
                     Buffers: shared hit=70
               ->  Index Scan using order_details_product_id_id_idx on order_details d (actual time=0.002..0.002 rows=0 loops=12494)
                     Index Cond: ((product_id = 8) AND (id = orders.id))
                     Buffers: shared hit=37512
 Planning:
   Buffers: shared hit=16
 Planning Time: 0.272 ms
 Execution Time: 16.684 ms
(19 rows)

To get the 10 rows result, this execution has read 12495 rows with 3 parallel processes (rows=4165 loops=3), and 37582 pages in total, before it was able to find the Top-10 verifying all filters.

The problem is that the user doesn't understand why it can take longer, as it is the same query and returns the same number of rows. Moreover, reading so many unnecessary pages impacts other queries as it occupies space in the shared buffer.

When the query planner estimates that this is too much, it does not choose to avoid a sort and switches to a hash join.


postgres=# EXPLAIN (analyze, buffers, costs off)
           EXECUTE query(1, 5, 10)
;
                                                                 QUERY PLAN                                                                                                      
--------------------------------------------------------------------------------------------------------------------------------------------                                     
 Limit (actual time=30.370..30.373 rows=10 loops=1)
   Buffers: shared hit=1882
   ->  Sort (actual time=30.369..30.371 rows=10 loops=1)
         Sort Key: orders.created_at DESC
         Sort Method: top-N heapsort  Memory: 26kB
         Buffers: shared hit=1882
         ->  Hash Join (actual time=28.466..30.324 rows=236 loops=1)
               Hash Cond: (d.id = orders.id)
               Buffers: shared hit=1882
               ->  Index Scan using order_details_product_id_id_idx on order_details d (actual time=0.013..1.434 rows=2311 loops=1)                                              
                     Index Cond: (product_id = 5)
                     Buffers: shared hit=1387
               ->  Hash (actual time=28.400..28.401 rows=99672 loops=1)
                     Buckets: 131072  Batches: 1  Memory Usage: 5697kB
                     Buffers: shared hit=495
                     ->  Index Only Scan using orders_country_id_created_at_id_idx on orders (actual time=0.010..13.136 rows=99672 loops=1)                                      
                           Index Cond: (country_id = 1)
                           Heap Fetches: 0
                           Buffers: shared hit=495
 Planning:
   Buffers: shared hit=16
 Planning Time: 0.267 ms
 Execution Time: 30.415 ms
(23 rows)

This plan doesn't depend on the size of the result but must read too many rows (rows=2311 and rows=99672) before joining, filtering them (to rows=236), and sorting them. This is where it becomes a scalability problem: the response time depends on the size of the database rather than the result size. A query that is supposed to read orders from a small time window must read the whole history of orders for one country, and the whole history of details for one product.

Note that this example is the best case, where tables were freshly vacuumed, and Index Only Scan is optimal with Heap Fetches: 0. It will be more expensive on an active table.

MongoDB document model

MongoDB's document model allows embedding related data within a single collection, optimizing data locality in memory and disk.

Here is a collection that loads similar data to the previous sample, but with order details embedded in the orders document, like it is structered in a business document or an application object:

const bulkOps = [];
for (let i = 0; i < 1000000; i++) {
  const orderDetails = [];
  for (let line = 1; line <= 10; line++) {
    orderDetails.push({
      line: line,
      product_id: Math.floor(Math.log2(1 + i * Math.random())),
      quantity: Math.floor(100 * Math.random()),
    });
  }
  bulkOps.push({
    insertOne: {
      document: {
        country_id: Math.floor(10 * Math.random()),
        created_at: new Date(),
        order_details: orderDetails
      }
    }
  });
}
db.orders.bulkWrite(bulkOps);

One advantage of the document model is the ability to get an order with its details without any join:

test> db.orders.find().sort({created_at: -1}).limit(1);
[
  {
    _id: ObjectId('67f1a477aabaf2dad73f4791'),
    country_id: 3,
    created_at: ISODate('2025-04-05T21:45:21.546Z'),
    order_details: [
      { line: 1, product_id: 19, quantity: 40 },
      { line: 2, product_id: 18, quantity: 10 },
      { line: 3, product_id: 18, quantity: 75 },
      { line: 4, product_id: 18, quantity: 81 },
      { line: 5, product_id: 16, quantity: 66 },
      { line: 6, product_id: 14, quantity: 17 },
      { line: 7, product_id: 19, quantity: 82 },
      { line: 8, product_id: 19, quantity: 81 },
      { line: 9, product_id: 17, quantity: 56 },
      { line: 10, product_id: 19, quantity: 59 }
    ]
  }
]

Having all fields in one document allows creating a single index that covers all filters, and MongoDB supports multi-key indexes, which enables indexing fields in embedded subdocuments:

db.orders.createIndex(
 { "country_id": 1, "order_details.product_id": 1, "created_at": -1 }
);

The query to retreive the last ten orders for one country and one product is simple without join:

db.orders.find(
  { country_id: 1, order_details: { $elemMatch: { product_id: 15 } } }
).sort({ created_at: -1 }).limit(10);

Let's check the execution plan:

mdb> db.orders.find(
   { country_id: 1, order_details: { $elemMatch: { product_id: 15 } } }
 ).sort({ created_at: -1 }).limit(10).explain(`executionStats`).executionStats
;
{
  executionSuccess: true,
  nReturned: 10,
  executionTimeMillis: 0,
  totalKeysExamined: 10,
  totalDocsExamined: 10,
  executionStages: {
    isCached: false,
    stage: 'LIMIT',
    nReturned: 10,
    executionTimeMillisEstimate: 0,
    works: 11,
    advanced: 10,
    limitAmount: 10,
    inputStage: {
      stage: 'FETCH',
      filter: {
        order_details: { '$elemMatch': { product_id: { '$eq': 15 } } }
      },
      nReturned: 10,
      executionTimeMillisEstimate: 0,
      works: 10,
      advanced: 10,
      docsExamined: 10,
      alreadyHasObj: 0,
      inputStage: {
        stage: 'IXSCAN',
        nReturned: 10,
        executionTimeMillisEstimate: 0,
        works: 10,
        advanced: 10,
        keyPattern: {
          country_id: 1,
          'order_details.product_id': 1,
          created_at: -1
        },
        indexName: 'country_id_1_order_details.product_id_1_created_at_-1',
        isMultiKey: true,
        multiKeyPaths: {
          country_id: [],
          'order_details.product_id': [ 'order_details' ],
          created_at: []
        },
        isUnique: false,
        isSparse: false,
        isPartial: false,
        indexVersion: 2,
        direction: 'forward',
        indexBounds: {
          country_id: [ '[1, 1]' ],
          'order_details.product_id': [ '[15, 15]' ],
          created_at: [ '[MaxKey, MinKey]' ]
        },
        keysExamined: 10,
        seeks: 1,
        dupsTested: 10,
        dupsDropped: 0
      }
    }
  }
}

The plan shows lots of details, but the most important is:

  nReturned: 10,
  totalKeysExamined: 10,
  totalDocsExamined: 10,

To get the 10 rows for the result, MongoDB has read only 10 index entries and 10 documents. It is the most optimal, reading only what is necessary. The index scan is optimal as it contains the bounds for all equality filters and get the rows ordered without the need for an additional sort:

        indexBounds: {
          country_id: [ '[1, 1]' ],
          'order_details.product_id': [ '[15, 15]' ],
          created_at: [ '[MaxKey, MinKey]' ]
        },

In addition to be fast, the performance is predictable because this execution plan will always be the same. This is visible with allPlansExecution:

mdb> db.orders.find(
   { country_id: 1, order_details: { $elemMatch: { product_id: 15 } } }
 ).sort({ created_at: -1 }).limit(10).explain(`allPlansExecution`).queryPlanner
;
{
  namespace: 'test.orders',
  parsedQuery: {
    '$and': [
      {
        order_details: { '$elemMatch': { product_id: { '$eq': 42 } } }
      },
      { country_id: { '$eq': 1 } }
    ]
  },
  indexFilterSet: false,
  queryHash: '0DAE06A4',
  planCacheShapeHash: '0DAE06A4',
  planCacheKey: 'C3D96884',
  optimizationTimeMillis: 0,
  maxIndexedOrSolutionsReached: false,
  maxIndexedAndSolutionsReached: false,
  maxScansToExplodeReached: false,
  prunedSimilarIndexes: false,
  winningPlan: {
    isCached: false,
    stage: 'LIMIT',
    limitAmount: 10,
    inputStage: {
      stage: 'FETCH',
      filter: {
        order_details: { '$elemMatch': { product_id: { '$eq': 42 } } }
      },
      inputStage: {
        stage: 'IXSCAN',
        keyPattern: {
          country_id: 1,
          'order_details.product_id': 1,
          created_at: -1
        },
        indexName: 'country_id_1_order_details.product_id_1_created_at_-1',
        isMultiKey: true,
        multiKeyPaths: {
          country_id: [],
          'order_details.product_id': [ 'order_details' ],
          created_at<... (truncated)
                                    

Where SQL joins struggle but MongoDB documents shine

Claims such as "Joins are slow" or "Joins don't scale" often prompt me to showcase how efficiently rows can be joined in a SQL database (here, here, or here). However, user perception of slowness persists for some queries, making it crucial to listen to developers and understand why they feel joins are slow.

Joining tables in relational databases can sometimes lead to suboptimal execution plans when data filtering occurs post-join. This limitation arises because indexes are efficient when selective predicates are on the same table. However, there is no multi-table index, at least for OLTP. Data warehouses in Oracle Databases can use bitmap join indexes, star transformation, and materialized views to overcome this, but they are not suited to OLTP workloads. Dynamic bitmap scans can combine multiple indexes but at the price of more work and less possibilities.

These suboptimal execution plans may be the reason why developers think that joins are slow, even if it is not the execution of the join itself that is slow.

Denormalization can help, but it undermines the advantages of normalization. In contrast, document databases like MongoDB utilize embedded documents to optimize complex queries with fewer joins, and multi-key composite indexes offer efficient access paths that cover all selective filters .

Here is an example in PostgreSQL and MongoDB.

PostgreSQL relational model

Relational databases normalize one-to-many relationships to separate tables. For instance, consider the relationship between orders and their corresponding order details, where each order can have multiple associated entries in the order details table.

CREATE TABLE orders(
  id BIGSERIAL PRIMARY KEY,
  country_id INT,
  created_at TIMESTAMPTZ DEFAULT clock_timestamp()
);

CREATE TABLE order_details (
  id BIGINT REFERENCES orders ON DELETE CASCADE,
  line INT,
  product_id BIGINT,
  quantity INT,
  PRIMARY KEY(id, line)
);

I insert some data, with a distribution of products that decrease in orders over time:

BEGIN TRANSACTION;

INSERT INTO orders(country_id)
 SELECT 10 * random() FROM generate_series(1,1000000);

INSERT INTO order_details (id, line, product_id, quantity)
 SELECT 
  id,
  generate_series(1,10),
  log(2,(1 + id * random())::int),
  100 * random()
  FROM orders;

COMMIT;

A relational data model does not depend on specific access patterns. Optimizing these patterns requires indexes. My primary keys defined indexes for navigating between orders and order details, but I have another use case.
To analyze orders by country, product, and date range, I create the following indexes:


CREATE INDEX ON orders (country_id, created_at DESC, id);

CREATE INDEX ON order_details (product_id, id);

To analyze the latest orders for a specific country and product, I use the following SQL query:


PREPARE query(int,int,int) AS
 SELECT id, created_at, product_id, quantity
 FROM orders
 JOIN order_details d USING(id)
 WHERE country_id=$1 AND product_id=$2
 ORDER BY created_at DESC LIMIT $3
;

I vacuum and analyze to get the best execution plan:

postgres=# VACUUM ANALYZE orders, order_details;
VACUUM

Ideally, such query should read only the rows for one country and one product, and get them ordered by date to apply the Top-n ORDER BY LIMIT without having to read all rows and sort them.

postgres=# EXPLAIN (analyze, buffers, costs off)
           EXECUTE query(1, 15, 10)
;
                                                                                                                                                                                                 QUERY PLAN                                                         
----------------------------------------------------------------------------------------------------------------------------
 Limit (actual time=0.031..0.110 rows=10 loops=1)
   Buffers: shared hit=132
   ->  Nested Loop (actual time=0.030..0.108 rows=10 loops=1)
         Buffers: shared hit=132
         ->  Index Only Scan using orders_country_id_created_at_id_idx on orders (actual time=0.011..0.023 rows=39 loops=1)
               Index Cond: (country_id = 1)
               Heap Fetches: 0
               Buffers: shared hit=5
         ->  Index Scan using order_details_product_id_id_idx on order_details d (actual time=0.002..0.002 rows=0 loops=39)
               Index Cond: ((product_id = 15) AND (id = orders.id))
               Buffers: shared hit=127
 Planning:
   Buffers: shared hit=16
 Planning Time: 0.272 ms
 Execution Time: 0.127 ms
(15 rows)

The execution plan is effective, beginning with the index on orders.created_at to eliminate the need for sorting. To preserve the order and push down the join filter, it uses a nested loop join to retrieve the rows from the other table.
Since there is another filter on order_details.product_id, after the join, it had to read more rows (rows=39) to obtain the final required rows (rows=10), and then more loops. As my example is small, the consequence is minimal in terms of time, nested loops (loops=39) and buffers (shared hit=127), but highlights the issue: it requires reading four rows and ten pages for each row to return results.

If I run the same query with another product that hasn't been ordered recently, it reads lots of orders before finding ten ones that include this product:

postgres=# EXPLAIN (analyze, buffers, costs off)
           EXECUTE query(1, 8, 10)
;
                                                                 QUERY PLAN                                                                  
---------------------------------------------------------------------------------------------------------------------------------------------
 Limit (actual time=15.614..16.661 rows=10 loops=1)
   Buffers: shared hit=37582
   ->  Gather Merge (actual time=15.613..16.659 rows=10 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         Buffers: shared hit=37582
         ->  Nested Loop (actual time=1.396..9.112 rows=7 loops=3)
               Buffers: shared hit=37582
               ->  Parallel Index Only Scan using orders_country_id_created_at_id_idx on orders (actual time=0.015..0.546 rows=4165 loops=3)
                     Index Cond: (country_id = 1)
                     Heap Fetches: 0
                     Buffers: shared hit=70
               ->  Index Scan using order_details_product_id_id_idx on order_details d (actual time=0.002..0.002 rows=0 loops=12494)
                     Index Cond: ((product_id = 8) AND (id = orders.id))
                     Buffers: shared hit=37512
 Planning:
   Buffers: shared hit=16
 Planning Time: 0.272 ms
 Execution Time: 16.684 ms
(19 rows)

To get the 10 rows result, this execution has read 12495 rows with 3 parallel processes (rows=4165 loops=3), and 37582 pages in total, before it was able to find the Top-10 verifying all filters.

The problem is that the user doesn't understand why it can take longer, as it is the same query and returns the same number of rows. Moreover, reading so many unnecessary pages impacts other queries as it occupies space in the shared buffer.

When the query planner estimates that this is too much, it does not choose to avoid a sort and switches to a hash join.


postgres=# EXPLAIN (analyze, buffers, costs off)
           EXECUTE query(1, 5, 10)
;
                                                                 QUERY PLAN                                                                                                      
--------------------------------------------------------------------------------------------------------------------------------------------                                     
 Limit (actual time=30.370..30.373 rows=10 loops=1)
   Buffers: shared hit=1882
   ->  Sort (actual time=30.369..30.371 rows=10 loops=1)
         Sort Key: orders.created_at DESC
         Sort Method: top-N heapsort  Memory: 26kB
         Buffers: shared hit=1882
         ->  Hash Join (actual time=28.466..30.324 rows=236 loops=1)
               Hash Cond: (d.id = orders.id)
               Buffers: shared hit=1882
               ->  Index Scan using order_details_product_id_id_idx on order_details d (actual time=0.013..1.434 rows=2311 loops=1)                                              
                     Index Cond: (product_id = 5)
                     Buffers: shared hit=1387
               ->  Hash (actual time=28.400..28.401 rows=99672 loops=1)
                     Buckets: 131072  Batches: 1  Memory Usage: 5697kB
                     Buffers: shared hit=495
                     ->  Index Only Scan using orders_country_id_created_at_id_idx on orders (actual time=0.010..13.136 rows=99672 loops=1)                                      
                           Index Cond: (country_id = 1)
                           Heap Fetches: 0
                           Buffers: shared hit=495
 Planning:
   Buffers: shared hit=16
 Planning Time: 0.267 ms
 Execution Time: 30.415 ms
(23 rows)

This plan doesn't depend on the size of the result but must read too many rows (rows=2311 and rows=99672) before joining, filtering them (to rows=236), and sorting them. This is where it becomes a scalability problem: the response time depends on the size of the database rather than the result size. A query that is supposed to read orders from a small time window must read the whole history of orders for one country, and the whole history of details for one product.

Note that this example is the best case, where tables were freshly vacuumed, and Index Only Scan is optimal with Heap Fetches: 0. It will be more expensive on an active table.

MongoDB document model

MongoDB's document model allows embedding related data within a single collection, optimizing data locality in memory and disk.

Here is a collection that loads similar data to the previous sample, but with order details embedded in the orders document, like it is structered in a business document or an application object:

const bulkOps = [];
for (let i = 0; i < 1000000; i++) {
  const orderDetails = [];
  for (let line = 1; line <= 10; line++) {
    orderDetails.push({
      line: line,
      product_id: Math.floor(Math.log2(1 + i * Math.random())),
      quantity: Math.floor(100 * Math.random()),
    });
  }
  bulkOps.push({
    insertOne: {
      document: {
        country_id: Math.floor(10 * Math.random()),
        created_at: new Date(),
        order_details: orderDetails
      }
    }
  });
}
db.orders.bulkWrite(bulkOps);

One advantage of the document model is the ability to get an order with its details without any join:

test> db.orders.find().sort({created_at: -1}).limit(1);
[
  {
    _id: ObjectId('67f1a477aabaf2dad73f4791'),
    country_id: 3,
    created_at: ISODate('2025-04-05T21:45:21.546Z'),
    order_details: [
      { line: 1, product_id: 19, quantity: 40 },
      { line: 2, product_id: 18, quantity: 10 },
      { line: 3, product_id: 18, quantity: 75 },
      { line: 4, product_id: 18, quantity: 81 },
      { line: 5, product_id: 16, quantity: 66 },
      { line: 6, product_id: 14, quantity: 17 },
      { line: 7, product_id: 19, quantity: 82 },
      { line: 8, product_id: 19, quantity: 81 },
      { line: 9, product_id: 17, quantity: 56 },
      { line: 10, product_id: 19, quantity: 59 }
    ]
  }
]

Having all fields in one document allows creating a single index that covers all filters, and MongoDB supports multi-key indexes, which enables indexing fields in embedded subdocuments:

db.orders.createIndex(
 { "country_id": 1, "order_details.product_id": 1, "created_at": -1 }
);

The query to retreive the last ten orders for one country and one product is simple without join:

db.orders.find(
  { country_id: 1, order_details: { $elemMatch: { product_id: 15 } } }
).sort({ created_at: -1 }).limit(10);

Let's check the execution plan:

mdb> db.orders.find(
   { country_id: 1, order_details: { $elemMatch: { product_id: 15 } } }
 ).sort({ created_at: -1 }).limit(10).explain(`executionStats`).executionStats
;
{
  executionSuccess: true,
  nReturned: 10,
  executionTimeMillis: 0,
  totalKeysExamined: 10,
  totalDocsExamined: 10,
  executionStages: {
    isCached: false,
    stage: 'LIMIT',
    nReturned: 10,
    executionTimeMillisEstimate: 0,
    works: 11,
    advanced: 10,
    limitAmount: 10,
    inputStage: {
      stage: 'FETCH',
      filter: {
        order_details: { '$elemMatch': { product_id: { '$eq': 15 } } }
      },
      nReturned: 10,
      executionTimeMillisEstimate: 0,
      works: 10,
      advanced: 10,
      docsExamined: 10,
      alreadyHasObj: 0,
      inputStage: {
        stage: 'IXSCAN',
        nReturned: 10,
        executionTimeMillisEstimate: 0,
        works: 10,
        advanced: 10,
        keyPattern: {
          country_id: 1,
          'order_details.product_id': 1,
          created_at: -1
        },
        indexName: 'country_id_1_order_details.product_id_1_created_at_-1',
        isMultiKey: true,
        multiKeyPaths: {
          country_id: [],
          'order_details.product_id': [ 'order_details' ],
          created_at: []
        },
        isUnique: false,
        isSparse: false,
        isPartial: false,
        indexVersion: 2,
        direction: 'forward',
        indexBounds: {
          country_id: [ '[1, 1]' ],
          'order_details.product_id': [ '[15, 15]' ],
          created_at: [ '[MaxKey, MinKey]' ]
        },
        keysExamined: 10,
        seeks: 1,
        dupsTested: 10,
        dupsDropped: 0
      }
    }
  }
}

The plan shows lots of details, but the most important is:

  nReturned: 10,
  totalKeysExamined: 10,
  totalDocsExamined: 10,

To get the 10 rows for the result, MongoDB has read only 10 index entries and 10 documents. It is the most optimal, reading only what is necessary. The index scan is optimal as it contains the bounds for all equality filters and get the rows ordered without the need for an additional sort:

        indexBounds: {
          country_id: [ '[1, 1]' ],
          'order_details.product_id': [ '[15, 15]' ],
          created_at: [ '[MaxKey, MinKey]' ]
        },

In addition to be fast, the performance is predictable because this execution plan will always be the same. This is visible with allPlansExecution:

mdb> db.orders.find(
   { country_id: 1, order_details: { $elemMatch: { product_id: 15 } } }
 ).sort({ created_at: -1 }).limit(10).explain(`allPlansExecution`).queryPlanner
;
{
  namespace: 'test.orders',
  parsedQuery: {
    '$and': [
      {
        order_details: { '$elemMatch': { product_id: { '$eq': 42 } } }
      },
      { country_id: { '$eq': 1 } }
    ]
  },
  indexFilterSet: false,
  queryHash: '0DAE06A4',
  planCacheShapeHash: '0DAE06A4',
  planCacheKey: 'C3D96884',
  optimizationTimeMillis: 0,
  maxIndexedOrSolutionsReached: false,
  maxIndexedAndSolutionsReached: false,
  maxScansToExplodeReached: false,
  prunedSimilarIndexes: false,
  winningPlan: {
    isCached: false,
    stage: 'LIMIT',
    limitAmount: 10,
    inputStage: {
      stage: 'FETCH',
      filter: {
        order_details: { '$elemMatch': { product_id: { '$eq': 42 } } }
      },
      inputStage: {
        stage: 'IXSCAN',
        keyPattern: {
          country_id: 1,
          'order_details.product_id': 1,
          created_at: -1
        },
        indexName: 'country_id_1_order_details.product_id_1_created_at_-1',
        isMultiKey: true,
        multiKeyPaths: {
          country_id: 
                                        by Franck Pachot
                                    

April 04, 2025