a curated list of database news from authoritative sources

April 17, 2025

April 16, 2025

How to Extend Percona Monitoring and Management to Add Logging Functionality

Evolution is one of the inherent traits of modern software. Many people reach out to product teams daily, asking to add more functionality to the software products they use and love. This is understandable: there will always be ways to make a product better by adding more features to the users’ delight so they can […]

April 15, 2025

PostgreSQL JSONB Indexing Limitations with B-Tree and GIN

In a previous post, I highlighted a significant limitation of relational databases: a One-to-Many relationship requires two tables, yet no index can optimize access with selective predicates on both of them, resulting in unnecessary rows that must be joined before elimination. A comment from @dogacel suggested continuing with PostgreSQL's JSON approach, so here it is.

SQL databases have evolved into general-purpose or converged databases capable of storing documents, like JSONB in PostgreSQL. However, they face challenges with indexing compared to document databases like MongoDB, which offer multi-key indexes for optimizing equality, sort, and range filtering. Let's explore JSONB documents and their indexing options.

Instead of an Order Details table, I embed the details into the Orders table:

create table orders(
 primary key(id)
 , id bigserial 
 , country_id int
 , created_at timestamptz default clock_timestamp()
 , details jsonb
);

I insert one million orders similar to the previous post:

insert into orders (country_id, details)
select
 (10 * random())::int as country_id,  -- generate a random country_id
 (
  select jsonb_agg(
   jsonb_build_object(
    'line', gs,
    'product_id', ((log(2, (1 + o.id * random())::numeric)::int)),
    'quantity', (100 * random())::int
   )
  )
  from generate_series(1, 10) as gs  -- create 10 lines per order
 ) as details
from generate_series(1, 1000000) as o(id);  -- generate 1,000,000 orders

Here is an example of Order with its details:

postgres=# select * from orders limit 1;
-[ RECORD 1 ]--------------------------------------------------------
id         | 1
country_id | 8
created_at | 2025-04-15 19:12:27.525731+00
details    | [
                {"line": 1,  "quantity": 6,   "product_id": 0},
                {"line": 2,  "quantity": 91,  "product_id": 1},
                {"line": 3,  "quantity": 66,  "product_id": 1},
                {"line": 4,  "quantity": 100, "product_id": 1},
                {"line": 5,  "quantity": 15,  "product_id": 0},
                {"line": 6,  "quantity": 80,  "product_id": 0},
                {"line": 7,  "quantity": 96,  "product_id": 1},
                {"line": 8,  "quantity": 9,   "product_id": 0},
                {"line": 9,  "quantity": 89,  "product_id": 0},
                {"line": 10, "quantity": 14,  "product_id": 0}
               ]

I want to run a simple query to retrieve the last ten orders from a specific country that contain a specific product:

SELECT    country_id, details
 FROM     orders
 WHERE    country_id = 1 
 AND      details @> '[{"product_id": 15}]'
 ORDER BY created_at DESC
 LIMIT    10
;

Be aware of filters at various levels: on the order for country and date, and on the details for the product. With a normalized model and two tables, it was impossible to have one index covering all those filters. However, using MongoDB made it straightforward to create an index on "country_id", "order_details.product_id", and "created_at", resulting in only ten documents being read.

Expression index on scalar values

The normalized columns in the row can easily be indexed with a B-Tree:

CREATE INDEX orders1 ON orders ("country_id", "created_at" desc)
;

A similar B-Tree index can be created on the JSONB attributes as long as they are top-level, or sub-documents without repeating groups (without JSON array). However, it is impossible to create such index when the JSON path has an array because in PostgreSQL a B-Tree index cannot have multiple index key for the same row.

Of course, it is possible to create any index, as there's no schema declared for the JSONB content:

CREATE INDEX orders0 ON orders ( (("details"->>'product_id')::int) )
;

However, such index will never find a document where "product_id" is an array as it indexes a scalar expression.

A query on the content of the array cannot use the index:

postgres=# set enable_seqscan=off;
SET
postgres=# EXPLAIN (ANALYZE, COSTS OFF, BUFFERS, SUMMARY OFF)
SELECT * from orders WHERE details @> '[{"product_id": 15}]'
;
                              QUERY PLAN                               
---------------------------------------------------------------------
 Seq Scan on orders (actual time=38.631..1293.388 rows=446355 loops=1)
   Filter: (details @> '[{"product_id": 15}]'::jsonb)
   Rows Removed by Filter: 553645
   Buffers: shared hit=16128 read=94984

Generalized Inverted Index (GIN) for arrays

In order to index though an array, we need an inverted index and PostgreSQL has GIN:

CREATE INDEX orders2 ON orders using GIN ( details )
;

This index is used for the predicate on the array:

postgres=# set enable_seqscan=off;
SET
postgres=# EXPLAIN (ANALYZE, COSTS OFF, BUFFERS, SUMMARY OFF)
SELECT * from orders WHERE details @> '[{"product_id": 15}]'
;
                                     QUERY PLAN                                      
-------------------------------------------------------------------------------------
 Bitmap Heap Scan on orders (actual time=95.272..1196.876 rows=446355 loops=1)
   Recheck Cond: (details @> '[{"product_id": 15}]'::jsonb)
   Rows Removed by Index Recheck: 303886
   Heap Blocks: exact=42718 lossy=65930
   Buffers: shared hit=371 read=108648 written=3415
   ->  Bitmap Index Scan on orders2 (actual time=63.730..63.730 rows=499460 loops=1)
         Index Cond: (details @> '[{"product_id": 15}]'::jsonb)
         Buffers: shared hit=371

I disabled Seq Scan because this predicate alone is not selective enough, returning half of the rows, but it proves that the GIN index can be used.

However, when I run the full query with additional WHERE and ORDER BY LIMIT filters, which are covered by the B-Tree index, the GIN index is no longer used:

postgres=# EXPLAIN (ANALYZE, COSTS OFF, BUFFERS, SUMMARY OFF)
SELECT    country_id, details
 FROM     orders
 WHERE    country_id = 1 
 AND      details @> '[{"product_id": 15}]'
 ORDER BY created_at DESC
 LIMIT    10
;
                                     QUERY PLAN                                      
-------------------------------------------------------------------------------------
 Limit (actual time=0.036..0.117 rows=10 loops=1)
   Buffers: shared hit=32
   ->  Index Scan using orders1 on orders (actual time=0.035..0.115 rows=10 loops=1)
         Index Cond: (country_id = 1)
         Filter: (details @> '[{"product_id": 15}]'::jsonb)
         Rows Removed by Filter: 38
         Buffers: shared hit=32

To understand the reason, I remove the LIMIT clause:

postgres=# EXPLAIN (ANALYZE, COSTS OFF, BUFFERS, SUMMARY OFF)
SELECT    country_id, details
 FROM     orders
 WHERE    country_id = 1 
 AND      details @> '[{"product_id": 15}]'
 ORDER BY created_at DESC
 -- LIMIT    10
;
                                              QUERY PLAN                                               
-------------------------------------------------------------------------------------------------------
 Gather Merge (actual time=237.895..257.523 rows=44864 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   Buffers: shared hit=158 read=61738, temp read=4584 written=4592
   ->  Sort (actual time=222.683..226.142 rows=14955 loops=3)
         Sort Key: created_at DESC
         Sort Method: external merge  Disk: 12688kB
         Buffers: shared hit=158 read=61738, temp read=4584 written=4592
         Worker 0:  Sort Method: external merge  Disk: 11616kB
         Worker 1:  Sort Method: external merge  Disk: 12368kB
         ->  Parallel Bitmap Heap Scan on orders (actual time=74.822..211.433 rows=14955 loops=3)
               Recheck Cond: ((country_id = 1) AND (details @> '[{"product_id": 15}]'::jsonb))
               Rows Removed by Index Recheck: 96641
               Heap Blocks: exact=9701 lossy=11292
               Buffers: shared hit=145 read=61733
               ->  BitmapAnd (actual time=78.189..78.190 rows=0 loops=1)
                     Buffers: shared hit=145 read=613
                     ->  Bitmap Index Scan on orders1 (actual time=11.496..11.497 rows=100362 loops=1)
                           Index Cond: (country_id = 1)
                           Buffers: shared read=387
                     ->  Bitmap Index Scan on orders2 (actual time=64.445..64.445 rows=499460 loops=1)
                           Index Cond: (details @> '[{"product_id": 15}]'::jsonb)
                           Buffers: shared hit=145 read=226

PostgreSQL can utilize both indexes, but it requires a Bitmap Scan to combine them, which does not preserve the B-Tree index order. Consequently, it must perform a Sort operation that reads all rows before returning the first one, making it unsuitable for the LIMIT clause due to excessive row reading.

Composite B-Tree + GIN with extension

With the btree_gin extension available in contrib, I can create a single index that is supposed to combine the advantages of B-Tree and GIN indexes:

postgres=# CREATE EXTENSION BTREE_GIN;
CREATE EXTENSION

postgres=# CREATE INDEX orders3 ON orders
           using GIN (country_id , details, created_at);
CREATE INDEX

Running it without LIMIT shows that it doesn't solve the problem:

postgres=# EXPLAIN (ANALYZE, COSTS OFF, BUFFERS, SUMMARY OFF)
SELECT    country_id, details
 FROM     orders
 WHERE    country_id = 1 
 AND      details @> '[{"product_id": 15}]'
 ORDER BY created_at DESC
 -- LIMIT    10
;
                                            QUERY PLAN                                             
---------------------------------------------------------------------------------------------------
 Gather Merge (actual time=123.675..143.726 rows=44864 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   Buffers: shared hit=237 read=40117, temp read=4585 written=4594
   ->  Sort (actual time=109.979..113.574 rows=14955 loops=3)
         Sort Key: created_at DESC
         Sort Method: external merge  Disk: 12456kB
         Buffers: shared hit=237 read=40117, temp read=4585 written=4594
         Worker 0:  Sort Method: external merge  Disk: 11720kB
         Worker 1:  Sort Method: external merge  Disk: 12504kB
         ->  Parallel Bitmap Heap Scan on orders (actual time=18.096..98.799 rows=14955 loops=3)
               Recheck Cond: ((country_id = 1) AND (details @> '[{"product_id": 15}]'::jsonb))
               Rows Removed by Index Recheck: 1760
               Heap Blocks: exact=13486
               Buffers: shared hit=226 read=40110
               ->  Bitmap Index Scan on orders3 (actual time=25.197..25.197 rows=50144 loops=1)
                     Index Cond: ((country_id = 1) AND (details @> '[{"product_id": 15}]'::jsonb))
                     Buffers: shared hit=226 read=251

Only one index was scanned, with all conditions in its Index Cond, but it is still a Bitmap Scan which doesn't preserve the index key order, because that's how PostgreSQL inverted index works. A GIN index, even with the BTREE_GIN extension, is not an equivalent to MongoDB multi-key indexes which have their index entries ordered.

Even if my new index has all columns, it is not used by the pagination query:

postgres=# EXPLAIN (ANALYZE, COSTS OFF, BUFFERS, SUMMARY OFF)
SELECT    country_id, details
 FROM     orders
 WHERE    country_id = 1 
 AND      details @> '[{"product_id": 15}]'
 ORDER BY created_at DESC
 LIMIT    10
;
                                     QUERY PLAN                                      
-------------------------------------------------------------------------------------
 Limit (actual time=0.034..0.117 rows=10 loops=1)
   Buffers: shared hit=32
   ->  Index Scan using orders1 on orders (actual time=0.034..0.115 rows=10 loops=1)
         Index Cond: (country_id = 1)
         Filter: (details @> '[{"product_id": 15}]'::jsonb)
         Rows Removed by Filter: 38
         Buffers: shared hit=32


If I drop the B-Tree index, the B-Tree GIN is used, but not efficient as it has to read and sort all rows before returning the ten row for the result:

postgres=# drop index orders1;
DROP INDEX
postgres=# EXPLAIN (ANALYZE, COSTS OFF, BUFFERS, SUMMARY OFF)
SELECT    country_id, details
 FROM     orders
 WHERE    country_id = 1 
 AND      details @> '[{"product_id": 15}]'
 ORDER BY created_at DESC
 LIMIT    10
;
                                            QUERY PLAN                                             
---------------------------------------------------------------------------------------------------
 Limit (actual time=215.701..215.704 rows=10 loops=1)
   Buffers: shared hit=229 read=40107 written=5
   ->  Sort (actual time=215.700..215.701 rows=10 loops=1)
         Sort Key: created_at DESC
         Sort Method: top-N heapsort  Memory: 46kB
         Buffers: shared hit=229 read=40107 written=5
         ->  Bitmap Heap Scan on orders (actual time=31.466..204.149 rows=44864 loops=1)
               Recheck Cond: ((country_id = 1) AND (details @> '[{"product_id": 15}]'::jsonb))
               Rows Removed by Index Recheck: 5280
               Heap Blocks: exact=39859
               Buffers: shared hit=229 read=40107 written=5
               ->  Bitmap Index Scan on orders3 (actual time=24.544..24.544 rows=50144 loops=1)
                     Index Cond: ((country_id = 1) AND (details @> '[{"product_id": 15}]'::jsonb))
                     Buffers: shared hit=229 read=248

Seeing all conditions in the Index Cond is not sufficient, especially with GIN indexes because they return false positives that must be rechecked later. Even if you don't have an ORDER BY LIMIT, the scan itself reads more rows than necessary, as reported by Rows Removed by Index Recheck.

Conclusion

PostgreSQL's JSONB with GIN indexing allows for schema-on-read capabilities within a schema-on-write relational model. While it enables document storage, it does not convert PostgreSQL into a document database.
Although GIN indexes efficiently handle equality predicates on array items, they are less effective for range or sorting operations compared to MongoDB's multi-key indexes, which support equality, sorting, and range queries transparently. You may use JSON datatypes to provide more flexibility to the relational model, but SQL databases should be used as relational databases because their engine is optimized for it. Always check the execution plan to avoid surprises.

PostgreSQL JSONB Indexing Limitations with B-Tree and GIN

In a previous post, I highlighted a significant limitation of relational databases: a One-to-Many relationship requires two tables, yet no index can optimize access with selective predicates on both of them, resulting in unnecessary rows that must be joined before elimination. A comment from @dogacel suggested continuing with PostgreSQL's JSON approach, so here it is.

SQL databases have evolved into general-purpose or converged databases capable of storing documents, like JSONB in PostgreSQL. However, they face challenges with indexing compared to document databases like MongoDB, which offer multi-key indexes for optimizing equality, sort, and range filtering. Let's explore JSONB documents and their indexing options.

Instead of an Order Details table, I embed the details into the Orders table:

create table orders(
 primary key(id)
 , id bigserial 
 , country_id int
 , created_at timestamptz default clock_timestamp()
 , details jsonb
);

I insert one million orders similar to the previous post:

insert into orders (country_id, details)
select
 (10 * random())::int as country_id,  -- generate a random country_id
 (
  select jsonb_agg(
   jsonb_build_object(
    'line', gs,
    'product_id', ((log(2, (1 + o.id * random())::numeric)::int)),
    'quantity', (100 * random())::int
   )
  )
  from generate_series(1, 10) as gs  -- create 10 lines per order
 ) as details
from generate_series(1, 1000000) as o(id);  -- generate 1,000,000 orders

Here is an example of Order with its details:

postgres=# select * from orders limit 1;
-[ RECORD 1 ]--------------------------------------------------------
id         | 1
country_id | 8
created_at | 2025-04-15 19:12:27.525731+00
details    | [
                {"line": 1,  "quantity": 6,   "product_id": 0},
                {"line": 2,  "quantity": 91,  "product_id": 1},
                {"line": 3,  "quantity": 66,  "product_id": 1},
                {"line": 4,  "quantity": 100, "product_id": 1},
                {"line": 5,  "quantity": 15,  "product_id": 0},
                {"line": 6,  "quantity": 80,  "product_id": 0},
                {"line": 7,  "quantity": 96,  "product_id": 1},
                {"line": 8,  "quantity": 9,   "product_id": 0},
                {"line": 9,  "quantity": 89,  "product_id": 0},
                {"line": 10, "quantity": 14,  "product_id": 0}
               ]

I want to run a simple query to retrieve the last ten orders from a specific country that contain a specific product:

SELECT    country_id, details
 FROM     orders
 WHERE    country_id = 1 
 AND      details @> '[{"product_id": 15}]'
 ORDER BY created_at DESC
 LIMIT    10
;

Be aware of filters at various levels: on the order for country and date, and on the details for the product. With a normalized model and two tables, it was impossible to have one index covering all those filters. However, using MongoDB made it straightforward to create an index on "country_id", "order_details.product_id", and "created_at", resulting in only ten documents being read.

Expression index on scalar values

The normalized columns in the row can easily be indexed with a B-Tree:

CREATE INDEX orders1 ON orders ("country_id", "created_at" desc)
;

A similar B-Tree index can be created on the JSONB attributes as long as they are top-level, or sub-documents without repeating groups (without JSON array). However, it is impossible to create such index when the JSON path has an array because in PostgreSQL a B-Tree index cannot have multiple index key for the same row.

Of course, it is possible to create any index, as there's no schema declared for the JSONB content:

CREATE INDEX orders0 ON orders ( (("details"->>'product_id')::int) )
;

However, such index will never find a document where "product_id" is an array as it indexes a scalar expression.

A query on the content of the array cannot use the index:

postgres=# set enable_seqscan=off;
SET
postgres=# EXPLAIN (ANALYZE, COSTS OFF, BUFFERS, SUMMARY OFF)
SELECT * from orders WHERE details @> '[{"product_id": 15}]'
;
                              QUERY PLAN                               
---------------------------------------------------------------------
 Seq Scan on orders (actual time=38.631..1293.388 rows=446355 loops=1)
   Filter: (details @> '[{"product_id": 15}]'::jsonb)
   Rows Removed by Filter: 553645
   Buffers: shared hit=16128 read=94984

Generalized Inverted Index (GIN) for arrays

In order to index though an array, we need an inverted index and PostgreSQL has GIN:

CREATE INDEX orders2 ON orders using GIN ( details )
;

This index is used for the predicate on the array:

postgres=# set enable_seqscan=off;
SET
postgres=# EXPLAIN (ANALYZE, COSTS OFF, BUFFERS, SUMMARY OFF)
SELECT * from orders WHERE details @> '[{"product_id": 15}]'
;
                                     QUERY PLAN                                      
-------------------------------------------------------------------------------------
 Bitmap Heap Scan on orders (actual time=95.272..1196.876 rows=446355 loops=1)
   Recheck Cond: (details @> '[{"product_id": 15}]'::jsonb)
   Rows Removed by Index Recheck: 303886
   Heap Blocks: exact=42718 lossy=65930
   Buffers: shared hit=371 read=108648 written=3415
   ->  Bitmap Index Scan on orders2 (actual time=63.730..63.730 rows=499460 loops=1)
         Index Cond: (details @> '[{"product_id": 15}]'::jsonb)
         Buffers: shared hit=371

I disabled Seq Scan because this predicate alone is not selective enough, returning half of the rows, but it proves that the GIN index can be used.

However, when I run the full query with additional WHERE and ORDER BY LIMIT filters, which are covered by the B-Tree index, the GIN index is no longer used:

postgres=# EXPLAIN (ANALYZE, COSTS OFF, BUFFERS, SUMMARY OFF)
SELECT    country_id, details
 FROM     orders
 WHERE    country_id = 1 
 AND      details @> '[{"product_id": 15}]'
 ORDER BY created_at DESC
 LIMIT    10
;
                                     QUERY PLAN                                      
-------------------------------------------------------------------------------------
 Limit (actual time=0.036..0.117 rows=10 loops=1)
   Buffers: shared hit=32
   ->  Index Scan using orders1 on orders (actual time=0.035..0.115 rows=10 loops=1)
         Index Cond: (country_id = 1)
         Filter: (details @> '[{"product_id": 15}]'::jsonb)
         Rows Removed by Filter: 38
         Buffers: shared hit=32

To understand the reason, I remove the LIMIT clause:

postgres=# EXPLAIN (ANALYZE, COSTS OFF, BUFFERS, SUMMARY OFF)
SELECT    country_id, details
 FROM     orders
 WHERE    country_id = 1 
 AND      details @> '[{"product_id": 15}]'
 ORDER BY created_at DESC
 -- LIMIT    10
;
                                              QUERY PLAN                                               
-------------------------------------------------------------------------------------------------------
 Gather Merge (actual time=237.895..257.523 rows=44864 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   Buffers: shared hit=158 read=61738, temp read=4584 written=4592
   ->  Sort (actual time=222.683..226.142 rows=14955 loops=3)
         Sort Key: created_at DESC
         Sort Method: external merge  Disk: 12688kB
         Buffers: shared hit=158 read=61738, temp read=4584 written=4592
         Worker 0:  Sort Method: external merge  Disk: 11616kB
         Worker 1:  Sort Method: external merge  Disk: 12368kB
         ->  Parallel Bitmap Heap Scan on orders (actual time=74.822..211.433 rows=14955 loops=3)
               Recheck Cond: ((country_id = 1) AND (details @> '[{"product_id": 15}]'::jsonb))
               Rows Removed by Index Recheck: 96641
               Heap Blocks: exact=9701 lossy=11292
               Buffers: shared hit=145 read=61733
               ->  BitmapAnd (actual time=78.189..78.190 rows=0 loops=1)
                     Buffers: shared hit=145 read=613
                     ->  Bitmap Index Scan on orders1 (actual time=11.496..11.497 rows=100362 loops=1)
                           Index Cond: (country_id = 1)
                           Buffers: shared read=387
                     ->  Bitmap Index Scan on orders2 (actual time=64.445..64.445 rows=499460 loops=1)
                           Index Cond: (details @> '[{"product_id": 15}]'::jsonb)
                           Buffers: shared hit=145 read=226

PostgreSQL can utilize both indexes, but it requires a Bitmap Scan to combine them, which does not preserve the B-Tree index order. Consequently, it must perform a Sort operation that reads all rows before returning the first one, making it unsuitable for the LIMIT clause due to excessive row reading.

Composite B-Tree + GIN with extension

With the btree_gin extension available in contrib, I can create a single index that is supposed to combine the advantages of B-Tree and GIN indexes:

postgres=# CREATE EXTENSION BTREE_GIN;
CREATE EXTENSION

postgres=# CREATE INDEX orders3 ON orders
           using GIN (country_id , details, created_at);
CREATE INDEX

Running it without LIMIT shows that it doesn't solve the problem:

postgres=# EXPLAIN (ANALYZE, COSTS OFF, BUFFERS, SUMMARY OFF)
SELECT    country_id, details
 FROM     orders
 WHERE    country_id = 1 
 AND      details @> '[{"product_id": 15}]'
 ORDER BY created_at DESC
 -- LIMIT    10
;
                                            QUERY PLAN                                             
---------------------------------------------------------------------------------------------------
 Gather Merge (actual time=123.675..143.726 rows=44864 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   Buffers: shared hit=237 read=40117, temp read=4585 written=4594
   ->  Sort (actual time=109.979..113.574 rows=14955 loops=3)
         Sort Key: created_at DESC
         Sort Method: external merge  Disk: 12456kB
         Buffers: shared hit=237 read=40117, temp read=4585 written=4594
         Worker 0:  Sort Method: external merge  Disk: 11720kB
         Worker 1:  Sort Method: external merge  Disk: 12504kB
         ->  Parallel Bitmap Heap Scan on orders (actual time=18.096..98.799 rows=14955 loops=3)
               Recheck Cond: ((country_id = 1) AND (details @> '[{"product_id": 15}]'::jsonb))
               Rows Removed by Index Recheck: 1760
               Heap Blocks: exact=13486
               Buffers: shared hit=226 read=40110
               ->  Bitmap Index Scan on orders3 (actual time=25.197..25.197 rows=50144 loops=1)
                     Index Cond: ((country_id = 1) AND (details @> '[{"product_id": 15}]'::jsonb))
                     Buffers: shared hit=226 read=251

Only one index was scanned, with all conditions in its Index Cond, but it is still a Bitmap Scan which doesn't preserve the index key order, because that's how PostgreSQL inverted index works. A GIN index, even with the BTREE_GIN extension, is not an equivalent to MongoDB multi-key indexes which have their index entries ordered.

Even if my new index has all columns, it is not used by the pagination query:

postgres=# EXPLAIN (ANALYZE, COSTS OFF, BUFFERS, SUMMARY OFF)
SELECT    country_id, details
 FROM     orders
 WHERE    country_id = 1 
 AND      details @> '[{"product_id": 15}]'
 ORDER BY created_at DESC
 LIMIT    10
;
                                     QUERY PLAN                                      
-------------------------------------------------------------------------------------
 Limit (actual time=0.034..0.117 rows=10 loops=1)
   Buffers: shared hit=32
   ->  Index Scan using orders1 on orders (actual time=0.034..0.115 rows=10 loops=1)
         Index Cond: (country_id = 1)
         Filter: (details @> '[{"product_id": 15}]'::jsonb)
         Rows Removed by Filter: 38
         Buffers: shared hit=32


If I drop the B-Tree index, the B-Tree GIN is used, but not efficient as it has to read and sort all rows before returning the ten row for the result:

postgres=# drop index orders1;
DROP INDEX
postgres=# EXPLAIN (ANALYZE, COSTS OFF, BUFFERS, SUMMARY OFF)
SELECT    country_id, details
 FROM     orders
 WHERE    country_id = 1 
 AND      details @> '[{"product_id": 15}]'
 ORDER BY created_at DESC
 LIMIT    10
;
                                            QUERY PLAN                                             
---------------------------------------------------------------------------------------------------
 Limit (actual time=215.701..215.704 rows=10 loops=1)
   Buffers: shared hit=229 read=40107 written=5
   ->  Sort (actual time=215.700..215.701 rows=10 loops=1)
         Sort Key: created_at DESC
         Sort Method: top-N heapsort  Memory: 46kB
         Buffers: shared hit=229 read=40107 written=5
         ->  Bitmap Heap Scan on orders (actual time=31.466..204.149 rows=44864 loops=1)
               Recheck Cond: ((country_id = 1) AND (details @> '[{"product_id": 15}]'::jsonb))
               Rows Removed by Index Recheck: 5280
               Heap Blocks: exact=39859
               Buffers: shared hit=229 read=40107 written=5
               ->  Bitmap Index Scan on orders3 (actual time=24.544..24.544 rows=50144 loops=1)
                     Index Cond: ((country_id = 1) AND (details @> '[{"product_id": 15}]'::jsonb))
                     Buffers: shared hit=229 read=248

Seeing all conditions in the Index Cond is not sufficient, especially with GIN indexes because they return false positives that must be rechecked later. Even if you don't have an ORDER BY LIMIT, the scan itself reads more rows than necessary, as reported by Rows Removed by Index Recheck.

Conclusion

PostgreSQL's JSONB with GIN indexing allows for schema-on-read capabilities within a schema-on-write relational model. While it enables document storage, it does not convert PostgreSQL into a document database.
Although GIN indexes efficiently handle equality predicates on array items, they are less effective for range or sorting operations compared to MongoDB's multi-key indexes, which support equality, sorting, and range queries transparently. You may use JSON datatypes to provide more flexibility to the relational model, but SQL databases should be used as relational databases because their engine is optimized for it. Always check the execution plan to avoid surprises.

Percona Server for MySQL Data Masking Enhanced with Dictionary Term Cache

In Percona Server for MySQL 8.0.41 / 8.4.4, we significantly re-designed the Data Masking Component. In particular, we made the following changes: Changed the user on behalf of whom we execute internal queries for dictionary operations. Introduced an in-memory dictionary term cache that allows significant speed-up of dictionary operations. Introduced masking_dictionaries_flush() User Defined Function. Introduced […]

Common PostgreSQL Upgrade Errors and How to Avoid Them

This post was originally published in January 2024 and was updated in April 2025. Everyone usually welcomes an upgrade, whether it’s a new phone OS, a flight seat, or a home renovation. Upgrades typically mean improvement, which holds true for PostgreSQL. Each new version brings updates vital for optimal application performance, stability, and security. However, […]

April 14, 2025

MongoDB Pricing Explained: What You’re Really Paying For

If you’re using MongoDB, whether it’s the Community Edition, Enterprise Advanced, or Atlas, you’ve probably asked yourself at some point: what is this actually costing us? And the answer isn’t always easy to pin down. The reality is that there’s no one-size-fits-all price tag. Your infrastructure, usage patterns, and team setup all factor in. That’s […]

April 13, 2025

MongoDB with Search Indexes Queried as Kimball's Star Schema with Facts and Dimensions

  • Export From SQL to CSV
  • Import to MongoDB
  • Indexing the Star Schema
  • Querying the Star Schema - Star Transformation
  • MongoDB queries on a Star Schema
  • Conclusion


In document databases like MongoDB, data modeling is typically optimized for predefined access patterns, where each domain or service owns its database. This contrasts with the relational model, which often serves as a centralized, fully normalized database designed independently of any specific application.
However, most applications also require the ability to perform ad-hoc queries or real-time analytics, where the query patterns are not predetermined.
Here, it is the opposite: SQL requires another database schema, whereas MongoDB offers the same dimensional model approach on the operational database.

In relational databases, a common practice is to create a separate analytics database, often in a dimensional model like the star schema popularized by Ralph Kimball. Here, normalized fact tables are surrounded by denormalized dimension tables. This setup is separate from the operational database and not updated in real-time by transactional workloads.

MongoDB, allows for a different strategy. Thanks to features like MongoDB Atlas Search Indexes, real-time analytics can be performed directly on the operational database without the need for replication to a separate analytical database. Here is a simple example using the Oracle Sales History Sample Schema which I used a lot in the past as an example of a Star Schema designed correctly with Bitmap Indexes on Foreign Keys to allow Star transformation.

Export From SQL to CSV

The simplest to get the Sales History Sample Schema (SH) populated with data is having an Autonomous Database on the Oracle Cloud Free Tier, as it is created by default.

I used SQLcl UNLOAD to get data into CSV files:

for i in CHANNELS COSTS COUNTRIES CUSTOMERS PRODUCTS PROMOTIONS SALES SUPPLEMENTARY_DEMOGRAPHICS TIMES 
do 
 echo "set loadformat csv"
 echo "alter session set nls_date_format = 'YYYY-MM-DD';"
 echo "unload SH.$i"
done > unload.sql

sqlcl @ unload.sql

This generated the following CSV files, in the current directory:


$ wc -l *.csv | sort -n                                                                                                                                                              
       6 CHANNELS_DATA_TABLE.csv
      24 COUNTRIES_DATA_TABLE.csv
      73 PRODUCTS_DATA_TABLE.csv
     504 PROMOTIONS_DATA_TABLE.csv
    1827 TIMES_DATA_TABLE.csv
    4501 SUPPLEMENTARY_DEMOGRAPHICS_DATA_TABLE.csv
   55501 CUSTOMERS_DATA_TABLE.csv
   82113 COSTS_DATA_TABLE.csv
  918844 SALES_DATA_TABLE.csv

The CSV files contain a header with the column names used in the star schema:

Image from: https://docs.oracle.com/cd/B19306_01/server.102/b14198/graphics/comsc007.gif

Import to MongoDB

I started a local MongoDB Atlas:

atlas deployments setup  atlas --type local --port 27017 --force

I imported the CSV files:

mongoimport -j 8 --type=csv --headerline --drop --file=CHANNELS_DATA_TABLE.csv --db=sh --collection=channels $@ &&                                                               
mongoimport -j 8 --type=csv --headerline --drop --file=COSTS_DATA_TABLE.csv --db=sh --collection=costs &&                                                                     
mongoimport -j 8 --type=csv --headerline --drop --file=COUNTRIES_DATA_TABLE.csv --db=sh --collection=countries &&                                                             
mongoimport -j 8 --type=csv --headerline --drop --file=CUSTOMERS_DATA_TABLE.csv --db=sh --collection=customers &&                                                             
mongoimport -j 8 --type=csv --headerline --drop --file=PRODUCTS_DATA_TABLE.csv --db=sh --collection=products &&                                                               
mongoimport -j 8 --type=csv --headerline --drop --file=PROMOTIONS_DATA_TABLE.csv --db=sh --collection=promotions &&                                                           
mongoimport -j 8 --type=csv --headerline --drop --file=SALES_DATA_TABLE.csv --db=sh --collection=sales &&
mongosh sh --eval "show collections"                         

It imports nine hundred thousand facts and the related dimensions in ten seconds:

I connect to the "sh" database and list an example of the "sales" documents:

mongosh
use sh;
db.sales.find().limit(2)

Indexing the Star Schema

In SQL databases, a single optimal index cannot be created for a Data Mart queried on multiple dimension combinations. The SH schema in Oracle Database utilizes one bitmap index for each foreign key in the fact table.
Queries read all necessary indexes and combine results with bitmap operations, which is ideal when individual predicates lack selectivity, but their combinations reduce it to a small set of rows to read from the fact table.
Bitmap indexes are effective for data warehouses but not for operational databases, as they can lead to index fragmentation and locking issues in OLTP systems. For real-time analytics, Oracle offers an In-Memory Column Store, which serves as an analytic-optimized cache for transactional databases.

MongoDB has a different solution: it can serve ad-hoc queries from a single Search Index per fact table. Atlas Search indexes run within the operational database and are asynchronously updated from the change stream, allowing near real-time analytics on operational databases.

I declared the following Search Index on the dimension key fields in the fact table:

db.sales.createSearchIndex(
   "SalesDimension",
   {
     "mappings": {
       "fields": {
         "PROD_ID":    { "type": "number" },
         "CUST_ID":    { "type": "number" },
         "TIME_ID":    { "type": "token" },
         "CHANNEL_ID": { "type": "number" },
         "PROMO_ID":   { "type": "number" },
       }
     }
   }
)

I imported data directly from the CSV without defining datatypes. The TIME_ID is a character string in the YYYY-MM-DD format, so I declared it as a token. The other fields are numbers. All fields serve as foreign keys referencing dimension collections. Their values are less important than having a natural order, which may facilitate range queries.

This example can be run on MongoDB Atlas free tier (M0) which allows three Search Indexes:

The total size of the SH schema, without the search index, is 60MB:

Logical Data Size: 172.12 MB
Storage Size: 31.49 MB
Index Size: 30.32 MB
Total Collections: 9

which is mostly the sales collection:

Storage Size: 20.83MB
Logical Data Size: 124.4MB
Total Documents: 918843
Indexes Total Size: 26.01MB

The Search Index takes approximately the same size:

This demo showcases a small database, illustrating the value of this approach when applied to larger databases. It is recommended to compare the size of the search index with the instance memory.

Querying the Star Schema - Star Transformation

The star schema's primary advantage lies in its fact table, which contains numerous rows with limited size, holding only dimension foreign keys and measures. This large row count is offset by the smaller row size. Ideally, there is one fact table to minimize joins across many rows.
Dimension tables are smaller yet can accommodate larger rows with multiple attributes for user-defined filtering and displaying long descriptions. In SQL, queries join the fact table to dimensions in the FROM clause, while predicates are specified in the WHERE clause. The RDBMS optimizer may transform the query to leverage the star schema efficiently, as I detailed in an old article for the SOUG newsletter:

Star Transformation, 12c Adaptive Bitmap Pruning and In-Memory option | PDF

Star Transformation, 12c Adaptive Bitmap Pruning and In-Memory option - Download as a PDF or view online for free

slideshare.net

The WHERE clause predicates are applied to small dimensions to obtain the list of dimension keys. This list is then used to access the fact table through bitmap indexes, retrieving only the necessary rows. Finally, this result is joined back with dimension tables for additional projections. This transformation, available in commercial RDBMS like Oracle Database Enterprise Edition, optimizes filtering to minimize unnecessary joins. It is not available in Oracle Standard Edition or PostgreSQL.

Whether this transformation is implemented or not, an SQL query uses joins between the fact table and dimensions. For example the following query retreives the sales sales transactions from 2001 that are associated with high-cost promotions, involve customers from California, and occur through the Direct Sales channel, specifically focusing on Saturdays:

SELECT
  s.PROMO_ID, s.CUST_ID, s.CHANNEL_ID, s.TIME_ID, s.PROD_ID,
  p.PROD_NAME, p.PROD_LIST_PRICE,
  s.quantity_sold, s.amount_sold,
FROM  sales s
 JOIN products p ON s.PROD_ID = p.PROD_ID
 JOIN promotions pr ON s.PROMO_ID = pr.PROMO_ID
 JOIN customers c ON s.CUST_ID = c.CUST_ID
 JOIN channels ch ON s.CHANNEL_ID = ch.CHANNEL_ID
 JOIN times t ON s.TIME_ID = t.TIME_ID
WHERE pr.PROMO_COST > 10000
  AND c.CUST_STATE_PROVINCE = 'CA'
  AND ch.CHANNEL_DESC = 'Direct Sales'
  AND t.FISCAL_YEAR = 2001
  AND t.DAY_NAME = 'Saturday'
  AND s.TIME_ID > date '2001-01-01'
;

SQL operates independently of the physical data model. The query planner must estimate the optimal order for joining tables and selecting indexes, relying on complex combinations of predicates to assess cardinalities.

MongoDB queries on a Star Schema

MongoDB gives more control to the developer on how data is accessed, and the same optimal approach can be used:

  1. get an array of references from each dimension where the filters are applied
  2. query the fact table with an aggregation pipeline using the search index to find those references
  3. lookup to the dimensions from which more fields are needed

The bitmap step is unnecessary because the search indexes, which are based on Apache Lucene, can effectively combine multiple filters with the compound operator. The dimension keys gathered as an array is simple to use with the in operator.

Here is an example to retrieve sales transactions from 2001 (TIME_ID: { $gt: "2001-01-01" }) that are associated with high-cost promotions (PROMO_COST: { $gt: 10000 }), involve customers from California ({ CUST_STATE_PROVINCE: 'CA' }), and occur through the Direct Sales channel ({ CHANNEL_DESC: 'Direct Sales' }), specifically focusing on Saturdays ({ FISCAL_YEAR: 2001 , DAY_NAME: 'Saturday' }).

For the time range, as the dimension key holds all information with its YYYY-MM-DD format, I don't need to read from the dimension. For the others, I query the dimension collections to get an array of keys that verify my condition:

const promoIds = db.promotions.find( 
 {PROMO_COST: { $gt: 10000 }} 
).toArray().map(doc => doc.PROMO_ID);

const custIds  = db.customers.find( 
  { CUST_STATE_PROVINCE: 'CA' }      
).toArray().map(doc => doc.CUST_ID);

const channelIds  = db.channels.find( 
  { CHANNEL_DESC: 'Direct Sales' } 
).toArray().map(doc => doc.CHANNEL_ID);

const timeIds  = db.times.find( 
 { FISCAL_YEAR: 2001 , DAY_NAME: 'Saturday' } 
).toArray().map(doc => doc.TIME_ID);

Don't worry about long lists, the search indexes can handle that:

The next step is an aggregation pipeline that uses those lists, and add the time range, in a compound filter:

db.sales.aggregate([
  {
    "$search": {
      "index": "SalesDimension",
      "compound": {
        "must": [
          { "in": { "path": "PROMO_ID". , "value": promoIds   } },
          { "in": { "path": "CUST_ID".  , "value": custIds    } },
          { "in": { "path": "CHANNEL_ID", "value": channelIds } },
          { "in": { "path": "TIME_ID".  , "value": timeIds    } },
          { "range": { "path": "TIME_ID", "gt": "2001-01-01"  } }
        ]
      }
    }
  },
  { "$sort": { "TIME_ID": -1 } },
  { "$limit": 3 },
  {
    "$lookup": {
      "from":         "products",
      "localField":   "PROD_ID",
      "foreignField": "PROD_ID",
      "as":           "product_info"
    }
  },
  {
    "$unwind": "$product_info"
  },
  {
    "$project": {
      "PROMO_ID": 1,
      "CUST_ID": 1,
      "CHANNEL_ID": 1,
      "TIME_ID": 1,
      "PROD_ID": 1,
      "quantity_sold": 1,
      "amount_sold": 1,
      "product_info.PROD_NAME": 1,
      "product_info.PROD_LIST_PRICE": 1
    }
  }
]);

The search operation is the first step of the aggregation pipeline, to get efficiently the minimal items requried by the query, and further processing can be added.
I've added a sort and limit operation here to display only the last three sales. I've added more information with a lookup on the product dimension, and unwind to project only the name and price.

The result is a collection of documents ready to return to the user:

This post aims to demonstrate the search component of the analytic query by reading only the necessary documents from the fact collection. Further analytics can be appended to the aggregation pipeline.

Conclusion

Unlike SQL star schemas, which necessitate Extract-Load-Transform (ELT) processes to another database, MongoDB’s document model allows transactions, including embedded or referenced data, to be stored in a single collection, effectively addressing multi-granularity facts without complex joins.

The Star Transformation approach can be used to generate a list of keys from lookup collections with applicable filters, similar to dimension tables. Atlas Search indexes combine these keys to retrieve necessary documents from the facts collection for further analysis in the aggregation pipeline.

With MongoDB, if you modeled your data properly for domain use cases and later encounter new query requirements needing scattered document aggregation, consider whether a search index can address these needs before changing the model or streaming to another database.

MongoDB with Search Indexes Queried as Kimball's Star Schema with Facts and Dimensions

  • Export From SQL to CSV
  • Import to MongoDB
  • Indexing the Star Schema
  • Querying the Star Schema - Star Transformation
  • MongoDB queries on a Star Schema
  • Conclusion


In document databases like MongoDB, data modeling is typically optimized for predefined access patterns, where each domain or service owns its database. This contrasts with the relational model, which often serves as a centralized, fully normalized database designed independently of any specific application.
However, most applications also require the ability to perform ad-hoc queries or real-time analytics, where the query patterns are not predetermined.
Here, it is the opposite: SQL often requires another database schema, whereas MongoDB offers the same dimensional model approach on the operational database.

In relational databases, a common practice is to create a separate analytics database, often in a dimensional model like the star schema popularized by Ralph Kimball. Here, normalized fact tables are surrounded by denormalized dimension tables. This setup is separate from the operational database and not updated in real-time by transactional workloads.

MongoDB, allows for a different strategy. Thanks to features like MongoDB Atlas Search Indexes, real-time analytics can be performed directly on the operational database without the need for replication to a separate analytical database. Here is a simple example using the Oracle Sales History Sample Schema which I used a lot in the past as an example of a Star Schema designed correctly with Bitmap Indexes on Foreign Keys to allow Star Transformation (discussed later).

Export From SQL to CSV

The simplest to get the Sales History Sample Schema (SH) populated with data is having an Autonomous Database on the Oracle Cloud Free Tier, as it is created by default.

I used SQLcl UNLOAD to get data into CSV files:

for i in CHANNELS COSTS COUNTRIES CUSTOMERS PRODUCTS PROMOTIONS SALES SUPPLEMENTARY_DEMOGRAPHICS TIMES 
do 
 echo "set loadformat csv"
 echo "alter session set nls_date_format = 'YYYY-MM-DD';"
 echo "unload SH.$i"
done > unload.sql

sqlcl @ unload.sql

This generated the following CSV files, in the current directory:


$ wc -l *.csv | sort -n                                                                                                                                                              
       6 CHANNELS_DATA_TABLE.csv
      24 COUNTRIES_DATA_TABLE.csv
      73 PRODUCTS_DATA_TABLE.csv
     504 PROMOTIONS_DATA_TABLE.csv
    1827 TIMES_DATA_TABLE.csv
    4501 SUPPLEMENTARY_DEMOGRAPHICS_DATA_TABLE.csv
   55501 CUSTOMERS_DATA_TABLE.csv
   82113 COSTS_DATA_TABLE.csv
  918844 SALES_DATA_TABLE.csv

The CSV files contain a header with the column names used in the star schema:

Image from: https://docs.oracle.com/cd/B19306_01/server.102/b14198/graphics/comsc007.gif

Import to MongoDB

I started a local MongoDB Atlas:

atlas deployments setup  atlas --type local --port 27017 --force

I imported the CSV files as-is:

mongoimport -j 8 --type=csv --headerline --drop --file=CHANNELS_DATA_TABLE.csv --db=sh --collection=channels $@ &&                                                               
mongoimport -j 8 --type=csv --headerline --drop --file=COSTS_DATA_TABLE.csv --db=sh --collection=costs &&                                                                     
mongoimport -j 8 --type=csv --headerline --drop --file=COUNTRIES_DATA_TABLE.csv --db=sh --collection=countries &&                                                             
mongoimport -j 8 --type=csv --headerline --drop --file=CUSTOMERS_DATA_TABLE.csv --db=sh --collection=customers &&                                                             
mongoimport -j 8 --type=csv --headerline --drop --file=PRODUCTS_DATA_TABLE.csv --db=sh --collection=products &&                                                               
mongoimport -j 8 --type=csv --headerline --drop --file=PROMOTIONS_DATA_TABLE.csv --db=sh --collection=promotions &&                                                           
mongoimport -j 8 --type=csv --headerline --drop --file=SALES_DATA_TABLE.csv --db=sh --collection=sales &&
mongosh sh --eval "show collections"                         

It imports nine hundred thousand facts and the related dimensions in ten seconds:

I connect to the "sh" database and list an example of the "sales" documents:

mongosh
use sh;
db.sales.find().limit(2)

Indexing the Star Schema

In SQL databases, a single optimal index cannot be created for a Data Mart queried on multiple dimension combinations. The SH schema in Oracle Database utilizes one bitmap index for each foreign key in the fact table.
Queries read all necessary indexes and combine results with bitmap operations, which is ideal when individual predicates lack selectivity, but their combinations reduce it to a small set of rows to read from the fact table.
Bitmap indexes are effective for data warehouses but not for operational databases, as they can lead to index fragmentation and locking issues in OLTP systems. For real-time analytics, Oracle offers an In-Memory Column Store, which serves as an analytic-optimized cache for transactional databases.

MongoDB has a different solution: it can serve ad-hoc queries from a single Search Index per fact table. Atlas Search indexes run within the operational database and are asynchronously updated from the change stream, allowing near real-time analytics on operational databases.

I declared the following Search Index on the dimension key fields in the fact table:

db.sales.createSearchIndex(
   "SalesDimension",
   {
     "mappings": {
       "fields": {
         "PROD_ID":    { "type": "number" },
         "CUST_ID":    { "type": "number" },
         "TIME_ID":    { "type": "token" },
         "CHANNEL_ID": { "type": "number" },
         "PROMO_ID":   { "type": "number" },
       }
     }
   }
)

I imported data directly from the CSV without defining datatypes. The TIME_ID is a character string in the YYYY-MM-DD format, so I declared it as a token. The other fields are numbers. All fields serve as foreign keys referencing dimension collections. Their values are less important than having a natural order, which may facilitate range queries.

This example can be run on MongoDB Atlas free tier (M0) which allows three Search Indexes:

The total size of the SH schema, without the search index, is 60MB:

Logical Data Size: 172.12 MB
Storage Size: 31.49 MB
Index Size: 30.32 MB
Total Collections: 9

which is mostly the sales collection:

Storage Size: 20.83MB
Logical Data Size: 124.4MB
Total Documents: 918843
Indexes Total Size: 26.01MB

The Search Index takes approximately the same size:

This demo showcases a small database, illustrating the value of this approach when applied to larger databases. It is recommended to compare the size of the search index with the instance memory.

Querying the Star Schema - Star Transformation

The star schema's primary advantage lies in its fact table, which contains numerous rows with limited size, holding only dimension foreign keys and measures. This large row count is offset by the smaller row size. Ideally, there is one fact table to minimize joins across many rows.
Dimension tables are smaller yet can accommodate larger rows with multiple attributes for user-defined filtering and displaying long descriptions. In SQL, queries join the fact table to dimensions in the FROM clause, while predicates are specified in the WHERE clause. The RDBMS optimizer may transform the query to leverage the star schema efficiently, as I detailed in an old article for the SOUG newsletter:

Star Transformation, 12c Adaptive Bitmap Pruning and In-Memory option | PDF

Star Transformation, 12c Adaptive Bitmap Pruning and In-Memory option - Download as a PDF or view online for free

slideshare.net

The WHERE clause predicates are applied to small dimensions to obtain the list of dimension keys. This list is then used to access the fact table through bitmap indexes, retrieving only the necessary rows. Finally, this result is joined back with dimension tables for additional projections. This transformation, available in commercial RDBMS like Oracle Database Enterprise Edition, optimizes filtering to minimize unnecessary joins. It is not available in Oracle Standard Edition or PostgreSQL.

Whether this transformation is implemented or not, an SQL query uses joins between the fact table and dimensions. For example the following query retreives the sales sales transactions from 2001 that are associated with high-cost promotions, involve customers from California, and occur through the Direct Sales channel, specifically focusing on Saturdays:

SELECT
  s.PROMO_ID, s.CUST_ID, s.CHANNEL_ID, s.TIME_ID, s.PROD_ID,
  p.PROD_NAME, p.PROD_LIST_PRICE,
  s.quantity_sold, s.amount_sold,
FROM  sales s
 JOIN products p ON s.PROD_ID = p.PROD_ID
 JOIN promotions pr ON s.PROMO_ID = pr.PROMO_ID
 JOIN customers c ON s.CUST_ID = c.CUST_ID
 JOIN channels ch ON s.CHANNEL_ID = ch.CHANNEL_ID
 JOIN times t ON s.TIME_ID = t.TIME_ID
WHERE pr.PROMO_COST > 10000
  AND c.CUST_STATE_PROVINCE = 'CA'
  AND ch.CHANNEL_DESC = 'Direct Sales'
  AND t.FISCAL_YEAR = 2001
  AND t.DAY_NAME = 'Saturday'
  AND s.TIME_ID > date '2001-01-01'
;

SQL operates independently of the physical data model. The query planner must estimate the optimal order for joining tables and selecting indexes, relying on complex combinations of predicates to assess cardinalities.

MongoDB queries on a Star Schema

MongoDB data models in operational databases resemble a dimensional schema, with a large collection that contains all business transaction information, like a fact table, and referencing smaller lookup tables similar to dimension tables.
MongoDB, designed for developers, offers more control over data access and allows the Star Transformation approach to be integrated into application code:

  1. get an array of references from each dimension where the filters are applied on many attributes
  2. query the fact table with an aggregation pipeline using the search index to find the documents having those references
  3. lookup to the dimensions from which more fields are needed for calculations, aggregations or projections

The bitmap step is unnecessary because the search indexes, which are based on Apache Lucene (with boolean queries), can effectively combine multiple filters with the compound operator. The dimension keys gathered as an array is simple to use with the in operator.

Here is an example to retrieve sales transactions from 2001 (TIME_ID: { $gt: "2001-01-01" }) that are associated with high-cost promotions (PROMO_COST: { $gt: 10000 }), involve customers from California ({ CUST_STATE_PROVINCE: 'CA' }), and occur through the Direct Sales channel ({ CHANNEL_DESC: 'Direct Sales' }), specifically focusing on Saturdays ({ FISCAL_YEAR: 2001 , DAY_NAME: 'Saturday' }). One criteria alone is not highly selective but their combination is.

For the time range, as the dimension key holds all information with its YYYY-MM-DD format, I don't need to read from the dimension. For the others, I query the dimension collections to get an array of keys that verify my conditions on their attributes:

const promoIds = db.promotions.find( 
 {PROMO_COST: { $gt: 10000 }} 
).toArray().map(doc => doc.PROMO_ID);

const custIds  = db.customers.find( 
  { CUST_STATE_PROVINCE: 'CA' }      
).toArray().map(doc => doc.CUST_ID);

const channelIds  = db.channels.find( 
  { CHANNEL_DESC: 'Direct Sales' } 
).toArray().map(doc => doc.CHANNEL_ID);

const timeIds  = db.times.find( 
 { FISCAL_YEAR: 2001 , DAY_NAME: 'Saturday' } 
).toArray().map(doc => doc.TIME_ID);

For the dimensions that are small, a COLLSCAN is fast, and larger dimensions can be indexed. Those collections are rarely updated and can have many indexes.

Don't worry about long lists, the search indexes can handle that:

The next step is an aggregation pipeline that uses those lists, and adds the time range, in a compound filter:

db.sales.aggregate([
  {
    "$search": {
      "index": "SalesDimension",
      "compound": {
        "must": [
          { "in": { "path": "PROMO_ID". , "value": promoIds   } },
          { "in": { "path": "CUST_ID".  , "value": custIds    } },
          { "in": { "path": "CHANNEL_ID", "value": channelIds } },
          { "in": { "path": "TIME_ID".  , "value": timeIds    } },
          { "range": { "path": "TIME_ID", "gt": "2001-01-01"  } }
        ]
      }
    }
  },
  { "$sort": { "TIME_ID": -1 } },
  { "$limit": 3 },
  {
    "$lookup": {
      "from":         "products",
      "localField":   "PROD_ID",
      "foreignField": "PROD_ID",
      "as":           "product_info"
    }
  },
  {
    "$unwind": "$product_info"
  },
  {
    "$project": {
      "PROMO_ID": 1,
      "CUST_ID": 1,
      "CHANNEL_ID": 1,
      "TIME_ID": 1,
      "PROD_ID": 1,
      "quantity_sold": 1,
      "amount_sold": 1,
      "product_info.PROD_NAME": 1,
      "product_info.PROD_LIST_PRICE": 1
    }
  }
]);

If you are new to MongoDB, an aggregation pipeline is an array of operations on a collection of documents. You use the same approach in SQL when you refactor a query with many subqueries that becomes hard to read, evolve and debug. Change it to a WITH clause and split the logical steps into Common Table Expressions (CTEs). The aggregation follows the same principle.

The search operation is the first step of the aggregation pipeline, to get efficiently the minimal items requried by the query, and further processing can be added.
I've added a sort and limit operation here to display only the last three sales. I've added more information with a lookup on the product dimension, and unwind to project only the name and price.

The result is a collection of documents ready to return to the user, returned in 8 milliseconds because it has solved all filtering before fetching the documents:

This post aims to demonstrate the search component of the analytic query by reading only the necessary documents from the fact collection. Further analytics can be appended to the aggregation pipeline. The datamodel can be improved (using better datatypes, keys as "_id", and index clustering) but the goal was to show how we can run efficent analytic queries without changing the schema.

Conclusion

Unlike SQL star schemas, which necessitate Extract-Load-Transform (ELT) processes to another database, MongoDB’s document model allows business transactions, including embedded or referenced data, to be stored in a single collection in the operational database, effectively addressing multi-granularity facts without complex joins.

To avoid embedding excessive information from dimension attributes into large collections, the Star Transformation approach can be used to generate a list of keys from lookup collections with applicable filters, similar to dimension tables. Atlas Search Indexes combine these keys to retrieve necessary documents from the facts collection for further analysis in the aggregation pipeline.

With MongoDB, if you modeled your data properly for domain use cases and later encounter new query requirements needing scattered document aggregation, consider whether a search index can address these needs before changing the model or streaming to another database.

April 12, 2025

Battle of the Mallocators

If you use RocksDB and want to avoid OOM then use jemalloc or tcmalloc and avoid glibc malloc. That was true in 2015 and remains true in 2025 (see here). The problem is that RocksDB can be an allocator stress test because it does an allocation (calls malloc) when a block is read from storage and then does a deallocation (calls free) on eviction. These allocations have very different lifetimes as some blocks remain cached for a long time and that leads to much larger RSS than expected when using glibc malloc. Fortunately, jemalloc and tcmalloc are better at tolerating that allocation pattern without making RSS too large.

I have yet to notice a similar problem with InnoDB, in part because it does a few large allocations at process start for the InnoDB buffer pool and it doesn't do malloc/free per block read from storage.

There was a recent claim from a MySQL performance expert, Dimitri Kravtchuk, that either RSS or VSZ can grow too large with InnoDB and jemalloc. I don't know all of the details for his setup and I failed to reproduce his result on my setup. Too be fair, I show here that VSZ for InnoDB + jemalloc can be larger than you might expect but that isn't a problem, it is just an artifact of jemalloc that can be confusing. But RSS for jemalloc with InnoDB is similar to what I get from tcmalloc.

tl;dr

  • For glibc malloc with MyRocks I get OOM on a server with 128G of RAM when the RocksDB buffer pool size is 50G. I might have been able to avoid OOM by using between 30G and 40G for the buffer pool. On that host I normally use jemalloc with MyRocks and a 100G buffer pool.
  • With respect to peak RSS
    • For InnoDB the peak RSS with all allocators is similar and peak RSS is ~1.06X larger than the InnoDB buffer pool.
    • For MyRocks the peak RSS is smallest with jemalloc, slightly larger with tcmalloc and much too large with glibc malloc. For (jemalloc, tcmalloc, glibc malloc) It was (1.22, 1.31, 3.62) times larger than the 10G MyRocks buffer pool. I suspect those ratios would be smaller for jemalloc and tcmalloc had I used an 80G buffer pool.
  • For performance, QPS with jemalloc and tcmalloc is slightly better than with glibc malloc
    • For InnoDB: [jemalloc, tcmalloc] get [2.5%, 3.5%] more QPS than glibc malloc
    • For MyRocks: [jemalloc, tcmalloc] get [5.1%, 3.0%] more QPS than glibc malloc

Prior art

I have several blog posts on using jemalloc with MyRocks.

  • October 2015 - MyRocks with glibc malloc, jemalloc and tcmalloc
  • April 2017 - Performance for large, concurrent allocations
  • April 2018 - RSS for MyRocks with jemalloc vs glibc malloc
  • August 2023 - RocksDB and glibc malloc
  • September 2023 - A regression in jemalloc 4.4.0 and 4.5.0 (too-large RSS) 
  • September 2023 - More on the regression in jemalloc 4.4.0 and 4.5.0
  • October 2023 - Even more on the regression in jemalloc 4.4.0 and 4.5.0

Builds, configuration and hardware

I compiled upstream MySQL 8.0.40 from source for InnoDB. I also compiled FB MySQL 8.0.32 from source for MyRocks. For FB MySQL I used source as of October 23, 2024 at git hash ba9709c9 with RocksDB 9.7.1.

The server is an ax162-s from Hetzner with 48 cores (AMD EPYC 9454P), 128G RAM and AMD SMT disabled. It uses Ubuntu 22.04 and storage is ext4 with SW RAID 1 over 2 locally attached NVMe devices. More details on it are here. At list prices a similar server from Google Cloud costs 10X more than from Hetzner.

For malloc the server uses:
  • glibc
    • version2.35-0ubuntu3.9
  • tcmalloc
    • provided by libgoogle-perftools-dev and apt-cache show claims this is version 2.9.1
    • enabled by malloc-lib=/usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so in my.cnf
  • jemalloc
    • provided by libjemalloc-dev and apt-cache show claims this is version 5.2.1-4ubuntu1
    • enabled by malloc-lib=/usr/lib/x86_64-linux-gnu/libjemalloc.so in my.cnf

The configuration files are here for InnoDB and for MyRocks. For InnoDB I used an 80G buffer pool. I tried to use a 50G buffer pool for MyRocks but with glibc malloc there was OOM so I repeated all tests with a 10G buffer pool. I might have been able avoid OOM with MyRocks and glibc malloc by using a between 30G and 40G for MyRocks -- but I didn't want to spend more time figuring that out when the real answer is to use jemalloc or tcmalloc.

Benchmark

I used sysbench and my usage is explained here. To save time I only run 27 of the 42 microbenchmarks and most test only 1 type of SQL statement.

The tests run with 16 tables and 50M rows/table. There are 256 client threads and each microbenchmark runs for 1200 seconds. Normally I don't run with (client threads / cores) >> 1 but I do so here to create more stress and to copy what I think Dimitri had done.

Normally when I run sysbench I configure it so that the test tables fit in the buffer pool (block cache) but I don't do that here because I want to MyRocks to do IO as allocations per storage read create much drama for the allocator.

The command line to run all tests is: bash r.sh 16 50000000 1200 1200 md2 1 0 256

Peak VSZ and RSS

The tables below show the peak values for VSZ and RSS from mysqld during the benchmark. The last column is the ratio (peak RSS / buffer pool size). I am not sure it is fair to compare these ratios between InnoDB and MyRocks from this work because the buffer pool size is so much larger for InnoDB. Regardless, RSS is more than 3X larger than the MyRocks buffer pool size with glibc malloc and that is a problem.

Peak values for InnoDB with 80G buffer pool
alloc           VSZ     RSS     RSS/80
glibc           88.2    86.5    1.08
tcmalloc        88.1    85.3    1.06
jemalloc        91.5    87.0    1.08

Peak values for MyRocks with 10G buffer pool
alloc           VSZ     RSS     RSS/10
glibc           46.1    36.2    3.62
tcmalloc        15.3    13.1    1.31
jemalloc        45.6    12.2    1.22

Performance: InnoDB

From the results here, QPS is mostly similar between tcmalloc and jemalloc but there are a few microbenchmarks where tcmalloc is much better than jemalloc and those are highlighted.

The results for read-only_range=10000 are an outlier (tcmalloc much faster than jemalloc) and from vmstat metrics here I see that CPU/operation (cpu/o) and context switches /operation (cs/o) are much larger for jemalloc than for tcmalloc.

These results use the relative QPS, which is the following where $allocator is tcmalloc or jemalloc. When this value is larger than 1.0 then QPS is larger with tcmalloc or jemalloc.
(QPS with $allocator) / (QPS with glibc malloc)
Relative to results with glibc malloc
col-1 : results with tcmalloc
col-2 : results with jemalloc

col-1 col-2
0.99 1.02 hot-points_range=100
1.05 1.04 point-query_range=100
0.96 0.99 points-covered-pk_range=100
0.98 0.99 points-covered-si_range=100
0.96 0.99 points-notcovered-pk_range=100
0.97 0.98 points-notcovered-si_range=100
0.97 1.00 random-points_range=1000
0.95 0.99 random-points_range=100
0.99 0.99 random-points_range=10
1.04 1.03 range-covered-pk_range=100
1.05 1.07 range-covered-si_range=100
1.04 1.03 range-notcovered-pk_range=100
0.98 1.00 range-notcovered-si_range=100
1.02 1.02 read-only-count_range=1000
1.05 1.07 read-only-distinct_range=1000
1.07 1.12 read-only-order_range=1000
1.28 1.09 read-only_range=10000
1.03 1.05 read-only_range=100
1.05 1.08 read-only_range=10
1.08 1.07 read-only-simple_range=1000
1.04 1.03 read-only-sum_range=1000
1.02 1.02 scan_range=100
1.01 1.00 delete_range=100
1.03 1.01 insert_range=100
1.02 1.02 read-write_range=100
1.03 1.03 read-write_range=10
1.01 1.02 update-index_range=100
1.15 0.98 update-inlist_range=100
1.06 0.99 update-nonindex_range=100
1.03 1.03 update-one_range=100
1.02 1.01 update-zipf_range=100
1.18 1.05 write-only_range=10000

Performance: MyRocks

From the results here, QPS is mostly similar between tcmalloc and jemalloc with a slight advantage for jemalloc but there are a few microbenchmarks where jemalloc is much better than tcmalloc and those are highlighted.

The results for hot-points below are odd (jemalloc is a lot faster than tcmalloc) and from vmstat metrics here I see that CPU/operation (cpu/o) and context switches /operation (cs/o) are both much larger for tcmalloc.

These results use the relative QPS, which is the following where $allocator is tcmalloc or jemalloc. When this value is larger than 1.0 then QPS is larger with tcmalloc or jemalloc.
(QPS with $allocator) / (QPS with glibc malloc)
Relative to results with glibc malloc
col-1 : results with tcmalloc
col-2 : results with jemalloc

col-1 col-2
0.68 1.00 hot-points_range=100
1.04 1.04 point-query_range=100
1.09 1.09 points-covered-pk_range=100
1.00 1.09 points-covered-si_range=100
1.09 1.09 points-notcovered-pk_range=100
1.10 1.12 points-notcovered-si_range=100
1.08 1.08 random-points_range=1000
1.09 1.09 random-points_range=100
1.05 1.10 random-points_range=10
0.99 1.07 range-covered-pk_range=100
1.01 1.03 range-covered-si_range=100
1.05 1.09 range-notcovered-pk_range=100
1.10 1.09 range-notcovered-si_range=100
1.07 1.05 read-only-count_range=1000
1.00 1.00 read-only-distinct_range=1000
0.98 1.04 read-only-order_range=1000
1.03 1.03 read-only_range=10000
0.96 1.03 read-only_range=100
1.02 1.04 read-only_range=10
0.98 1.07 read-only-simple_range=1000
1.07 1.09 read-only-sum_range=1000
1.02 1.02 scan_range=100
1.05 1.03 delete_range=100
1.11 1.07 insert_range=100
0.96 0.97 read-write_range=100
0.94 0.95 read-write_range=10
1.08 1.04 update-index_range=100
1.08 1.07 update-inlist_range=100
1.09 1.04 update-nonindex_range=100
1.04 1.04 update-one_range=100
1.07 1.04 update-zipf_range=100
1.03 1.02 write-only_range=10000

April 11, 2025

Oracle Multi-Value Index and ORDER BY Pagination queries

In the previous post, I highlighted a case where relational SQL databases struggle with optimizing index access with joins. In contrast, a document model using MongoDB's multi-key indexes efficiently retrieves only the necessary data for the results.

SQL databases implemented document datatypes, storing them in binary JSON columns in SQL tables, but that's not enough to compete as a document database. It's essential to look at the JSON path indexing possibilities?
A JSON document may include nested sub-documents, and you must index paths that may go through arrays. Oracle 21c introduced multi-value indexes for JSON. Let's test it with the same example from the previous post where MongoDB reads only the necessary index entries, filtering upfront thanks to embedding a One-to-Many instead of joining.

Example

As I demonstrated it on MongoDB, I used the MongoDB-compatible API of Oracle 23c to use the same syntax, which saves me migrating to the SQL/JSON syntax:

const bulkOps = [];
for (let i = 0; i < 1000000; i++) {
  const orderDetails = [];
  for (let line = 1; line <= 10; line++) {
    orderDetails.push({
      line: line,
      product_id: Math.floor(Math.log2(1 + i * Math.random())),
      quantity: Math.floor(100 * Math.random()),
    });
  }
  bulkOps.push({
    insertOne: {
      document: {
        country_id: Math.floor(10 * Math.random()),
        created_at: new Date(),
        order_details: orderDetails
      }
    }
  });
}
db.orders.bulkWrite(bulkOps);

db.orders.createIndex(
 { "country_id": 1, "order_details .product_id": 1, "created_at": -1 }
);

The MongoDB emulation on Oracle Database has converted this to the following SQL table with a multi-value index:

CREATE JSON COLLECTION TABLE "ORA"."orders" 
   ( "SYS_NC00005$" RAW(4000) GENERATED ALWAYS AS ( JSON_MKMVI(JSON_TABLE( "DATA", '$' PRESENT ON EMPTY MINIMAL CROSS PRODUCT WITH ERROR ON PARALLEL ARRAYS COLUMNS( NESTED PATH '$."country_id"[*]' COLUMNS( "K0" ANY ORA_RAWCOMPARE PATH '$' ERROR ON ERROR PRESENT ON EMPTY NULL ON MISMATCH ) , NESTED PATH '$."order_details"[*]' COLUMNS( NESTED PATH '$."product_id"[*]' COLUMNS( "K1" ANY ORA_RAWCOMPARE PATH '$' ERROR ON ERROR PRESENT ON EMPTY NULL ON MISMATCH ) ) , NESTED PATH '$."created_at"[*]' COLUMNS( "K2" ANY ORA_RAWCOMPARE PATH '$' ERROR ON ERROR PRESENT ON EMPTY NULL ON MISMATCH  DESC ) ) )  AS "K0","K1","K2" DESC )) VIRTUAL , 
    "SYS_NC00006$" RAW(4000) GENERATED ALWAYS AS (JSON_QUERY("DATA" FORMAT OSON , '$."order_details"[*]."product_id"[*]' RETURNING ANY ORA_RAWCOMPARE ASIS  WITHOUT ARRAY WRAPPER ERROR ON ERROR PRESENT ON EMPTY NULL ON MISMATCH TYPE(LAX)  MULTIVALUE)) VIRTUAL , 
    "SYS_NC00007$" RAW(6001) GENERATED ALWAYS AS (JSON_QUERY("DATA" FORMAT OSON , '$."created_at"[*]' RETURNING ANY ORA_RAWCOMPARE ASIS  WITHOUT ARRAY WRAPPER ERROR ON ERROR PRESENT ON EMPTY NULL ON MISMATCH TYPE(LAX)  MULTIVALUE DESC )) VIRTUAL 
   )  DEFAULT COLLATION "USING_NLS_COMP" ;

CREATE MULTIVALUE INDEX "ORA"."$ora:orders.country_id_1_order_details.product_id_1_created_at_-1" ON "ORA"."orders" (JSON_MKMVI(JSON_TABLE( "DATA", '$' PRESENT ON EMPTY MINIMAL CROSS PRODUCT WITH ERROR ON PARALLEL ARRAYS COLUMNS( NESTED PATH '$."country_id"[*]' COLUMNS( "K0" ANY ORA_RAWCOMPARE PATH '$' ERROR ON ERROR PRESENT ON EMPTY NULL ON MISMATCH ) , NESTED PATH '$."order_details"[*]' COLUMNS( NESTED PATH '$."product_id"[*]' COLUMNS( "K1" ANY ORA_RAWCOMPARE PATH '$' ERROR ON ERROR PRESENT ON EMPTY NULL ON MISMATCH ) ) , NESTED PATH '$."created_at"[*]' COLUMNS( "K2" ANY ORA_RAWCOMPARE PATH '$' ERROR ON ERROR PRESENT ON EMPTY NULL ON MISMATCH  DESC ) ) )  AS "K0","K1","K2" DESC ), JSON_QUERY("DATA" FORMAT OSON , '$."order_details"[*]."product_id"[*]' RETURNING ANY ORA_RAWCOMPARE ASIS  WITHOUT ARRAY WRAPPER ERROR ON ERROR PRESENT ON EMPTY NULL ON MISMATCH TYPE(LAX)  MULTIVALUE), JSON_QUERY("DATA" FORMAT OSON , '$."created_at"[*]' RETURNING ANY ORA_RAWCOMPARE ASIS  WITHOUT ARRAY WRAPPER ERROR ON ERROR PRESENT ON EMPTY NULL ON MISMATCH TYPE(LAX)  MULTIVALUE DESC ) DESC) 
  ;

The execution plan provided by the translation layer shows some SQL, but not the execution statistics:


ora> db.orders.find( { country_id: 1, order_details: { $elemMatch: { product_id: 15 } } } ).sort({ created_at: -1 }).limit(10).explain(`executionStatistics`)

{
  queryPlanner: {
    plannerVersion: 1,
    namespace: 'ora.orders',
    indexFilterSet: false,
    parsedQuery: {
      '$query': {
        '$and': [
          { country_id: { '$numberOnly': 1 } },
          { 'order_details[*]': { product_id: { '$numberOnly': 15 } } }
        ]
      },
      '$orderby': {
        '$fields': [ { path: 'created_at', order: 'desc', sortByMinMax: true } ],
        '$lax': true
      }
    },
    rewrittenQuery: {
      '$and': [
        {
          '$query': {
            '$and': [
              { country_id: { '$numberOnly': 1 } },
              {
                'order_details[*]': { product_id: { '$numberOnly': 15 } }
              }
            ]
          }
        },
        {
          '$orderby': {
            '$fields': [
              { path: 'created_at', order: 'desc', sortByMinMax: true }
            ],
            '$lax': true
          }
        }
      ]
    },
    winningPlan: {
      stage: 'SELECT STATEMENT',
      inputStage: {
        stage: 'COUNT',
        options: 'STOPKEY',
        columns: '"from$_subquery$_002"."DATA"[JSON,8200], "from$_subquery$_002"."RAWTOHEX("RESID")"[VARCHAR2,4000], "from$_subquery$_002"."ETAG"[RAW,16]',
        filterType: 'filter',
        filter: 'ROWNUM<=10',
        inputStage: {
          stage: 'VIEW',
          columns: '"from$_subquery$_002"."DATA"[JSON,8200], "from$_subquery$_002"."RAWTOHEX("RESID")"[VARCHAR2,4000], "from$_subquery$_002"."ETAG"[RAW,16]',
          inputStage: {
            stage: 'SORT',
            options: 'ORDER BY STOPKEY',
            columns: `(#keys=1) JSON_VALUE(JSON_QUERY("DATA" /*+ LOB_BY_VALUE */  FORMAT OSON , '$.created_at[*].max()' RETURNING JSON WITHOUT ARRAY WRAPPER NULL ON ERROR TYPE(LAX) ) FORMAT OSON , '$' RETURNING ANY ORA_RAWCOMPARE(32767) ERROR ON ERROR TYPE(LAX) )[32767], "DATA" /*+ LOB_BY_VALUE */ [JSON,8200], JSON_VALUE("DATA" /*+ LOB_BY_VALUE */  FORMAT OSON , '$._id' RETURNING ANY ORA_RAWCOMPARE(2000) NO ARRAY ERROR ON ERROR TYPE(LAX) )[RAW,2000], "ETAG"[RAW,16]`,
            filterType: 'filter',
            filter: 'ROWNUM<=10',
            path: "$.created_at[*].max()'",
            inputStage: {
              stage: 'TABLE ACCESS',
              options: 'BY INDEX ROWID BATCHED',
              source: 'orders',
              columns: `"DATA" /*+ LOB_BY_VALUE */ [JSON,8200], JSON_VALUE("DATA" /*+ LOB_BY_VALUE */  FORMAT OSON , '$._id' RETURNING ANY ORA_RAWCOMPARE(2000) NO ARRAY ERROR ON ERROR TYPE(LAX) )[RAW,2000], "ETAG"[RAW,16]`,
              path: "$._id'",
              inputStage: {
                stage: 'HASH',
                options: 'UNIQUE',
                columns: '(#keys=2) "orders".ROWID[ROWID,10], SYSVARCOL[8]',
                inputStage: {
                  stage: 'INDEX',
                  options: 'RANGE SCAN (MULTI VALUE)',
                  source: '$ora:orders.country_id_1_order_details.product_id_1_created_at_-1',
                  columns: `"orders".ROWID[ROWID,10], JSON_QUERY("DATA" /*+ LOB_BY_VALUE */  FORMAT OSON , '$."country_id"[*]' RETURNING ANY ORA_RAWCOMPARE ASIS  WITHOUT ARRAY WRAPPER ERROR ON ERROR PRESENT ON EMPTY NULL ON MISMATCH TYPE(LAX)  MULTIVALUE)[RAW,4000], JSON_QUERY("DATA" /*+ LOB_BY_VALUE */  FORMAT OSON , '$."order_details"[*]."product_id"[*]' RETURNING ANY ORA_RAWCOMPARE ASIS  WITHOUT ARRAY WRAPPER ERROR ON ERROR PRESENT ON EMPTY NULL ON MISMATCH TYPE(LAX)  MULTIVALUE)[RAW,4000], SYSVARCOL[8]`,
                  filterType: 'access',
                  filter: `JSON_QUERY("DATA" /*+ LOB_BY_VALUE */  FORMAT OSON , '$."country_id"[*]' RETURNING ANY ORA_RAWCOMPARE ASIS  WITHOUT ARRAY WRAPPER ERROR ON ERROR PRESENT ON EMPTY NULL ON MISMATCH TYPE(LAX)  MULTIVALUE)=SYS_CONS_ANY_SCALAR(:1, 3) AND JSON_QUERY("DATA" /*+ LOB_BY_VALUE */  FORMAT OSON , '$."order_details"[*]."product_id"[*]' RETURNING ANY ORA_RAWCOMPARE ASIS  WITHOUT ARRAY WRAPPER ERROR ON ERROR PRESENT ON EMPTY NULL ON MISMATCH TYPE(LAX)  MULTIVALUE)=SYS_CONS_ANY_SCALAR(:2, 3)`
                }
              }
            }
          }
        }
      }
    },
    rejectPlans: []
  },
  serverInfo: { host: 'localhost', port: 27017, version: '4.2.14' },
  ok: 1
}
ora> 

However, It's the SQL engine behind, and I captured the SQL statement from GV$SQL:


SELECT /*+ FIRST_ROWS(10) monitor */
    "DATA",
    RAWTOHEX("RESID"),
    "ETAG"
FROM
    "ORA"."orders"
WHERE
    JSON_EXISTS ( "DATA", '$?( 
        (@.country_id.numberOnly() == $B0)
        &&
        ( exists(@.order_details[*]?( (@.product_id.numberOnly() == $B1) )) ) 
    )'
    PASSING :1 AS "B0",
    :2 AS "B1" TYPE ( STRICT ) )
ORDER BY
    JSON_QUERY("DATA", '$.created_at[*].max()') DESC NULLS LAST
FETCH NEXT 10 ROWS ONLY

I have added the /* monitor */ hint to get the SQL execution plan with execution statistics:

Unlike MongoDB, which reads only ten documents with such .find().sort().limit(10) query, Oracle's multi-value index had to read all index entries (53K Rows) and documents, and sorted them, before returning the Top-10. The index was used for filtering on the WHERE clause, but not for ordering on the ORDER BY FETCH NEXT 10 ROWS ONLY despite having "created_at" in the key next to the equiality filter columns.

Possible explanation

I tried to understand if it is a limitation of Oracle multi-value indexes, or the query planner. The internals of Oracle multi-value indexes are not documented, but there's the patent that describes the idea: https://patents.google.com/patent/US11640380B2

My understanding from it is that the following document:

{
  "_id": ObjectId('67f1a477aabaf2dad73f4791'),
  "country_id": 1,
  "created_at": ISODate('2025-04-05T21:45:21.546Z'),
  "order_details": [
    { "line": 1, "product_id": 19, "quantity": 40 },
    { "line": 2, "product_id": 15, "quantity": 10 },
    { "line": 3, "product_id": 18, "quantity": 75 },
    { "line": 4, "product_id": 15, "quantity": 50 }
  ]
}

with the following index:

db.orders.createIndex(
 { "country_id": 1, "order_details.product_id": 1, "created_at": -1 }
);

results in the following index entries where # is the ordinal position in the "order_details" array:

+------------+--------------------+---------------------------+
| country_id |  order_details     | created_at                |
|            +---------------+----+                           |
|            | .product_id   | #  |                           |
+------------|---------------|----|---------------------------+
| 1          |  19           | 0  | 2025-04-05T21:45:21.546Z  |
| 1          |  15           | 1  | 2025-04-05T21:45:21.546Z  |
| 1          |  18           | 2  | 2025-04-05T21:45:21.546Z  |
| 1          |  15           | 3  | 2025-04-05T21:45:21.546Z  |
+------------|---------------|----|---------------------------+

By applying an equality predicate on "country_id" and "product_id," and considering that "created_at" is at the same level and not nested in "order_details," the optimizer could potentially recognize that the entries are sorted by "created_at". Note that it is possible that the ordinal position is not present as I see no FOR ORDINALITY clause in the CREATE MULTIVALUE INDEX statement.

The index entries are ordered and a TABLE ACCESS BY INDEX ROWID should preserve the order, as it is not BATCHED to reduce scattered reads, so a blocking sort operation could be avoided. This could enable the query to read only the necessary documents, but it is not what I observe from the execution plan. An HASH UNIQUE operation breaks the ordering in between. It is true that multiple entries may be returned for the same document and we want the document only once, but I would have expected a SORT UNIQUE NOSORT to preserve the ordering.

Note that all these are guesses, I've run this on the autonomous managed service where I do not have access to the datafiles to look at the internals of the multi-value index.

Conclusion

In exploring multi-value indexes in Oracle for sorting and filtering nested JSON data, a key performance differences emerges compared to MongoDB. Oracle's indexes manage complex JSON structures but struggle with key ordering, requiring an additional blocking sort, which is problematic in pagination queries. This was tested on Oracle 23ai (23.7.0.25.03) showing a MongoDB version of 4.2.14 and may change in the future as the multi-value indexes apparently store the entries in order.

Thus, when evaluating a SQL database to be used as a document database, it's vital to consider both efficient JSON storage and indexing options. A comment in the previous post shows that it's the same with PostgreSQL, as GIN indexes, required for JSON paths though arrays, does not return entries in order.