a curated list of database news from authoritative sources

April 19, 2025

Battle of the Mallocators: part 2

This post addresses some of the feedback I received from my previous post on the impact of the malloc library when using RocksDB and MyRocks. Here I test:

  • MALLOC_ARENA_MAX with glibc malloc
    • see here for more background on MALLOC_ARENA_MAX. By default glibc can use too many arenas for some workloads (8 X number_of_CPU_cores) so I tested it with 1, 8, 48 and 96 arenas.
  • compiling RocksDB and MyRocks with jemalloc specific code enabled
    • In my previous results I just set malloc-lib in my.cnf which uses LD_LIBRARY_PATH to link with your favorite malloc library implementation.
tl;dr: jemalloc
  • For mysqld with jemalloc enabled via malloc-lib (LD_LIBRARY_PATH) versus mysqld with jemalloc specific code enabled
    • performance, VSZ and RSS were similar
  • After setting rocksdb_cache_dump=0 in the binary with jemalloc specific code
    • performance is slightly better (excluding the outlier, the benefit is up to 3%)
    • peak VSZ is cut in half
    • peak RSS is reduced by ~9%
tl;dr: glibc malloc on a 48-core server
  • With 1 arena performance is lousy but the RSS bloat is mostly solved
  • With 8, 48 or 96 arenas the RSS bloat is still there
  • With 48 arenas there are still significant (5% to 10%) performance drops
  • With 96 arenas the performance drop was mostly ~2%
Building MyRocks with jemalloc support

This was harder than I expected. The first step was easy -- I added these to the CMake command line, the first is for MyRocks and the second is for RocksDB. When the first is set then HAVE_JEMALLOC is defined in config.h. When the second is set then ROCKSDB_JEMALLOC is defined on the compiler command line.

  -DHAVE_JEMALLOC=1
  -DWITH_JEMALLOC=1
The hard part is that there were linker errors for unresolved symbols -- the open-source build was broken. The fix that worked for me is here. I removed libunwind.so and added libjemalloc.so in its place.

Running mysqld with MALLOC_ARENA_MAX

I wasn't sure if it was sufficient for me to set an environment variable when invoking mysqld_safe, so I just edited the mysqld_safe script to do that for me:

182a183,184
>   cmd="MALLOC_ARENA_MAX=1 $cmd"
>   echo Run :: $cmd

Results: jemalloc

The jemalloc specific code in MyRocks and RocksDB is useful but most of it is not there to boost performance. The jemalloc specific code most likely to boost performance is here in MyRocks and is enabled when rocksdb_cache_dump=0 is added to my.cnf.

Results are here for 3 setups:
  • fbmy8032_rel_o2nofp_end_241023_ba9709c9_971.za4_jemalloc_c32r128
    • This is the base case in the table below
    • this is what I used in my previous post and jemalloc is enabled via setting malloc-lib in my.cnf which uses LD_LIBRARY_PATH
  • fbmy8032_rel_o2nofp_end_je_241023_ba9709c9_971.za4_c32r128
    • This is col-1 in the table below
    • MySQL with jemalloc specific code enabled at compile time
  • fbmy8032_rel_o2nofp_end_je_241023_ba9709c9_971.za5_c32r128
    • This is col-2 in the table below
    • MySQL with jemalloc specific code enabled at compile time and rocksdb_cache_dump=0 added to my.cnf
These results use the relative QPS, which is the following where $allocator is tcmalloc or jemalloc. When this value is larger than 1.0 then QPS is larger with tcmalloc or jemalloc.
(QPS with $allocator) / (QPS with glibc malloc)
From the results below:
  • results in col-1 are similar to the base case. So compiling in the jemalloc specific code didn't help performance.
  • results in col-2 are slightly better than the base case with one outlier (hot-points). So consider setting rocksdb_cache_dump=0 in my.cnf after compiling in jemalloc specific code.
Relative to: fbmy8032_rel_o2nofp_end_241023_ba9709c9_971.za4_jemalloc_c32r128

col-1 : fbmy8032_rel_o2nofp_end_je_241023_ba9709c9_971.za4_c32r128
col-2 : fbmy8032_rel_o2nofp_end_je_241023_ba9709c9_971.za5_c32r128

col-1   col-2
0.92    1.40    hot-points_range=100
1.00    1.01    point-query_range=100
1.01    1.02    points-covered-pk_range=100
0.94    1.03    points-covered-si_range=100
1.01    1.02    points-notcovered-pk_range=100
0.98    1.02    points-notcovered-si_range=100
1.01    1.03    random-points_range=1000
1.01    1.02    random-points_range=100
0.99    1.00    random-points_range=10
0.98    1.00    range-covered-pk_range=100
0.96    0.97    range-covered-si_range=100
0.98    0.98    range-notcovered-pk_range=100
1.00    1.02    range-notcovered-si_range=100
0.98    1.00    read-only-count_range=1000
1.01    1.01    read-only-distinct_range=1000
0.99    0.99    read-only-order_range=1000
1.00    1.00    read-only_range=10000
0.99    0.99    read-only_range=100
0.99    1.00    read-only_range=10
0.98    0.99    read-only-simple_range=1000
0.99    0.99    read-only-sum_range=1000
0.98    0.98    scan_range=100
1.01    1.02    delete_range=100
1.01    1.03    insert_range=100
0.99    1.01    read-write_range=100
1.00    1.01    read-write_range=10
1.00    1.02    update-index_range=100
1.02    1.02    update-inlist_range=100
1.01    1.03    update-nonindex_range=100
0.99    1.01    update-one_range=100
1.01    1.03    update-zipf_range=100
1.00    1.01    write-only_range=10000

The impact on VSZ and RSS is interesting. The tables below show the peak values for VSZ and RSS from mysqld during the benchmark. The last column is the ratio (peak RSS / buffer pool size). To save space I use abbreviated names for the binaries.
  • jemalloc.1
    • base case, fbmy8032_rel_o2nofp_end_241023_ba9709c9_971.za4_jemalloc_c32r128
  • jemalloc.2
    • col-1 above, fbmy8032_rel_o2nofp_end_je_241023_ba9709c9_971.za4_c32r128
    • This has little impact on VSZ and RSS
  • jemalloc.3
    • col-2 above, fbmy8032_rel_o2nofp_end_je_241023_ba9709c9_971.za5_c32r128
    • This cuts peak VSZ in half and reduces peak RSS by 9%
Peak values for MyRocks with 10G buffer pool
alloc           VSZ     RSS     RSS/10
jemalloc.1      45.6    12.2    1.22
jemalloc.2      46.0    12.5    1.25
jemalloc.3      20.2    11.6    1.16

Results: MALLOC_ARENA_MAX

The binaries tested are:
  • fbmy8032_rel_o2nofp_end_241023_ba9709c9_971.za4_glibcmalloc_c32r128
    • base case in the table below
  • fbmy8032_rel_o2nofp_end_241023_ba9709c9_971.za4_glibcmalloc_1arena_c32r128
    • col-1 in the table below
    • uses MALLOC_ARENA_MAX=1
  • fbmy8032_rel_o2nofp_end_241023_ba9709c9_971.za4_glibcmalloc_8arena_c32r128
    • col-2 in the table below
    • uses MALLOC_ARENA_MAX=8
  • fbmy8032_rel_o2nofp_end_241023_ba9709c9_971.za4_glibcmalloc_48arena_c32r128
    • col-3 in the table below
    • uses MALLOC_ARENA_MAX=48
  • fbmy8032_rel_o2nofp_end_241023_ba9709c9_971.za4_glibcmalloc_96arena_c32r128
    • col-4 in the table below
    • uses MALLOC_ARENA_MAX=48
These results use the relative QPS, which is the following where $allocator is tcmalloc or jemalloc. When this value is larger than 1.0 then QPS is larger with tcmalloc or jemalloc.
(QPS with $allocator) / (QPS with glibc malloc)
From the results below:
  • performance with 1 or 8 arenas is lousy
  • performance drops some (often 5% to 10%) with 48 arenas
  • performance drops ~2% with 96 arenas
Relative to: fbmy8032_rel_o2nofp_end_241023_ba9709c9_971.za4_glibcmalloc_c32r128

col-1 : fbmy8032_rel_o2nofp_end_241023_ba9709c9_971.za4_glibcmalloc_1arena_c32r128
col-2 : fbmy8032_rel_o2nofp_end_241023_ba9709c9_971.za4_glibcmalloc_8arena_c32r128
col-3 : fbmy8032_rel_o2nofp_end_241023_ba9709c9_971.za4_glibcmalloc_48arena_c32r128
col-4 : fbmy8032_rel_o2nofp_end_241023_ba9709c9_971.za4_glibcmalloc_96arena_c32r128

col-1   col-2   col-3   col-4
0.89    0.78    0.72    0.78    hot-points_range=100
0.23    0.61    0.96    0.98    point-query_range=100
0.31    0.86    0.96    1.01    points-covered-pk_range=100
0.24    0.87    0.95    1.01    points-covered-si_range=100
0.31    0.86    0.97    1.01    points-notcovered-pk_range=100
0.20    0.86    0.97    1.00    points-notcovered-si_range=100
0.35    0.79    0.96    1.01    random-points_range=1000
0.30    0.87    0.96    1.01    random-points_range=100
0.23    0.67    0.96    0.99    random-points_range=10
0.06    0.48    0.92    0.96    range-covered-pk_range=100
0.14    0.52    0.97    0.99    range-covered-si_range=100
0.13    0.46    0.91    0.97    range-notcovered-pk_range=100
0.23    0.87    0.96    1.01    range-notcovered-si_range=100
0.23    0.76    0.97    0.99    read-only-count_range=1000
0.56    1.00    0.96    0.97    read-only-distinct_range=1000
0.20    0.47    0.90    0.94    read-only-order_range=1000
0.68    1.04    1.00    1.00    read-only_range=10000
0.21    0.76    0.98    0.99    read-only_range=100
0.19    0.70    0.97    0.99    read-only_range=10
0.21    0.58    0.94    0.98    read-only-simple_range=1000
0.19    0.57    0.95    1.00    read-only-sum_range=1000
0.53    0.98    1.00    1.01    scan_range=100
0.30    0.81    0.98    1.00    delete_range=100
0.50    0.92    1.00    1.00    insert_range=100
0.23    0.72    0.97    0.98    read-write_range=100
0.20    0.67    0.96    0.98    read-write_range=10
0.33    0.88    0.99    1.00    update-index_range=100
0.36    0.76    0.94    0.98    update-inlist_range=100
0.30    0.85    0.98    0.99    update-nonindex_range=100
0.86    0.98    1.00    1.01    update-one_range=100
0.32    0.86    0.98    0.98    update-zipf_range=100
0.27    0.80    0.97    0.98    write-only_range=10000

The impact on VSZ and RSS is interesting. The tables below show the peak values for VSZ and RSS from mysqld during the benchmark. The last column is the ratio (peak RSS / buffer pool size). To save space I use abbreviated names for the binaries.

Using 1 arena prevents RSS bloat but comes at a huge cost in performance. If I had more time I would have tested for 2, 4 and 6 arenas but I don't think glibc malloc + RocksDB are meant to be.

Peak values for MyRocks with 10G buffer pool
alloc           VSZ     RSS     RSS/10
default         46.1    36.2    3.62
arena = 1       15.9    14.1    1.41
arena = 8       32.6    27.7    2.77
arena = 48      35.2    29.2    2.92
arena = 96      39.3    32.5    3.25


April 18, 2025

Run PostgreSQL on Kubernetes: A Practical Guide with Benchmarks & Best Practices

Remember when running databases in Kubernetes felt like a gamble? When unpredictable storage performance and complex state management made stability a constant battle? Thankfully, those days are largely behind us. With better tools, smarter operators, and field-tested strategies, you can now confidently deploy PostgreSQL on Kubernetes, especially when you need scale, automation, and platform consistency. […]

April 17, 2025

What’s New in MySQL 8.4 LTS: Key Enhancements Explained

MySQL has been evolving rapidly, and with the recent release of MySQL 8.4, there’s a lot to unpack. While MySQL 8.4 is officially dubbed a Long-Term Support (LTS) release, many features introduced between 8.0 and 8.4 have now matured into defaults or become deprecated/removed. This blog provides a comprehensive, side-by-side comparison of the key differences […]

April 16, 2025

How to Extend Percona Monitoring and Management to Add Logging Functionality

Evolution is one of the inherent traits of modern software. Many people reach out to product teams daily, asking to add more functionality to the software products they use and love. This is understandable: there will always be ways to make a product better by adding more features to the users’ delight so they can […]

April 15, 2025

PostgreSQL JSONB Indexing Limitations with B-Tree and GIN

In a previous post, I highlighted a significant limitation of relational databases: a One-to-Many relationship requires two tables, yet no index can optimize access with selective predicates on both of them, resulting in unnecessary rows that must be joined before elimination. A comment from @dogacel suggested continuing with PostgreSQL's JSON approach, so here it is.

SQL databases have evolved into general-purpose or converged databases capable of storing documents, like JSONB in PostgreSQL. However, they face challenges with indexing compared to document databases like MongoDB, which offer multi-key indexes for optimizing equality, sort, and range filtering. Let's explore JSONB documents and their indexing options.

Instead of an Order Details table, I embed the details into the Orders table:

create table orders(
 primary key(id)
 , id bigserial 
 , country_id int
 , created_at timestamptz default clock_timestamp()
 , details jsonb
);

I insert one million orders similar to the previous post:

insert into orders (country_id, details)
select
 (10 * random())::int as country_id,  -- generate a random country_id
 (
  select jsonb_agg(
   jsonb_build_object(
    'line', gs,
    'product_id', ((log(2, (1 + o.id * random())::numeric)::int)),
    'quantity', (100 * random())::int
   )
  )
  from generate_series(1, 10) as gs  -- create 10 lines per order
 ) as details
from generate_series(1, 1000000) as o(id);  -- generate 1,000,000 orders

Here is an example of Order with its details:

postgres=# select * from orders limit 1;
-[ RECORD 1 ]--------------------------------------------------------
id         | 1
country_id | 8
created_at | 2025-04-15 19:12:27.525731+00
details    | [
                {"line": 1,  "quantity": 6,   "product_id": 0},
                {"line": 2,  "quantity": 91,  "product_id": 1},
                {"line": 3,  "quantity": 66,  "product_id": 1},
                {"line": 4,  "quantity": 100, "product_id": 1},
                {"line": 5,  "quantity": 15,  "product_id": 0},
                {"line": 6,  "quantity": 80,  "product_id": 0},
                {"line": 7,  "quantity": 96,  "product_id": 1},
                {"line": 8,  "quantity": 9,   "product_id": 0},
                {"line": 9,  "quantity": 89,  "product_id": 0},
                {"line": 10, "quantity": 14,  "product_id": 0}
               ]

I want to run a simple query to retrieve the last ten orders from a specific country that contain a specific product:

SELECT    country_id, details
 FROM     orders
 WHERE    country_id = 1 
 AND      details @> '[{"product_id": 15}]'
 ORDER BY created_at DESC
 LIMIT    10
;

Be aware of filters at various levels: on the order for country and date, and on the details for the product. With a normalized model and two tables, it was impossible to have one index covering all those filters. However, using MongoDB made it straightforward to create an index on "country_id", "order_details.product_id", and "created_at", resulting in only ten documents being read.

Expression index on scalar values

The normalized columns in the row can easily be indexed with a B-Tree:

CREATE INDEX orders1 ON orders ("country_id", "created_at" desc)
;

A similar B-Tree index can be created on the JSONB attributes as long as they are top-level, or sub-documents without repeating groups (without JSON array). However, it is impossible to create such index when the JSON path has an array because in PostgreSQL a B-Tree index cannot have multiple index key for the same row.

Of course, it is possible to create any index, as there's no schema declared for the JSONB content:

CREATE INDEX orders0 ON orders ( (("details"->>'product_id')::int) )
;

However, such index will never find a document where "product_id" is an array as it indexes a scalar expression.

A query on the content of the array cannot use the index:

postgres=# set enable_seqscan=off;
SET
postgres=# EXPLAIN (ANALYZE, COSTS OFF, BUFFERS, SUMMARY OFF)
SELECT * from orders WHERE details @> '[{"product_id": 15}]'
;
                              QUERY PLAN                               
---------------------------------------------------------------------
 Seq Scan on orders (actual time=38.631..1293.388 rows=446355 loops=1)
   Filter: (details @> '[{"product_id": 15}]'::jsonb)
   Rows Removed by Filter: 553645
   Buffers: shared hit=16128 read=94984

Generalized Inverted Index (GIN) for arrays

In order to index though an array, we need an inverted index and PostgreSQL has GIN:

CREATE INDEX orders2 ON orders using GIN ( details )
;

This index is used for the predicate on the array:

postgres=# set enable_seqscan=off;
SET
postgres=# EXPLAIN (ANALYZE, COSTS OFF, BUFFERS, SUMMARY OFF)
SELECT * from orders WHERE details @> '[{"product_id": 15}]'
;
                                     QUERY PLAN                                      
-------------------------------------------------------------------------------------
 Bitmap Heap Scan on orders (actual time=95.272..1196.876 rows=446355 loops=1)
   Recheck Cond: (details @> '[{"product_id": 15}]'::jsonb)
   Rows Removed by Index Recheck: 303886
   Heap Blocks: exact=42718 lossy=65930
   Buffers: shared hit=371 read=108648 written=3415
   ->  Bitmap Index Scan on orders2 (actual time=63.730..63.730 rows=499460 loops=1)
         Index Cond: (details @> '[{"product_id": 15}]'::jsonb)
         Buffers: shared hit=371

I disabled Seq Scan because this predicate alone is not selective enough, returning half of the rows, but it proves that the GIN index can be used.

However, when I run the full query with additional WHERE and ORDER BY LIMIT filters, which are covered by the B-Tree index, the GIN index is no longer used:

postgres=# EXPLAIN (ANALYZE, COSTS OFF, BUFFERS, SUMMARY OFF)
SELECT    country_id, details
 FROM     orders
 WHERE    country_id = 1 
 AND      details @> '[{"product_id": 15}]'
 ORDER BY created_at DESC
 LIMIT    10
;
                                     QUERY PLAN                                      
-------------------------------------------------------------------------------------
 Limit (actual time=0.036..0.117 rows=10 loops=1)
   Buffers: shared hit=32
   ->  Index Scan using orders1 on orders (actual time=0.035..0.115 rows=10 loops=1)
         Index Cond: (country_id = 1)
         Filter: (details @> '[{"product_id": 15}]'::jsonb)
         Rows Removed by Filter: 38
         Buffers: shared hit=32

To understand the reason, I remove the LIMIT clause:

postgres=# EXPLAIN (ANALYZE, COSTS OFF, BUFFERS, SUMMARY OFF)
SELECT    country_id, details
 FROM     orders
 WHERE    country_id = 1 
 AND      details @> '[{"product_id": 15}]'
 ORDER BY created_at DESC
 -- LIMIT    10
;
                                              QUERY PLAN                                               
-------------------------------------------------------------------------------------------------------
 Gather Merge (actual time=237.895..257.523 rows=44864 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   Buffers: shared hit=158 read=61738, temp read=4584 written=4592
   ->  Sort (actual time=222.683..226.142 rows=14955 loops=3)
         Sort Key: created_at DESC
         Sort Method: external merge  Disk: 12688kB
         Buffers: shared hit=158 read=61738, temp read=4584 written=4592
         Worker 0:  Sort Method: external merge  Disk: 11616kB
         Worker 1:  Sort Method: external merge  Disk: 12368kB
         ->  Parallel Bitmap Heap Scan on orders (actual time=74.822..211.433 rows=14955 loops=3)
               Recheck Cond: ((country_id = 1) AND (details @> '[{"product_id": 15}]'::jsonb))
               Rows Removed by Index Recheck: 96641
               Heap Blocks: exact=9701 lossy=11292
               Buffers: shared hit=145 read=61733
               ->  BitmapAnd (actual time=78.189..78.190 rows=0 loops=1)
                     Buffers: shared hit=145 read=613
                     ->  Bitmap Index Scan on orders1 (actual time=11.496..11.497 rows=100362 loops=1)
                           Index Cond: (country_id = 1)
                           Buffers: shared read=387
                     ->  Bitmap Index Scan on orders2 (actual time=64.445..64.445 rows=499460 loops=1)
                           Index Cond: (details @> '[{"product_id": 15}]'::jsonb)
                           Buffers: shared hit=145 read=226

PostgreSQL can utilize both indexes, but it requires a Bitmap Scan to combine them, which does not preserve the B-Tree index order. Consequently, it must perform a Sort operation that reads all rows before returning the first one, making it unsuitable for the LIMIT clause due to excessive row reading.

Composite B-Tree + GIN with extension

With the btree_gin extension available in contrib, I can create a single index that is supposed to combine the advantages of B-Tree and GIN indexes:

postgres=# CREATE EXTENSION BTREE_GIN;
CREATE EXTENSION

postgres=# CREATE INDEX orders3 ON orders
           using GIN (country_id , details, created_at);
CREATE INDEX

Running it without LIMIT shows that it doesn't solve the problem:

postgres=# EXPLAIN (ANALYZE, COSTS OFF, BUFFERS, SUMMARY OFF)
SELECT    country_id, details
 FROM     orders
 WHERE    country_id = 1 
 AND      details @> '[{"product_id": 15}]'
 ORDER BY created_at DESC
 -- LIMIT    10
;
                                            QUERY PLAN                                             
---------------------------------------------------------------------------------------------------
 Gather Merge (actual time=123.675..143.726 rows=44864 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   Buffers: shared hit=237 read=40117, temp read=4585 written=4594
   ->  Sort (actual time=109.979..113.574 rows=14955 loops=3)
         Sort Key: created_at DESC
         Sort Method: external merge  Disk: 12456kB
         Buffers: shared hit=237 read=40117, temp read=4585 written=4594
         Worker 0:  Sort Method: external merge  Disk: 11720kB
         Worker 1:  Sort Method: external merge  Disk: 12504kB
         ->  Parallel Bitmap Heap Scan on orders (actual time=18.096..98.799 rows=14955 loops=3)
               Recheck Cond: ((country_id = 1) AND (details @> '[{"product_id": 15}]'::jsonb))
               Rows Removed by Index Recheck: 1760
               Heap Blocks: exact=13486
               Buffers: shared hit=226 read=40110
               ->  Bitmap Index Scan on orders3 (actual time=25.197..25.197 rows=50144 loops=1)
                     Index Cond: ((country_id = 1) AND (details @> '[{"product_id": 15}]'::jsonb))
                     Buffers: shared hit=226 read=251

Only one index was scanned, with all conditions in its Index Cond, but it is still a Bitmap Scan which doesn't preserve the index key order, because that's how PostgreSQL inverted index works. A GIN index, even with the BTREE_GIN extension, is not an equivalent to MongoDB multi-key indexes which have their index entries ordered.

Even if my new index has all columns, it is not used by the pagination query:

postgres=# EXPLAIN (ANALYZE, COSTS OFF, BUFFERS, SUMMARY OFF)
SELECT    country_id, details
 FROM     orders
 WHERE    country_id = 1 
 AND      details @> '[{"product_id": 15}]'
 ORDER BY created_at DESC
 LIMIT    10
;
                                     QUERY PLAN                                      
-------------------------------------------------------------------------------------
 Limit (actual time=0.034..0.117 rows=10 loops=1)
   Buffers: shared hit=32
   ->  Index Scan using orders1 on orders (actual time=0.034..0.115 rows=10 loops=1)
         Index Cond: (country_id = 1)
         Filter: (details @> '[{"product_id": 15}]'::jsonb)
         Rows Removed by Filter: 38
         Buffers: shared hit=32


If I drop the B-Tree index, the B-Tree GIN is used, but not efficient as it has to read and sort all rows before returning the ten row for the result:

postgres=# drop index orders1;
DROP INDEX
postgres=# EXPLAIN (ANALYZE, COSTS OFF, BUFFERS, SUMMARY OFF)
SELECT    country_id, details
 FROM     orders
 WHERE    country_id = 1 
 AND      details @> '[{"product_id": 15}]'
 ORDER BY created_at DESC
 LIMIT    10
;
                                            QUERY PLAN                                             
---------------------------------------------------------------------------------------------------
 Limit (actual time=215.701..215.704 rows=10 loops=1)
   Buffers: shared hit=229 read=40107 written=5
   ->  Sort (actual time=215.700..215.701 rows=10 loops=1)
         Sort Key: created_at DESC
         Sort Method: top-N heapsort  Memory: 46kB
         Buffers: shared hit=229 read=40107 written=5
         ->  Bitmap Heap Scan on orders (actual time=31.466..204.149 rows=44864 loops=1)
               Recheck Cond: ((country_id = 1) AND (details @> '[{"product_id": 15}]'::jsonb))
               Rows Removed by Index Recheck: 5280
               Heap Blocks: exact=39859
               Buffers: shared hit=229 read=40107 written=5
               ->  Bitmap Index Scan on orders3 (actual time=24.544..24.544 rows=50144 loops=1)
                     Index Cond: ((country_id = 1) AND (details @> '[{"product_id": 15}]'::jsonb))
                     Buffers: shared hit=229 read=248

Seeing all conditions in the Index Cond is not sufficient, especially with GIN indexes because they return false positives that must be rechecked later. Even if you don't have an ORDER BY LIMIT, the scan itself reads more rows than necessary, as reported by Rows Removed by Index Recheck.

Conclusion

PostgreSQL's JSONB with GIN indexing allows for schema-on-read capabilities within a schema-on-write relational model. While it enables document storage, it does not convert PostgreSQL into a document database.
Although GIN indexes efficiently handle equality predicates on array items, they are less effective for range or sorting operations compared to MongoDB's multi-key indexes, which support equality, sorting, and range queries transparently. You may use JSON datatypes to provide more flexibility to the relational model, but SQL databases should be used as relational databases because their engine is optimized for it. Always check the execution plan to avoid surprises.

PostgreSQL JSONB Indexing Limitations with B-Tree and GIN

In a previous post, I highlighted a significant limitation of relational databases: a One-to-Many relationship requires two tables, yet no index can optimize access with selective predicates on both of them, resulting in unnecessary rows that must be joined before elimination. A comment from @dogacel suggested continuing with PostgreSQL's JSON approach, so here it is.

SQL databases have evolved into general-purpose or converged databases capable of storing documents, like JSONB in PostgreSQL. However, they face challenges with indexing compared to document databases like MongoDB, which offer multi-key indexes for optimizing equality, sort, and range filtering. Let's explore JSONB documents and their indexing options.

Instead of an Order Details table, I embed the details into the Orders table:

create table orders(
 primary key(id)
 , id bigserial 
 , country_id int
 , created_at timestamptz default clock_timestamp()
 , details jsonb
);

I insert one million orders similar to the previous post:

insert into orders (country_id, details)
select
 (10 * random())::int as country_id,  -- generate a random country_id
 (
  select jsonb_agg(
   jsonb_build_object(
    'line', gs,
    'product_id', ((log(2, (1 + o.id * random())::numeric)::int)),
    'quantity', (100 * random())::int
   )
  )
  from generate_series(1, 10) as gs  -- create 10 lines per order
 ) as details
from generate_series(1, 1000000) as o(id);  -- generate 1,000,000 orders

Here is an example of Order with its details:

postgres=# select * from orders limit 1;
-[ RECORD 1 ]--------------------------------------------------------
id         | 1
country_id | 8
created_at | 2025-04-15 19:12:27.525731+00
details    | [
                {"line": 1,  "quantity": 6,   "product_id": 0},
                {"line": 2,  "quantity": 91,  "product_id": 1},
                {"line": 3,  "quantity": 66,  "product_id": 1},
                {"line": 4,  "quantity": 100, "product_id": 1},
                {"line": 5,  "quantity": 15,  "product_id": 0},
                {"line": 6,  "quantity": 80,  "product_id": 0},
                {"line": 7,  "quantity": 96,  "product_id": 1},
                {"line": 8,  "quantity": 9,   "product_id": 0},
                {"line": 9,  "quantity": 89,  "product_id": 0},
                {"line": 10, "quantity": 14,  "product_id": 0}
               ]

I want to run a simple query to retrieve the last ten orders from a specific country that contain a specific product:

SELECT    country_id, details
 FROM     orders
 WHERE    country_id = 1 
 AND      details @> '[{"product_id": 15}]'
 ORDER BY created_at DESC
 LIMIT    10
;

Be aware of filters at various levels: on the order for country and date, and on the details for the product. With a normalized model and two tables, it was impossible to have one index covering all those filters. However, using MongoDB made it straightforward to create an index on "country_id", "order_details.product_id", and "created_at", resulting in only ten documents being read.

Expression index on scalar values

The normalized columns in the row can easily be indexed with a B-Tree:

CREATE INDEX orders1 ON orders ("country_id", "created_at" desc)
;

A similar B-Tree index can be created on the JSONB attributes as long as they are top-level, or sub-documents without repeating groups (without JSON array). However, it is impossible to create such index when the JSON path has an array because in PostgreSQL a B-Tree index cannot have multiple index key for the same row.

Of course, it is possible to create any index, as there's no schema declared for the JSONB content:

CREATE INDEX orders0 ON orders ( (("details"->>'product_id')::int) )
;

However, such index will never find a document where "product_id" is an array as it indexes a scalar expression.

A query on the content of the array cannot use the index:

postgres=# set enable_seqscan=off;
SET
postgres=# EXPLAIN (ANALYZE, COSTS OFF, BUFFERS, SUMMARY OFF)
SELECT * from orders WHERE details @> '[{"product_id": 15}]'
;
                              QUERY PLAN                               
---------------------------------------------------------------------
 Seq Scan on orders (actual time=38.631..1293.388 rows=446355 loops=1)
   Filter: (details @> '[{"product_id": 15}]'::jsonb)
   Rows Removed by Filter: 553645
   Buffers: shared hit=16128 read=94984

Generalized Inverted Index (GIN) for arrays

In order to index though an array, we need an inverted index and PostgreSQL has GIN:

CREATE INDEX orders2 ON orders using GIN ( details )
;

This index is used for the predicate on the array:

postgres=# set enable_seqscan=off;
SET
postgres=# EXPLAIN (ANALYZE, COSTS OFF, BUFFERS, SUMMARY OFF)
SELECT * from orders WHERE details @> '[{"product_id": 15}]'
;
                                     QUERY PLAN                                      
-------------------------------------------------------------------------------------
 Bitmap Heap Scan on orders (actual time=95.272..1196.876 rows=446355 loops=1)
   Recheck Cond: (details @> '[{"product_id": 15}]'::jsonb)
   Rows Removed by Index Recheck: 303886
   Heap Blocks: exact=42718 lossy=65930
   Buffers: shared hit=371 read=108648 written=3415
   ->  Bitmap Index Scan on orders2 (actual time=63.730..63.730 rows=499460 loops=1)
         Index Cond: (details @> '[{"product_id": 15}]'::jsonb)
         Buffers: shared hit=371

I disabled Seq Scan because this predicate alone is not selective enough, returning half of the rows, but it proves that the GIN index can be used.

However, when I run the full query with additional WHERE and ORDER BY LIMIT filters, which are covered by the B-Tree index, the GIN index is no longer used:

postgres=# EXPLAIN (ANALYZE, COSTS OFF, BUFFERS, SUMMARY OFF)
SELECT    country_id, details
 FROM     orders
 WHERE    country_id = 1 
 AND      details @> '[{"product_id": 15}]'
 ORDER BY created_at DESC
 LIMIT    10
;
                                     QUERY PLAN                                      
-------------------------------------------------------------------------------------
 Limit (actual time=0.036..0.117 rows=10 loops=1)
   Buffers: shared hit=32
   ->  Index Scan using orders1 on orders (actual time=0.035..0.115 rows=10 loops=1)
         Index Cond: (country_id = 1)
         Filter: (details @> '[{"product_id": 15}]'::jsonb)
         Rows Removed by Filter: 38
         Buffers: shared hit=32

To understand the reason, I remove the LIMIT clause:

postgres=# EXPLAIN (ANALYZE, COSTS OFF, BUFFERS, SUMMARY OFF)
SELECT    country_id, details
 FROM     orders
 WHERE    country_id = 1 
 AND      details @> '[{"product_id": 15}]'
 ORDER BY created_at DESC
 -- LIMIT    10
;
                                              QUERY PLAN                                               
-------------------------------------------------------------------------------------------------------
 Gather Merge (actual time=237.895..257.523 rows=44864 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   Buffers: shared hit=158 read=61738, temp read=4584 written=4592
   ->  Sort (actual time=222.683..226.142 rows=14955 loops=3)
         Sort Key: created_at DESC
         Sort Method: external merge  Disk: 12688kB
         Buffers: shared hit=158 read=61738, temp read=4584 written=4592
         Worker 0:  Sort Method: external merge  Disk: 11616kB
         Worker 1:  Sort Method: external merge  Disk: 12368kB
         ->  Parallel Bitmap Heap Scan on orders (actual time=74.822..211.433 rows=14955 loops=3)
               Recheck Cond: ((country_id = 1) AND (details @> '[{"product_id": 15}]'::jsonb))
               Rows Removed by Index Recheck: 96641
               Heap Blocks: exact=9701 lossy=11292
               Buffers: shared hit=145 read=61733
               ->  BitmapAnd (actual time=78.189..78.190 rows=0 loops=1)
                     Buffers: shared hit=145 read=613
                     ->  Bitmap Index Scan on orders1 (actual time=11.496..11.497 rows=100362 loops=1)
                           Index Cond: (country_id = 1)
                           Buffers: shared read=387
                     ->  Bitmap Index Scan on orders2 (actual time=64.445..64.445 rows=499460 loops=1)
                           Index Cond: (details @> '[{"product_id": 15}]'::jsonb)
                           Buffers: shared hit=145 read=226

PostgreSQL can utilize both indexes, but it requires a Bitmap Scan to combine them, which does not preserve the B-Tree index order. Consequently, it must perform a Sort operation that reads all rows before returning the first one, making it unsuitable for the LIMIT clause due to excessive row reading.

Composite B-Tree + GIN with extension

With the btree_gin extension available in contrib, I can create a single index that is supposed to combine the advantages of B-Tree and GIN indexes:

postgres=# CREATE EXTENSION BTREE_GIN;
CREATE EXTENSION

postgres=# CREATE INDEX orders3 ON orders
           using GIN (country_id , details, created_at);
CREATE INDEX

Running it without LIMIT shows that it doesn't solve the problem:

postgres=# EXPLAIN (ANALYZE, COSTS OFF, BUFFERS, SUMMARY OFF)
SELECT    country_id, details
 FROM     orders
 WHERE    country_id = 1 
 AND      details @> '[{"product_id": 15}]'
 ORDER BY created_at DESC
 -- LIMIT    10
;
                                            QUERY PLAN                                             
---------------------------------------------------------------------------------------------------
 Gather Merge (actual time=123.675..143.726 rows=44864 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   Buffers: shared hit=237 read=40117, temp read=4585 written=4594
   ->  Sort (actual time=109.979..113.574 rows=14955 loops=3)
         Sort Key: created_at DESC
         Sort Method: external merge  Disk: 12456kB
         Buffers: shared hit=237 read=40117, temp read=4585 written=4594
         Worker 0:  Sort Method: external merge  Disk: 11720kB
         Worker 1:  Sort Method: external merge  Disk: 12504kB
         ->  Parallel Bitmap Heap Scan on orders (actual time=18.096..98.799 rows=14955 loops=3)
               Recheck Cond: ((country_id = 1) AND (details @> '[{"product_id": 15}]'::jsonb))
               Rows Removed by Index Recheck: 1760
               Heap Blocks: exact=13486
               Buffers: shared hit=226 read=40110
               ->  Bitmap Index Scan on orders3 (actual time=25.197..25.197 rows=50144 loops=1)
                     Index Cond: ((country_id = 1) AND (details @> '[{"product_id": 15}]'::jsonb))
                     Buffers: shared hit=226 read=251

Only one index was scanned, with all conditions in its Index Cond, but it is still a Bitmap Scan which doesn't preserve the index key order, because that's how PostgreSQL inverted index works. A GIN index, even with the BTREE_GIN extension, is not an equivalent to MongoDB multi-key indexes which have their index entries ordered.

Even if my new index has all columns, it is not used by the pagination query:

postgres=# EXPLAIN (ANALYZE, COSTS OFF, BUFFERS, SUMMARY OFF)
SELECT    country_id, details
 FROM     orders
 WHERE    country_id = 1 
 AND      details @> '[{"product_id": 15}]'
 ORDER BY created_at DESC
 LIMIT    10
;
                                     QUERY PLAN                                      
-------------------------------------------------------------------------------------
 Limit (actual time=0.034..0.117 rows=10 loops=1)
   Buffers: shared hit=32
   ->  Index Scan using orders1 on orders (actual time=0.034..0.115 rows=10 loops=1)
         Index Cond: (country_id = 1)
         Filter: (details @> '[{"product_id": 15}]'::jsonb)
         Rows Removed by Filter: 38
         Buffers: shared hit=32


If I drop the B-Tree index, the B-Tree GIN is used, but not efficient as it has to read and sort all rows before returning the ten row for the result:

postgres=# drop index orders1;
DROP INDEX
postgres=# EXPLAIN (ANALYZE, COSTS OFF, BUFFERS, SUMMARY OFF)
SELECT    country_id, details
 FROM     orders
 WHERE    country_id = 1 
 AND      details @> '[{"product_id": 15}]'
 ORDER BY created_at DESC
 LIMIT    10
;
                                            QUERY PLAN                                             
---------------------------------------------------------------------------------------------------
 Limit (actual time=215.701..215.704 rows=10 loops=1)
   Buffers: shared hit=229 read=40107 written=5
   ->  Sort (actual time=215.700..215.701 rows=10 loops=1)
         Sort Key: created_at DESC
         Sort Method: top-N heapsort  Memory: 46kB
         Buffers: shared hit=229 read=40107 written=5
         ->  Bitmap Heap Scan on orders (actual time=31.466..204.149 rows=44864 loops=1)
               Recheck Cond: ((country_id = 1) AND (details @> '[{"product_id": 15}]'::jsonb))
               Rows Removed by Index Recheck: 5280
               Heap Blocks: exact=39859
               Buffers: shared hit=229 read=40107 written=5
               ->  Bitmap Index Scan on orders3 (actual time=24.544..24.544 rows=50144 loops=1)
                     Index Cond: ((country_id = 1) AND (details @> '[{"product_id": 15}]'::jsonb))
                     Buffers: shared hit=229 read=248

Seeing all conditions in the Index Cond is not sufficient, especially with GIN indexes because they return false positives that must be rechecked later. Even if you don't have an ORDER BY LIMIT, the scan itself reads more rows than necessary, as reported by Rows Removed by Index Recheck.

Conclusion

PostgreSQL's JSONB with GIN indexing allows for schema-on-read capabilities within a schema-on-write relational model. While it enables document storage, it does not convert PostgreSQL into a document database.
Although GIN indexes efficiently handle equality predicates on array items, they are less effective for range or sorting operations compared to MongoDB's multi-key indexes, which support equality, sorting, and range queries transparently. You may use JSON datatypes to provide more flexibility to the relational model, but SQL databases should be used as relational databases because their engine is optimized for it. Always check the execution plan to avoid surprises.

Percona Server for MySQL Data Masking Enhanced with Dictionary Term Cache

In Percona Server for MySQL 8.0.41 / 8.4.4, we significantly re-designed the Data Masking Component. In particular, we made the following changes: Changed the user on behalf of whom we execute internal queries for dictionary operations. Introduced an in-memory dictionary term cache that allows significant speed-up of dictionary operations. Introduced masking_dictionaries_flush() User Defined Function. Introduced […]

Common PostgreSQL Upgrade Errors and How to Avoid Them

This post was originally published in January 2024 and was updated in April 2025. Everyone usually welcomes an upgrade, whether it’s a new phone OS, a flight seat, or a home renovation. Upgrades typically mean improvement, which holds true for PostgreSQL. Each new version brings updates vital for optimal application performance, stability, and security. However, […]

April 14, 2025

MongoDB Pricing Explained: What You’re Really Paying For

If you’re using MongoDB, whether it’s the Community Edition, Enterprise Advanced, or Atlas, you’ve probably asked yourself at some point: what is this actually costing us? And the answer isn’t always easy to pin down. The reality is that there’s no one-size-fits-all price tag. Your infrastructure, usage patterns, and team setup all factor in. That’s […]

April 13, 2025

MongoDB with Search Indexes Queried as Kimball's Star Schema with Facts and Dimensions

  • Export From SQL to CSV
  • Import to MongoDB
  • Indexing the Star Schema
  • Querying the Star Schema - Star Transformation
  • MongoDB queries on a Star Schema
  • Conclusion


In document databases like MongoDB, data modeling is typically optimized for predefined access patterns, where each domain or service owns its database. This contrasts with the relational model, which often serves as a centralized, fully normalized database designed independently of any specific application.
However, most applications also require the ability to perform ad-hoc queries or real-time analytics, where the query patterns are not predetermined.
Here, it is the opposite: SQL requires another database schema, whereas MongoDB offers the same dimensional model approach on the operational database.

In relational databases, a common practice is to create a separate analytics database, often in a dimensional model like the star schema popularized by Ralph Kimball. Here, normalized fact tables are surrounded by denormalized dimension tables. This setup is separate from the operational database and not updated in real-time by transactional workloads.

MongoDB, allows for a different strategy. Thanks to features like MongoDB Atlas Search Indexes, real-time analytics can be performed directly on the operational database without the need for replication to a separate analytical database. Here is a simple example using the Oracle Sales History Sample Schema which I used a lot in the past as an example of a Star Schema designed correctly with Bitmap Indexes on Foreign Keys to allow Star transformation.

Export From SQL to CSV

The simplest to get the Sales History Sample Schema (SH) populated with data is having an Autonomous Database on the Oracle Cloud Free Tier, as it is created by default.

I used SQLcl UNLOAD to get data into CSV files:

for i in CHANNELS COSTS COUNTRIES CUSTOMERS PRODUCTS PROMOTIONS SALES SUPPLEMENTARY_DEMOGRAPHICS TIMES 
do 
 echo "set loadformat csv"
 echo "alter session set nls_date_format = 'YYYY-MM-DD';"
 echo "unload SH.$i"
done > unload.sql

sqlcl @ unload.sql

This generated the following CSV files, in the current directory:


$ wc -l *.csv | sort -n                                                                                                                                                              
       6 CHANNELS_DATA_TABLE.csv
      24 COUNTRIES_DATA_TABLE.csv
      73 PRODUCTS_DATA_TABLE.csv
     504 PROMOTIONS_DATA_TABLE.csv
    1827 TIMES_DATA_TABLE.csv
    4501 SUPPLEMENTARY_DEMOGRAPHICS_DATA_TABLE.csv
   55501 CUSTOMERS_DATA_TABLE.csv
   82113 COSTS_DATA_TABLE.csv
  918844 SALES_DATA_TABLE.csv

The CSV files contain a header with the column names used in the star schema:

Image from: https://docs.oracle.com/cd/B19306_01/server.102/b14198/graphics/comsc007.gif

Import to MongoDB

I started a local MongoDB Atlas:

atlas deployments setup  atlas --type local --port 27017 --force

I imported the CSV files:

mongoimport -j 8 --type=csv --headerline --drop --file=CHANNELS_DATA_TABLE.csv --db=sh --collection=channels $@ &&                                                               
mongoimport -j 8 --type=csv --headerline --drop --file=COSTS_DATA_TABLE.csv --db=sh --collection=costs &&                                                                     
mongoimport -j 8 --type=csv --headerline --drop --file=COUNTRIES_DATA_TABLE.csv --db=sh --collection=countries &&                                                             
mongoimport -j 8 --type=csv --headerline --drop --file=CUSTOMERS_DATA_TABLE.csv --db=sh --collection=customers &&                                                             
mongoimport -j 8 --type=csv --headerline --drop --file=PRODUCTS_DATA_TABLE.csv --db=sh --collection=products &&                                                               
mongoimport -j 8 --type=csv --headerline --drop --file=PROMOTIONS_DATA_TABLE.csv --db=sh --collection=promotions &&                                                           
mongoimport -j 8 --type=csv --headerline --drop --file=SALES_DATA_TABLE.csv --db=sh --collection=sales &&
mongosh sh --eval "show collections"                         

It imports nine hundred thousand facts and the related dimensions in ten seconds:

I connect to the "sh" database and list an example of the "sales" documents:

mongosh
use sh;
db.sales.find().limit(2)

Indexing the Star Schema

In SQL databases, a single optimal index cannot be created for a Data Mart queried on multiple dimension combinations. The SH schema in Oracle Database utilizes one bitmap index for each foreign key in the fact table.
Queries read all necessary indexes and combine results with bitmap operations, which is ideal when individual predicates lack selectivity, but their combinations reduce it to a small set of rows to read from the fact table.
Bitmap indexes are effective for data warehouses but not for operational databases, as they can lead to index fragmentation and locking issues in OLTP systems. For real-time analytics, Oracle offers an In-Memory Column Store, which serves as an analytic-optimized cache for transactional databases.

MongoDB has a different solution: it can serve ad-hoc queries from a single Search Index per fact table. Atlas Search indexes run within the operational database and are asynchronously updated from the change stream, allowing near real-time analytics on operational databases.

I declared the following Search Index on the dimension key fields in the fact table:

db.sales.createSearchIndex(
   "SalesDimension",
   {
     "mappings": {
       "fields": {
         "PROD_ID":    { "type": "number" },
         "CUST_ID":    { "type": "number" },
         "TIME_ID":    { "type": "token" },
         "CHANNEL_ID": { "type": "number" },
         "PROMO_ID":   { "type": "number" },
       }
     }
   }
)

I imported data directly from the CSV without defining datatypes. The TIME_ID is a character string in the YYYY-MM-DD format, so I declared it as a token. The other fields are numbers. All fields serve as foreign keys referencing dimension collections. Their values are less important than having a natural order, which may facilitate range queries.

This example can be run on MongoDB Atlas free tier (M0) which allows three Search Indexes:

The total size of the SH schema, without the search index, is 60MB:

Logical Data Size: 172.12 MB
Storage Size: 31.49 MB
Index Size: 30.32 MB
Total Collections: 9

which is mostly the sales collection:

Storage Size: 20.83MB
Logical Data Size: 124.4MB
Total Documents: 918843
Indexes Total Size: 26.01MB

The Search Index takes approximately the same size:

This demo showcases a small database, illustrating the value of this approach when applied to larger databases. It is recommended to compare the size of the search index with the instance memory.

Querying the Star Schema - Star Transformation

The star schema's primary advantage lies in its fact table, which contains numerous rows with limited size, holding only dimension foreign keys and measures. This large row count is offset by the smaller row size. Ideally, there is one fact table to minimize joins across many rows.
Dimension tables are smaller yet can accommodate larger rows with multiple attributes for user-defined filtering and displaying long descriptions. In SQL, queries join the fact table to dimensions in the FROM clause, while predicates are specified in the WHERE clause. The RDBMS optimizer may transform the query to leverage the star schema efficiently, as I detailed in an old article for the SOUG newsletter:

Star Transformation, 12c Adaptive Bitmap Pruning and In-Memory option | PDF

Star Transformation, 12c Adaptive Bitmap Pruning and In-Memory option - Download as a PDF or view online for free

slideshare.net

The WHERE clause predicates are applied to small dimensions to obtain the list of dimension keys. This list is then used to access the fact table through bitmap indexes, retrieving only the necessary rows. Finally, this result is joined back with dimension tables for additional projections. This transformation, available in commercial RDBMS like Oracle Database Enterprise Edition, optimizes filtering to minimize unnecessary joins. It is not available in Oracle Standard Edition or PostgreSQL.

Whether this transformation is implemented or not, an SQL query uses joins between the fact table and dimensions. For example the following query retreives the sales sales transactions from 2001 that are associated with high-cost promotions, involve customers from California, and occur through the Direct Sales channel, specifically focusing on Saturdays:

SELECT
  s.PROMO_ID, s.CUST_ID, s.CHANNEL_ID, s.TIME_ID, s.PROD_ID,
  p.PROD_NAME, p.PROD_LIST_PRICE,
  s.quantity_sold, s.amount_sold,
FROM  sales s
 JOIN products p ON s.PROD_ID = p.PROD_ID
 JOIN promotions pr ON s.PROMO_ID = pr.PROMO_ID
 JOIN customers c ON s.CUST_ID = c.CUST_ID
 JOIN channels ch ON s.CHANNEL_ID = ch.CHANNEL_ID
 JOIN times t ON s.TIME_ID = t.TIME_ID
WHERE pr.PROMO_COST > 10000
  AND c.CUST_STATE_PROVINCE = 'CA'
  AND ch.CHANNEL_DESC = 'Direct Sales'
  AND t.FISCAL_YEAR = 2001
  AND t.DAY_NAME = 'Saturday'
  AND s.TIME_ID > date '2001-01-01'
;

SQL operates independently of the physical data model. The query planner must estimate the optimal order for joining tables and selecting indexes, relying on complex combinations of predicates to assess cardinalities.

MongoDB queries on a Star Schema

MongoDB gives more control to the developer on how data is accessed, and the same optimal approach can be used:

  1. get an array of references from each dimension where the filters are applied
  2. query the fact table with an aggregation pipeline using the search index to find those references
  3. lookup to the dimensions from which more fields are needed

The bitmap step is unnecessary because the search indexes, which are based on Apache Lucene, can effectively combine multiple filters with the compound operator. The dimension keys gathered as an array is simple to use with the in operator.

Here is an example to retrieve sales transactions from 2001 (TIME_ID: { $gt: "2001-01-01" }) that are associated with high-cost promotions (PROMO_COST: { $gt: 10000 }), involve customers from California ({ CUST_STATE_PROVINCE: 'CA' }), and occur through the Direct Sales channel ({ CHANNEL_DESC: 'Direct Sales' }), specifically focusing on Saturdays ({ FISCAL_YEAR: 2001 , DAY_NAME: 'Saturday' }).

For the time range, as the dimension key holds all information with its YYYY-MM-DD format, I don't need to read from the dimension. For the others, I query the dimension collections to get an array of keys that verify my condition:

const promoIds = db.promotions.find( 
 {PROMO_COST: { $gt: 10000 }} 
).toArray().map(doc => doc.PROMO_ID);

const custIds  = db.customers.find( 
  { CUST_STATE_PROVINCE: 'CA' }      
).toArray().map(doc => doc.CUST_ID);

const channelIds  = db.channels.find( 
  { CHANNEL_DESC: 'Direct Sales' } 
).toArray().map(doc => doc.CHANNEL_ID);

const timeIds  = db.times.find( 
 { FISCAL_YEAR: 2001 , DAY_NAME: 'Saturday' } 
).toArray().map(doc => doc.TIME_ID);

Don't worry about long lists, the search indexes can handle that:

The next step is an aggregation pipeline that uses those lists, and add the time range, in a compound filter:

db.sales.aggregate([
  {
    "$search": {
      "index": "SalesDimension",
      "compound": {
        "must": [
          { "in": { "path": "PROMO_ID". , "value": promoIds   } },
          { "in": { "path": "CUST_ID".  , "value": custIds    } },
          { "in": { "path": "CHANNEL_ID", "value": channelIds } },
          { "in": { "path": "TIME_ID".  , "value": timeIds    } },
          { "range": { "path": "TIME_ID", "gt": "2001-01-01"  } }
        ]
      }
    }
  },
  { "$sort": { "TIME_ID": -1 } },
  { "$limit": 3 },
  {
    "$lookup": {
      "from":         "products",
      "localField":   "PROD_ID",
      "foreignField": "PROD_ID",
      "as":           "product_info"
    }
  },
  {
    "$unwind": "$product_info"
  },
  {
    "$project": {
      "PROMO_ID": 1,
      "CUST_ID": 1,
      "CHANNEL_ID": 1,
      "TIME_ID": 1,
      "PROD_ID": 1,
      "quantity_sold": 1,
      "amount_sold": 1,
      "product_info.PROD_NAME": 1,
      "product_info.PROD_LIST_PRICE": 1
    }
  }
]);

The search operation is the first step of the aggregation pipeline, to get efficiently the minimal items requried by the query, and further processing can be added.
I've added a sort and limit operation here to display only the last three sales. I've added more information with a lookup on the product dimension, and unwind to project only the name and price.

The result is a collection of documents ready to return to the user:

This post aims to demonstrate the search component of the analytic query by reading only the necessary documents from the fact collection. Further analytics can be appended to the aggregation pipeline.

Conclusion

Unlike SQL star schemas, which necessitate Extract-Load-Transform (ELT) processes to another database, MongoDB’s document model allows transactions, including embedded or referenced data, to be stored in a single collection, effectively addressing multi-granularity facts without complex joins.

The Star Transformation approach can be used to generate a list of keys from lookup collections with applicable filters, similar to dimension tables. Atlas Search indexes combine these keys to retrieve necessary documents from the facts collection for further analysis in the aggregation pipeline.

With MongoDB, if you modeled your data properly for domain use cases and later encounter new query requirements needing scattered document aggregation, consider whether a search index can address these needs before changing the model or streaming to another database.

MongoDB with Search Indexes Queried as Kimball's Star Schema with Facts and Dimensions

  • Export From SQL to CSV
  • Import to MongoDB
  • Indexing the Star Schema
  • Querying the Star Schema - Star Transformation
  • MongoDB queries on a Star Schema
  • Conclusion


In document databases like MongoDB, data modeling is typically optimized for predefined access patterns, where each domain or service owns its database. This contrasts with the relational model, which often serves as a centralized, fully normalized database designed independently of any specific application.
However, most applications also require the ability to perform ad-hoc queries or real-time analytics, where the query patterns are not predetermined.
Here, it is the opposite: SQL often requires another database schema, whereas MongoDB offers the same dimensional model approach on the operational database.

In relational databases, a common practice is to create a separate analytics database, often in a dimensional model like the star schema popularized by Ralph Kimball. Here, normalized fact tables are surrounded by denormalized dimension tables. This setup is separate from the operational database and not updated in real-time by transactional workloads.

MongoDB, allows for a different strategy. Thanks to features like MongoDB Atlas Search Indexes, real-time analytics can be performed directly on the operational database without the need for replication to a separate analytical database. Here is a simple example using the Oracle Sales History Sample Schema which I used a lot in the past as an example of a Star Schema designed correctly with Bitmap Indexes on Foreign Keys to allow Star Transformation (discussed later).

Export From SQL to CSV

The simplest to get the Sales History Sample Schema (SH) populated with data is having an Autonomous Database on the Oracle Cloud Free Tier, as it is created by default.

I used SQLcl UNLOAD to get data into CSV files:

for i in CHANNELS COSTS COUNTRIES CUSTOMERS PRODUCTS PROMOTIONS SALES SUPPLEMENTARY_DEMOGRAPHICS TIMES 
do 
 echo "set loadformat csv"
 echo "alter session set nls_date_format = 'YYYY-MM-DD';"
 echo "unload SH.$i"
done > unload.sql

sqlcl @ unload.sql

This generated the following CSV files, in the current directory:


$ wc -l *.csv | sort -n                                                                                                                                                              
       6 CHANNELS_DATA_TABLE.csv
      24 COUNTRIES_DATA_TABLE.csv
      73 PRODUCTS_DATA_TABLE.csv
     504 PROMOTIONS_DATA_TABLE.csv
    1827 TIMES_DATA_TABLE.csv
    4501 SUPPLEMENTARY_DEMOGRAPHICS_DATA_TABLE.csv
   55501 CUSTOMERS_DATA_TABLE.csv
   82113 COSTS_DATA_TABLE.csv
  918844 SALES_DATA_TABLE.csv

The CSV files contain a header with the column names used in the star schema:

Image from: https://docs.oracle.com/cd/B19306_01/server.102/b14198/graphics/comsc007.gif

Import to MongoDB

I started a local MongoDB Atlas:

atlas deployments setup  atlas --type local --port 27017 --force

I imported the CSV files as-is:

mongoimport -j 8 --type=csv --headerline --drop --file=CHANNELS_DATA_TABLE.csv --db=sh --collection=channels $@ &&                                                               
mongoimport -j 8 --type=csv --headerline --drop --file=COSTS_DATA_TABLE.csv --db=sh --collection=costs &&                                                                     
mongoimport -j 8 --type=csv --headerline --drop --file=COUNTRIES_DATA_TABLE.csv --db=sh --collection=countries &&                                                             
mongoimport -j 8 --type=csv --headerline --drop --file=CUSTOMERS_DATA_TABLE.csv --db=sh --collection=customers &&                                                             
mongoimport -j 8 --type=csv --headerline --drop --file=PRODUCTS_DATA_TABLE.csv --db=sh --collection=products &&                                                               
mongoimport -j 8 --type=csv --headerline --drop --file=PROMOTIONS_DATA_TABLE.csv --db=sh --collection=promotions &&                                                           
mongoimport -j 8 --type=csv --headerline --drop --file=SALES_DATA_TABLE.csv --db=sh --collection=sales &&
mongosh sh --eval "show collections"                         

It imports nine hundred thousand facts and the related dimensions in ten seconds:

I connect to the "sh" database and list an example of the "sales" documents:

mongosh
use sh;
db.sales.find().limit(2)

Indexing the Star Schema

In SQL databases, a single optimal index cannot be created for a Data Mart queried on multiple dimension combinations. The SH schema in Oracle Database utilizes one bitmap index for each foreign key in the fact table.
Queries read all necessary indexes and combine results with bitmap operations, which is ideal when individual predicates lack selectivity, but their combinations reduce it to a small set of rows to read from the fact table.
Bitmap indexes are effective for data warehouses but not for operational databases, as they can lead to index fragmentation and locking issues in OLTP systems. For real-time analytics, Oracle offers an In-Memory Column Store, which serves as an analytic-optimized cache for transactional databases.

MongoDB has a different solution: it can serve ad-hoc queries from a single Search Index per fact table. Atlas Search indexes run within the operational database and are asynchronously updated from the change stream, allowing near real-time analytics on operational databases.

I declared the following Search Index on the dimension key fields in the fact table:

db.sales.createSearchIndex(
   "SalesDimension",
   {
     "mappings": {
       "fields": {
         "PROD_ID":    { "type": "number" },
         "CUST_ID":    { "type": "number" },
         "TIME_ID":    { "type": "token" },
         "CHANNEL_ID": { "type": "number" },
         "PROMO_ID":   { "type": "number" },
       }
     }
   }
)

I imported data directly from the CSV without defining datatypes. The TIME_ID is a character string in the YYYY-MM-DD format, so I declared it as a token. The other fields are numbers. All fields serve as foreign keys referencing dimension collections. Their values are less important than having a natural order, which may facilitate range queries.

This example can be run on MongoDB Atlas free tier (M0) which allows three Search Indexes:

The total size of the SH schema, without the search index, is 60MB:

Logical Data Size: 172.12 MB
Storage Size: 31.49 MB
Index Size: 30.32 MB
Total Collections: 9

which is mostly the sales collection:

Storage Size: 20.83MB
Logical Data Size: 124.4MB
Total Documents: 918843
Indexes Total Size: 26.01MB

The Search Index takes approximately the same size:

This demo showcases a small database, illustrating the value of this approach when applied to larger databases. It is recommended to compare the size of the search index with the instance memory.

Querying the Star Schema - Star Transformation

The star schema's primary advantage lies in its fact table, which contains numerous rows with limited size, holding only dimension foreign keys and measures. This large row count is offset by the smaller row size. Ideally, there is one fact table to minimize joins across many rows.
Dimension tables are smaller yet can accommodate larger rows with multiple attributes for user-defined filtering and displaying long descriptions. In SQL, queries join the fact table to dimensions in the FROM clause, while predicates are specified in the WHERE clause. The RDBMS optimizer may transform the query to leverage the star schema efficiently, as I detailed in an old article for the SOUG newsletter:

Star Transformation, 12c Adaptive Bitmap Pruning and In-Memory option | PDF

Star Transformation, 12c Adaptive Bitmap Pruning and In-Memory option - Download as a PDF or view online for free

slideshare.net

The WHERE clause predicates are applied to small dimensions to obtain the list of dimension keys. This list is then used to access the fact table through bitmap indexes, retrieving only the necessary rows. Finally, this result is joined back with dimension tables for additional projections. This transformation, available in commercial RDBMS like Oracle Database Enterprise Edition, optimizes filtering to minimize unnecessary joins. It is not available in Oracle Standard Edition or PostgreSQL.

Whether this transformation is implemented or not, an SQL query uses joins between the fact table and dimensions. For example the following query retreives the sales sales transactions from 2001 that are associated with high-cost promotions, involve customers from California, and occur through the Direct Sales channel, specifically focusing on Saturdays:

SELECT
  s.PROMO_ID, s.CUST_ID, s.CHANNEL_ID, s.TIME_ID, s.PROD_ID,
  p.PROD_NAME, p.PROD_LIST_PRICE,
  s.quantity_sold, s.amount_sold,
FROM  sales s
 JOIN products p ON s.PROD_ID = p.PROD_ID
 JOIN promotions pr ON s.PROMO_ID = pr.PROMO_ID
 JOIN customers c ON s.CUST_ID = c.CUST_ID
 JOIN channels ch ON s.CHANNEL_ID = ch.CHANNEL_ID
 JOIN times t ON s.TIME_ID = t.TIME_ID
WHERE pr.PROMO_COST > 10000
  AND c.CUST_STATE_PROVINCE = 'CA'
  AND ch.CHANNEL_DESC = 'Direct Sales'
  AND t.FISCAL_YEAR = 2001
  AND t.DAY_NAME = 'Saturday'
  AND s.TIME_ID > date '2001-01-01'
;

SQL operates independently of the physical data model. The query planner must estimate the optimal order for joining tables and selecting indexes, relying on complex combinations of predicates to assess cardinalities.

MongoDB queries on a Star Schema

MongoDB data models in operational databases resemble a dimensional schema, with a large collection that contains all business transaction information, like a fact table, and referencing smaller lookup tables similar to dimension tables.
MongoDB, designed for developers, offers more control over data access and allows the Star Transformation approach to be integrated into application code:

  1. get an array of references from each dimension where the filters are applied on many attributes
  2. query the fact table with an aggregation pipeline using the search index to find the documents having those references
  3. lookup to the dimensions from which more fields are needed for calculations, aggregations or projections

The bitmap step is unnecessary because the search indexes, which are based on Apache Lucene (with boolean queries), can effectively combine multiple filters with the compound operator. The dimension keys gathered as an array is simple to use with the in operator.

Here is an example to retrieve sales transactions from 2001 (TIME_ID: { $gt: "2001-01-01" }) that are associated with high-cost promotions (PROMO_COST: { $gt: 10000 }), involve customers from California ({ CUST_STATE_PROVINCE: 'CA' }), and occur through the Direct Sales channel ({ CHANNEL_DESC: 'Direct Sales' }), specifically focusing on Saturdays ({ FISCAL_YEAR: 2001 , DAY_NAME: 'Saturday' }). One criteria alone is not highly selective but their combination is.

For the time range, as the dimension key holds all information with its YYYY-MM-DD format, I don't need to read from the dimension. For the others, I query the dimension collections to get an array of keys that verify my conditions on their attributes:

const promoIds = db.promotions.find( 
 {PROMO_COST: { $gt: 10000 }} 
).toArray().map(doc => doc.PROMO_ID);

const custIds  = db.customers.find( 
  { CUST_STATE_PROVINCE: 'CA' }      
).toArray().map(doc => doc.CUST_ID);

const channelIds  = db.channels.find( 
  { CHANNEL_DESC: 'Direct Sales' } 
).toArray().map(doc => doc.CHANNEL_ID);

const timeIds  = db.times.find( 
 { FISCAL_YEAR: 2001 , DAY_NAME: 'Saturday' } 
).toArray().map(doc => doc.TIME_ID);

For the dimensions that are small, a COLLSCAN is fast, and larger dimensions can be indexed. Those collections are rarely updated and can have many indexes.

Don't worry about long lists, the search indexes can handle that:

The next step is an aggregation pipeline that uses those lists, and adds the time range, in a compound filter:

db.sales.aggregate([
  {
    "$search": {
      "index": "SalesDimension",
      "compound": {
        "must": [
          { "in": { "path": "PROMO_ID". , "value": promoIds   } },
          { "in": { "path": "CUST_ID".  , "value": custIds    } },
          { "in": { "path": "CHANNEL_ID", "value": channelIds } },
          { "in": { "path": "TIME_ID".  , "value": timeIds    } },
          { "range": { "path": "TIME_ID", "gt": "2001-01-01"  } }
        ]
      }
    }
  },
  { "$sort": { "TIME_ID": -1 } },
  { "$limit": 3 },
  {
    "$lookup": {
      "from":         "products",
      "localField":   "PROD_ID",
      "foreignField": "PROD_ID",
      "as":           "product_info"
    }
  },
  {
    "$unwind": "$product_info"
  },
  {
    "$project": {
      "PROMO_ID": 1,
      "CUST_ID": 1,
      "CHANNEL_ID": 1,
      "TIME_ID": 1,
      "PROD_ID": 1,
      "quantity_sold": 1,
      "amount_sold": 1,
      "product_info.PROD_NAME": 1,
      "product_info.PROD_LIST_PRICE": 1
    }
  }
]);

If you are new to MongoDB, an aggregation pipeline is an array of operations on a collection of documents. You use the same approach in SQL when you refactor a query with many subqueries that becomes hard to read, evolve and debug. Change it to a WITH clause and split the logical steps into Common Table Expressions (CTEs). The aggregation follows the same principle.

The search operation is the first step of the aggregation pipeline, to get efficiently the minimal items requried by the query, and further processing can be added.
I've added a sort and limit operation here to display only the last three sales. I've added more information with a lookup on the product dimension, and unwind to project only the name and price.

The result is a collection of documents ready to return to the user, returned in 8 milliseconds because it has solved all filtering before fetching the documents:

This post aims to demonstrate the search component of the analytic query by reading only the necessary documents from the fact collection. Further analytics can be appended to the aggregation pipeline. The datamodel can be improved (using better datatypes, keys as "_id", and index clustering) but the goal was to show how we can run efficent analytic queries without changing the schema.

Conclusion

Unlike SQL star schemas, which necessitate Extract-Load-Transform (ELT) processes to another database, MongoDB’s document model allows business transactions, including embedded or referenced data, to be stored in a single collection in the operational database, effectively addressing multi-granularity facts without complex joins.

To avoid embedding excessive information from dimension attributes into large collections, the Star Transformation approach can be used to generate a list of keys from lookup collections with applicable filters, similar to dimension tables. Atlas Search Indexes combine these keys to retrieve necessary documents from the facts collection for further analysis in the aggregation pipeline.

With MongoDB, if you modeled your data properly for domain use cases and later encounter new query requirements needing scattered document aggregation, consider whether a search index can address these needs before changing the model or streaming to another database.