April 17, 2025
Decomposing Transactional Systems
April 16, 2025
Lessons learned from 5 years operating huge ClickHouse® clusters: Part II
How to Extend Percona Monitoring and Management to Add Logging Functionality
April 15, 2025
PostgreSQL JSONB Indexing Limitations with B-Tree and GIN
In a previous post, I highlighted a significant limitation of relational databases: a One-to-Many relationship requires two tables, yet no index can optimize access with selective predicates on both of them, resulting in unnecessary rows that must be joined before elimination. A comment from @dogacel suggested continuing with PostgreSQL's JSON approach, so here it is.
SQL databases have evolved into general-purpose or converged databases capable of storing documents, like JSONB in PostgreSQL. However, they face challenges with indexing compared to document databases like MongoDB, which offer multi-key indexes for optimizing equality, sort, and range filtering. Let's explore JSONB documents and their indexing options.
Instead of an Order Details table, I embed the details into the Orders table:
create table orders(
primary key(id)
, id bigserial
, country_id int
, created_at timestamptz default clock_timestamp()
, details jsonb
);
I insert one million orders similar to the previous post:
insert into orders (country_id, details)
select
(10 * random())::int as country_id, -- generate a random country_id
(
select jsonb_agg(
jsonb_build_object(
'line', gs,
'product_id', ((log(2, (1 + o.id * random())::numeric)::int)),
'quantity', (100 * random())::int
)
)
from generate_series(1, 10) as gs -- create 10 lines per order
) as details
from generate_series(1, 1000000) as o(id); -- generate 1,000,000 orders
Here is an example of Order with its details:
postgres=# select * from orders limit 1;
-[ RECORD 1 ]--------------------------------------------------------
id | 1
country_id | 8
created_at | 2025-04-15 19:12:27.525731+00
details | [
{"line": 1, "quantity": 6, "product_id": 0},
{"line": 2, "quantity": 91, "product_id": 1},
{"line": 3, "quantity": 66, "product_id": 1},
{"line": 4, "quantity": 100, "product_id": 1},
{"line": 5, "quantity": 15, "product_id": 0},
{"line": 6, "quantity": 80, "product_id": 0},
{"line": 7, "quantity": 96, "product_id": 1},
{"line": 8, "quantity": 9, "product_id": 0},
{"line": 9, "quantity": 89, "product_id": 0},
{"line": 10, "quantity": 14, "product_id": 0}
]
I want to run a simple query to retrieve the last ten orders from a specific country that contain a specific product:
SELECT country_id, details
FROM orders
WHERE country_id = 1
AND details @> '[{"product_id": 15}]'
ORDER BY created_at DESC
LIMIT 10
;
Be aware of filters at various levels: on the order for country and date, and on the details for the product. With a normalized model and two tables, it was impossible to have one index covering all those filters. However, using MongoDB made it straightforward to create an index on "country_id", "order_details.product_id", and "created_at", resulting in only ten documents being read.
Expression index on scalar values
The normalized columns in the row can easily be indexed with a B-Tree:
CREATE INDEX orders1 ON orders ("country_id", "created_at" desc)
;
A similar B-Tree index can be created on the JSONB attributes as long as they are top-level, or sub-documents without repeating groups (without JSON array). However, it is impossible to create such index when the JSON path has an array because in PostgreSQL a B-Tree index cannot have multiple index key for the same row.
Of course, it is possible to create any index, as there's no schema declared for the JSONB content:
CREATE INDEX orders0 ON orders ( (("details"->>'product_id')::int) )
;
However, such index will never find a document where "product_id" is an array as it indexes a scalar expression.
A query on the content of the array cannot use the index:
postgres=# set enable_seqscan=off;
SET
postgres=# EXPLAIN (ANALYZE, COSTS OFF, BUFFERS, SUMMARY OFF)
SELECT * from orders WHERE details @> '[{"product_id": 15}]'
;
QUERY PLAN
---------------------------------------------------------------------
Seq Scan on orders (actual time=38.631..1293.388 rows=446355 loops=1)
Filter: (details @> '[{"product_id": 15}]'::jsonb)
Rows Removed by Filter: 553645
Buffers: shared hit=16128 read=94984
Generalized Inverted Index (GIN) for arrays
In order to index though an array, we need an inverted index and PostgreSQL has GIN:
CREATE INDEX orders2 ON orders using GIN ( details )
;
This index is used for the predicate on the array:
postgres=# set enable_seqscan=off;
SET
postgres=# EXPLAIN (ANALYZE, COSTS OFF, BUFFERS, SUMMARY OFF)
SELECT * from orders WHERE details @> '[{"product_id": 15}]'
;
QUERY PLAN
-------------------------------------------------------------------------------------
Bitmap Heap Scan on orders (actual time=95.272..1196.876 rows=446355 loops=1)
Recheck Cond: (details @> '[{"product_id": 15}]'::jsonb)
Rows Removed by Index Recheck: 303886
Heap Blocks: exact=42718 lossy=65930
Buffers: shared hit=371 read=108648 written=3415
-> Bitmap Index Scan on orders2 (actual time=63.730..63.730 rows=499460 loops=1)
Index Cond: (details @> '[{"product_id": 15}]'::jsonb)
Buffers: shared hit=371
I disabled Seq Scan because this predicate alone is not selective enough, returning half of the rows, but it proves that the GIN index can be used.
However, when I run the full query with additional WHERE and ORDER BY LIMIT filters, which are covered by the B-Tree index, the GIN index is no longer used:
postgres=# EXPLAIN (ANALYZE, COSTS OFF, BUFFERS, SUMMARY OFF)
SELECT country_id, details
FROM orders
WHERE country_id = 1
AND details @> '[{"product_id": 15}]'
ORDER BY created_at DESC
LIMIT 10
;
QUERY PLAN
-------------------------------------------------------------------------------------
Limit (actual time=0.036..0.117 rows=10 loops=1)
Buffers: shared hit=32
-> Index Scan using orders1 on orders (actual time=0.035..0.115 rows=10 loops=1)
Index Cond: (country_id = 1)
Filter: (details @> '[{"product_id": 15}]'::jsonb)
Rows Removed by Filter: 38
Buffers: shared hit=32
To understand the reason, I remove the LIMIT clause:
postgres=# EXPLAIN (ANALYZE, COSTS OFF, BUFFERS, SUMMARY OFF)
SELECT country_id, details
FROM orders
WHERE country_id = 1
AND details @> '[{"product_id": 15}]'
ORDER BY created_at DESC
-- LIMIT 10
;
QUERY PLAN
-------------------------------------------------------------------------------------------------------
Gather Merge (actual time=237.895..257.523 rows=44864 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=158 read=61738, temp read=4584 written=4592
-> Sort (actual time=222.683..226.142 rows=14955 loops=3)
Sort Key: created_at DESC
Sort Method: external merge Disk: 12688kB
Buffers: shared hit=158 read=61738, temp read=4584 written=4592
Worker 0: Sort Method: external merge Disk: 11616kB
Worker 1: Sort Method: external merge Disk: 12368kB
-> Parallel Bitmap Heap Scan on orders (actual time=74.822..211.433 rows=14955 loops=3)
Recheck Cond: ((country_id = 1) AND (details @> '[{"product_id": 15}]'::jsonb))
Rows Removed by Index Recheck: 96641
Heap Blocks: exact=9701 lossy=11292
Buffers: shared hit=145 read=61733
-> BitmapAnd (actual time=78.189..78.190 rows=0 loops=1)
Buffers: shared hit=145 read=613
-> Bitmap Index Scan on orders1 (actual time=11.496..11.497 rows=100362 loops=1)
Index Cond: (country_id = 1)
Buffers: shared read=387
-> Bitmap Index Scan on orders2 (actual time=64.445..64.445 rows=499460 loops=1)
Index Cond: (details @> '[{"product_id": 15}]'::jsonb)
Buffers: shared hit=145 read=226
PostgreSQL can utilize both indexes, but it requires a Bitmap Scan to combine them, which does not preserve the B-Tree index order. Consequently, it must perform a Sort operation that reads all rows before returning the first one, making it unsuitable for the LIMIT clause due to excessive row reading.
Composite B-Tree + GIN with extension
With the btree_gin
extension available in contrib, I can create a single index that is supposed to combine the advantages of B-Tree and GIN indexes:
postgres=# CREATE EXTENSION BTREE_GIN;
CREATE EXTENSION
postgres=# CREATE INDEX orders3 ON orders
using GIN (country_id , details, created_at);
CREATE INDEX
Running it without LIMIT shows that it doesn't solve the problem:
postgres=# EXPLAIN (ANALYZE, COSTS OFF, BUFFERS, SUMMARY OFF)
SELECT country_id, details
FROM orders
WHERE country_id = 1
AND details @> '[{"product_id": 15}]'
ORDER BY created_at DESC
-- LIMIT 10
;
QUERY PLAN
---------------------------------------------------------------------------------------------------
Gather Merge (actual time=123.675..143.726 rows=44864 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=237 read=40117, temp read=4585 written=4594
-> Sort (actual time=109.979..113.574 rows=14955 loops=3)
Sort Key: created_at DESC
Sort Method: external merge Disk: 12456kB
Buffers: shared hit=237 read=40117, temp read=4585 written=4594
Worker 0: Sort Method: external merge Disk: 11720kB
Worker 1: Sort Method: external merge Disk: 12504kB
-> Parallel Bitmap Heap Scan on orders (actual time=18.096..98.799 rows=14955 loops=3)
Recheck Cond: ((country_id = 1) AND (details @> '[{"product_id": 15}]'::jsonb))
Rows Removed by Index Recheck: 1760
Heap Blocks: exact=13486
Buffers: shared hit=226 read=40110
-> Bitmap Index Scan on orders3 (actual time=25.197..25.197 rows=50144 loops=1)
Index Cond: ((country_id = 1) AND (details @> '[{"product_id": 15}]'::jsonb))
Buffers: shared hit=226 read=251
Only one index was scanned, with all conditions in its Index Cond, but it is still a Bitmap Scan which doesn't preserve the index key order, because that's how PostgreSQL inverted index works. A GIN index, even with the BTREE_GIN extension, is not an equivalent to MongoDB multi-key indexes which have their index entries ordered.
Even if my new index has all columns, it is not used by the pagination query:
postgres=# EXPLAIN (ANALYZE, COSTS OFF, BUFFERS, SUMMARY OFF)
SELECT country_id, details
FROM orders
WHERE country_id = 1
AND details @> '[{"product_id": 15}]'
ORDER BY created_at DESC
LIMIT 10
;
QUERY PLAN
-------------------------------------------------------------------------------------
Limit (actual time=0.034..0.117 rows=10 loops=1)
Buffers: shared hit=32
-> Index Scan using orders1 on orders (actual time=0.034..0.115 rows=10 loops=1)
Index Cond: (country_id = 1)
Filter: (details @> '[{"product_id": 15}]'::jsonb)
Rows Removed by Filter: 38
Buffers: shared hit=32
If I drop the B-Tree index, the B-Tree GIN is used, but not efficient as it has to read and sort all rows before returning the ten row for the result:
postgres=# drop index orders1;
DROP INDEX
postgres=# EXPLAIN (ANALYZE, COSTS OFF, BUFFERS, SUMMARY OFF)
SELECT country_id, details
FROM orders
WHERE country_id = 1
AND details @> '[{"product_id": 15}]'
ORDER BY created_at DESC
LIMIT 10
;
QUERY PLAN
---------------------------------------------------------------------------------------------------
Limit (actual time=215.701..215.704 rows=10 loops=1)
Buffers: shared hit=229 read=40107 written=5
-> Sort (actual time=215.700..215.701 rows=10 loops=1)
Sort Key: created_at DESC
Sort Method: top-N heapsort Memory: 46kB
Buffers: shared hit=229 read=40107 written=5
-> Bitmap Heap Scan on orders (actual time=31.466..204.149 rows=44864 loops=1)
Recheck Cond: ((country_id = 1) AND (details @> '[{"product_id": 15}]'::jsonb))
Rows Removed by Index Recheck: 5280
Heap Blocks: exact=39859
Buffers: shared hit=229 read=40107 written=5
-> Bitmap Index Scan on orders3 (actual time=24.544..24.544 rows=50144 loops=1)
Index Cond: ((country_id = 1) AND (details @> '[{"product_id": 15}]'::jsonb))
Buffers: shared hit=229 read=248
Seeing all conditions in the Index Cond is not sufficient, especially with GIN indexes because they return false positives that must be rechecked later. Even if you don't have an ORDER BY LIMIT, the scan itself reads more rows than necessary, as reported by Rows Removed by Index Recheck.
Conclusion
PostgreSQL's JSONB with GIN indexing allows for schema-on-read capabilities within a schema-on-write relational model. While it enables document storage, it does not convert PostgreSQL into a document database.
Although GIN indexes efficiently handle equality predicates on array items, they are less effective for range or sorting operations compared to MongoDB's multi-key indexes, which support equality, sorting, and range queries transparently. You may use JSON datatypes to provide more flexibility to the relational model, but SQL databases should be used as relational databases because their engine is optimized for it. Always check the execution plan to avoid surprises.
PostgreSQL JSONB Indexing Limitations with B-Tree and GIN
In a previous post, I highlighted a significant limitation of relational databases: a One-to-Many relationship requires two tables, yet no index can optimize access with selective predicates on both of them, resulting in unnecessary rows that must be joined before elimination. A comment from @dogacel suggested continuing with PostgreSQL's JSON approach, so here it is.
SQL databases have evolved into general-purpose or converged databases capable of storing documents, like JSONB in PostgreSQL. However, they face challenges with indexing compared to document databases like MongoDB, which offer multi-key indexes for optimizing equality, sort, and range filtering. Let's explore JSONB documents and their indexing options.
Instead of an Order Details table, I embed the details into the Orders table:
create table orders(
primary key(id)
, id bigserial
, country_id int
, created_at timestamptz default clock_timestamp()
, details jsonb
);
I insert one million orders similar to the previous post:
insert into orders (country_id, details)
select
(10 * random())::int as country_id, -- generate a random country_id
(
select jsonb_agg(
jsonb_build_object(
'line', gs,
'product_id', ((log(2, (1 + o.id * random())::numeric)::int)),
'quantity', (100 * random())::int
)
)
from generate_series(1, 10) as gs -- create 10 lines per order
) as details
from generate_series(1, 1000000) as o(id); -- generate 1,000,000 orders
Here is an example of Order with its details:
postgres=# select * from orders limit 1;
-[ RECORD 1 ]--------------------------------------------------------
id | 1
country_id | 8
created_at | 2025-04-15 19:12:27.525731+00
details | [
{"line": 1, "quantity": 6, "product_id": 0},
{"line": 2, "quantity": 91, "product_id": 1},
{"line": 3, "quantity": 66, "product_id": 1},
{"line": 4, "quantity": 100, "product_id": 1},
{"line": 5, "quantity": 15, "product_id": 0},
{"line": 6, "quantity": 80, "product_id": 0},
{"line": 7, "quantity": 96, "product_id": 1},
{"line": 8, "quantity": 9, "product_id": 0},
{"line": 9, "quantity": 89, "product_id": 0},
{"line": 10, "quantity": 14, "product_id": 0}
]
I want to run a simple query to retrieve the last ten orders from a specific country that contain a specific product:
SELECT country_id, details
FROM orders
WHERE country_id = 1
AND details @> '[{"product_id": 15}]'
ORDER BY created_at DESC
LIMIT 10
;
Be aware of filters at various levels: on the order for country and date, and on the details for the product. With a normalized model and two tables, it was impossible to have one index covering all those filters. However, using MongoDB made it straightforward to create an index on "country_id", "order_details.product_id", and "created_at", resulting in only ten documents being read.
Expression index on scalar values
The normalized columns in the row can easily be indexed with a B-Tree:
CREATE INDEX orders1 ON orders ("country_id", "created_at" desc)
;
A similar B-Tree index can be created on the JSONB attributes as long as they are top-level, or sub-documents without repeating groups (without JSON array). However, it is impossible to create such index when the JSON path has an array because in PostgreSQL a B-Tree index cannot have multiple index key for the same row.
Of course, it is possible to create any index, as there's no schema declared for the JSONB content:
CREATE INDEX orders0 ON orders ( (("details"->>'product_id')::int) )
;
However, such index will never find a document where "product_id" is an array as it indexes a scalar expression.
A query on the content of the array cannot use the index:
postgres=# set enable_seqscan=off;
SET
postgres=# EXPLAIN (ANALYZE, COSTS OFF, BUFFERS, SUMMARY OFF)
SELECT * from orders WHERE details @> '[{"product_id": 15}]'
;
QUERY PLAN
---------------------------------------------------------------------
Seq Scan on orders (actual time=38.631..1293.388 rows=446355 loops=1)
Filter: (details @> '[{"product_id": 15}]'::jsonb)
Rows Removed by Filter: 553645
Buffers: shared hit=16128 read=94984
Generalized Inverted Index (GIN) for arrays
In order to index though an array, we need an inverted index and PostgreSQL has GIN:
CREATE INDEX orders2 ON orders using GIN ( details )
;
This index is used for the predicate on the array:
postgres=# set enable_seqscan=off;
SET
postgres=# EXPLAIN (ANALYZE, COSTS OFF, BUFFERS, SUMMARY OFF)
SELECT * from orders WHERE details @> '[{"product_id": 15}]'
;
QUERY PLAN
-------------------------------------------------------------------------------------
Bitmap Heap Scan on orders (actual time=95.272..1196.876 rows=446355 loops=1)
Recheck Cond: (details @> '[{"product_id": 15}]'::jsonb)
Rows Removed by Index Recheck: 303886
Heap Blocks: exact=42718 lossy=65930
Buffers: shared hit=371 read=108648 written=3415
-> Bitmap Index Scan on orders2 (actual time=63.730..63.730 rows=499460 loops=1)
Index Cond: (details @> '[{"product_id": 15}]'::jsonb)
Buffers: shared hit=371
I disabled Seq Scan because this predicate alone is not selective enough, returning half of the rows, but it proves that the GIN index can be used.
However, when I run the full query with additional WHERE and ORDER BY LIMIT filters, which are covered by the B-Tree index, the GIN index is no longer used:
postgres=# EXPLAIN (ANALYZE, COSTS OFF, BUFFERS, SUMMARY OFF)
SELECT country_id, details
FROM orders
WHERE country_id = 1
AND details @> '[{"product_id": 15}]'
ORDER BY created_at DESC
LIMIT 10
;
QUERY PLAN
-------------------------------------------------------------------------------------
Limit (actual time=0.036..0.117 rows=10 loops=1)
Buffers: shared hit=32
-> Index Scan using orders1 on orders (actual time=0.035..0.115 rows=10 loops=1)
Index Cond: (country_id = 1)
Filter: (details @> '[{"product_id": 15}]'::jsonb)
Rows Removed by Filter: 38
Buffers: shared hit=32
To understand the reason, I remove the LIMIT clause:
postgres=# EXPLAIN (ANALYZE, COSTS OFF, BUFFERS, SUMMARY OFF)
SELECT country_id, details
FROM orders
WHERE country_id = 1
AND details @> '[{"product_id": 15}]'
ORDER BY created_at DESC
-- LIMIT 10
;
QUERY PLAN
-------------------------------------------------------------------------------------------------------
Gather Merge (actual time=237.895..257.523 rows=44864 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=158 read=61738, temp read=4584 written=4592
-> Sort (actual time=222.683..226.142 rows=14955 loops=3)
Sort Key: created_at DESC
Sort Method: external merge Disk: 12688kB
Buffers: shared hit=158 read=61738, temp read=4584 written=4592
Worker 0: Sort Method: external merge Disk: 11616kB
Worker 1: Sort Method: external merge Disk: 12368kB
-> Parallel Bitmap Heap Scan on orders (actual time=74.822..211.433 rows=14955 loops=3)
Recheck Cond: ((country_id = 1) AND (details @> '[{"product_id": 15}]'::jsonb))
Rows Removed by Index Recheck: 96641
Heap Blocks: exact=9701 lossy=11292
Buffers: shared hit=145 read=61733
-> BitmapAnd (actual time=78.189..78.190 rows=0 loops=1)
Buffers: shared hit=145 read=613
-> Bitmap Index Scan on orders1 (actual time=11.496..11.497 rows=100362 loops=1)
Index Cond: (country_id = 1)
Buffers: shared read=387
-> Bitmap Index Scan on orders2 (actual time=64.445..64.445 rows=499460 loops=1)
Index Cond: (details @> '[{"product_id": 15}]'::jsonb)
Buffers: shared hit=145 read=226
PostgreSQL can utilize both indexes, but it requires a Bitmap Scan to combine them, which does not preserve the B-Tree index order. Consequently, it must perform a Sort operation that reads all rows before returning the first one, making it unsuitable for the LIMIT clause due to excessive row reading.
Composite B-Tree + GIN with extension
With the btree_gin
extension available in contrib, I can create a single index that is supposed to combine the advantages of B-Tree and GIN indexes:
postgres=# CREATE EXTENSION BTREE_GIN;
CREATE EXTENSION
postgres=# CREATE INDEX orders3 ON orders
using GIN (country_id , details, created_at);
CREATE INDEX
Running it without LIMIT shows that it doesn't solve the problem:
postgres=# EXPLAIN (ANALYZE, COSTS OFF, BUFFERS, SUMMARY OFF)
SELECT country_id, details
FROM orders
WHERE country_id = 1
AND details @> '[{"product_id": 15}]'
ORDER BY created_at DESC
-- LIMIT 10
;
QUERY PLAN
---------------------------------------------------------------------------------------------------
Gather Merge (actual time=123.675..143.726 rows=44864 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=237 read=40117, temp read=4585 written=4594
-> Sort (actual time=109.979..113.574 rows=14955 loops=3)
Sort Key: created_at DESC
Sort Method: external merge Disk: 12456kB
Buffers: shared hit=237 read=40117, temp read=4585 written=4594
Worker 0: Sort Method: external merge Disk: 11720kB
Worker 1: Sort Method: external merge Disk: 12504kB
-> Parallel Bitmap Heap Scan on orders (actual time=18.096..98.799 rows=14955 loops=3)
Recheck Cond: ((country_id = 1) AND (details @> '[{"product_id": 15}]'::jsonb))
Rows Removed by Index Recheck: 1760
Heap Blocks: exact=13486
Buffers: shared hit=226 read=40110
-> Bitmap Index Scan on orders3 (actual time=25.197..25.197 rows=50144 loops=1)
Index Cond: ((country_id = 1) AND (details @> '[{"product_id": 15}]'::jsonb))
Buffers: shared hit=226 read=251
Only one index was scanned, with all conditions in its Index Cond, but it is still a Bitmap Scan which doesn't preserve the index key order, because that's how PostgreSQL inverted index works. A GIN index, even with the BTREE_GIN extension, is not an equivalent to MongoDB multi-key indexes which have their index entries ordered.
Even if my new index has all columns, it is not used by the pagination query:
postgres=# EXPLAIN (ANALYZE, COSTS OFF, BUFFERS, SUMMARY OFF)
SELECT country_id, details
FROM orders
WHERE country_id = 1
AND details @> '[{"product_id": 15}]'
ORDER BY created_at DESC
LIMIT 10
;
QUERY PLAN
-------------------------------------------------------------------------------------
Limit (actual time=0.034..0.117 rows=10 loops=1)
Buffers: shared hit=32
-> Index Scan using orders1 on orders (actual time=0.034..0.115 rows=10 loops=1)
Index Cond: (country_id = 1)
Filter: (details @> '[{"product_id": 15}]'::jsonb)
Rows Removed by Filter: 38
Buffers: shared hit=32
If I drop the B-Tree index, the B-Tree GIN is used, but not efficient as it has to read and sort all rows before returning the ten row for the result:
postgres=# drop index orders1;
DROP INDEX
postgres=# EXPLAIN (ANALYZE, COSTS OFF, BUFFERS, SUMMARY OFF)
SELECT country_id, details
FROM orders
WHERE country_id = 1
AND details @> '[{"product_id": 15}]'
ORDER BY created_at DESC
LIMIT 10
;
QUERY PLAN
---------------------------------------------------------------------------------------------------
Limit (actual time=215.701..215.704 rows=10 loops=1)
Buffers: shared hit=229 read=40107 written=5
-> Sort (actual time=215.700..215.701 rows=10 loops=1)
Sort Key: created_at DESC
Sort Method: top-N heapsort Memory: 46kB
Buffers: shared hit=229 read=40107 written=5
-> Bitmap Heap Scan on orders (actual time=31.466..204.149 rows=44864 loops=1)
Recheck Cond: ((country_id = 1) AND (details @> '[{"product_id": 15}]'::jsonb))
Rows Removed by Index Recheck: 5280
Heap Blocks: exact=39859
Buffers: shared hit=229 read=40107 written=5
-> Bitmap Index Scan on orders3 (actual time=24.544..24.544 rows=50144 loops=1)
Index Cond: ((country_id = 1) AND (details @> '[{"product_id": 15}]'::jsonb))
Buffers: shared hit=229 read=248
Seeing all conditions in the Index Cond is not sufficient, especially with GIN indexes because they return false positives that must be rechecked later. Even if you don't have an ORDER BY LIMIT, the scan itself reads more rows than necessary, as reported by Rows Removed by Index Recheck.
Conclusion
PostgreSQL's JSONB with GIN indexing allows for schema-on-read capabilities within a schema-on-write relational model. While it enables document storage, it does not convert PostgreSQL into a document database.
Although GIN indexes efficiently handle equality predicates on array items, they are less effective for range or sorting operations compared to MongoDB's multi-key indexes, which support equality, sorting, and range queries transparently. You may use JSON datatypes to provide more flexibility to the relational model, but SQL databases should be used as relational databases because their engine is optimized for it. Always check the execution plan to avoid surprises.
Percona Server for MySQL Data Masking Enhanced with Dictionary Term Cache
Common PostgreSQL Upgrade Errors and How to Avoid Them
April 14, 2025
MongoDB Pricing Explained: What You’re Really Paying For
April 13, 2025
MongoDB with Search Indexes Queried as Kimball's Star Schema with Facts and Dimensions
- Export From SQL to CSV
- Import to MongoDB
- Indexing the Star Schema
- Querying the Star Schema - Star Transformation
- MongoDB queries on a Star Schema
- Conclusion
In document databases like MongoDB, data modeling is typically optimized for predefined access patterns, where each domain or service owns its database. This contrasts with the relational model, which often serves as a centralized, fully normalized database designed independently of any specific application.
However, most applications also require the ability to perform ad-hoc queries or real-time analytics, where the query patterns are not predetermined.
Here, it is the opposite: SQL requires another database schema, whereas MongoDB offers the same dimensional model approach on the operational database.
In relational databases, a common practice is to create a separate analytics database, often in a dimensional model like the star schema popularized by Ralph Kimball. Here, normalized fact tables are surrounded by denormalized dimension tables. This setup is separate from the operational database and not updated in real-time by transactional workloads.
MongoDB, allows for a different strategy. Thanks to features like MongoDB Atlas Search Indexes, real-time analytics can be performed directly on the operational database without the need for replication to a separate analytical database. Here is a simple example using the Oracle Sales History Sample Schema which I used a lot in the past as an example of a Star Schema designed correctly with Bitmap Indexes on Foreign Keys to allow Star transformation.
Export From SQL to CSV
The simplest to get the Sales History Sample Schema (SH) populated with data is having an Autonomous Database on the Oracle Cloud Free Tier, as it is created by default.
I used SQLcl UNLOAD to get data into CSV files:
for i in CHANNELS COSTS COUNTRIES CUSTOMERS PRODUCTS PROMOTIONS SALES SUPPLEMENTARY_DEMOGRAPHICS TIMES
do
echo "set loadformat csv"
echo "alter session set nls_date_format = 'YYYY-MM-DD';"
echo "unload SH.$i"
done > unload.sql
sqlcl @ unload.sql
This generated the following CSV files, in the current directory:
$ wc -l *.csv | sort -n
6 CHANNELS_DATA_TABLE.csv
24 COUNTRIES_DATA_TABLE.csv
73 PRODUCTS_DATA_TABLE.csv
504 PROMOTIONS_DATA_TABLE.csv
1827 TIMES_DATA_TABLE.csv
4501 SUPPLEMENTARY_DEMOGRAPHICS_DATA_TABLE.csv
55501 CUSTOMERS_DATA_TABLE.csv
82113 COSTS_DATA_TABLE.csv
918844 SALES_DATA_TABLE.csv
The CSV files contain a header with the column names used in the star schema:
Image from: https://docs.oracle.com/cd/B19306_01/server.102/b14198/graphics/comsc007.gif
Import to MongoDB
I started a local MongoDB Atlas:
atlas deployments setup atlas --type local --port 27017 --force
I imported the CSV files:
mongoimport -j 8 --type=csv --headerline --drop --file=CHANNELS_DATA_TABLE.csv --db=sh --collection=channels $@ &&
mongoimport -j 8 --type=csv --headerline --drop --file=COSTS_DATA_TABLE.csv --db=sh --collection=costs &&
mongoimport -j 8 --type=csv --headerline --drop --file=COUNTRIES_DATA_TABLE.csv --db=sh --collection=countries &&
mongoimport -j 8 --type=csv --headerline --drop --file=CUSTOMERS_DATA_TABLE.csv --db=sh --collection=customers &&
mongoimport -j 8 --type=csv --headerline --drop --file=PRODUCTS_DATA_TABLE.csv --db=sh --collection=products &&
mongoimport -j 8 --type=csv --headerline --drop --file=PROMOTIONS_DATA_TABLE.csv --db=sh --collection=promotions &&
mongoimport -j 8 --type=csv --headerline --drop --file=SALES_DATA_TABLE.csv --db=sh --collection=sales &&
mongosh sh --eval "show collections"
It imports nine hundred thousand facts and the related dimensions in ten seconds:
I connect to the "sh" database and list an example of the "sales" documents:
mongosh
use sh;
db.sales.find().limit(2)
Indexing the Star Schema
In SQL databases, a single optimal index cannot be created for a Data Mart queried on multiple dimension combinations. The SH schema in Oracle Database utilizes one bitmap index for each foreign key in the fact table.
Queries read all necessary indexes and combine results with bitmap operations, which is ideal when individual predicates lack selectivity, but their combinations reduce it to a small set of rows to read from the fact table.
Bitmap indexes are effective for data warehouses but not for operational databases, as they can lead to index fragmentation and locking issues in OLTP systems. For real-time analytics, Oracle offers an In-Memory Column Store, which serves as an analytic-optimized cache for transactional databases.
MongoDB has a different solution: it can serve ad-hoc queries from a single Search Index per fact table. Atlas Search indexes run within the operational database and are asynchronously updated from the change stream, allowing near real-time analytics on operational databases.
I declared the following Search Index on the dimension key fields in the fact table:
db.sales.createSearchIndex(
"SalesDimension",
{
"mappings": {
"fields": {
"PROD_ID": { "type": "number" },
"CUST_ID": { "type": "number" },
"TIME_ID": { "type": "token" },
"CHANNEL_ID": { "type": "number" },
"PROMO_ID": { "type": "number" },
}
}
}
)
I imported data directly from the CSV without defining datatypes. The TIME_ID is a character string in the YYYY-MM-DD format, so I declared it as a token. The other fields are numbers. All fields serve as foreign keys referencing dimension collections. Their values are less important than having a natural order, which may facilitate range queries.
This example can be run on MongoDB Atlas free tier (M0) which allows three Search Indexes:
The total size of the SH schema, without the search index, is 60MB:
Logical Data Size: 172.12 MB
Storage Size: 31.49 MB
Index Size: 30.32 MB
Total Collections: 9
which is mostly the sales collection:
Storage Size: 20.83MB
Logical Data Size: 124.4MB
Total Documents: 918843
Indexes Total Size: 26.01MB
The Search Index takes approximately the same size:
This demo showcases a small database, illustrating the value of this approach when applied to larger databases. It is recommended to compare the size of the search index with the instance memory.
Querying the Star Schema - Star Transformation
The star schema's primary advantage lies in its fact table, which contains numerous rows with limited size, holding only dimension foreign keys and measures. This large row count is offset by the smaller row size. Ideally, there is one fact table to minimize joins across many rows.
Dimension tables are smaller yet can accommodate larger rows with multiple attributes for user-defined filtering and displaying long descriptions. In SQL, queries join the fact table to dimensions in the FROM clause, while predicates are specified in the WHERE clause. The RDBMS optimizer may transform the query to leverage the star schema efficiently, as I detailed in an old article for the SOUG newsletter:
The WHERE clause predicates are applied to small dimensions to obtain the list of dimension keys. This list is then used to access the fact table through bitmap indexes, retrieving only the necessary rows. Finally, this result is joined back with dimension tables for additional projections. This transformation, available in commercial RDBMS like Oracle Database Enterprise Edition, optimizes filtering to minimize unnecessary joins. It is not available in Oracle Standard Edition or PostgreSQL.
Whether this transformation is implemented or not, an SQL query uses joins between the fact table and dimensions. For example the following query retreives the sales sales transactions from 2001 that are associated with high-cost promotions, involve customers from California, and occur through the Direct Sales channel, specifically focusing on Saturdays:
SELECT
s.PROMO_ID, s.CUST_ID, s.CHANNEL_ID, s.TIME_ID, s.PROD_ID,
p.PROD_NAME, p.PROD_LIST_PRICE,
s.quantity_sold, s.amount_sold,
FROM sales s
JOIN products p ON s.PROD_ID = p.PROD_ID
JOIN promotions pr ON s.PROMO_ID = pr.PROMO_ID
JOIN customers c ON s.CUST_ID = c.CUST_ID
JOIN channels ch ON s.CHANNEL_ID = ch.CHANNEL_ID
JOIN times t ON s.TIME_ID = t.TIME_ID
WHERE pr.PROMO_COST > 10000
AND c.CUST_STATE_PROVINCE = 'CA'
AND ch.CHANNEL_DESC = 'Direct Sales'
AND t.FISCAL_YEAR = 2001
AND t.DAY_NAME = 'Saturday'
AND s.TIME_ID > date '2001-01-01'
;
SQL operates independently of the physical data model. The query planner must estimate the optimal order for joining tables and selecting indexes, relying on complex combinations of predicates to assess cardinalities.
MongoDB queries on a Star Schema
MongoDB gives more control to the developer on how data is accessed, and the same optimal approach can be used:
- get an array of references from each dimension where the filters are applied
- query the fact table with an aggregation pipeline using the search index to find those references
- lookup to the dimensions from which more fields are needed
The bitmap step is unnecessary because the search indexes, which are based on Apache Lucene, can effectively combine multiple filters with the compound operator. The dimension keys gathered as an array is simple to use with the in operator.
Here is an example to retrieve sales transactions from 2001 (TIME_ID: { $gt: "2001-01-01" }
) that are associated with high-cost promotions (PROMO_COST: { $gt: 10000 }
), involve customers from California ({ CUST_STATE_PROVINCE: 'CA' }
), and occur through the Direct Sales channel ({ CHANNEL_DESC: 'Direct Sales' }
), specifically focusing on Saturdays ({ FISCAL_YEAR: 2001 , DAY_NAME: 'Saturday' }
).
For the time range, as the dimension key holds all information with its YYYY-MM-DD format, I don't need to read from the dimension. For the others, I query the dimension collections to get an array of keys that verify my condition:
const promoIds = db.promotions.find(
{PROMO_COST: { $gt: 10000 }}
).toArray().map(doc => doc.PROMO_ID);
const custIds = db.customers.find(
{ CUST_STATE_PROVINCE: 'CA' }
).toArray().map(doc => doc.CUST_ID);
const channelIds = db.channels.find(
{ CHANNEL_DESC: 'Direct Sales' }
).toArray().map(doc => doc.CHANNEL_ID);
const timeIds = db.times.find(
{ FISCAL_YEAR: 2001 , DAY_NAME: 'Saturday' }
).toArray().map(doc => doc.TIME_ID);
Don't worry about long lists, the search indexes can handle that:
The next step is an aggregation pipeline that uses those lists, and add the time range, in a compound filter:
db.sales.aggregate([
{
"$search": {
"index": "SalesDimension",
"compound": {
"must": [
{ "in": { "path": "PROMO_ID". , "value": promoIds } },
{ "in": { "path": "CUST_ID". , "value": custIds } },
{ "in": { "path": "CHANNEL_ID", "value": channelIds } },
{ "in": { "path": "TIME_ID". , "value": timeIds } },
{ "range": { "path": "TIME_ID", "gt": "2001-01-01" } }
]
}
}
},
{ "$sort": { "TIME_ID": -1 } },
{ "$limit": 3 },
{
"$lookup": {
"from": "products",
"localField": "PROD_ID",
"foreignField": "PROD_ID",
"as": "product_info"
}
},
{
"$unwind": "$product_info"
},
{
"$project": {
"PROMO_ID": 1,
"CUST_ID": 1,
"CHANNEL_ID": 1,
"TIME_ID": 1,
"PROD_ID": 1,
"quantity_sold": 1,
"amount_sold": 1,
"product_info.PROD_NAME": 1,
"product_info.PROD_LIST_PRICE": 1
}
}
]);
The search operation is the first step of the aggregation pipeline, to get efficiently the minimal items requried by the query, and further processing can be added.
I've added a sort and limit operation here to display only the last three sales. I've added more information with a lookup on the product dimension, and unwind to project only the name and price.
The result is a collection of documents ready to return to the user:
This post aims to demonstrate the search component of the analytic query by reading only the necessary documents from the fact collection. Further analytics can be appended to the aggregation pipeline.
Conclusion
Unlike SQL star schemas, which necessitate Extract-Load-Transform (ELT) processes to another database, MongoDB’s document model allows transactions, including embedded or referenced data, to be stored in a single collection, effectively addressing multi-granularity facts without complex joins.
The Star Transformation approach can be used to generate a list of keys from lookup collections with applicable filters, similar to dimension tables. Atlas Search indexes combine these keys to retrieve necessary documents from the facts collection for further analysis in the aggregation pipeline.
With MongoDB, if you modeled your data properly for domain use cases and later encounter new query requirements needing scattered document aggregation, consider whether a search index can address these needs before changing the model or streaming to another database.
MongoDB with Search Indexes Queried as Kimball's Star Schema with Facts and Dimensions
- Export From SQL to CSV
- Import to MongoDB
- Indexing the Star Schema
- Querying the Star Schema - Star Transformation
- MongoDB queries on a Star Schema
- Conclusion
In document databases like MongoDB, data modeling is typically optimized for predefined access patterns, where each domain or service owns its database. This contrasts with the relational model, which often serves as a centralized, fully normalized database designed independently of any specific application.
However, most applications also require the ability to perform ad-hoc queries or real-time analytics, where the query patterns are not predetermined.
Here, it is the opposite: SQL often requires another database schema, whereas MongoDB offers the same dimensional model approach on the operational database.
In relational databases, a common practice is to create a separate analytics database, often in a dimensional model like the star schema popularized by Ralph Kimball. Here, normalized fact tables are surrounded by denormalized dimension tables. This setup is separate from the operational database and not updated in real-time by transactional workloads.
MongoDB, allows for a different strategy. Thanks to features like MongoDB Atlas Search Indexes, real-time analytics can be performed directly on the operational database without the need for replication to a separate analytical database. Here is a simple example using the Oracle Sales History Sample Schema which I used a lot in the past as an example of a Star Schema designed correctly with Bitmap Indexes on Foreign Keys to allow Star Transformation (discussed later).
Export From SQL to CSV
The simplest to get the Sales History Sample Schema (SH) populated with data is having an Autonomous Database on the Oracle Cloud Free Tier, as it is created by default.
I used SQLcl UNLOAD to get data into CSV files:
for i in CHANNELS COSTS COUNTRIES CUSTOMERS PRODUCTS PROMOTIONS SALES SUPPLEMENTARY_DEMOGRAPHICS TIMES
do
echo "set loadformat csv"
echo "alter session set nls_date_format = 'YYYY-MM-DD';"
echo "unload SH.$i"
done > unload.sql
sqlcl @ unload.sql
This generated the following CSV files, in the current directory:
$ wc -l *.csv | sort -n
6 CHANNELS_DATA_TABLE.csv
24 COUNTRIES_DATA_TABLE.csv
73 PRODUCTS_DATA_TABLE.csv
504 PROMOTIONS_DATA_TABLE.csv
1827 TIMES_DATA_TABLE.csv
4501 SUPPLEMENTARY_DEMOGRAPHICS_DATA_TABLE.csv
55501 CUSTOMERS_DATA_TABLE.csv
82113 COSTS_DATA_TABLE.csv
918844 SALES_DATA_TABLE.csv
The CSV files contain a header with the column names used in the star schema:
Image from: https://docs.oracle.com/cd/B19306_01/server.102/b14198/graphics/comsc007.gif
Import to MongoDB
I started a local MongoDB Atlas:
atlas deployments setup atlas --type local --port 27017 --force
I imported the CSV files as-is:
mongoimport -j 8 --type=csv --headerline --drop --file=CHANNELS_DATA_TABLE.csv --db=sh --collection=channels $@ &&
mongoimport -j 8 --type=csv --headerline --drop --file=COSTS_DATA_TABLE.csv --db=sh --collection=costs &&
mongoimport -j 8 --type=csv --headerline --drop --file=COUNTRIES_DATA_TABLE.csv --db=sh --collection=countries &&
mongoimport -j 8 --type=csv --headerline --drop --file=CUSTOMERS_DATA_TABLE.csv --db=sh --collection=customers &&
mongoimport -j 8 --type=csv --headerline --drop --file=PRODUCTS_DATA_TABLE.csv --db=sh --collection=products &&
mongoimport -j 8 --type=csv --headerline --drop --file=PROMOTIONS_DATA_TABLE.csv --db=sh --collection=promotions &&
mongoimport -j 8 --type=csv --headerline --drop --file=SALES_DATA_TABLE.csv --db=sh --collection=sales &&
mongosh sh --eval "show collections"
It imports nine hundred thousand facts and the related dimensions in ten seconds:
I connect to the "sh" database and list an example of the "sales" documents:
mongosh
use sh;
db.sales.find().limit(2)
Indexing the Star Schema
In SQL databases, a single optimal index cannot be created for a Data Mart queried on multiple dimension combinations. The SH schema in Oracle Database utilizes one bitmap index for each foreign key in the fact table.
Queries read all necessary indexes and combine results with bitmap operations, which is ideal when individual predicates lack selectivity, but their combinations reduce it to a small set of rows to read from the fact table.
Bitmap indexes are effective for data warehouses but not for operational databases, as they can lead to index fragmentation and locking issues in OLTP systems. For real-time analytics, Oracle offers an In-Memory Column Store, which serves as an analytic-optimized cache for transactional databases.
MongoDB has a different solution: it can serve ad-hoc queries from a single Search Index per fact table. Atlas Search indexes run within the operational database and are asynchronously updated from the change stream, allowing near real-time analytics on operational databases.
I declared the following Search Index on the dimension key fields in the fact table:
db.sales.createSearchIndex(
"SalesDimension",
{
"mappings": {
"fields": {
"PROD_ID": { "type": "number" },
"CUST_ID": { "type": "number" },
"TIME_ID": { "type": "token" },
"CHANNEL_ID": { "type": "number" },
"PROMO_ID": { "type": "number" },
}
}
}
)
I imported data directly from the CSV without defining datatypes. The TIME_ID is a character string in the YYYY-MM-DD format, so I declared it as a token. The other fields are numbers. All fields serve as foreign keys referencing dimension collections. Their values are less important than having a natural order, which may facilitate range queries.
This example can be run on MongoDB Atlas free tier (M0) which allows three Search Indexes:
The total size of the SH schema, without the search index, is 60MB:
Logical Data Size: 172.12 MB
Storage Size: 31.49 MB
Index Size: 30.32 MB
Total Collections: 9
which is mostly the sales collection:
Storage Size: 20.83MB
Logical Data Size: 124.4MB
Total Documents: 918843
Indexes Total Size: 26.01MB
The Search Index takes approximately the same size:
This demo showcases a small database, illustrating the value of this approach when applied to larger databases. It is recommended to compare the size of the search index with the instance memory.
Querying the Star Schema - Star Transformation
The star schema's primary advantage lies in its fact table, which contains numerous rows with limited size, holding only dimension foreign keys and measures. This large row count is offset by the smaller row size. Ideally, there is one fact table to minimize joins across many rows.
Dimension tables are smaller yet can accommodate larger rows with multiple attributes for user-defined filtering and displaying long descriptions. In SQL, queries join the fact table to dimensions in the FROM clause, while predicates are specified in the WHERE clause. The RDBMS optimizer may transform the query to leverage the star schema efficiently, as I detailed in an old article for the SOUG newsletter:
The WHERE clause predicates are applied to small dimensions to obtain the list of dimension keys. This list is then used to access the fact table through bitmap indexes, retrieving only the necessary rows. Finally, this result is joined back with dimension tables for additional projections. This transformation, available in commercial RDBMS like Oracle Database Enterprise Edition, optimizes filtering to minimize unnecessary joins. It is not available in Oracle Standard Edition or PostgreSQL.
Whether this transformation is implemented or not, an SQL query uses joins between the fact table and dimensions. For example the following query retreives the sales sales transactions from 2001 that are associated with high-cost promotions, involve customers from California, and occur through the Direct Sales channel, specifically focusing on Saturdays:
SELECT
s.PROMO_ID, s.CUST_ID, s.CHANNEL_ID, s.TIME_ID, s.PROD_ID,
p.PROD_NAME, p.PROD_LIST_PRICE,
s.quantity_sold, s.amount_sold,
FROM sales s
JOIN products p ON s.PROD_ID = p.PROD_ID
JOIN promotions pr ON s.PROMO_ID = pr.PROMO_ID
JOIN customers c ON s.CUST_ID = c.CUST_ID
JOIN channels ch ON s.CHANNEL_ID = ch.CHANNEL_ID
JOIN times t ON s.TIME_ID = t.TIME_ID
WHERE pr.PROMO_COST > 10000
AND c.CUST_STATE_PROVINCE = 'CA'
AND ch.CHANNEL_DESC = 'Direct Sales'
AND t.FISCAL_YEAR = 2001
AND t.DAY_NAME = 'Saturday'
AND s.TIME_ID > date '2001-01-01'
;
SQL operates independently of the physical data model. The query planner must estimate the optimal order for joining tables and selecting indexes, relying on complex combinations of predicates to assess cardinalities.
MongoDB queries on a Star Schema
MongoDB data models in operational databases resemble a dimensional schema, with a large collection that contains all business transaction information, like a fact table, and referencing smaller lookup tables similar to dimension tables.
MongoDB, designed for developers, offers more control over data access and allows the Star Transformation approach to be integrated into application code:
- get an array of references from each dimension where the filters are applied on many attributes
- query the fact table with an aggregation pipeline using the search index to find the documents having those references
- lookup to the dimensions from which more fields are needed for calculations, aggregations or projections
The bitmap step is unnecessary because the search indexes, which are based on Apache Lucene (with boolean queries), can effectively combine multiple filters with the compound operator. The dimension keys gathered as an array is simple to use with the in operator.
Here is an example to retrieve sales transactions from 2001 (TIME_ID: { $gt: "2001-01-01" }
) that are associated with high-cost promotions (PROMO_COST: { $gt: 10000 }
), involve customers from California ({ CUST_STATE_PROVINCE: 'CA' }
), and occur through the Direct Sales channel ({ CHANNEL_DESC: 'Direct Sales' }
), specifically focusing on Saturdays ({ FISCAL_YEAR: 2001 , DAY_NAME: 'Saturday' }
). One criteria alone is not highly selective but their combination is.
For the time range, as the dimension key holds all information with its YYYY-MM-DD format, I don't need to read from the dimension. For the others, I query the dimension collections to get an array of keys that verify my conditions on their attributes:
const promoIds = db.promotions.find(
{PROMO_COST: { $gt: 10000 }}
).toArray().map(doc => doc.PROMO_ID);
const custIds = db.customers.find(
{ CUST_STATE_PROVINCE: 'CA' }
).toArray().map(doc => doc.CUST_ID);
const channelIds = db.channels.find(
{ CHANNEL_DESC: 'Direct Sales' }
).toArray().map(doc => doc.CHANNEL_ID);
const timeIds = db.times.find(
{ FISCAL_YEAR: 2001 , DAY_NAME: 'Saturday' }
).toArray().map(doc => doc.TIME_ID);
For the dimensions that are small, a COLLSCAN is fast, and larger dimensions can be indexed. Those collections are rarely updated and can have many indexes.
Don't worry about long lists, the search indexes can handle that:
The next step is an aggregation pipeline that uses those lists, and adds the time range, in a compound filter:
db.sales.aggregate([
{
"$search": {
"index": "SalesDimension",
"compound": {
"must": [
{ "in": { "path": "PROMO_ID". , "value": promoIds } },
{ "in": { "path": "CUST_ID". , "value": custIds } },
{ "in": { "path": "CHANNEL_ID", "value": channelIds } },
{ "in": { "path": "TIME_ID". , "value": timeIds } },
{ "range": { "path": "TIME_ID", "gt": "2001-01-01" } }
]
}
}
},
{ "$sort": { "TIME_ID": -1 } },
{ "$limit": 3 },
{
"$lookup": {
"from": "products",
"localField": "PROD_ID",
"foreignField": "PROD_ID",
"as": "product_info"
}
},
{
"$unwind": "$product_info"
},
{
"$project": {
"PROMO_ID": 1,
"CUST_ID": 1,
"CHANNEL_ID": 1,
"TIME_ID": 1,
"PROD_ID": 1,
"quantity_sold": 1,
"amount_sold": 1,
"product_info.PROD_NAME": 1,
"product_info.PROD_LIST_PRICE": 1
}
}
]);
If you are new to MongoDB, an aggregation pipeline is an array of operations on a collection of documents. You use the same approach in SQL when you refactor a query with many subqueries that becomes hard to read, evolve and debug. Change it to a WITH clause and split the logical steps into Common Table Expressions (CTEs). The aggregation follows the same principle.
The search operation is the first step of the aggregation pipeline, to get efficiently the minimal items requried by the query, and further processing can be added.
I've added a sort and limit operation here to display only the last three sales. I've added more information with a lookup on the product dimension, and unwind to project only the name and price.
The result is a collection of documents ready to return to the user, returned in 8 milliseconds because it has solved all filtering before fetching the documents:
This post aims to demonstrate the search component of the analytic query by reading only the necessary documents from the fact collection. Further analytics can be appended to the aggregation pipeline. The datamodel can be improved (using better datatypes, keys as "_id", and index clustering) but the goal was to show how we can run efficent analytic queries without changing the schema.
Conclusion
Unlike SQL star schemas, which necessitate Extract-Load-Transform (ELT) processes to another database, MongoDB’s document model allows business transactions, including embedded or referenced data, to be stored in a single collection in the operational database, effectively addressing multi-granularity facts without complex joins.
To avoid embedding excessive information from dimension attributes into large collections, the Star Transformation approach can be used to generate a list of keys from lookup collections with applicable filters, similar to dimension tables. Atlas Search Indexes combine these keys to retrieve necessary documents from the facts collection for further analysis in the aggregation pipeline.
With MongoDB, if you modeled your data properly for domain use cases and later encounter new query requirements needing scattered document aggregation, consider whether a search index can address these needs before changing the model or streaming to another database.
April 12, 2025
Battle of the Mallocators
If you use RocksDB and want to avoid OOM then use jemalloc or tcmalloc and avoid glibc malloc. That was true in 2015 and remains true in 2025 (see here). The problem is that RocksDB can be an allocator stress test because it does an allocation (calls malloc) when a block is read from storage and then does a deallocation (calls free) on eviction. These allocations have very different lifetimes as some blocks remain cached for a long time and that leads to much larger RSS than expected when using glibc malloc. Fortunately, jemalloc and tcmalloc are better at tolerating that allocation pattern without making RSS too large.
I have yet to notice a similar problem with InnoDB, in part because it does a few large allocations at process start for the InnoDB buffer pool and it doesn't do malloc/free per block read from storage.
There was a recent claim from a MySQL performance expert, Dimitri Kravtchuk, that either RSS or VSZ can grow too large with InnoDB and jemalloc. I don't know all of the details for his setup and I failed to reproduce his result on my setup. Too be fair, I show here that VSZ for InnoDB + jemalloc can be larger than you might expect but that isn't a problem, it is just an artifact of jemalloc that can be confusing. But RSS for jemalloc with InnoDB is similar to what I get from tcmalloc.
tl;dr
- For glibc malloc with MyRocks I get OOM on a server with 128G of RAM when the RocksDB buffer pool size is 50G. I might have been able to avoid OOM by using between 30G and 40G for the buffer pool. On that host I normally use jemalloc with MyRocks and a 100G buffer pool.
- With respect to peak RSS
- For InnoDB the peak RSS with all allocators is similar and peak RSS is ~1.06X larger than the InnoDB buffer pool.
- For MyRocks the peak RSS is smallest with jemalloc, slightly larger with tcmalloc and much too large with glibc malloc. For (jemalloc, tcmalloc, glibc malloc) It was (1.22, 1.31, 3.62) times larger than the 10G MyRocks buffer pool. I suspect those ratios would be smaller for jemalloc and tcmalloc had I used an 80G buffer pool.
- For performance, QPS with jemalloc and tcmalloc is slightly better than with glibc malloc
- For InnoDB: [jemalloc, tcmalloc] get [2.5%, 3.5%] more QPS than glibc malloc
- For MyRocks: [jemalloc, tcmalloc] get [5.1%, 3.0%] more QPS than glibc malloc
Prior art
I have several blog posts on using jemalloc with MyRocks.
- October 2015 - MyRocks with glibc malloc, jemalloc and tcmalloc
- April 2017 - Performance for large, concurrent allocations
- April 2018 - RSS for MyRocks with jemalloc vs glibc malloc
- August 2023 - RocksDB and glibc malloc
- September 2023 - A regression in jemalloc 4.4.0 and 4.5.0 (too-large RSS)
- September 2023 - More on the regression in jemalloc 4.4.0 and 4.5.0
- October 2023 - Even more on the regression in jemalloc 4.4.0 and 4.5.0
- glibc
- version2.35-0ubuntu3.9
- tcmalloc
- provided by libgoogle-perftools-dev and apt-cache show claims this is version 2.9.1
- enabled by malloc-lib=/usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so in my.cnf
- jemalloc
- provided by libjemalloc-dev and apt-cache show claims this is version 5.2.1-4ubuntu1
- enabled by malloc-lib=/usr/lib/x86_64-linux-gnu/libjemalloc.so in my.cnf
The tests run with 16 tables and 50M rows/table. There are 256 client threads and each microbenchmark runs for 1200 seconds. Normally I don't run with (client threads / cores) >> 1 but I do so here to create more stress and to copy what I think Dimitri had done.
From the results here, QPS is mostly similar between tcmalloc and jemalloc but there are a few microbenchmarks where tcmalloc is much better than jemalloc and those are highlighted.
(QPS with $allocator) / (QPS with glibc malloc)
The results for hot-points below are odd (jemalloc is a lot faster than tcmalloc) and from vmstat metrics here I see that CPU/operation (cpu/o) and context switches /operation (cs/o) are both much larger for tcmalloc.
(QPS with $allocator) / (QPS with glibc malloc)
Torn Write Detection and Protection
April 11, 2025
Oracle Multi-Value Index and ORDER BY Pagination queries
In the previous post, I highlighted a case where relational SQL databases struggle with optimizing index access with joins. In contrast, a document model using MongoDB's multi-key indexes efficiently retrieves only the necessary data for the results.
SQL databases implemented document datatypes, storing them in binary JSON columns in SQL tables, but that's not enough to compete as a document database. It's essential to look at the JSON path indexing possibilities?
A JSON document may include nested sub-documents, and you must index paths that may go through arrays. Oracle 21c introduced multi-value indexes for JSON. Let's test it with the same example from the previous post where MongoDB reads only the necessary index entries, filtering upfront thanks to embedding a One-to-Many instead of joining.
Example
As I demonstrated it on MongoDB, I used the MongoDB-compatible API of Oracle 23c to use the same syntax, which saves me migrating to the SQL/JSON syntax:
const bulkOps = [];
for (let i = 0; i < 1000000; i++) {
const orderDetails = [];
for (let line = 1; line <= 10; line++) {
orderDetails.push({
line: line,
product_id: Math.floor(Math.log2(1 + i * Math.random())),
quantity: Math.floor(100 * Math.random()),
});
}
bulkOps.push({
insertOne: {
document: {
country_id: Math.floor(10 * Math.random()),
created_at: new Date(),
order_details: orderDetails
}
}
});
}
db.orders.bulkWrite(bulkOps);
db.orders.createIndex(
{ "country_id": 1, "order_details .product_id": 1, "created_at": -1 }
);
The MongoDB emulation on Oracle Database has converted this to the following SQL table with a multi-value index:
CREATE JSON COLLECTION TABLE "ORA"."orders"
( "SYS_NC00005$" RAW(4000) GENERATED ALWAYS AS ( JSON_MKMVI(JSON_TABLE( "DATA", '$' PRESENT ON EMPTY MINIMAL CROSS PRODUCT WITH ERROR ON PARALLEL ARRAYS COLUMNS( NESTED PATH '$."country_id"[*]' COLUMNS( "K0" ANY ORA_RAWCOMPARE PATH '$' ERROR ON ERROR PRESENT ON EMPTY NULL ON MISMATCH ) , NESTED PATH '$."order_details"[*]' COLUMNS( NESTED PATH '$."product_id"[*]' COLUMNS( "K1" ANY ORA_RAWCOMPARE PATH '$' ERROR ON ERROR PRESENT ON EMPTY NULL ON MISMATCH ) ) , NESTED PATH '$."created_at"[*]' COLUMNS( "K2" ANY ORA_RAWCOMPARE PATH '$' ERROR ON ERROR PRESENT ON EMPTY NULL ON MISMATCH DESC ) ) ) AS "K0","K1","K2" DESC )) VIRTUAL ,
"SYS_NC00006$" RAW(4000) GENERATED ALWAYS AS (JSON_QUERY("DATA" FORMAT OSON , '$."order_details"[*]."product_id"[*]' RETURNING ANY ORA_RAWCOMPARE ASIS WITHOUT ARRAY WRAPPER ERROR ON ERROR PRESENT ON EMPTY NULL ON MISMATCH TYPE(LAX) MULTIVALUE)) VIRTUAL ,
"SYS_NC00007$" RAW(6001) GENERATED ALWAYS AS (JSON_QUERY("DATA" FORMAT OSON , '$."created_at"[*]' RETURNING ANY ORA_RAWCOMPARE ASIS WITHOUT ARRAY WRAPPER ERROR ON ERROR PRESENT ON EMPTY NULL ON MISMATCH TYPE(LAX) MULTIVALUE DESC )) VIRTUAL
) DEFAULT COLLATION "USING_NLS_COMP" ;
CREATE MULTIVALUE INDEX "ORA"."$ora:orders.country_id_1_order_details.product_id_1_created_at_-1" ON "ORA"."orders" (JSON_MKMVI(JSON_TABLE( "DATA", '$' PRESENT ON EMPTY MINIMAL CROSS PRODUCT WITH ERROR ON PARALLEL ARRAYS COLUMNS( NESTED PATH '$."country_id"[*]' COLUMNS( "K0" ANY ORA_RAWCOMPARE PATH '$' ERROR ON ERROR PRESENT ON EMPTY NULL ON MISMATCH ) , NESTED PATH '$."order_details"[*]' COLUMNS( NESTED PATH '$."product_id"[*]' COLUMNS( "K1" ANY ORA_RAWCOMPARE PATH '$' ERROR ON ERROR PRESENT ON EMPTY NULL ON MISMATCH ) ) , NESTED PATH '$."created_at"[*]' COLUMNS( "K2" ANY ORA_RAWCOMPARE PATH '$' ERROR ON ERROR PRESENT ON EMPTY NULL ON MISMATCH DESC ) ) ) AS "K0","K1","K2" DESC ), JSON_QUERY("DATA" FORMAT OSON , '$."order_details"[*]."product_id"[*]' RETURNING ANY ORA_RAWCOMPARE ASIS WITHOUT ARRAY WRAPPER ERROR ON ERROR PRESENT ON EMPTY NULL ON MISMATCH TYPE(LAX) MULTIVALUE), JSON_QUERY("DATA" FORMAT OSON , '$."created_at"[*]' RETURNING ANY ORA_RAWCOMPARE ASIS WITHOUT ARRAY WRAPPER ERROR ON ERROR PRESENT ON EMPTY NULL ON MISMATCH TYPE(LAX) MULTIVALUE DESC ) DESC)
;
The execution plan provided by the translation layer shows some SQL, but not the execution statistics:
ora> db.orders.find( { country_id: 1, order_details: { $elemMatch: { product_id: 15 } } } ).sort({ created_at: -1 }).limit(10).explain(`executionStatistics`)
{
queryPlanner: {
plannerVersion: 1,
namespace: 'ora.orders',
indexFilterSet: false,
parsedQuery: {
'$query': {
'$and': [
{ country_id: { '$numberOnly': 1 } },
{ 'order_details[*]': { product_id: { '$numberOnly': 15 } } }
]
},
'$orderby': {
'$fields': [ { path: 'created_at', order: 'desc', sortByMinMax: true } ],
'$lax': true
}
},
rewrittenQuery: {
'$and': [
{
'$query': {
'$and': [
{ country_id: { '$numberOnly': 1 } },
{
'order_details[*]': { product_id: { '$numberOnly': 15 } }
}
]
}
},
{
'$orderby': {
'$fields': [
{ path: 'created_at', order: 'desc', sortByMinMax: true }
],
'$lax': true
}
}
]
},
winningPlan: {
stage: 'SELECT STATEMENT',
inputStage: {
stage: 'COUNT',
options: 'STOPKEY',
columns: '"from$_subquery$_002"."DATA"[JSON,8200], "from$_subquery$_002"."RAWTOHEX("RESID")"[VARCHAR2,4000], "from$_subquery$_002"."ETAG"[RAW,16]',
filterType: 'filter',
filter: 'ROWNUM<=10',
inputStage: {
stage: 'VIEW',
columns: '"from$_subquery$_002"."DATA"[JSON,8200], "from$_subquery$_002"."RAWTOHEX("RESID")"[VARCHAR2,4000], "from$_subquery$_002"."ETAG"[RAW,16]',
inputStage: {
stage: 'SORT',
options: 'ORDER BY STOPKEY',
columns: `(#keys=1) JSON_VALUE(JSON_QUERY("DATA" /*+ LOB_BY_VALUE */ FORMAT OSON , '$.created_at[*].max()' RETURNING JSON WITHOUT ARRAY WRAPPER NULL ON ERROR TYPE(LAX) ) FORMAT OSON , '$' RETURNING ANY ORA_RAWCOMPARE(32767) ERROR ON ERROR TYPE(LAX) )[32767], "DATA" /*+ LOB_BY_VALUE */ [JSON,8200], JSON_VALUE("DATA" /*+ LOB_BY_VALUE */ FORMAT OSON , '$._id' RETURNING ANY ORA_RAWCOMPARE(2000) NO ARRAY ERROR ON ERROR TYPE(LAX) )[RAW,2000], "ETAG"[RAW,16]`,
filterType: 'filter',
filter: 'ROWNUM<=10',
path: "$.created_at[*].max()'",
inputStage: {
stage: 'TABLE ACCESS',
options: 'BY INDEX ROWID BATCHED',
source: 'orders',
columns: `"DATA" /*+ LOB_BY_VALUE */ [JSON,8200], JSON_VALUE("DATA" /*+ LOB_BY_VALUE */ FORMAT OSON , '$._id' RETURNING ANY ORA_RAWCOMPARE(2000) NO ARRAY ERROR ON ERROR TYPE(LAX) )[RAW,2000], "ETAG"[RAW,16]`,
path: "$._id'",
inputStage: {
stage: 'HASH',
options: 'UNIQUE',
columns: '(#keys=2) "orders".ROWID[ROWID,10], SYSVARCOL[8]',
inputStage: {
stage: 'INDEX',
options: 'RANGE SCAN (MULTI VALUE)',
source: '$ora:orders.country_id_1_order_details.product_id_1_created_at_-1',
columns: `"orders".ROWID[ROWID,10], JSON_QUERY("DATA" /*+ LOB_BY_VALUE */ FORMAT OSON , '$."country_id"[*]' RETURNING ANY ORA_RAWCOMPARE ASIS WITHOUT ARRAY WRAPPER ERROR ON ERROR PRESENT ON EMPTY NULL ON MISMATCH TYPE(LAX) MULTIVALUE)[RAW,4000], JSON_QUERY("DATA" /*+ LOB_BY_VALUE */ FORMAT OSON , '$."order_details"[*]."product_id"[*]' RETURNING ANY ORA_RAWCOMPARE ASIS WITHOUT ARRAY WRAPPER ERROR ON ERROR PRESENT ON EMPTY NULL ON MISMATCH TYPE(LAX) MULTIVALUE)[RAW,4000], SYSVARCOL[8]`,
filterType: 'access',
filter: `JSON_QUERY("DATA" /*+ LOB_BY_VALUE */ FORMAT OSON , '$."country_id"[*]' RETURNING ANY ORA_RAWCOMPARE ASIS WITHOUT ARRAY WRAPPER ERROR ON ERROR PRESENT ON EMPTY NULL ON MISMATCH TYPE(LAX) MULTIVALUE)=SYS_CONS_ANY_SCALAR(:1, 3) AND JSON_QUERY("DATA" /*+ LOB_BY_VALUE */ FORMAT OSON , '$."order_details"[*]."product_id"[*]' RETURNING ANY ORA_RAWCOMPARE ASIS WITHOUT ARRAY WRAPPER ERROR ON ERROR PRESENT ON EMPTY NULL ON MISMATCH TYPE(LAX) MULTIVALUE)=SYS_CONS_ANY_SCALAR(:2, 3)`
}
}
}
}
}
}
},
rejectPlans: []
},
serverInfo: { host: 'localhost', port: 27017, version: '4.2.14' },
ok: 1
}
ora>
However, It's the SQL engine behind, and I captured the SQL statement from GV$SQL:
SELECT /*+ FIRST_ROWS(10) monitor */
"DATA",
RAWTOHEX("RESID"),
"ETAG"
FROM
"ORA"."orders"
WHERE
JSON_EXISTS ( "DATA", '$?(
(@.country_id.numberOnly() == $B0)
&&
( exists(@.order_details[*]?( (@.product_id.numberOnly() == $B1) )) )
)'
PASSING :1 AS "B0",
:2 AS "B1" TYPE ( STRICT ) )
ORDER BY
JSON_QUERY("DATA", '$.created_at[*].max()') DESC NULLS LAST
FETCH NEXT 10 ROWS ONLY
I have added the /* monitor */
hint to get the SQL execution plan with execution statistics:
Unlike MongoDB, which reads only ten documents with such .find().sort().limit(10)
query, Oracle's multi-value index had to read all index entries (53K Rows
) and documents, and sorted them, before returning the Top-10. The index was used for filtering on the WHERE clause, but not for ordering on the ORDER BY FETCH NEXT 10 ROWS ONLY despite having "created_at" in the key next to the equiality filter columns.
Possible explanation
I tried to understand if it is a limitation of Oracle multi-value indexes, or the query planner. The internals of Oracle multi-value indexes are not documented, but there's the patent that describes the idea: https://patents.google.com/patent/US11640380B2
My understanding from it is that the following document:
{
"_id": ObjectId('67f1a477aabaf2dad73f4791'),
"country_id": 1,
"created_at": ISODate('2025-04-05T21:45:21.546Z'),
"order_details": [
{ "line": 1, "product_id": 19, "quantity": 40 },
{ "line": 2, "product_id": 15, "quantity": 10 },
{ "line": 3, "product_id": 18, "quantity": 75 },
{ "line": 4, "product_id": 15, "quantity": 50 }
]
}
with the following index:
db.orders.createIndex(
{ "country_id": 1, "order_details.product_id": 1, "created_at": -1 }
);
results in the following index entries where #
is the ordinal position in the "order_details" array:
+------------+--------------------+---------------------------+
| country_id | order_details | created_at |
| +---------------+----+ |
| | .product_id | # | |
+------------|---------------|----|---------------------------+
| 1 | 19 | 0 | 2025-04-05T21:45:21.546Z |
| 1 | 15 | 1 | 2025-04-05T21:45:21.546Z |
| 1 | 18 | 2 | 2025-04-05T21:45:21.546Z |
| 1 | 15 | 3 | 2025-04-05T21:45:21.546Z |
+------------|---------------|----|---------------------------+
By applying an equality predicate on "country_id" and "product_id," and considering that "created_at" is at the same level and not nested in "order_details," the optimizer could potentially recognize that the entries are sorted by "created_at". Note that it is possible that the ordinal position is not present as I see no FOR ORDINALITY clause in the CREATE MULTIVALUE INDEX statement.
The index entries are ordered and a TABLE ACCESS BY INDEX ROWID should preserve the order, as it is not BATCHED to reduce scattered reads, so a blocking sort operation could be avoided. This could enable the query to read only the necessary documents, but it is not what I observe from the execution plan. An HASH UNIQUE operation breaks the ordering in between. It is true that multiple entries may be returned for the same document and we want the document only once, but I would have expected a SORT UNIQUE NOSORT to preserve the ordering.
Note that all these are guesses, I've run this on the autonomous managed service where I do not have access to the datafiles to look at the internals of the multi-value index.
Conclusion
In exploring multi-value indexes in Oracle for sorting and filtering nested JSON data, a key performance differences emerges compared to MongoDB. Oracle's indexes manage complex JSON structures but struggle with key ordering, requiring an additional blocking sort, which is problematic in pagination queries. This was tested on Oracle 23ai (23.7.0.25.03) showing a MongoDB version of 4.2.14 and may change in the future as the multi-value indexes apparently store the entries in order.
Thus, when evaluating a SQL database to be used as a document database, it's vital to consider both efficient JSON storage and indexing options. A comment in the previous post shows that it's the same with PostgreSQL, as GIN indexes, required for JSON paths though arrays, does not return entries in order.