This article examines PostgreSQL operations that benefit from GIN indexes, as listed in Built-in GIN Operator Classes.
I created a table containing one million rows, plus three additional rows relevant to my queries. The goal is to demonstrate the benefits of using indexes, allowing for efficient data retrieval without the need to read all rows.
Sample table in PostgreSQL
Here is a table that simulates a document collection, with all attributes in a JSONB column:
CREATE TABLE data (
id SERIAL PRIMARY KEY,
details JSONB
);
INSERT INTO data (details)
SELECT jsonb_build_object(
'stuff', md5(random()::text)
)
FROM generate_series(1, 1000000);
ANALYZE data;
INSERT INTO data (details) VALUES
('{"name": "Alice", "age": 30, "tags": ["developer", "engineer"]}'),
('{"name": "Bob", "age": 25, "tags": ["designer"]}'),
('{"name": "Carol", "age": 40, "tags": ["developer", "manager"]}');
Sample collection on MongoDB
I am creating the same data in a MongoDB collection:
const bulk = db.data.initializeUnorderedBulkOp();
for (let i = 0; i < 1000000; i++) {
bulk.insert({
stuff: Math.random().toString(36).substring(2, 15)
});
}
bulk.execute();
db.data.insertMany([
{ "name": "Alice", "age": 30, "tags": ["developer", "engineer"] },
{ "name": "Bob", "age": 25, "tags": ["designer"] },
{ "name": "Carol", "age": 40, "tags": ["developer", "manager"] }
]);
@> (jsonb, jsonb): Contains
The following queries search for data where the tag array contains a value:
PostgreSQL:
SELECT * FROM data WHERE details @> '{"tags": ["developer"]}'
;
MongoDB:
db.data.find({ tags: { $all: ["developer"] } })
;
Without an index, this runs a SeqScan on PostgreSQL:
postgres=# explain (analyze, costs off, buffers, serialize text)
SELECT * FROM data WHERE details @> '{"tags": ["developer"]}'
;
QUERY PLAN
----------------------------------------------------------------
Seq Scan on data (actual time=144.002..144.004 rows=2 loops=1)
Filter: (details @> '{"tags": ["developer"]}'::jsonb)
Rows Removed by Filter: 1000001
Buffers: shared hit=5715 read=4595
Planning Time: 0.065 ms
Serialization: time=0.008 ms output=1kB format=text
Execution Time: 144.236 ms
and a COLLSCAN on MongoDB:
test> print(
db.data.find({ tags: { $all: ["developer"] } })
.explain('executionStats').executionStats
);
{
executionSuccess: true,
nReturned: 2,
executionTimeMillis: 437,
totalKeysExamined: 0,
totalDocsExamined: 1000003,
executionStages: {
stage: 'COLLSCAN',
filter: { tags: { '$eq': 'developer' } },
nReturned: 2,
executionTimeMillisEstimate: 391,
works: 1000004,
advanced: 2,
direction: 'forward',
docsExamined: 1000003
}
}
I create a jsonb_path_ops GIN index on PostgreSQL:
CREATE INDEX idx_details ON data USING GIN (details jsonb_path_ops)
;
and a multi-key index sparse index on MongoDB for the array of tags:
db.data.createIndex({ tags: 1 } , { sparse: true })
;
PostgreSQL uses the GIN index to find the two index entries, though a Bitmap Index Scan, and then the two rows where the Recheck Cond didn't have to filter more:
postgres=# explain (analyze, costs off, buffers, serialize text)
SELECT * FROM data WHERE details @> '{"tags": ["developer"]}'
;
QUERY PLAN
----------------------------------------------------------------------------------
Bitmap Heap Scan on data (actual time=0.019..0.021 rows=2 loops=1)
Recheck Cond: (details @> '{"tags": ["developer"]}'::jsonb)
Heap Blocks: exact=1
Buffers: shared hit=5
-> Bitmap Index Scan on idx_details (actual time=0.010..0.010 rows=2 loops=1)
Index Cond: (details @> '{"tags": ["developer"]}'::jsonb)
Buffers: shared hit=4
Planning:
Buffers: shared hit=1
Planning Time: 0.079 ms
Serialization: time=0.005 ms output=1kB format=text
Execution Time: 0.041 ms
MongoDB efficiently reads two index entries and retrieves two documents without the need of bitmaps. This method preserves the index order, which is advantageous when handling multiple rows.
test> print(
db.data.find({ tags: { $all: ["developer"] } })
. explain('executionStats').executionStats
);
{
executionSuccess: true,
nReturned: 2,
executionTimeMillis: 0,
totalKeysExamined: 2,
totalDocsExamined: 2,
executionStages: {
stage: 'FETCH',
nReturned: 2,
executionTimeMillisEstimate: 0,
works: 3,
advanced: 2,
docsExamined: 2,
inputStage: {
stage: 'IXSCAN',
nReturned: 2,
executionTimeMillisEstimate: 0,
works: 3,
advanced: 2,
keyPattern: { tags: 1 },
indexName: 'tags_1',
isMultiKey: true,
multiKeyPaths: { tags: [ 'tags' ] },
isSparse: true,
isPartial: false,
direction: 'forward',
indexBounds: { tags: [ '["developer", "developer"]' ] },
keysExamined: 2,
seeks: 1,
dupsTested: 2
}
}
}
@? (jsonb, jsonpath): JSON Path Match
The following query searches for developers, as well as non developers younger than 35:
On PostgreSQL:
SELECT * FROM data
WHERE details @? '$?(@.tags[*] == "developer" || @.age < 35)'
;
On MongoDB:
db.data.find({
$or: [
{ tags: { $elemMatch: { $eq: "developer" } } },
{ age: { $lt: 35 } }
]
});
Without an additional index on "age", MongoDB chooses a COLLSCAN:
test> print(
db.data.find({
$or: [
{ tags: { $elemMatch: { $eq: "developer" } } },
{ age: { $lt: 35 } }
]
}).explain('executionStats').executionStats
);
{
executionSuccess: true,
nReturned: 3,
executionTimeMillis: 585,
totalKeysExamined: 0,
totalDocsExamined: 1000003,
executionStages: {
isCached: false,
stage: 'SUBPLAN',
nReturned: 3,
executionTimeMillisEstimate: 547,
works: 1000004,
advanced: 3,
needTime: 1000000,
inputStage: {
stage: 'COLLSCAN',
filter: {
'$or': [
{ tags: { '$elemMatch': [Object] } },
{ age: { '$lt': 35 } }
]
},
nReturned: 3,
executionTimeMillisEstimate: 517,
works: 1000004,
advanced: 3,
needTime: 1000000,
direction: 'forward',
docsExamined: 1000003
}
}
}
PostgreSQL uses the GIN index but not efficiently as it reads all index entries to remove them later by recheck:
postgres=# explain (analyze, costs off, buffers, serialize text)
SELECT * FROM data
WHERE details @? '$?(@.tags[*] == "developer" || @.age < 35)'
;
QUERY PLAN
---------------------------------------------------------------------------------------------
Bitmap Heap Scan on data (actual time=582.323..582.327 rows=3 loops=1)
Recheck Cond: (details @? '$?(@."tags"[*] == "developer" || @."age" < 35)'::jsonpath)
Rows Removed by Index Recheck: 1000000
Heap Blocks: exact=10310
Buffers: shared hit=14703
-> Bitmap Index Scan on idx_details (actual time=123.755..123.755 rows=1000003 loops=1)
Index Cond: (details @? '$?(@."tags"[*] == "developer" || @."age" < 35)'::jsonpath)
Buffers: shared hit=4393
Planning:
Buffers: shared hit=1
Planning Time: 0.117 ms
Serialization: time=0.009 ms output=1kB format=text
Execution Time: 582.575 ms
The solution, in both databases, is to add an index on "age".
An expression-based index in PostgreSQL:
CREATE INDEX idx_age ON data ( ((details->>'age')::int) )
;
A regular index on MongoDB, but sparse as I don't need to index missing values:
db.data.createIndex({ age: 1 }, { sparse: true} );
Here is the new execution plan in MongoDB which can combine the multi-key index on "tags" and the regular index on "age":
print(
db.data.find({
$or: [
{ tags: { $elemMatch: { $eq: "developer" } } },
{ age: { $lt: 35 } }
]
}).explain('executionStats').executionStats
);
{
executionSuccess: true,
nReturned: 3,
executionTimeMillis: 0,
totalKeysExamined: 4,
totalDocsExamined: 5,
executionStages: {
isCached: false,
stage: 'SUBPLAN',
nReturned: 3,
executionTimeMillisEstimate: 0,
works: 6,
advanced: 3,
inputStage: {
stage: 'FETCH',
nReturned: 3,
executionTimeMillisEstimate: 0,
works: 6,
advanced: 3,
docsExamined: 3,
alreadyHasObj: 2,
inputStage: {
stage: 'OR',
nReturned: 3,
executionTimeMillisEstimate: 0,
works: 6,
advanced: 3,
dupsTested: 4,
dupsDropped: 1,
inputStages: [
{
stage: 'FETCH',
filter: { '$or': [Array] },
nReturned: 2,
executionTimeMillisEstimate: 0,
works: 3,
advanced: 2,
docsExamined: 2,
alreadyHasObj: 0,
inputStage: {
stage: 'IXSCAN',
nReturned: 2,
executionTimeMillisEstimate: 0,
works: 3,
advanced: 2,
keyPattern: [Object],
indexName: 'tags_1',
isMultiKey: true,
multiKeyPaths: [Object],
isUnique: false,
isSparse: true,
isPartial: false,
indexVersion: 2,
direction: 'forward',
indexBounds: [Object],
keysExamined: 2,
seeks: 1,
dupsTested: 2,
dupsDropped: 0
}
},
{
stage: 'IXSCAN',
nReturned: 2,
executionTimeMillisEstimate: 0,
works: 3,
advanced: 2,
keyPattern: { age: 1 },
indexName: 'age_1',
isMultiKey: false,
multiKeyPaths: { age: [] },
isUnique: false,
isSparse: true,
isPartial: false,
indexVersion: 2,
direction: 'forward',
indexBounds: { age: [Array] },
keysExamined: 2,
seeks: 1,
dupsTested: 0,
dupsDropped: 0
}
]
}
}
}
}
I've detailed this OR expansion in a previous blog post:
PostgreSQL cannot do the same and you need to write the query differently:
postgres=# explain (analyze, costs off, buffers, serialize text)
-- Query for using GIN index on "details" column
SELECT * FROM data
WHERE details @? '$?(@.tags[*] == "developer")'
UNION
-- Query for using B-tree index on "age" column
SELECT * FROM data
WHERE (details->>'age')::int < 35
;
QUERY PLAN
----------------------------------------------------------------------------------------------
HashAggregate (actual time=0.134..0.220 rows=3 loops=1)
Group Key: data.id, data.details
Batches: 1 Memory Usage: 1561kB
Buffers: shared hit=9
-> Append (actual time=0.023..0.035 rows=4 loops=1)
Buffers: shared hit=9
-> Bitmap Heap Scan on data (actual time=0.022..0.024 rows=2 loops=1)
Recheck Cond: (details @? '$?(@."tags"[*] == "developer")'::jsonpath)
Heap Blocks: exact=1
Buffers: shared hit=5
-> Bitmap Index Scan on idx_details (actual time=0.012..0.012 rows=2 loops=1)
Index Cond: (details @? '$?(@."tags"[*] == "developer")'::jsonpath)
Buffers: shared hit=4
-> Bitmap Heap Scan on data data_1 (actual time=0.009..0.010 rows=2 loops=1)
Recheck Cond: (((details ->> 'age'::text))::integer < 35)
Heap Blocks: exact=1
Buffers: shared hit=4
-> Bitmap Index Scan on idx_age (actual time=0.008..0.008 rows=2 loops=1)
Index Cond: (((details ->> 'age'::text))::integer < 35)
Buffers: shared hit=3
Planning:
Buffers: shared hit=1
Planning Time: 0.160 ms
Serialization: time=0
by Franck Pachot
Franck Pachot
This article examines PostgreSQL operations that benefit from GIN indexes, as listed in Built-in GIN Operator Classes.
I created a table containing one million rows, plus three additional rows relevant to my queries. The goal is to demonstrate the benefits of using indexes, allowing for efficient data retrieval without the need to read all rows.
Sample table in PostgreSQL
Here is a table that simulates a document collection, with all attributes in a JSONB column:
CREATE TABLE data (
id SERIAL PRIMARY KEY,
details JSONB
);
INSERT INTO data (details)
SELECT jsonb_build_object(
'stuff', md5(random()::text)
)
FROM generate_series(1, 1000000);
ANALYZE data;
INSERT INTO data (details) VALUES
('{"name": "Alice", "age": 30, "tags": ["developer", "engineer"]}'),
('{"name": "Bob", "age": 25, "tags": ["designer"]}'),
('{"name": "Carol", "age": 40, "tags": ["developer", "manager"]}');
Sample collection on MongoDB
I am creating the same data in a MongoDB collection:
const bulk = db.data.initializeUnorderedBulkOp();
for (let i = 0; i < 1000000; i++) {
bulk.insert({
stuff: Math.random().toString(36).substring(2, 15)
});
}
bulk.execute();
db.data.insertMany([
{ "name": "Alice", "age": 30, "tags": ["developer", "engineer"] },
{ "name": "Bob", "age": 25, "tags": ["designer"] },
{ "name": "Carol", "age": 40, "tags": ["developer", "manager"] }
]);
@> (jsonb, jsonb): Contains
The following queries search for data where the tag array contains a value:
PostgreSQL:
SELECT * FROM data WHERE details @> '{"tags": ["developer"]}'
;
MongoDB:
db.data.find({ tags: { $all: ["developer"] } })
;
Without an index, this runs a SeqScan on PostgreSQL:
postgres=# explain (analyze, costs off, buffers, serialize text)
SELECT * FROM data WHERE details @> '{"tags": ["developer"]}'
;
QUERY PLAN
----------------------------------------------------------------
Seq Scan on data (actual time=144.002..144.004 rows=2 loops=1)
Filter: (details @> '{"tags": ["developer"]}'::jsonb)
Rows Removed by Filter: 1000001
Buffers: shared hit=5715 read=4595
Planning Time: 0.065 ms
Serialization: time=0.008 ms output=1kB format=text
Execution Time: 144.236 ms
and a COLLSCAN on MongoDB:
test> print(
db.data.find({ tags: { $all: ["developer"] } })
.explain('executionStats').executionStats
);
{
executionSuccess: true,
nReturned: 2,
executionTimeMillis: 437,
totalKeysExamined: 0,
totalDocsExamined: 1000003,
executionStages: {
stage: 'COLLSCAN',
filter: { tags: { '$eq': 'developer' } },
nReturned: 2,
executionTimeMillisEstimate: 391,
works: 1000004,
advanced: 2,
direction: 'forward',
docsExamined: 1000003
}
}
I create a jsonb_path_ops GIN index on PostgreSQL:
CREATE INDEX idx_details ON data USING GIN (details jsonb_path_ops)
;
and a multi-key index sparse index on MongoDB for the array of tags:
db.data.createIndex({ tags: 1 } , { sparse: true })
;
PostgreSQL uses the GIN index to find the two index entries, though a Bitmap Index Scan, and then the two rows where the Recheck Cond didn't have to filter more:
postgres=# explain (analyze, costs off, buffers, serialize text)
SELECT * FROM data WHERE details @> '{"tags": ["developer"]}'
;
QUERY PLAN
----------------------------------------------------------------------------------
Bitmap Heap Scan on data (actual time=0.019..0.021 rows=2 loops=1)
Recheck Cond: (details @> '{"tags": ["developer"]}'::jsonb)
Heap Blocks: exact=1
Buffers: shared hit=5
-> Bitmap Index Scan on idx_details (actual time=0.010..0.010 rows=2 loops=1)
Index Cond: (details @> '{"tags": ["developer"]}'::jsonb)
Buffers: shared hit=4
Planning:
Buffers: shared hit=1
Planning Time: 0.079 ms
Serialization: time=0.005 ms output=1kB format=text
Execution Time: 0.041 ms
MongoDB efficiently reads two index entries and retrieves two documents without the need of bitmaps. This method preserves the index order, which is advantageous when handling multiple rows.
test> print(
db.data.find({ tags: { $all: ["developer"] } })
. explain('executionStats').executionStats
);
{
executionSuccess: true,
nReturned: 2,
executionTimeMillis: 0,
totalKeysExamined: 2,
totalDocsExamined: 2,
executionStages: {
stage: 'FETCH',
nReturned: 2,
executionTimeMillisEstimate: 0,
works: 3,
advanced: 2,
docsExamined: 2,
inputStage: {
stage: 'IXSCAN',
nReturned: 2,
executionTimeMillisEstimate: 0,
works: 3,
advanced: 2,
keyPattern: { tags: 1 },
indexName: 'tags_1',
isMultiKey: true,
multiKeyPaths: { tags: [ 'tags' ] },
isSparse: true,
isPartial: false,
direction: 'forward',
indexBounds: { tags: [ '["developer", "developer"]' ] },
keysExamined: 2,
seeks: 1,
dupsTested: 2
}
}
}
@? (jsonb, jsonpath): JSON Path Match
The following query searches for developers, as well as non developers younger than 35:
On PostgreSQL:
SELECT * FROM data
WHERE details @? '$?(@.tags[*] == "developer" || @.age < 35)'
;
On MongoDB:
db.data.find({
$or: [
{ tags: { $elemMatch: { $eq: "developer" } } },
{ age: { $lt: 35 } }
]
});
Without an additional index on "age", MongoDB chooses a COLLSCAN:
test> print(
db.data.find({
$or: [
{ tags: { $elemMatch: { $eq: "developer" } } },
{ age: { $lt: 35 } }
]
}).explain('executionStats').executionStats
);
{
executionSuccess: true,
nReturned: 3,
executionTimeMillis: 585,
totalKeysExamined: 0,
totalDocsExamined: 1000003,
executionStages: {
isCached: false,
stage: 'SUBPLAN',
nReturned: 3,
executionTimeMillisEstimate: 547,
works: 1000004,
advanced: 3,
needTime: 1000000,
inputStage: {
stage: 'COLLSCAN',
filter: {
'$or': [
{ tags: { '$elemMatch': [Object] } },
{ age: { '$lt': 35 } }
]
},
nReturned: 3,
executionTimeMillisEstimate: 517,
works: 1000004,
advanced: 3,
needTime: 1000000,
direction: 'forward',
docsExamined: 1000003
}
}
}
PostgreSQL uses the GIN index but not efficiently as it reads all index entries to remove them later by recheck:
postgres=# explain (analyze, costs off, buffers, serialize text)
SELECT * FROM data
WHERE details @? '$?(@.tags[*] == "developer" || @.age < 35)'
;
QUERY PLAN
---------------------------------------------------------------------------------------------
Bitmap Heap Scan on data (actual time=582.323..582.327 rows=3 loops=1)
Recheck Cond: (details @? '$?(@."tags"[*] == "developer" || @."age" < 35)'::jsonpath)
Rows Removed by Index Recheck: 1000000
Heap Blocks: exact=10310
Buffers: shared hit=14703
-> Bitmap Index Scan on idx_details (actual time=123.755..123.755 rows=1000003 loops=1)
Index Cond: (details @? '$?(@."tags"[*] == "developer" || @."age" < 35)'::jsonpath)
Buffers: shared hit=4393
Planning:
Buffers: shared hit=1
Planning Time: 0.117 ms
Serialization: time=0.009 ms output=1kB format=text
Execution Time: 582.575 ms
The solution, in both databases, is to add an index on "age".
An expression-based index in PostgreSQL:
CREATE INDEX idx_age ON data ( ((details->>'age')::int) )
;
A regular index on MongoDB, but sparse as I don't need to index missing values:
db.data.createIndex({ age: 1 }, { sparse: true} );
Here is the new execution plan in MongoDB which can combine the multi-key index on "tags" and the regular index on "age":
print(
db.data.find({
$or: [
{ tags: { $elemMatch: { $eq: "developer" } } },
{ age: { $lt: 35 } }
]
}).explain('executionStats').executionStats
);
{
executionSuccess: true,
nReturned: 3,
executionTimeMillis: 0,
totalKeysExamined: 4,
totalDocsExamined: 5,
executionStages: {
isCached: false,
stage: 'SUBPLAN',
nReturned: 3,
executionTimeMillisEstimate: 0,
works: 6,
advanced: 3,
inputStage: {
stage: 'FETCH',
nReturned: 3,
executionTimeMillisEstimate: 0,
works: 6,
advanced: 3,
docsExamined: 3,
alreadyHasObj: 2,
inputStage: {
stage: 'OR',
nReturned: 3,
executionTimeMillisEstimate: 0,
works: 6,
advanced: 3,
dupsTested: 4,
dupsDropped: 1,
inputStages: [
{
stage: 'FETCH',
filter: { '$or': [Array] },
nReturned: 2,
executionTimeMillisEstimate: 0,
works: 3,
advanced: 2,
docsExamined: 2,
alreadyHasObj: 0,
inputStage: {
stage: 'IXSCAN',
nReturned: 2,
executionTimeMillisEstimate: 0,
works: 3,
advanced: 2,
keyPattern: [Object],
indexName: 'tags_1',
isMultiKey: true,
multiKeyPaths: [Object],
isUnique: false,
isSparse: true,
isPartial: false,
indexVersion: 2,
direction: 'forward',
indexBounds: [Object],
keysExamined: 2,
seeks: 1,
dupsTested: 2,
dupsDropped: 0
}
},
{
stage: 'IXSCAN',
nReturned: 2,
executionTimeMillisEstimate: 0,
works: 3,
advanced: 2,
keyPattern: { age: 1 },
indexName: 'age_1',
isMultiKey: false,
multiKeyPaths: { age: [] },
isUnique: false,
isSparse: true,
isPartial: false,
indexVersion: 2,
direction: 'forward',
indexBounds: { age: [Array] },
keysExamined: 2,
seeks: 1,
dupsTested: 0,
dupsDropped: 0
}
]
}
}
}
}
I've detailed this OR expansion in a previous blog post:
PostgreSQL cannot do the same and you need to write the query differently:
postgres=# explain (analyze, costs off, buffers, serialize text)
-- Query for using GIN index on "details" column
SELECT * FROM data
WHERE details @? '$?(@.tags[*] == "developer")'
UNION
-- Query for using B-tree index on "age" column
SELECT * FROM data
WHERE (details->>'age')::int < 35
;
QUERY PLAN
----------------------------------------------------------------------------------------------
HashAggregate (actual time=0.134..0.220 rows=3 loops=1)
Group Key: data.id, data.details
Batches: 1 Memory Usage: 1561kB
Buffers: shared hit=9
-> Append (actual time=0.023..0.035 rows=4 loops=1)
Buffers: shared hit=9
-> Bitmap Heap Scan on data (actual time=0.022..0.024 rows=2 loops=1)
Recheck Cond: (details @? '$?(@."tags"[*] == "developer")'::jsonpath)
Heap Blocks: exact=1
Buffers: shared hit=5
-> Bitmap Index Scan on idx_details (actual time=0.012..0.012 rows=2 loops=1)
Index Cond: (details @? '$?(@."tags"[*] == "developer")'::jsonpath)
Buffers: shared hit=4
-> Bitmap Heap Scan on data data_1 (actual time=0.009..0.010 rows=2 loops=1)
Recheck Cond: (((details ->> 'age'::text))::integer < 35)
Heap Blocks: exact=1
Buffers: shared hit=4
-> Bitmap Index Scan on idx_age (actual time=0.008..0.008 rows=2 loops=1)
Index Cond: (((details ->> 'age'::text))::integer < 35)
Buffers: shared hit=3
Planning:
Buffers: shared hit=1
Planning Time: 0.160 ms
Serialization: time=0
by Franck Pachot
Percona Database Performance Blog
If you’re running enterprise applications, you might be facing a common dilemma: cloud solutions offer great features, but regulatory constraints and data sovereignty laws might prevent you from using them fully. In fact, about 60% of enterprises struggle with this exact problem. While cloud computing delivers scalability and convenience, your organization might require on-premises or […]
by David Quilty
Tinybird Engineering Blog
Somehow we got stuck on the idea that big data systems should be slow. We're making it fast.
AWS Database Blog - Amazon Aurora
In this post, we demonstrate how you can implement scheduled scaling for Aurora Serverless using Amazon EventBridge Scheduler. By proactively adjusting minimum Aurora Capacity Units (ACUs), you can achieve faster scaling rates during peak periods while maintaining cost efficiency during low-demand times.
by Josh Hart