This article examines PostgreSQL operations that benefit from GIN indexes, as listed in Built-in GIN Operator Classes.
I created a table containing one million rows, plus three additional rows relevant to my queries. The goal is to demonstrate the benefits of using indexes, allowing for efficient data retrieval without the need to read all rows.
Here is a table that simulates a document collection, with all attributes in a JSONB column:
PostgreSQL uses the GIN index to find the two index entries, though a Bitmap Index Scan, and then the two rows where the Recheck Cond didn't have to filter more:
MongoDB efficiently reads two index entries and retrieves two documents without the need of bitmaps. This method preserves the index order, which is advantageous when handling multiple rows.
The following query searches for developers, as well as non developers younger than 35:
PostgreSQL uses the GIN index but not efficiently as it reads all index entries to remove them later by recheck:
The solution, in both databases, is to add an index on "age".
Here is the new execution plan in MongoDB which can combine the multi-key index on "tags" and the regular index on "age":
postgres=# explain (analyze, costs off, buffers, serialize text)
-- Query for using GIN index on "details" column
SELECT * FROM data
WHERE details @? '$?(@.tags[*] == "developer")'
UNION
-- Query for using B-tree index on "age" column
SELECT * FROM data
WHERE (details->>'age')::int < 35
;
QUERY PLAN
----------------------------------------------------------------------------------------------
HashAggregate (actual time=0.134..0.220 rows=3 loops=1)
Group Key: data.id, data.details
Batches: 1 Memory Usage: 1561kB
Buffers: shared hit=9
-> Append (actual time=0.023..0.035 rows=4 loops=1)
Buffers: shared hit=9
-> Bitmap Heap Scan on data (actual time=0.022..0.024 rows=2 loops=1)
Recheck Cond: (details @? '$?(@."tags"[*] == "developer")'::jsonpath)
Heap Blocks: exact=1
Buffers: shared hit=5
-> Bitmap Index Scan on idx_details (actual time=0.012..0.012 rows=2 loops=1)
Index Cond: (details @? '$?(@."tags"[*] == "developer")'::jsonpath)
Buffers: shared hit=4
-> Bitmap Heap Scan on data data_1 (actual time=0.009..0.010 rows=2 loops=1)
Recheck Cond: (((details ->> 'age'::text))::integer < 35)
Heap Blocks: exact=1
Buffers: shared hit=4
-> Bitmap Index Scan on idx_age (actual time=0.008..0.008 rows=2 loops=1)
Index Cond: (((details ->> 'age'::text))::integer < 35)
Buffers: shared hit=3
Planning:
Buffers: shared hit=1
Planning Time: 0.160 ms
Serialization: time=0
by Franck Pachot
Franck Pachot
This article examines PostgreSQL operations that benefit from GIN indexes, as listed in Built-in GIN Operator Classes.
I created a table containing one million rows, plus three additional rows relevant to my queries. The goal is to demonstrate the benefits of using indexes, allowing for efficient data retrieval without the need to read all rows.
Sample table in PostgreSQL
Here is a table that simulates a document collection, with all attributes in a JSONB column:
CREATE TABLE data (
id SERIAL PRIMARY KEY,
details JSONB
);
INSERT INTO data (details)
SELECT jsonb_build_object(
'stuff', md5(random()::text)
)
FROM generate_series(1, 1000000);
ANALYZE data;
INSERT INTO data (details) VALUES
('{"name": "Alice", "age": 30, "tags": ["developer", "engineer"]}'),
('{"name": "Bob", "age": 25, "tags": ["designer"]}'),
('{"name": "Carol", "age": 40, "tags": ["developer", "manager"]}');
Sample collection on MongoDB
I am creating the same data in a MongoDB collection:
const bulk = db.data.initializeUnorderedBulkOp();
for (let i = 0; i < 1000000; i++) {
bulk.insert({
stuff: Math.random().toString(36).substring(2, 15)
});
}
bulk.execute();
db.data.insertMany([
{ "name": "Alice", "age": 30, "tags": ["developer", "engineer"] },
{ "name": "Bob", "age": 25, "tags": ["designer"] },
{ "name": "Carol", "age": 40, "tags": ["developer", "manager"] }
]);
@> (jsonb, jsonb): Contains
The following queries search for data where the tag array contains a value:
PostgreSQL:
SELECT * FROM data WHERE details @> '{"tags": ["developer"]}'
;
MongoDB:
db.data.find({ tags: { $all: ["developer"] } })
;
Without an index, this runs a SeqScan on PostgreSQL:
postgres=# explain (analyze, costs off, buffers, serialize text)
SELECT * FROM data WHERE details @> '{"tags": ["developer"]}'
;
QUERY PLAN
----------------------------------------------------------------
Seq Scan on data (actual time=144.002..144.004 rows=2 loops=1)
Filter: (details @> '{"tags": ["developer"]}'::jsonb)
Rows Removed by Filter: 1000001
Buffers: shared hit=5715 read=4595
Planning Time: 0.065 ms
Serialization: time=0.008 ms output=1kB format=text
Execution Time: 144.236 ms
and a COLLSCAN on MongoDB:
test> print(
db.data.find({ tags: { $all: ["developer"] } })
.explain('executionStats').executionStats
);
{
executionSuccess: true,
nReturned: 2,
executionTimeMillis: 437,
totalKeysExamined: 0,
totalDocsExamined: 1000003,
executionStages: {
stage: 'COLLSCAN',
filter: { tags: { '$eq': 'developer' } },
nReturned: 2,
executionTimeMillisEstimate: 391,
works: 1000004,
advanced: 2,
direction: 'forward',
docsExamined: 1000003
}
}
I create a jsonb_path_ops GIN index on PostgreSQL:
CREATE INDEX idx_details ON data USING GIN (details jsonb_path_ops)
;
and a multi-key index sparse index on MongoDB for the array of tags:
db.data.createIndex({ tags: 1 } , { sparse: true })
;
PostgreSQL uses the GIN index to find the two index entries, though a Bitmap Index Scan, and then the two rows where the Recheck Cond didn't have to filter more:
postgres=# explain (analyze, costs off, buffers, serialize text)
SELECT * FROM data WHERE details @> '{"tags": ["developer"]}'
;
QUERY PLAN
----------------------------------------------------------------------------------
Bitmap Heap Scan on data (actual time=0.019..0.021 rows=2 loops=1)
Recheck Cond: (details @> '{"tags": ["developer"]}'::jsonb)
Heap Blocks: exact=1
Buffers: shared hit=5
-> Bitmap Index Scan on idx_details (actual time=0.010..0.010 rows=2 loops=1)
Index Cond: (details @> '{"tags": ["developer"]}'::jsonb)
Buffers: shared hit=4
Planning:
Buffers: shared hit=1
Planning Time: 0.079 ms
Serialization: time=0.005 ms output=1kB format=text
Execution Time: 0.041 ms
MongoDB efficiently reads two index entries and retrieves two documents without the need of bitmaps. This method preserves the index order, which is advantageous when handling multiple rows.
test> print(
db.data.find({ tags: { $all: ["developer"] } })
. explain('executionStats').executionStats
);
{
executionSuccess: true,
nReturned: 2,
executionTimeMillis: 0,
totalKeysExamined: 2,
totalDocsExamined: 2,
executionStages: {
stage: 'FETCH',
nReturned: 2,
executionTimeMillisEstimate: 0,
works: 3,
advanced: 2,
docsExamined: 2,
inputStage: {
stage: 'IXSCAN',
nReturned: 2,
executionTimeMillisEstimate: 0,
works: 3,
advanced: 2,
keyPattern: { tags: 1 },
indexName: 'tags_1',
isMultiKey: true,
multiKeyPaths: { tags: [ 'tags' ] },
isSparse: true,
isPartial: false,
direction: 'forward',
indexBounds: { tags: [ '["developer", "developer"]' ] },
keysExamined: 2,
seeks: 1,
dupsTested: 2
}
}
}
@? (jsonb, jsonpath): JSON Path Match
The following query searches for developers, as well as non developers younger than 35:
On PostgreSQL:
SELECT * FROM data
WHERE details @? '$?(@.tags[*] == "developer" || @.age < 35)'
;
On MongoDB:
db.data.find({
$or: [
{ tags: { $elemMatch: { $eq: "developer" } } },
{ age: { $lt: 35 } }
]
});
Without an additional index on "age", MongoDB chooses a COLLSCAN:
test> print(
db.data.find({
$or: [
{ tags: { $elemMatch: { $eq: "developer" } } },
{ age: { $lt: 35 } }
]
}).explain('executionStats').executionStats
);
{
executionSuccess: true,
nReturned: 3,
executionTimeMillis: 585,
totalKeysExamined: 0,
totalDocsExamined: 1000003,
executionStages: {
isCached: false,
stage: 'SUBPLAN',
nReturned: 3,
executionTimeMillisEstimate: 547,
works: 1000004,
advanced: 3,
needTime: 1000000,
inputStage: {
stage: 'COLLSCAN',
filter: {
'$or': [
{ tags: { '$elemMatch': [Object] } },
{ age: { '$lt': 35 } }
]
},
nReturned: 3,
executionTimeMillisEstimate: 517,
works: 1000004,
advanced: 3,
needTime: 1000000,
direction: 'forward',
docsExamined: 1000003
}
}
}
PostgreSQL uses the GIN index but not efficiently as it reads all index entries to remove them later by recheck:
postgres=# explain (analyze, costs off, buffers, serialize text)
SELECT * FROM data
WHERE details @? '$?(@.tags[*] == "developer" || @.age < 35)'
;
QUERY PLAN
---------------------------------------------------------------------------------------------
Bitmap Heap Scan on data (actual time=582.323..582.327 rows=3 loops=1)
Recheck Cond: (details @? '$?(@."tags"[*] == "developer" || @."age" < 35)'::jsonpath)
Rows Removed by Index Recheck: 1000000
Heap Blocks: exact=10310
Buffers: shared hit=14703
-> Bitmap Index Scan on idx_details (actual time=123.755..123.755 rows=1000003 loops=1)
Index Cond: (details @? '$?(@."tags"[*] == "developer" || @."age" < 35)'::jsonpath)
Buffers: shared hit=4393
Planning:
Buffers: shared hit=1
Planning Time: 0.117 ms
Serialization: time=0.009 ms output=1kB format=text
Execution Time: 582.575 ms
The solution, in both databases, is to add an index on "age".
An expression-based index in PostgreSQL:
CREATE INDEX idx_age ON data ( ((details->>'age')::int) )
;
A regular index on MongoDB, but sparse as I don't need to index missing values:
db.data.createIndex({ age: 1 }, { sparse: true} );
Here is the new execution plan in MongoDB which can combine the multi-key index on "tags" and the regular index on "age":
print(
db.data.find({
$or: [
{ tags: { $elemMatch: { $eq: "developer" } } },
{ age: { $lt: 35 } }
]
}).explain('executionStats').executionStats
);
{
executionSuccess: true,
nReturned: 3,
executionTimeMillis: 0,
totalKeysExamined: 4,
totalDocsExamined: 5,
executionStages: {
isCached: false,
stage: 'SUBPLAN',
nReturned: 3,
executionTimeMillisEstimate: 0,
works: 6,
advanced: 3,
inputStage: {
stage: 'FETCH',
nReturned: 3,
executionTimeMillisEstimate: 0,
works: 6,
advanced: 3,
docsExamined: 3,
alreadyHasObj: 2,
inputStage: {
stage: 'OR',
nReturned: 3,
executionTimeMillisEstimate: 0,
works: 6,
advanced: 3,
dupsTested: 4,
dupsDropped: 1,
inputStages: [
{
stage: 'FETCH',
filter: { '$or': [Array] },
nReturned: 2,
executionTimeMillisEstimate: 0,
works: 3,
advanced: 2,
docsExamined: 2,
alreadyHasObj: 0,
inputStage: {
stage: 'IXSCAN',
nReturned: 2,
executionTimeMillisEstimate: 0,
works: 3,
advanced: 2,
keyPattern: [Object],
indexName: 'tags_1',
isMultiKey: true,
multiKeyPaths: [Object],
isUnique: false,
isSparse: true,
isPartial: false,
indexVersion: 2,
direction: 'forward',
indexBounds: [Object],
keysExamined: 2,
seeks: 1,
dupsTested: 2,
dupsDropped: 0
}
},
{
stage: 'IXSCAN',
nReturned: 2,
executionTimeMillisEstimate: 0,
works: 3,
advanced: 2,
keyPattern: { age: 1 },
indexName: 'age_1',
isMultiKey: false,
multiKeyPaths: { age: [] },
isUnique: false,
isSparse: true,
isPartial: false,
indexVersion: 2,
direction: 'forward',
indexBounds: { age: [Array] },
keysExamined: 2,
seeks: 1,
dupsTested: 0,
dupsDropped: 0
}
]
}
}
}
}
I've detailed this OR expansion in a previous blog post:
PostgreSQL cannot do the same and you need to write the query differently:
postgres=# explain (analyze, costs off, buffers, serialize text)
-- Query for using GIN index on "details" column
SELECT * FROM data
WHERE details @? '$?(@.tags[*] == "developer")'
UNION
-- Query for using B-tree index on "age" column
SELECT * FROM data
WHERE (details->>'age')::int < 35
;
QUERY PLAN
----------------------------------------------------------------------------------------------
HashAggregate (actual time=0.134..0.220 rows=3 loops=1)
Group Key: data.id, data.details
Batches: 1 Memory Usage: 1561kB
Buffers: shared hit=9
-> Append (actual time=0.023..0.035 rows=4 loops=1)
Buffers: shared hit=9
-> Bitmap Heap Scan on data (actual time=0.022..0.024 rows=2 loops=1)
Recheck Cond: (details @? '$?(@."tags"[*] == "developer")'::jsonpath)
Heap Blocks: exact=1
Buffers: shared hit=5
-> Bitmap Index Scan on idx_details (actual time=0.012..0.012 rows=2 loops=1)
Index Cond: (details @? '$?(@."tags"[*] == "developer")'::jsonpath)
Buffers: shared hit=4
-> Bitmap Heap Scan on data data_1 (actual time=0.009..0.010 rows=2 loops=1)
Recheck Cond: (((details ->> 'age'::text))::integer < 35)
Heap Blocks: exact=1
Buffers: shared hit=4
-> Bitmap Index Scan on idx_age (actual time=0.008..0.008 rows=2 loops=1)
Index Cond: (((details ->> 'age'::text))::integer < 35)
Buffers: shared hit=3
Planning:
Buffers: shared hit=1
Planning Time: 0.160 ms
Serialization: time=0
by Franck Pachot
Percona Database Performance Blog
If you’re running enterprise applications, you might be facing a common dilemma: cloud solutions offer great features, but regulatory constraints and data sovereignty laws might prevent you from using them fully. In fact, about 60% of enterprises struggle with this exact problem. While cloud computing delivers scalability and convenience, your organization might require on-premises or […]
by David Quilty
Tinybird Engineering Blog
Somehow we got stuck on the idea that big data systems should be slow. We're making it fast.
AWS Database Blog - Amazon Aurora
In this post, we demonstrate how you can implement scheduled scaling for Aurora Serverless using Amazon EventBridge Scheduler. By proactively adjusting minimum Aurora Capacity Units (ACUs), you can achieve faster scaling rates during peak periods while maintaining cost efficiency during low-demand times.
by Josh Hart
March 18, 2025
Percona Database Performance Blog
From punch cards to containerization, database technologies have changed tremendously over the years. And there has been no shortage of watershed moments along the way. However, we believe the emergence of advanced AI may very well be the most significant one yet. Percona certainly isn’t alone in that belief. But, when it comes to the […]
by Liz Warner
PlanetScale Blog
Our experience running AWS EBS at scale for critical workloads
March 17, 2025
Tinybird Engineering Blog
You're probably sick of hearing people talk about "vibe coding", but what about "vibe data engineering?" ;)
Franck Pachot
MongoDB promotes document data modeling rather than the highest normal forms of a relational model. To understand this approach, it's interesting to examine the reasons why E. F. Codd introduced normalization, as he describes in his article NORMALIZED DATA BASE STRUCTURE: A BRIEF TUTORIAL (1971), and compare with modern applications design:
make formatted data bases readily accessible to users (especially casual users) who have little or no training in programming
casual user at a terminal often has occasion to require tables to be printed out or displayed
Normalizing data structures into a set of two-dimensional tables was originally intended to simplify these structures for non-programmers and interactive users, a role taken by data scientists today. They accessed databases through early computer devices (TTY on CRT terminals and dot matrix printers).
This approach contrasts with the modern graphical user interface experience, where users can navigate and unfold nested structures using a mouse click.
Moreover, when the database client is a program rather than a human, it interacts with object-oriented structures that form a graph of interconnected objects instead of independent tables. In both cases, a document model, whether normalized or not, can be a better fit than the set of tables described by Codd (as long as data consistency and schema validation is maintained, of course).
It's also important to note that Codd's description of normalization focuses on the logical structure and the API, not the physical storage:
remember we are not discussing the physical representation
Data model to process data
Some critiques of the document model proposed by MongoDB sometimes mention Codd's relational theory as if the document model were heretical to established norms. It's important to clarify that physically storing nested structures doesn't violate the principles of normalized logical structures.
For modern applications, a document model API can be the best fit, just as highly normalized structures were once ideal for supporting casual users' ad-hoc queries. Both models serve different needs, and understanding this flexibility is crucial for leveraging the right approach in modern database applications.
Representing data as a set of independent tables makes it easy to display, but how do we efficiently process it, given the absence of navigable pointers and procedural languages?
To address this challenge, Codd introduced relational algebra, with powerful set-based operations such as projection (ρ), selection(σ), join (⋈), union (∪), intersection (∩), difference (−), product (×), and division (÷):
Now, it may be contended that tables are inadequate data structures to support facile manipulation of relationships between and within the tables. Application of elementary operations on relations (operations such as projection, join, and division) shows that this contention is false.
SQL is widely recognized as the declarative, English-based language that implements the operations defined by Codd's relational algebra, designed primarily for human users to interact with databases. However, the same operations can be achieved using MongoDB's aggregation pipelines on documents.
Notably, one operation, relational division — cited as essential for efficient data processing on normalized tables — can be quite complex to achieve in SQL databases but is more straightforward in MongoDB.
Relational Division
Relational division is essential for identifying relationships between tables based on subsets instead of individual rows. It finds records linked to every item in a specific subset. Practical business examples include:
- Finding employees who possess all required certifications.
- Identifying customers who purchase all items in a bundle.
- Locating students who have completed all required courses.
- Determining vendors supplying all contract materials.
- Discovering authors collaborating with all team members.
- Selecting athletes participating in all league events.
- Finding tutors who cover every curriculum topic.
- Identifying devices compatible with all software platforms.
- Find the pilots who can fly all planes in the hangar
This operation is vital for scenarios where comprehensive coverage or participation across a set of criteria is necessary, and while complex in SQL, it's more straightforward to implement in MongoDB.
Relational Division in SQL (normalized - 3NF)
The pilots who can fly all planes in the hangar is an example taken from the excellent article Divided We Stand: The SQL of Relational Division by Joe Celko (Redgate).
In PostgreSQL, the following normalized tables can store pilots and planes entities, as well as the relations for planes in hangar and pilot skills:
CREATE TABLE pilots (
pilot_id SERIAL PRIMARY KEY,
pilot_name VARCHAR NOT NULL
);
CREATE TABLE planes (
plane_id SERIAL PRIMARY KEY,
plane_name VARCHAR NOT NULL
);
CREATE TABLE pilot_skills (
pilot_id INT REFERENCES pilots(pilot_id),
plane_id INT REFERENCES planes(plane_id),
PRIMARY KEY (pilot_id, plane_id)
);
CREATE TABLE hangar (
plane_id INT PRIMARY KEY REFERENCES planes(plane_id)
);
I insert the same data as in the Redgate article:
INSERT INTO pilots (pilot_name) VALUES
('Celko'),
('Higgins'),
('Jones'),
('Smith'),
('Wilson')
;
INSERT INTO planes (plane_name) VALUES
('Piper Cub'),
('B-1 Bomber'),
('B-52 Bomber'),
('F-14 Fighter'),
('F-17 Fighter')
;
INSERT INTO pilot_skills (pilot_id, plane_id) VALUES
(1, 1), -- Celko: Piper Cub
(2, 3), (2, 4), (2, 1), -- Higgins: B-52 Bomber, F-14 Fighter, Piper Cub
(3, 3), (3, 4), -- Jones: B-52 Bomber, F-14 Fighter
(4, 2), (4, 3), (4, 4), -- Smith: B-1 Bomber, B-52 Bomber, F-14 Fighter
(5, 2), (5, 3), (5, 4), (5, 5) -- Wilson: B-1 Bomber, B-52 Bomber, F-14 Fighter, F-17 Fighter
;
INSERT INTO hangar (plane_id) VALUES
(2), -- B-1 Bomber
(3), -- B-52 Bomber
(4) -- F-14 Fighter
;
To find pilots who can fly all planes in the hangar, use a double negation: identify all pilots for whom no plane in the hangar exists that they cannot fly. This corresponds to a double NOT EXISTS in SQL:
postgres=> SELECT p.pilot_id, p.pilot_name
FROM pilots p
WHERE NOT EXISTS (
SELECT *
FROM hangar h
WHERE NOT EXISTS (
SELECT *
FROM pilot_skills ps
WHERE ps.pilot_id = p.pilot_id AND ps.plane_id = h.plane_id
)
);
pilot_id | pilot_name
----------+------------
4 | Smith
5 | Wilson
(2 rows)
If you find it more readable, the second negation can be declared with EXCEPT or MINUS instead of NOT EXISTS:
postgres=> SELECT p.pilot_id, p.pilot_name
FROM pilots p
WHERE NOT EXISTS (
SELECT plane_id FROM hangar
EXCEPT
SELECT plane_id FROM pilot_skills ps WHERE ps.pilot_id = p.pilot_id
);
pilot_id | pilot_name
----------+------------
4 | Smith
5 | Wilson
(2 rows)
If you feel more like an accountant than a logician, you can bypass SQL's lack of relational division by counting pilot skills and comparing them to the number of planes in the hangar:
postgres=> SELECT p.pilot_id, p.pilot_name
FROM pilots p
JOIN pilot_skills ps ON p.pilot_id = ps.pilot_id
JOIN hangar h ON ps.plane_id = h.plane_id
GROUP BY p.pilot_id, p.pilot_name
HAVING COUNT(DISTINCT h.plane_id) = (SELECT COUNT(*) FROM hangar);
pilot_id | pilot_name
----------+------------
4 | Smith
5 | Wilson
(2 rows)
In mathematics, relational division may have a remainder. How can you list pilots with additional skills (planes they can fly that are not in the hangar)? How would you identify pilots lacking required skills and specify the missing skills?
E. F. Codd stated that processing normalized tables would not be difficult due to relational operations, but what happens if no SQL database implements them? A document model may be easier.
Relational Division in SQL (unnormalized - 0NF)
Let's violate the first normal form and store the skills as an array (which is a PostgreSQL datatype) embedded with the pilot:
CREATE TABLE pilotsWithSkills (
pilot_name VARCHAR PRIMARY KEY,
planes VARCHAR[] -- An array of plane names
);
INSERT INTO pilotsWithSkills (pilot_name, planes)
VALUES
('Celko', ARRAY['Piper Cub']),
('Higgins', ARRAY['B-52 Bomber', 'F-14 Fighter', 'Piper Cub']),
('Jones', ARRAY['B-52 Bomber', 'F-14 Fighter']),
('Smith', ARRAY['B-1 Bomber', 'B-52 Bomber', 'F-14 Fighter']),
('Wilson', ARRAY['B-1 Bomber', 'B-52 Bomber', 'F-14 Fighter', 'F-17 Fighter']);
I can use the <@ operator with a subquery to find the pilots who can fly all planes in the hangar:
SELECT pilot_name
FROM pilotsWithSkills
WHERE array(
select plane_name from hangar join planes using (plane_id)
) <@ planes
;
With a GIN index on the array, the following execution plan with Bitmap Index Scan and Recheck Cond can optimize the search:
I can also create the list in JSONB instead of ARRAY and use jsonb_agg() instead of array(). But another possibility is to use a document database.
Relational Division in MongoDB
To normalize without the complexity of embedding arrays or JSON in SQL tables, MongoDB can store and index documents natively:
db.pilotSkills.insertMany([
{ pilot_name: 'Celko', planes: ['Piper Cub'] },
{ pilot_name: 'Higgins', planes: ['B-52 Bomber', 'F-14 Fighter', 'Piper Cub'] },
{ pilot_name: 'Jones', planes: ['B-52 Bomber', 'F-14 Fighter'] },
{ pilot_name: 'Smith', planes: ['B-1 Bomber', 'B-52 Bomber', 'F-14 Fighter'] },
{ pilot_name: 'Wilson', planes: ['B-1 Bomber', 'B-52 Bomber', 'F-14 Fighter', 'F-17 Fighter'] }
]);
db.hangar.insertMany([
{ plane_name: 'B-1 Bomber' },
{ plane_name: 'B-52 Bomber' },
{ plane_name: 'F-14 Fighter' }
]);
db.pilotSkills.createIndex({ planes: 1 });
Here is how I find the pilots who can fly all planes in the hangar:
mdb> db.pilotSkills.find({
planes: { $all: db.hangar.distinct("plane_name") }
} , { pilot_name: 1, _id: 0})
;
[ { pilot_name: 'Smith' }, { pilot_name: 'Wilson' } ]
If you look at the execution plan (adding .explain()) you will see that the MongoDB Multi-Plan optimizer has considered using the index for one of each value, in case one is more selective, or the combination of the three with an AND_SORTED
It is possible to do it in one query, and even a more complex query, like adding the extra skills of the pilot (the planes not in the hangar) for a relational division with reminder:
db.pilotSkills.aggregate([
// Lookup hangar data to include required planes
{
$lookup: {
from: "hangar",
pipeline: [
{ $group: { _id: null, plane_names: { $addToSet: "$plane_name" } } }
],
as: "hangar_planes"
}
},
// Unwind the hangar_planes array
{
$unwind: "$hangar_planes"
},
// Match pilots who can fly all planes in the hangar
{
$match: {
$expr: {
$setIsSubset: ["$hangar_planes.plane_names", "$planes"]
}
}
},
// Add a field for extra skills beyond the required planes
{
$addFields: {
extraSkills: {
$filter: {
input: "$planes",
as: "plane",
cond: { $not: { $in: ["$$plane", "$hangar_planes.plane_names"] } }
}
}
}
},
// Project only the desired fields
{
$project: {
pilot_name: 1,
extraSkills: 1,
_id: 0
}
}
]);
[
{ pilot_name: 'Smith', extraSkills: [] },
{ pilot_name: 'Wilson', extraSkills: [ 'F-17 Fighter' ] }
]
With $setIsSubset MongoDB implements relational division with better developer experience than SQL's subqueries with double negations or counting. PostgreSQL JSONB adds operations on documents, but can become more complex, for example for the division with remainder. MongoDB simplifies this through a series of steps in an aggregation pipeline: retrieving planes from the hangar, unwinding into an array to compare to the pilot's array of skills, applying division logic between the two, calculating the remainders, and projecting pilot names with skills.
It may be surprising, but SQL must use complex subqueries, while MongoDB provides a direct method for this relational algebra operator, enabling efficient processing of normalized collections.
by Franck Pachot
Franck Pachot
MongoDB promotes document data modeling rather than the highest normal forms of a relational model. To understand this approach, it's interesting to examine the reasons why E. F. Codd introduced normalization, as he describes in his article NORMALIZED DATA BASE STRUCTURE: A BRIEF TUTORIAL (1971), and compare with modern applications design:
make formatted data bases readily accessible to users (especially casual users) who have little or no training in programming
casual user at a terminal often has occasion to require tables to be printed out or displayed
Normalizing data structures into a set of two-dimensional tables was originally intended to simplify these structures for non-programmers and interactive users, a role taken by data scientists today. They accessed databases through early computer devices (TTY on CRT terminals and dot matrix printers).
This approach contrasts with the modern graphical user interface experience, where users can navigate and unfold nested structures using a mouse click.
Moreover, when the database client is a program rather than a human, it interacts with object-oriented structures that form a graph of interconnected objects instead of independent tables. In both cases, a document model, whether normalized or not, can be a better fit than the set of tables described by Codd (as long as data consistency and schema validation is maintained, of course).
It's also important to note that Codd's description of normalization focuses on the logical structure and the API, not the physical storage:
remember we are not discussing the physical representation
Data model to process data
Some critiques of the document model proposed by MongoDB sometimes mention Codd's relational theory as if the document model were heretical to established norms. It's important to clarify that physically storing nested structures doesn't violate the principles of normalized logical structures.
For modern applications, a document model API can be the best fit, just as highly normalized structures were once ideal for supporting casual users' ad-hoc queries. Both models serve different needs, and understanding this flexibility is crucial for leveraging the right approach in modern database applications.
Representing data as a set of independent tables makes it easy to display, but how do we efficiently process it, given the absence of navigable pointers and procedural languages?
To address this challenge, Codd introduced relational algebra, with powerful set-based operations such as projection (ρ), selection(σ), join (⋈), union (∪), intersection (∩), difference (−), product (×), and division (÷):
Now, it may be contended that tables are inadequate data structures to support facile manipulation of relationships between and within the tables. Application of elementary operations on relations (operations such as projection, join, and division) shows that this contention is false.
SQL is widely recognized as the declarative, English-based language that implements the operations defined by Codd's relational algebra, designed primarily for human users to interact with databases. However, the same operations can be achieved using MongoDB's aggregation pipelines on documents.
Notably, one operation, relational division — cited as essential for efficient data processing on normalized tables — can be quite complex to achieve in SQL databases but is more straightforward in MongoDB.
Relational Division
Relational division is essential for identifying relationships between tables based on subsets instead of individual rows. It finds records linked to every item in a specific subset. Practical business examples include:
- Finding employees who possess all required certifications.
- Identifying customers who purchase all items in a bundle.
- Locating students who have completed all required courses.
- Determining vendors supplying all contract materials.
- Discovering authors collaborating with all team members.
- Selecting athletes participating in all league events.
- Finding tutors who cover every curriculum topic.
- Identifying devices compatible with all software platforms.
- Find the pilots who can fly all planes in the hangar
This operation is vital for scenarios where comprehensive coverage or participation across a set of criteria is necessary, and while complex in SQL, it's more straightforward to implement in MongoDB.
Relational Division in SQL (normalized - 3NF)
The pilots who can fly all planes in the hangar is an example taken from the excellent article Divided We Stand: The SQL of Relational Division by Joe Celko (Redgate).
In PostgreSQL, the following normalized tables can store pilots and planes entities, as well as the relations for planes in hangar and pilot skills:
CREATE TABLE pilots (
pilot_id SERIAL PRIMARY KEY,
pilot_name VARCHAR NOT NULL
);
CREATE TABLE planes (
plane_id SERIAL PRIMARY KEY,
plane_name VARCHAR NOT NULL
);
CREATE TABLE pilot_skills (
pilot_id INT REFERENCES pilots(pilot_id),
plane_id INT REFERENCES planes(plane_id),
PRIMARY KEY (pilot_id, plane_id)
);
CREATE TABLE hangar (
plane_id INT PRIMARY KEY REFERENCES planes(plane_id)
);
I insert the same data as in the Redgate article:
INSERT INTO pilots (pilot_name) VALUES
('Celko'),
('Higgins'),
('Jones'),
('Smith'),
('Wilson')
;
INSERT INTO planes (plane_name) VALUES
('Piper Cub'),
('B-1 Bomber'),
('B-52 Bomber'),
('F-14 Fighter'),
('F-17 Fighter')
;
INSERT INTO pilot_skills (pilot_id, plane_id) VALUES
(1, 1), -- Celko: Piper Cub
(2, 3), (2, 4), (2, 1), -- Higgins: B-52 Bomber, F-14 Fighter, Piper Cub
(3, 3), (3, 4), -- Jones: B-52 Bomber, F-14 Fighter
(4, 2), (4, 3), (4, 4), -- Smith: B-1 Bomber, B-52 Bomber, F-14 Fighter
(5, 2), (5, 3), (5, 4), (5, 5) -- Wilson: B-1 Bomber, B-52 Bomber, F-14 Fighter, F-17 Fighter
;
INSERT INTO hangar (plane_id) VALUES
(2), -- B-1 Bomber
(3), -- B-52 Bomber
(4) -- F-14 Fighter
;
To find pilots who can fly all planes in the hangar, use a double negation: identify all pilots for whom no plane in the hangar exists that they cannot fly. This corresponds to a double NOT EXISTS in SQL:
postgres=> SELECT p.pilot_id, p.pilot_name
FROM pilots p
WHERE NOT EXISTS (
SELECT *
FROM hangar h
WHERE NOT EXISTS (
SELECT *
FROM pilot_skills ps
WHERE ps.pilot_id = p.pilot_id AND ps.plane_id = h.plane_id
)
);
pilot_id | pilot_name
----------+------------
4 | Smith
5 | Wilson
(2 rows)
If you find it more readable, the second negation can be declared with EXCEPT or MINUS instead of NOT EXISTS:
postgres=> SELECT p.pilot_id, p.pilot_name
FROM pilots p
WHERE NOT EXISTS (
SELECT plane_id FROM hangar
EXCEPT
SELECT plane_id FROM pilot_skills ps WHERE ps.pilot_id = p.pilot_id
);
pilot_id | pilot_name
----------+------------
4 | Smith
5 | Wilson
(2 rows)
If you feel more like an accountant than a logician, you can bypass SQL's lack of relational division by counting pilot skills and comparing them to the number of planes in the hangar:
postgres=> SELECT p.pilot_id, p.pilot_name
FROM pilots p
JOIN pilot_skills ps ON p.pilot_id = ps.pilot_id
JOIN hangar h ON ps.plane_id = h.plane_id
GROUP BY p.pilot_id, p.pilot_name
HAVING COUNT(DISTINCT h.plane_id) = (SELECT COUNT(*) FROM hangar);
pilot_id | pilot_name
----------+------------
4 | Smith
5 | Wilson
(2 rows)
In mathematics, relational division may have a remainder. How can you list pilots with additional skills (planes they can fly that are not in the hangar)? How would you identify pilots lacking required skills and specify the missing skills?
E. F. Codd stated that processing normalized tables would not be difficult due to relational operations, but what happens if no SQL database implements them? A document model may be easier.
Relational Division in SQL (unnormalized - 0NF)
Let's violate the first normal form and store the skills as an array (which is a PostgreSQL datatype) embedded with the pilot:
CREATE TABLE pilotsWithSkills (
pilot_name VARCHAR PRIMARY KEY,
planes VARCHAR[] -- An array of plane names
);
INSERT INTO pilotsWithSkills (pilot_name, planes)
VALUES
('Celko', ARRAY['Piper Cub']),
('Higgins', ARRAY['B-52 Bomber', 'F-14 Fighter', 'Piper Cub']),
('Jones', ARRAY['B-52 Bomber', 'F-14 Fighter']),
('Smith', ARRAY['B-1 Bomber', 'B-52 Bomber', 'F-14 Fighter']),
('Wilson', ARRAY['B-1 Bomber', 'B-52 Bomber', 'F-14 Fighter', 'F-17 Fighter']);
I can use the <@ operator with a subquery to find the pilots who can fly all planes in the hangar:
SELECT pilot_name
FROM pilotsWithSkills
WHERE array(
select plane_name from hangar join planes using (plane_id)
) <@ planes
;
With a GIN index on the array, the following execution plan with Bitmap Index Scan and Recheck Cond can optimize the search:
I can also create the list in JSONB instead of ARRAY and use jsonb_agg() instead of array(). But another possibility is to use a document database.
Relational Division in MongoDB
To normalize without the complexity of embedding arrays or JSON in SQL tables, MongoDB can store and index documents natively:
db.pilotSkills.insertMany([
{ pilot_name: 'Celko', planes: ['Piper Cub'] },
{ pilot_name: 'Higgins', planes: ['B-52 Bomber', 'F-14 Fighter', 'Piper Cub'] },
{ pilot_name: 'Jones', planes: ['B-52 Bomber', 'F-14 Fighter'] },
{ pilot_name: 'Smith', planes: ['B-1 Bomber', 'B-52 Bomber', 'F-14 Fighter'] },
{ pilot_name: 'Wilson', planes: ['B-1 Bomber', 'B-52 Bomber', 'F-14 Fighter', 'F-17 Fighter'] }
]);
db.hangar.insertMany([
{ plane_name: 'B-1 Bomber' },
{ plane_name: 'B-52 Bomber' },
{ plane_name: 'F-14 Fighter' }
]);
db.pilotSkills.createIndex({ planes: 1 });
Here is how I find the pilots who can fly all planes in the hangar:
mdb> db.pilotSkills.find({
planes: { $all: db.hangar.distinct("plane_name") }
} , { pilot_name: 1, _id: 0})
;
[ { pilot_name: 'Smith' }, { pilot_name: 'Wilson' } ]
If you look at the execution plan (adding .explain()) you will see that the MongoDB Multi-Plan optimizer has considered using the index for one of each value, in case one is more selective, or the combination of the three with an AND_SORTED
It is possible to do it in one query, and even a more complex query, like adding the extra skills of the pilot (the planes not in the hangar) for a relational division with reminder:
db.pilotSkills.aggregate([
// Lookup hangar data to include required planes
{
$lookup: {
from: "hangar",
pipeline: [
{ $group: { _id: null, plane_names: { $addToSet: "$plane_name" } } }
],
as: "hangar_planes"
}
},
// Unwind the hangar_planes array
{
$unwind: "$hangar_planes"
},
// Match pilots who can fly all planes in the hangar
{
$match: {
$expr: {
$setIsSubset: ["$hangar_planes.plane_names", "$planes"]
}
}
},
// Add a field for extra skills beyond the required planes
{
$addFields: {
extraSkills: {
$filter: {
input: "$planes",
as: "plane",
cond: { $not: { $in: ["$$plane", "$hangar_planes.plane_names"] } }
}
}
}
},
// Project only the desired fields
{
$project: {
pilot_name: 1,
extraSkills: 1,
_id: 0
}
}
]);
[
{ pilot_name: 'Smith', extraSkills: [] },
{ pilot_name: 'Wilson', extraSkills: [ 'F-17 Fighter' ] }
]
With $setIsSubset MongoDB implements relational division with better developer experience than SQL's subqueries with double negations or counting. PostgreSQL JSONB adds operations on documents, but can become more complex, for example for the division with remainder. MongoDB simplifies this through a series of steps in an aggregation pipeline: retrieving planes from the hangar, unwinding into an array to compare to the pilot's array of skills, applying division logic between the two, calculating the remainders, and projecting pilot names with skills.
It may be surprising, but SQL must use complex subqueries, while MongoDB provides a direct method for this relational algebra operator, enabling efficient processing of normalized collections.
by Franck Pachot
Percona Database Performance Blog
In this blog, we will look at the differences between LTS (Long Term Stable) versions of Percona Server for MySQL. Released in April 2019, MySQL 8.0 represented a major change from the previous version, 5.7, introducing significant changes to the data dictionary and enabling many features and enhancements. It also provided no direct downgrade path, […]
by Dov Endress
Tinybird Engineering Blog
Run Tinybird on your machine. Deploy your data project locally or to the cloud in the exact same way, as it should be.
March 16, 2025
Small Datum - Mark Callaghan
Are MySQL 5.7 and 8.0 faster than 5.6? That depends a lot on the workload -- both types of SQL and amount of concurrency. Here I summarize results from sysbench on a larger server (48 cores) using 1, 4, 6, 8, 10, 20 and 40 clients to show how things change.
tl;dr
- the workload here is microbenchmarks with a database cached by InnoDB
- 5.7.44 is faster than 8.0.x at all concurrency levels on most microbenchmarks
- for 5.6.51 vs 8.0.x
- for point queries, 5.6.51 is faster at <= 8 clients
- for range queries without aggregation 5.6.51 is always faster
- for range queries with aggregation 5.6.51 is faster except at 40 clients
- for writes, 5.6.51 is almost always faster at 10 or fewer clients (excluding update-index)
Performance summaries
For point queries:
- 5.7.44 is always faster than 8.0
- 8.0.28 suffers from bug 102037 - found by me with sysbench, fixed in 8.0.31
- at what level of concurrency do most things get faster in 5.7 & 8.0 vs 5.6?
- 5.7.44 becomes faster than 5.6.51 at 6+ clients
- 8.0.x becomes faster than 5.6.51 at between 10 and 20 clients
- Two of the microbenchmarks are always faster in 5.6.51 - both do non-covering queries on a secondary index
For range queries without aggregation
- 5.7.44 is always faster than 8.0x
- 5.6.51 is always faster than 5.7.44 and 8.0.x
For range queries with aggregation
- 5.7.44 is almost always faster than 8.0.x
- 5.7.44 becomes faster than 5.6.51 at 6+ clients
- 8.0.x becomes faster than 5.6.51 at 40 clients
For writes
- For update-index
- 5.7.44 and 8.0.x are always faster than 5.6.51 at 4+ clients
- There is an odd drop from ~6X to ~3X for 8.0.32 and 8.0.39 at 20 clients
- 5.7.44 is mostly faster than 8.0.x for 1 to 20 clients and they have similar results at 40 clients
- 5.7.44 & 8.0.x are always faster than 5.6.51 at 20+ clients
Builds, configuration and hardware
I compiled MySQL from source for versions 5.6.51, 5.7.44, 8.0.28, 8.0.32, 8.0.39 and 8.0.41.
The server is an
ax162-s from Hetzner with 48 cores (AMD EPYC 9454P), 128G RAM and AMD SMT disabled. It uses Ubuntu 22.04 and storage is ext4 with SW RAID 1 over 2 locally attached NVMe devices. More details on it
are here. At list prices a similar server from Google Cloud costs 10X more than from Hetzner.
Benchmark
I used sysbench and my usage is
explained here. To save time I only run 27 of the 42 microbenchmarks and most test only 1 type of SQL statement. Benchmarks are run with the database cached by InnoDB.
The tests run with 8 tables and 10M rows/table. There are 40 client threads, read-heavy microbenchmarks run for 180 seconds and write-heavy run for 300 seconds.
The command lines to run all tests are:
bash r.sh 8 10000000 180 300 md2 1 1 1
bash r.sh 8 10000000 180 300 md2 1 1 4
bash r.sh 8 10000000 180 300 md2 1 1 6
bash r.sh 8 10000000 180 300 md2 1 1 8
bash r.sh 8 10000000 180 300 md2 1 1 10
bash r.sh 8 10000000 180 300 md2 1 1 20
bash r.sh 8 10000000 180 300 md2 1 1 40
Results
For the results below I split the microbenchmarks into 4 groups: point queries, range queries without aggregation, range queries with queries, writes. The spreadsheet with all data
is here. Files with performance summaries for relative and absolute QPS
are here.
Values from iostat and vmstat per microbenchmark are here for 1 client, 4 clients, 6 clients, 8 clients, 10 clients, 20 clients and 40 clients. These help to explain why something is faster or slower because it shows how much HW is used per query.
The relative QPS is the following where $version is >= 5.7.44.
(QPS for $version) / (QPS for MySQL 5.6.51)
The numbers in the spreadsheets are the relative QPS. When the relative QPS is > 1 then $version is faster than MySQL 5.6.51. When it is 3.0 then $version is 3X faster than the base case.
Results: charts
Notes on the charts
- the y-axis shows the relative QPS
- the y-axis starts at 0.80 to make it easier to see differences
- in some cases the y-axis truncates the good outliers, cases where the relative QPS is greater than 1.5. I do this to improve readability for values near 1.0. Regardless, the improvements are nice.
Results: point queries
Summary
- 5.7.44 is always faster than 8.0
- 8.0.28 suffers from bug 102037 - found by me with sysbench, fixed in 8.0.31
- at what level of concurrency do most things get faster in 5.7 & 8.0 vs 5.6?
- 5.7.44 becomes faster than 5.6.51 at 6+ clients
- 8.0.x becomes faster than 5.6.51 at between 10 and 20 clients
- Two of the microbenchmarks are always faster in 5.6.51 - both do non-covering queries on a secondary index
Results: range queries without aggregation
- 5.7.44 is always faster than 8.0x
- 5.6.51 is always faster than 5.7.44 and 8.0.x
Results: range queries with aggregation
- 5.7.44 is almost always faster than 8.0.x
- 5.7.44 becomes faster than 5.6.51 at 6+ clients
- 8.0.x becomes faster than 5.6.51 at 40 clients
The relative speedup for the update-index microbenchmark is frequently so large that it obscures the smaller changes on other microbenchmarks. So here I truncate the y-axis for some of the charts (for 6+ clients) and the section that follows has the charts without truncation.
Summary
- For update-index
- 5.7.44 and 8.0.x are always faster than 5.6.51 at 4+ clients
- There is an odd drop from ~6X to ~3X for 8.0.32 and 8.0.39 at 20 clients but you can't see that on the charts in this section because of the truncation. It is visible in the next section. From vmstat I see an increase in CPU/operation (cpu/o) and context switches /operation (cs/o) at 20 clients but not at 40 clients.
- 5.7.44 is mostly faster than 8.0.x for 1 to 20 clients and they have similar results at 40 clients
- 5.7.44 & 8.0.x are always faster than 5.6.51 at 20+ clients
Results: charts for writes without truncation
The y-axis is truncated the the charts for writes in the previous section for 6+ clients. This section has those charts without truncation.
by Mark Callaghan (noreply@blogger.com)