March 25, 2025
Integrate natural language processing and generative AI with relational databases
The simplest way to count 100 billion unique IDs: Part 1
PlanetScale vectors is now GA
Phil Eaton on Technical Blogging
This is an external post of mine. Click here if you are not redirected.
March 24, 2025
Cedar: A New Language for Expressive, Fast, Safe, and Analyzable Authorization
This paper (2024) presents Cedar, AWS's new authorization policy language. By providing a clear declarative way to manage access control policies, Cedar addresses the common limitations of embedding authorization logic directly into application code: problems with correctness, auditing, and maintainence. Cedar introduces a domain-specific language (DSL) to express policies that are separate from application code, and improves readability and manageability. In that sense, this is like aspect-oriented programming but for authorization policy.
The language is designed with four main objectives: expressiveness, performance, safety, and analyzability. Cedar balances these goals by supporting role-based (RBAC), attribute-based (ABAC), and relationship-based (ReBAC) access control models.
Policy Structure and Evaluation
Cedar policies consist of three primary components: effect, scope, and conditions. The effect can either be "permit" or "forbid", defining whether access is granted or denied. The scope specifies the principal (user), action, and resource. Conditions provide additional constraints using entity attributes.
Entities in Cedar represent users, resources, etc. Each entity belongs to a specific type, such as User, Team, or List. Entities are uniquely identified by a combination of their type and a string identifier (e.g., User::"andrew"). Entity attributes are key-value pairs associated with entities, providing additional information. Attributes can include primitive data types like strings or numbers, collections like lists or sets, or references to other entities. For example, a List entity might have attributes like owner, readers, editors, and tasks, where owner references a User entity, and readers references a set of Team entities. Entities and their attributes form a hierarchical structure, which Cedar uses to evaluate policies efficiently.Cedar ensures safety by applying a deny-by-default model. If no "permit" policy is applicable, access is denied. Additionally, "forbid" policies always take precedence over "permit" policies.
A schema defines entity types, attributes, and valid actions, and Cedar's policy validator uses optional typing and static capabilities to catch potential errors, like accessing non-existent attributes. Another key feature is policy templates, which allow developers to define reusable policy patterns with placeholders. These placeholders are instantiated with specific entities or attributes at runtime, reducing redundancy and simplifying policy management.
Symbolic Analysis and Proofs
Cedar supports deeper policy analysis through symbolic compilation. Symbolic compilation reduces policies to SMT (Satisfiability Modulo Theories) formulas, enabling automated reasoning about policy behavior. This allows checking for policy equivalence, identifying inconsistencies, and verifying security invariants.
Checking policy equivalence is particularly useful for ensuring that policy refactoring or updates do not introduce unintended behavior. By comparing two versions of a policy set, Cedar can determine if they produce identical authorization decisions for all possible requests. This is crucial for organizations with frequent policy updates to ensure no permissions are inadvertently granted or revoked.
By compiling to SMT, Cedar leverages the rapid advancements in SMT solvers over the past two decades. Improvements in solver algorithms, better heuristics for decision procedures, and more efficient memory management have significantly enhanced SMT solver performance. Tools like Z3 and CVC5 are now capable of solving complex logical formulas quickly, making real-time policy analysis feasible. These advancements enable Cedar to perform sophisticated policy checks with minimal overhead.
Cedar is formalized in the Lean programming language and uses its proof assistant to establish key properties like correctness of the authorization algorithm, sound slicing, and validation soundness. The compiler's encoding is proved to be decidable, sound, and complete. This result, the first of its kind for a non-trivial policy language, is made possible by Cedar's careful control over expressiveness and by leveraging invariants ensured by Cedar's policy validator.
Implementation and Evaluation
Cedar is implemented in Rust and open-sourced at https://github.com/cedar-policy/. The implementation is extensively tested using differential random testing to ensure correctness.
Cedar was evaluated against two prominent open-source, general-purpose authorization languages, OpenFGA and Rego, using three example sets of policies. The evaluation demonstrated that Cedar's policy slicing significantly reduces evaluation time. For large policy sets, Cedar achieved a 28.7x to 35.2x speedup over OpenFGA and a 42.8x to 80.8x speedup over Rego.
Unlike OpenFGA and Rego, which show scaling challenges with increased input sizes, Cedar maintains consistently low evaluation latency. In a simulated Google Drive authorization scenario, Cedar handled requests in a median time of 4.0 microseconds (𝜇s) with 5 Users/Groups/Documents/Folders, increasing only to 5.0𝜇s with 50 Users/Groups/Documents/Folders. Similarly, in a GitHub-like authorization scenario, Cedar maintained a median performance of around 11.0𝜇s across all input ranges. Even at the 99th percentile (p99), Cedar's response times remained low, with under 10𝜇s for Google Drive and under 20𝜇s for GitHub. Further validating its real-world applicability, Cedar is deployed at scale within Amazon Web Services, providing secure and efficient authorization for large and complex systems.
MongoDB TTL and Disk Storage
In a previous blog post, I explained how MongoDB TTL indexes work and their optimization to avoid fragmentation during scans. However, I didn’t cover the details of on-disk storage. A recent Reddit question is the occasion to explore this aspect further.
Reproducible example
Here is a small program that inserts documents in a loop, with a timestamp and some random text:
// random string to be not too friendly with compression
function getRandomString(length) {
const characters = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789';
let result = '';
const charactersLength = characters.length;
for (let i = 0; i < length; i++) {
result += characters.charAt(Math.floor(Math.random() * charactersLength));
}
return result;
}
// MongoDB loop for inserting documents with a random string
while (true) {
const doc = { text: getRandomString(1000), ts: new Date() };
db.ttl.insertOne(doc);
insertedCount++;
}
Before executing this, I ran a background function to display statistics every minute, including the number of records, memory usage, and disk size.
// Prints stats every minute
let insertedCount = 0;
let maxThroughput = 0;
let maxCollectionCount = 0;
let maxCollectionSize = 0;
let maxStorageSize = 0;
let maxTotalIndexSize = 0;
setInterval(async () => {
const stats = await db.ttl.stats();
const throughput = insertedCount / 10; // assuming measure over 10 seconds
const collectionCount = stats.count;
const collectionSizeMB = stats.size / 1024 / 1024;
const storageSizeMB = stats.storageSize / 1024 / 1024;
const totalIndexSizeMB = stats.totalIndexSize / 1024 / 1024;
maxThroughput = Math.max(maxThroughput, throughput);
maxCollectionCount = Math.max(maxCollectionCount, collectionCount);
maxCollectionSize = Math.max(maxCollectionSize, collectionSizeMB);
maxStorageSize = Math.max(maxStorageSize, storageSizeMB);
maxTotalIndexSize = Math.max(maxTotalIndexSize, totalIndexSizeMB);
console.log(`Collection Name: ${stats.ns}
Throughput: ${throughput.toFixed(0).padStart(10)} docs/min (max: ${maxThroughput.toFixed(0)} docs/min)
Collection Size: ${collectionSizeMB.toFixed(0).padStart(10)} MB (max: ${maxCollectionSize.toFixed(0)} MB)
Number of Records: ${collectionCount.toFixed(0).padStart(10)} (max: ${maxCollectionCount.toFixed(0)} docs)
Storage Size: ${storageSizeMB.toFixed(0).padStart(10)} MB (max: ${maxStorageSize.toFixed(0)} MB)
Total Index Size: ${totalIndexSizeMB.toFixed(0).padStart(10)} MB (max: ${maxTotalIndexSize.toFixed(0)} MB)`);
insertedCount = 0;
}, 60000); // every minute
I created the collection with a TTL index, which automatically expires data older than five minutes:
// TTL expire after 5 minutes
db.ttl.drop();
db.ttl.createIndex({ ts: 1 }, { expireAfterSeconds: 300 });
I let this running to see how the storage size evolves. Note that this was run in on MongoDB 7.0.16 (without the auto compact job that appeared in 8.0).
Output after 3 hours
The consistent insert rate, combined with TTL expiration, keeps the number of documents in the collection relatively stable. Deletions occur every minute, ensuring that the overall document count remains consistent.
I observe the same from the statistics I log every minute:
The storage size also remains constant, at 244MB for the table and 7MB for the indexes. The size of files has increased for the first 6 minutes and then remained constant:
This is sufficient to show that the deletion and insertion do not have a fragmentation effect that would require additional consideration. About 25% is marked as available for reuse and is effectively reused.
It is possible to force a compaction, to temporarily reclaim more space, but it is not necessary:
Collection Name: test.ttl
Throughput: 3286 docs/min (max: 3327 docs/min)
Collection Size: 170 MB (max: 198 MB)
Number of Records: 171026 (max: 198699 docs)
Storage Size: 244 MB (max: 244 MB)
Total Index Size: 7 MB (max: 7 MB)
Collection Name: test.ttl
Throughput: 3299 docs/min (max: 3327 docs/min)
Collection Size: 170 MB (max: 198 MB)
Number of Records: 170977 (max: 198699 docs)
Storage Size: 244 MB (max: 244 MB)
Total Index Size: 6 MB (max: 7 MB)
Collection Name: test.ttl
Throughput: 3317 docs/min (max: 3327 docs/min)
Collection Size: 170 MB (max: 198 MB)
Number of Records: 170985 (max: 198699 docs)
Storage Size: 244 MB (max: 244 MB)
Total Index Size: 6 MB (max: 7 MB)
test> db.runCommand({ compact: 'ttl' });
{ bytesFreed: 49553408, ok: 1 }
Collection Name: test.ttl
Throughput: 1244 docs/min (max: 3327 docs/min)
Collection Size: 150 MB (max: 198 MB)
Number of Records: 150165 (max: 198699 docs)
Storage Size: 197 MB (max: 244 MB)
Total Index Size: 6 MB (max: 7 MB)
Collection Name: test.ttl
Throughput: 3272 docs/min (max: 3327 docs/min)
Collection Size: 149 MB (max: 198 MB)
Number of Records: 149553 (max: 198699 docs)
Storage Size: 203 MB (max: 244 MB)
Total Index Size: 6 MB (max: 7 MB)
While this has reduced storage, it eventually returns to its normal volume. It's typical for a B-tree to maintain some free space, which helps to minimize frequent space allocations and reclaim.
Here is a focus when I've run manual compaction:
Conclusion
TTL deletion makes space available for reuse instead of reclaiming it immediately, but this doesn't increase the fragmentation. This space is reused automatically to maintain a total size proportional to the document count, with a constant free space of 25% in my case, to minimize frequent allocations typical of B-Tree implementations.
The TTL mechanism operates autonomously, requiring no manual compaction. If you have any doubt, monitor it. MongoDB offers statistics to compare logical and physical sizes at both the MongoDB layer and WiredTiger storage.
MongoDB TTL and Disk Storage
In a previous blog post, I explained how MongoDB TTL indexes work and their optimization to avoid fragmentation during scans. However, I didn’t cover the details of on-disk storage. A recent Reddit question is the occasion to explore this aspect further.
Reproducible example
Here is a small program that inserts documents in a loop, with a timestamp and some random text:
// random string to be not too friendly with compression
function getRandomString(length) {
const characters = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789';
let result = '';
const charactersLength = characters.length;
for (let i = 0; i < length; i++) {
result += characters.charAt(Math.floor(Math.random() * charactersLength));
}
return result;
}
// MongoDB loop for inserting documents with a random string
while (true) {
const doc = { text: getRandomString(1000), ts: new Date() };
db.ttl.insertOne(doc);
insertedCount++;
}
Before executing this, I ran a background function to display statistics every minute, including the number of records, memory usage, and disk size.
// Prints stats every minute
let insertedCount = 0;
let maxThroughput = 0;
let maxCollectionCount = 0;
let maxCollectionSize = 0;
let maxStorageSize = 0;
let maxTotalIndexSize = 0;
setInterval(async () => {
const stats = await db.ttl.stats();
const throughput = insertedCount / 10; // assuming measure over 10 seconds
const collectionCount = stats.count;
const collectionSizeMB = stats.size / 1024 / 1024;
const storageSizeMB = stats.storageSize / 1024 / 1024;
const totalIndexSizeMB = stats.totalIndexSize / 1024 / 1024;
maxThroughput = Math.max(maxThroughput, throughput);
maxCollectionCount = Math.max(maxCollectionCount, collectionCount);
maxCollectionSize = Math.max(maxCollectionSize, collectionSizeMB);
maxStorageSize = Math.max(maxStorageSize, storageSizeMB);
maxTotalIndexSize = Math.max(maxTotalIndexSize, totalIndexSizeMB);
console.log(`Collection Name: ${stats.ns}
Throughput: ${throughput.toFixed(0).padStart(10)} docs/min (max: ${maxThroughput.toFixed(0)} docs/min)
Collection Size: ${collectionSizeMB.toFixed(0).padStart(10)} MB (max: ${maxCollectionSize.toFixed(0)} MB)
Number of Records: ${collectionCount.toFixed(0).padStart(10)} (max: ${maxCollectionCount.toFixed(0)} docs)
Storage Size: ${storageSizeMB.toFixed(0).padStart(10)} MB (max: ${maxStorageSize.toFixed(0)} MB)
Total Index Size: ${totalIndexSizeMB.toFixed(0).padStart(10)} MB (max: ${maxTotalIndexSize.toFixed(0)} MB)`);
insertedCount = 0;
}, 60000); // every minute
I created the collection with a TTL index, which automatically expires data older than five minutes:
// TTL expire after 5 minutes
db.ttl.drop();
db.ttl.createIndex({ ts: 1 }, { expireAfterSeconds: 300 });
I let this running to see how the storage size evolves. Note that this was run in on MongoDB 7.0.16 (without the auto compact job that appeared in 8.0).
Output after 3 hours
The consistent insert rate, combined with TTL expiration, keeps the number of documents in the collection relatively stable. Deletions occur every minute, ensuring that the overall document count remains consistent.
I observe the same from the statistics I log every minute:
The storage size also remains constant, at 244MB for the table and 7MB for the indexes. The size of files has increased for the first 6 minutes and then remained constant:
This is sufficient to show that the deletion and insertion do not have a fragmentation effect that would require additional consideration. About 25% is marked as available for reuse and is effectively reused.
It is possible to force a compaction, to temporarily reclaim more space, but it is not necessary:
Collection Name: test.ttl
Throughput: 3286 docs/min (max: 3327 docs/min)
Collection Size: 170 MB (max: 198 MB)
Number of Records: 171026 (max: 198699 docs)
Storage Size: 244 MB (max: 244 MB)
Total Index Size: 7 MB (max: 7 MB)
Collection Name: test.ttl
Throughput: 3299 docs/min (max: 3327 docs/min)
Collection Size: 170 MB (max: 198 MB)
Number of Records: 170977 (max: 198699 docs)
Storage Size: 244 MB (max: 244 MB)
Total Index Size: 6 MB (max: 7 MB)
Collection Name: test.ttl
Throughput: 3317 docs/min (max: 3327 docs/min)
Collection Size: 170 MB (max: 198 MB)
Number of Records: 170985 (max: 198699 docs)
Storage Size: 244 MB (max: 244 MB)
Total Index Size: 6 MB (max: 7 MB)
test> db.runCommand({ compact: 'ttl' });
{ bytesFreed: 49553408, ok: 1 }
Collection Name: test.ttl
Throughput: 1244 docs/min (max: 3327 docs/min)
Collection Size: 150 MB (max: 198 MB)
Number of Records: 150165 (max: 198699 docs)
Storage Size: 197 MB (max: 244 MB)
Total Index Size: 6 MB (max: 7 MB)
Collection Name: test.ttl
Throughput: 3272 docs/min (max: 3327 docs/min)
Collection Size: 149 MB (max: 198 MB)
Number of Records: 149553 (max: 198699 docs)
Storage Size: 203 MB (max: 244 MB)
Total Index Size: 6 MB (max: 7 MB)
While this has reduced storage, it eventually returns to its normal volume. It's typical for a B-tree to maintain some free space, which helps to minimize frequent space allocations and reclaim.
Here is a focus when I've run manual compaction:
Conclusion
TTL deletion makes space available for reuse instead of reclaiming it immediately, but this doesn't increase the fragmentation. This space is reused automatically to maintain a total size proportional to the document count, with a constant free space of 25% in my case, to minimize frequent allocations typical of B-Tree implementations.
The TTL mechanism operates autonomously, requiring no manual compaction. If you have any doubt, monitor it. MongoDB offers statistics to compare logical and physical sizes at both the MongoDB layer and WiredTiger storage.
Percona XtraBackup 8.4 Pro: Reduce Server Locking by up to 4300X
Reducing PostgreSQL Costs in the Cloud
Run Tinybird on your own infrastructure
March 21, 2025
Less ceremony, more shipping
Migrating from Fauna to Supabase
March 20, 2025
Percona Monitoring and Management 2: Clarifying the End-of-Life and Transition to PMM 3
ParallelRaft: Out-of-Order Executions in PolarFS
This paper (2021) dives deeper in the Parallel Raft protocol introduced with PolarFS.
PolarFS (VLDB'18) is a distributed file system with ultra-low latency and high availability, developed by Alibaba. Its variant of the Raft consensus protocol, ParallelRaft, relaxes Raft's strict serialization constraints. ParallelRaft allows state machines to commit and execute log entries out of order.
This study provides a formal specification of ParallelRaft in TLA+, proves its correctness, and identifies consistency issues caused by ghost log entries. To address these issues, the refined ParallelRaft-CE protocol limits parallelism during leader transitions.
Introduction
Raft is a tight-ass. It enforces sequential commitment and execution of log entries. No funny business. Multi-Paxos is a bit rebellious. It allows out-of-order commitment, but it still insists on ordered execution. In order not to leave any performance on the table, PolarFS's ParallelRaft rebels completely: it not only allows out-of-order commitment, but out-of-order execution as well.
How does ParallelRaft keep consistency of state machine replication across replicas? The trick is to permit out-of-order execution if commands are conflict-free. It is the oldest trick in the book, the generalized Paxos idea. Each command contains a Logical Block Address (LBA). Commands accessing non-overlapping LBAs are conflict-free and can execute in parallel (i.e., out of order). Overlapping commands must execute in log order.
Ok, still this is a big claim. Concurrency is a wild beast! Did you expect this to be simple? What about leader turnovers? There is a lot to consider there. This paper complains that the original PolarFS paper was too quick to call it a day without specifying the protocol completely or providing an opensource implementation. They suspect that, in the presence of leader turnovers, ParallelRaft protocol is vulnerable to inconsistency induced by ghost log entries.
To investigate, they introduce an intermediate protocol, ParallelRaft-SE (Sequential Execution), which allows out-of-order commitment but mandates sequential execution. The point of ParallelRaft-SE is to establish a refinement mapping to Multi-Paxos. They then relax it to get ParallelRaft-CE (Concurrent Execution), which supports both out-of-order commitment and execution. They show that ParallelRaft-CE is prone to state inconsistency due to ghost log entries, and they propose to mitigate this issue using stricter leader election rules and log recovery mechanisms, forbidding out-of-order execution during leader takeover.
Out-of-Order Executions and Ghost Log Entries
A key challenge of out-of-order execution is determining conflicts when log entries are missing. How do you track conflicts with holes in the log? ParallelRaft introduces a Look Behind Buffer (LBF) that tracks the LBAs of up to K preceding log entries. If no holes exceeding K exist, conflict detection is feasible.
However, the paper argues ghost log entries can cause inconsistency as in Figure 1.
In phase 1, leader s1 creates entries 1-4, with only entries 1-2 being committed before s1 fails. When s3 becomes leader in phase 2, it doesn't receive s1's uncommitted entries 3-4 during recovery. S3 then adds its own entries 3-6, commits entry 6 (which sets y←2), and executes this operation before failing.
In phase 3, new leader s2 unexpectedly receives entry 4 from the now-recovered s1 during recovery. After committing this entry, s2 must execute it (setting y←3). This operation should have executed before entry 6 according to ordering rules, but s3 has already executed y←2.
This "ghost log entries" phenomenon--where s1's uncommitted entries disappear during s3's leadership only to reappear during s2's leadership-- creates inconsistent execution order. The actual execution conflicts with what should have happened based on log position, leading to incorrect conflict determination and system inconsistency.
The proposed ParallelRaft-CE patch
ParallelRaft-CE addresses ghost entries by refining ParallelRaft-SE. The key idea is to track the generation moment of log entries by the sync number, and barrier entries from old terms.
The protocol creates a clear boundary between terms by allowing concurrent operations only for log entries with the same term, while enforcing sequential processing for entries from different terms. It tracks entry generation through the sync number and adds a "date" variable to each log entry (similar to Paxos proposal numbers) to determine recency.
During recovery, the new leader collects log entries from a majority of nodes and selects the latest entries based on their "date" values. This ensures that already committed entries remain in the log, previously uncommitted entries are either properly committed or discarded, and no ghost entries can persist undetected across term changes. By introducing this controlled concurrency model where the sync number is always less than or equal to a node's term, ParallelRaft-CE effectively uses leader election as a sequentialization bottleneck to clean up potential inconsistencies.
ParallelRaft-CE has three key components:
1. Log Synchronization Mechanism: Followers accept only log entries matching their sync number (equivalent to the current term). Entries from different terms are rejected.
2. Leader Election Mechanism: Similar to Raft, but with an additional check to ensure no ghost entries exist. Leaders are responsible for reconciling logs and eliminating inconsistencies.
3. Log Recovery Mechanism: During recovery, the new leader collects logs from a majority of nodes and applies a conflict resolution process. Uncommitted entries from the previous term are either committed or discarded.
In sequential Raft, leader transition is simple, as the selected leader is already caught up: Logs of nodes in Raft do not have holes, and the log matching property between nodes is guaranteed. But in ParallelRaft, since there are holes, leader election needs to be augmented with recovery as in Paxos leader transitions. This is the price to pay for being relaxed in out-of-order commit and execution.
The ParallelRaft-CE protocol limits out-of-order execution to entries within the same term. Log entries from different terms follow a sequential recovery and commitment process to eliminate ghost entries and guarantee state consistency.
Writing tests sucks. Use LLMs so it sucks less.
Migrating from the MongoDB Data API to Supabase
Faster interpreters in Go: Catching up with C++
March 19, 2025
MongoDB equivalent for PostgreSQL JSONB operations
This article examines PostgreSQL operations that benefit from GIN indexes, as listed in Built-in GIN Operator Classes.
I created a table containing one million rows, plus three additional rows relevant to my queries. The goal is to demonstrate the benefits of using indexes, allowing for efficient data retrieval without the need to read all rows.
Sample table in PostgreSQL
Here is a table that simulates a document collection, with all attributes in a JSONB column:
CREATE TABLE data (
id SERIAL PRIMARY KEY,
details JSONB
);
INSERT INTO data (details)
SELECT jsonb_build_object(
'stuff', md5(random()::text)
)
FROM generate_series(1, 1000000);
ANALYZE data;
INSERT INTO data (details) VALUES
('{"name": "Alice", "age": 30, "tags": ["developer", "engineer"]}'),
('{"name": "Bob", "age": 25, "tags": ["designer"]}'),
('{"name": "Carol", "age": 40, "tags": ["developer", "manager"]}');
Sample collection on MongoDB
I am creating the same data in a MongoDB collection:
const bulk = db.data.initializeUnorderedBulkOp();
for (let i = 0; i < 1000000; i++) {
bulk.insert({
stuff: Math.random().toString(36).substring(2, 15)
});
}
bulk.execute();
db.data.insertMany([
{ "name": "Alice", "age": 30, "tags": ["developer", "engineer"] },
{ "name": "Bob", "age": 25, "tags": ["designer"] },
{ "name": "Carol", "age": 40, "tags": ["developer", "manager"] }
]);
@> (jsonb, jsonb): Contains
The following queries search for data where the tag array contains a value:
PostgreSQL:
SELECT * FROM data WHERE details @> '{"tags": ["developer"]}'
;
MongoDB:
db.data.find({ tags: { $all: ["developer"] } })
;
Without an index, this runs a SeqScan on PostgreSQL:
postgres=# explain (analyze, costs off, buffers, serialize text)
SELECT * FROM data WHERE details @> '{"tags": ["developer"]}'
;
QUERY PLAN
----------------------------------------------------------------
Seq Scan on data (actual time=144.002..144.004 rows=2 loops=1)
Filter: (details @> '{"tags": ["developer"]}'::jsonb)
Rows Removed by Filter: 1000001
Buffers: shared hit=5715 read=4595
Planning Time: 0.065 ms
Serialization: time=0.008 ms output=1kB format=text
Execution Time: 144.236 ms
and a COLLSCAN on MongoDB:
test> print(
db.data.find({ tags: { $all: ["developer"] } })
.explain('executionStats').executionStats
);
{
executionSuccess: true,
nReturned: 2,
executionTimeMillis: 437,
totalKeysExamined: 0,
totalDocsExamined: 1000003,
executionStages: {
stage: 'COLLSCAN',
filter: { tags: { '$eq': 'developer' } },
nReturned: 2,
executionTimeMillisEstimate: 391,
works: 1000004,
advanced: 2,
direction: 'forward',
docsExamined: 1000003
}
}
I create a jsonb_path_ops
GIN index on PostgreSQL:
CREATE INDEX idx_details ON data USING GIN (details jsonb_path_ops)
;
and a multi-key index sparse index on MongoDB for the array of tags:
db.data.createIndex({ tags: 1 } , { sparse: true })
;
PostgreSQL uses the GIN index to find the two index entries, though a Bitmap Index Scan, and then the two rows where the Recheck Cond didn't have to filter more:
postgres=# explain (analyze, costs off, buffers, serialize text)
SELECT * FROM data WHERE details @> '{"tags": ["developer"]}'
;
QUERY PLAN
----------------------------------------------------------------------------------
Bitmap Heap Scan on data (actual time=0.019..0.021 rows=2 loops=1)
Recheck Cond: (details @> '{"tags": ["developer"]}'::jsonb)
Heap Blocks: exact=1
Buffers: shared hit=5
-> Bitmap Index Scan on idx_details (actual time=0.010..0.010 rows=2 loops=1)
Index Cond: (details @> '{"tags": ["developer"]}'::jsonb)
Buffers: shared hit=4
Planning:
Buffers: shared hit=1
Planning Time: 0.079 ms
Serialization: time=0.005 ms output=1kB format=text
Execution Time: 0.041 ms
MongoDB efficiently reads two index entries and retrieves two documents without the need of bitmaps. This method preserves the index order, which is advantageous when handling multiple rows.
test> print(
db.data.find({ tags: { $all: ["developer"] } })
. explain('executionStats').executionStats
);
{
executionSuccess: true,
nReturned: 2,
executionTimeMillis: 0,
totalKeysExamined: 2,
totalDocsExamined: 2,
executionStages: {
stage: 'FETCH',
nReturned: 2,
executionTimeMillisEstimate: 0,
works: 3,
advanced: 2,
docsExamined: 2,
inputStage: {
stage: 'IXSCAN',
nReturned: 2,
executionTimeMillisEstimate: 0,
works: 3,
advanced: 2,
keyPattern: { tags: 1 },
indexName: 'tags_1',
isMultiKey: true,
multiKeyPaths: { tags: [ 'tags' ] },
isSparse: true,
isPartial: false,
direction: 'forward',
indexBounds: { tags: [ '["developer", "developer"]' ] },
keysExamined: 2,
seeks: 1,
dupsTested: 2
}
}
}
@? (jsonb, jsonpath): JSON Path Match
The following query searches for developers, as well as non developers younger than 35:
On PostgreSQL:
SELECT * FROM data
WHERE details @? '$?(@.tags[*] == "developer" || @.age < 35)'
;
On MongoDB:
db.data.find({
$or: [
{ tags: { $elemMatch: { $eq: "developer" } } },
{ age: { $lt: 35 } }
]
});
Without an additional index on "age", MongoDB chooses a COLLSCAN:
test> print(
db.data.find({
$or: [
{ tags: { $elemMatch: { $eq: "developer" } } },
{ age: { $lt: 35 } }
]
}).explain('executionStats').executionStats
);
{
executionSuccess: true,
nReturned: 3,
executionTimeMillis: 585,
totalKeysExamined: 0,
totalDocsExamined: 1000003,
executionStages: {
isCached: false,
stage: 'SUBPLAN',
nReturned: 3,
executionTimeMillisEstimate: 547,
works: 1000004,
advanced: 3,
needTime: 1000000,
inputStage: {
stage: 'COLLSCAN',
filter: {
'$or': [
{ tags: { '$elemMatch': [Object] } },
{ age: { '$lt': 35 } }
]
},
nReturned: 3,
executionTimeMillisEstimate: 517,
works: 1000004,
advanced: 3,
needTime: 1000000,
direction: 'forward',
docsExamined: 1000003
}
}
}
PostgreSQL uses the GIN index but not efficiently as it reads all index entries to remove them later by recheck:
postgres=# explain (analyze, costs off, buffers, serialize text)
SELECT * FROM data
WHERE details @? '$?(@.tags[*] == "developer" || @.age < 35)'
;
QUERY PLAN
---------------------------------------------------------------------------------------------
Bitmap Heap Scan on data (actual time=582.323..582.327 rows=3 loops=1)
Recheck Cond: (details @? '$?(@."tags"[*] == "developer" || @."age" < 35)'::jsonpath)
Rows Removed by Index Recheck: 1000000
Heap Blocks: exact=10310
Buffers: shared hit=14703
-> Bitmap Index Scan on idx_details (actual time=123.755..123.755 rows=1000003 loops=1)
Index Cond: (details @? '$?(@."tags"[*] == "developer" || @."age" < 35)'::jsonpath)
Buffers: shared hit=4393
Planning:
Buffers: shared hit=1
Planning Time: 0.117 ms
Serialization: time=0.009 ms output=1kB format=text
Execution Time: 582.575 ms
The solution, in both databases, is to add an index on "age".
An expression-based index in PostgreSQL:
CREATE INDEX idx_age ON data ( ((details->>'age')::int) )
;
A regular index on MongoDB, but sparse as I don't need to index missing values:
db.data.createIndex({ age: 1 }, { sparse: true} );
Here is the new execution plan in MongoDB which can combine the multi-key index on "tags" and the regular index on "age":
print(
db.data.find({
$or: [
{ tags: { $elemMatch: { $eq: "developer" } } },
{ age: { $lt: 35 } }
]
}).explain('executionStats').executionStats
);
{
executionSuccess: true,
nReturned: 3,
executionTimeMillis: 0,
totalKeysExamined: 4,
totalDocsExamined: 5,
executionStages: {
isCached: false,
stage: 'SUBPLAN',
nReturned: 3,
executionTimeMillisEstimate: 0,
works: 6,
advanced: 3,
inputStage: {
stage: 'FETCH',
nReturned: 3,
executionTimeMillisEstimate: 0,
works: 6,
advanced: 3,
docsExamined: 3,
alreadyHasObj: 2,
inputStage: {
stage: 'OR',
nReturned: 3,
executionTimeMillisEstimate: 0,
works: 6,
advanced: 3,
dupsTested: 4,
dupsDropped: 1,
inputStages: [
{
stage: 'FETCH',
filter: { '$or': [Array] },
nReturned: 2,
executionTimeMillisEstimate: 0,
works: 3,
advanced: 2,
docsExamined: 2,
alreadyHasObj: 0,
inputStage: {
stage: 'IXSCAN',
nReturned: 2,
executionTimeMillisEstimate: 0,
works: 3,
advanced: 2,
keyPattern: [Object],
indexName: 'tags_1',
isMultiKey: true,
multiKeyPaths: [Object],
isUnique: false,
isSparse: true,
isPartial: false,
indexVersion: 2,
direction: 'forward',
indexBounds: [Object],
keysExamined: 2,
seeks: 1,
dupsTested: 2,
dupsDropped: 0
}
},
{
stage: 'IXSCAN',
nReturned: 2,
executionTimeMillisEstimate: 0,
works: 3,
advanced: 2,
keyPattern: { age: 1 },
indexName: 'age_1',
isMultiKey: false,
multiKeyPaths: { age: [] },
isUnique: false,
isSparse: true,
isPartial: false,
indexVersion: 2,
direction: 'forward',
indexBounds: { age: [Array] },
keysExamined: 2,
seeks: 1,
dupsTested: 0,
dupsDropped: 0
}
]
}
}
}
}
I've detailed this OR expansion in a previous blog post:
Order-preserving or-expansion in MongoDB and $in:[,,] treated as a range skip scan, or exploded to an equality for the ESR rule
Franck Pachot for MongoDB ・ Feb 27
PostgreSQL cannot do the same and you need to write the query differently:
postgres=# explain (analyze, costs off, buffers, serialize text)
-- Query for using GIN index on "details" column
SELECT * FROM data
WHERE details @? '$?(@.tags[*] == "developer")'
UNION
-- Query for using B-tree index on "age" column
SELECT * FROM data
WHERE (details->>'age')::int < 35
;
QUERY PLAN
----------------------------------------------------------------------------------------------
HashAggregate (actual time=0.134..0.220 rows=3 loops=1)
Group Key: data.id, data.details
Batches: 1 Memory Usage: 1561kB
Buffers: shared hit=9
-> Append (actual time=0.023..0.035 rows=4 loops=1)
Buffers: shared hit=9
-> Bitmap Heap Scan on data (actual time=0.022..0.024 rows=2 loops=1)
Recheck Cond: (details @? '$?(@."tags"[*] == "developer")'::jsonpath)
Heap Blocks: exact=1
Buffers: shared hit=5
-> Bitmap Index Scan on idx_details (actual time=0.012..0.012 rows=2 loops=1)
Index Cond: (details @? '$?(@."tags"[*] == "developer")'::jsonpath)
Buffers: shared hit=4
-> Bitmap Heap Scan on data data_1 (actual time=0.009..0.010 rows=2 loops=1)
Recheck Cond: (((details ->> 'age'::text))::integer < 35)
Heap Blocks: exact=1
Buffers: shared hit=4
-> Bitmap Index Scan on idx_age (actual time=0.008..0.008 rows=2 loops=1)
Index Cond: (((details ->> 'age'::text))::integer < 35)
Buffers: shared hit=3
Planning:
Buffers: shared hit=1
Planning Time: 0.160 ms
Serialization: time=0
MongoDB equivalent for PostgreSQL JSONB operations
This article examines PostgreSQL operations that benefit from GIN indexes, as listed in Built-in GIN Operator Classes.
I created a table containing one million rows, plus three additional rows relevant to my queries. The goal is to demonstrate the benefits of using indexes, allowing for efficient data retrieval without the need to read all rows.
Sample table in PostgreSQL
Here is a table that simulates a document collection, with all attributes in a JSONB column:
CREATE TABLE data (
id SERIAL PRIMARY KEY,
details JSONB
);
INSERT INTO data (details)
SELECT jsonb_build_object(
'stuff', md5(random()::text)
)
FROM generate_series(1, 1000000);
ANALYZE data;
INSERT INTO data (details) VALUES
('{"name": "Alice", "age": 30, "tags": ["developer", "engineer"]}'),
('{"name": "Bob", "age": 25, "tags": ["designer"]}'),
('{"name": "Carol", "age": 40, "tags": ["developer", "manager"]}');
Sample collection on MongoDB
I am creating the same data in a MongoDB collection:
const bulk = db.data.initializeUnorderedBulkOp();
for (let i = 0; i < 1000000; i++) {
bulk.insert({
stuff: Math.random().toString(36).substring(2, 15)
});
}
bulk.execute();
db.data.insertMany([
{ "name": "Alice", "age": 30, "tags": ["developer", "engineer"] },
{ "name": "Bob", "age": 25, "tags": ["designer"] },
{ "name": "Carol", "age": 40, "tags": ["developer", "manager"] }
]);
@> (jsonb, jsonb): Contains
The following queries search for data where the tag array contains a value:
PostgreSQL:
SELECT * FROM data WHERE details @> '{"tags": ["developer"]}'
;
MongoDB:
db.data.find({ tags: { $all: ["developer"] } })
;
Without an index, this runs a SeqScan on PostgreSQL:
postgres=# explain (analyze, costs off, buffers, serialize text)
SELECT * FROM data WHERE details @> '{"tags": ["developer"]}'
;
QUERY PLAN
----------------------------------------------------------------
Seq Scan on data (actual time=144.002..144.004 rows=2 loops=1)
Filter: (details @> '{"tags": ["developer"]}'::jsonb)
Rows Removed by Filter: 1000001
Buffers: shared hit=5715 read=4595
Planning Time: 0.065 ms
Serialization: time=0.008 ms output=1kB format=text
Execution Time: 144.236 ms
and a COLLSCAN on MongoDB:
test> print(
db.data.find({ tags: { $all: ["developer"] } })
.explain('executionStats').executionStats
);
{
executionSuccess: true,
nReturned: 2,
executionTimeMillis: 437,
totalKeysExamined: 0,
totalDocsExamined: 1000003,
executionStages: {
stage: 'COLLSCAN',
filter: { tags: { '$eq': 'developer' } },
nReturned: 2,
executionTimeMillisEstimate: 391,
works: 1000004,
advanced: 2,
direction: 'forward',
docsExamined: 1000003
}
}
I create a jsonb_path_ops
GIN index on PostgreSQL:
CREATE INDEX idx_details ON data USING GIN (details jsonb_path_ops)
;
and a multi-key index sparse index on MongoDB for the array of tags:
db.data.createIndex({ tags: 1 } , { sparse: true })
;
PostgreSQL uses the GIN index to find the two index entries, though a Bitmap Index Scan, and then the two rows where the Recheck Cond didn't have to filter more:
postgres=# explain (analyze, costs off, buffers, serialize text)
SELECT * FROM data WHERE details @> '{"tags": ["developer"]}'
;
QUERY PLAN
----------------------------------------------------------------------------------
Bitmap Heap Scan on data (actual time=0.019..0.021 rows=2 loops=1)
Recheck Cond: (details @> '{"tags": ["developer"]}'::jsonb)
Heap Blocks: exact=1
Buffers: shared hit=5
-> Bitmap Index Scan on idx_details (actual time=0.010..0.010 rows=2 loops=1)
Index Cond: (details @> '{"tags": ["developer"]}'::jsonb)
Buffers: shared hit=4
Planning:
Buffers: shared hit=1
Planning Time: 0.079 ms
Serialization: time=0.005 ms output=1kB format=text
Execution Time: 0.041 ms
MongoDB efficiently reads two index entries and retrieves two documents without the need of bitmaps. This method preserves the index order, which is advantageous when handling multiple rows.
test> print(
db.data.find({ tags: { $all: ["developer"] } })
. explain('executionStats').executionStats
);
{
executionSuccess: true,
nReturned: 2,
executionTimeMillis: 0,
totalKeysExamined: 2,
totalDocsExamined: 2,
executionStages: {
stage: 'FETCH',
nReturned: 2,
executionTimeMillisEstimate: 0,
works: 3,
advanced: 2,
docsExamined: 2,
inputStage: {
stage: 'IXSCAN',
nReturned: 2,
executionTimeMillisEstimate: 0,
works: 3,
advanced: 2,
keyPattern: { tags: 1 },
indexName: 'tags_1',
isMultiKey: true,
multiKeyPaths: { tags: [ 'tags' ] },
isSparse: true,
isPartial: false,
direction: 'forward',
indexBounds: { tags: [ '["developer", "developer"]' ] },
keysExamined: 2,
seeks: 1,
dupsTested: 2
}
}
}
@? (jsonb, jsonpath): JSON Path Match
The following query searches for developers, as well as non developers younger than 35:
On PostgreSQL:
SELECT * FROM data
WHERE details @? '$?(@.tags[*] == "developer" || @.age < 35)'
;
On MongoDB:
db.data.find({
$or: [
{ tags: { $elemMatch: { $eq: "developer" } } },
{ age: { $lt: 35 } }
]
});
Without an additional index on "age", MongoDB chooses a COLLSCAN:
test> print(
db.data.find({
$or: [
{ tags: { $elemMatch: { $eq: "developer" } } },
{ age: { $lt: 35 } }
]
}).explain('executionStats').executionStats
);
{
executionSuccess: true,
nReturned: 3,
executionTimeMillis: 585,
totalKeysExamined: 0,
totalDocsExamined: 1000003,
executionStages: {
isCached: false,
stage: 'SUBPLAN',
nReturned: 3,
executionTimeMillisEstimate: 547,
works: 1000004,
advanced: 3,
needTime: 1000000,
inputStage: {
stage: 'COLLSCAN',
filter: {
'$or': [
{ tags: { '$elemMatch': [Object] } },
{ age: { '$lt': 35 } }
]
},
nReturned: 3,
executionTimeMillisEstimate: 517,
works: 1000004,
advanced: 3,
needTime: 1000000,
direction: 'forward',
docsExamined: 1000003
}
}
}
PostgreSQL uses the GIN index but not efficiently as it reads all index entries to remove them later by recheck:
postgres=# explain (analyze, costs off, buffers, serialize text)
SELECT * FROM data
WHERE details @? '$?(@.tags[*] == "developer" || @.age < 35)'
;
QUERY PLAN
---------------------------------------------------------------------------------------------
Bitmap Heap Scan on data (actual time=582.323..582.327 rows=3 loops=1)
Recheck Cond: (details @? '$?(@."tags"[*] == "developer" || @."age" < 35)'::jsonpath)
Rows Removed by Index Recheck: 1000000
Heap Blocks: exact=10310
Buffers: shared hit=14703
-> Bitmap Index Scan on idx_details (actual time=123.755..123.755 rows=1000003 loops=1)
Index Cond: (details @? '$?(@."tags"[*] == "developer" || @."age" < 35)'::jsonpath)
Buffers: shared hit=4393
Planning:
Buffers: shared hit=1
Planning Time: 0.117 ms
Serialization: time=0.009 ms output=1kB format=text
Execution Time: 582.575 ms
The solution, in both databases, is to add an index on "age".
An expression-based index in PostgreSQL:
CREATE INDEX idx_age ON data ( ((details->>'age')::int) )
;
A regular index on MongoDB, but sparse as I don't need to index missing values:
db.data.createIndex({ age: 1 }, { sparse: true} );
Here is the new execution plan in MongoDB which can combine the multi-key index on "tags" and the regular index on "age":
print(
db.data.find({
$or: [
{ tags: { $elemMatch: { $eq: "developer" } } },
{ age: { $lt: 35 } }
]
}).explain('executionStats').executionStats
);
{
executionSuccess: true,
nReturned: 3,
executionTimeMillis: 0,
totalKeysExamined: 4,
totalDocsExamined: 5,
executionStages: {
isCached: false,
stage: 'SUBPLAN',
nReturned: 3,
executionTimeMillisEstimate: 0,
works: 6,
advanced: 3,
inputStage: {
stage: 'FETCH',
nReturned: 3,
executionTimeMillisEstimate: 0,
works: 6,
advanced: 3,
docsExamined: 3,
alreadyHasObj: 2,
inputStage: {
stage: 'OR',
nReturned: 3,
executionTimeMillisEstimate: 0,
works: 6,
advanced: 3,
dupsTested: 4,
dupsDropped: 1,
inputStages: [
{
stage: 'FETCH',
filter: { '$or': [Array] },
nReturned: 2,
executionTimeMillisEstimate: 0,
works: 3,
advanced: 2,
docsExamined: 2,
alreadyHasObj: 0,
inputStage: {
stage: 'IXSCAN',
nReturned: 2,
executionTimeMillisEstimate: 0,
works: 3,
advanced: 2,
keyPattern: [Object],
indexName: 'tags_1',
isMultiKey: true,
multiKeyPaths: [Object],
isUnique: false,
isSparse: true,
isPartial: false,
indexVersion: 2,
direction: 'forward',
indexBounds: [Object],
keysExamined: 2,
seeks: 1,
dupsTested: 2,
dupsDropped: 0
}
},
{
stage: 'IXSCAN',
nReturned: 2,
executionTimeMillisEstimate: 0,
works: 3,
advanced: 2,
keyPattern: { age: 1 },
indexName: 'age_1',
isMultiKey: false,
multiKeyPaths: { age: [] },
isUnique: false,
isSparse: true,
isPartial: false,
indexVersion: 2,
direction: 'forward',
indexBounds: { age: [Array] },
keysExamined: 2,
seeks: 1,
dupsTested: 0,
dupsDropped: 0
}
]
}
}
}
}
I've detailed this OR expansion in a previous blog post:
Order-preserving or-expansion in MongoDB and $in:[,,] treated as a range skip scan, or exploded to an equality for the ESR rule
Franck Pachot for MongoDB ・ Feb 27
#postgres
#index
#sort
#webdev
PostgreSQL cannot do the same and you need to write the query differently:
postgres=# explain (analyze, costs off, buffers, serialize text)
-- Query for using GIN index on "details" column
SELECT * FROM data
WHERE details @? '$?(@.tags[*] == "developer")'
UNION
-- Query for using B-tree index on "age" column
SELECT * FROM data
WHERE (details->>'age')::int < 35
;
QUERY PLAN
----------------------------------------------------------------------------------------------
HashAggregate (actual time=0.134..0.220 rows=3 loops=1)
Group Key: data.id, data.details
Batches: 1 Memory Usage: 1561kB
Buffers: shared hit=9
-> Append (actual time=0.023..0.035 rows=4 loops=1)
Buffers: shared hit=9
-> Bitmap Heap Scan on data (actual time=0.022..0.024 rows=2 loops=1)
Recheck Cond: (details @? '$?(@."tags"[*] == "developer")'::jsonpath)
Heap Blocks: exact=1
Buffers: shared hit=5
-> Bitmap Index Scan on idx_details (actual time=0.012..0.012 rows=2 loops=1)
Index Cond: (details @? '$?(@."tags"[*] == "developer")'::jsonpath)
Buffers: shared hit=4
-> Bitmap Heap Scan on data data_1 (actual time=0.009..0.010 rows=2 loops=1)
Recheck Cond: (((details ->> 'age'::text))::integer < 35)
Heap Blocks: exact=1
Buffers: shared hit=4
-> Bitmap Index Scan on idx_age (actual time=0.008..0.008 rows=2 loops=1)
Index Cond: (((details ->> 'age'::text))::integer < 35)
Buffers: shared hit=3
Planning:
Buffers: shared hit=1
Planning Time: 0.160 ms
Serialization: time=0
MongoDB Atlas On-Premise: Challenges and Flexible Alternatives
If you’re running enterprise applications, you might be facing a common dilemma: cloud solutions offer great features, but regulatory constraints and data sovereignty laws might prevent you from using them fully. In fact, about 60% of enterprises struggle with this exact problem. While cloud computing delivers scalability and convenience, your organization might require on-premises or […]
Build fast software with big data requirements
Somehow we got stuck on the idea that big data systems should be slow. We're making it fast.
Scheduled scaling of Amazon Aurora Serverless with Amazon EventBridge Scheduler
In this post, we demonstrate how you can implement scheduled scaling for Aurora Serverless using Amazon EventBridge Scheduler. By proactively adjusting minimum Aurora Capacity Units (ACUs), you can achieve faster scaling rates during peak periods while maintaining cost efficiency during low-demand times.
March 18, 2025
The Percona Perspective: A Vision for the Future of Open Source Databases in the Age of AI
From punch cards to containerization, database technologies have changed tremendously over the years. And there has been no shortage of watershed moments along the way. However, we believe the emergence of advanced AI may very well be the most significant one yet. Percona certainly isn’t alone in that belief. But, when it comes to the […]
The Real Failure Rate of EBS
Our experience running AWS EBS at scale for critical workloads
March 17, 2025
Vibe data engineering
You're probably sick of hearing people talk about "vibe coding", but what about "vibe data engineering?" ;)
Normalization and Relational Division in SQL and MongoDB
MongoDB promotes document data modeling rather than the highest normal forms of a relational model. To understand this approach, it's interesting to examine the reasons why E. F. Codd introduced normalization, as he describes in his article NORMALIZED DATA BASE STRUCTURE: A BRIEF TUTORIAL (1971), and compare with modern applications design:
make formatted data bases readily accessible to users (especially casual users) who have little or no training in programming
casual user at a terminal often has occasion to require tables to be printed out or displayed
Normalizing data structures into a set of two-dimensional tables was originally intended to simplify these structures for non-programmers and interactive users, a role taken by data scientists today. They accessed databases through early computer devices (TTY on CRT terminals and dot matrix printers).
This approach contrasts with the modern graphical user interface experience, where users can navigate and unfold nested structures using a mouse click.
Moreover, when the database client is a program rather than a human, it interacts with object-oriented structures that form a graph of interconnected objects instead of independent tables. In both cases, a document model, whether normalized or not, can be a better fit than the set of tables described by Codd (as long as data consistency and schema validation is maintained, of course).
It's also important to note that Codd's description of normalization focuses on the logical structure and the API, not the physical storage:
remember we are not discussing the physical representation
Data model to process data
Some critiques of the document model proposed by MongoDB sometimes mention Codd's relational theory as if the document model were heretical to established norms. It's important to clarify that physically storing nested structures doesn't violate the principles of normalized logical structures.
For modern applications, a document model API can be the best fit, just as highly normalized structures were once ideal for supporting casual users' ad-hoc queries. Both models serve different needs, and understanding this flexibility is crucial for leveraging the right approach in modern database applications.
Representing data as a set of independent tables makes it easy to display, but how do we efficiently process it, given the absence of navigable pointers and procedural languages?
To address this challenge, Codd introduced relational algebra, with powerful set-based operations such as projection (ρ), selection(σ), join (⋈), union (∪), intersection (∩), difference (−), product (×), and division (÷):
Now, it may be contended that tables are inadequate data structures to support facile manipulation of relationships between and within the tables. Application of elementary operations on relations (operations such as projection, join, and division) shows that this contention is false.
SQL is widely recognized as the declarative, English-based language that implements the operations defined by Codd's relational algebra, designed primarily for human users to interact with databases. However, the same operations can be achieved using MongoDB's aggregation pipelines on documents.
Notably, one operation, relational division — cited as essential for efficient data processing on normalized tables — can be quite complex to achieve in SQL databases but is more straightforward in MongoDB.
Relational Division
Relational division is essential for identifying relationships between tables based on subsets instead of individual rows. It finds records linked to every item in a specific subset. Practical business examples include:
- Finding employees who possess all required certifications.
- Identifying customers who purchase all items in a bundle.
- Locating students who have completed all required courses.
- Determining vendors supplying all contract materials.
- Discovering authors collaborating with all team members.
- Selecting athletes participating in all league events.
- Finding tutors who cover every curriculum topic.
- Identifying devices compatible with all software platforms.
- Find the pilots who can fly all planes in the hangar
This operation is vital for scenarios where comprehensive coverage or participation across a set of criteria is necessary, and while complex in SQL, it's more straightforward to implement in MongoDB.
Relational Division in SQL (normalized - 3NF)
The pilots who can fly all planes in the hangar is an example taken from the excellent article Divided We Stand: The SQL of Relational Division by Joe Celko (Redgate).
In PostgreSQL, the following normalized tables can store pilots and planes entities, as well as the relations for planes in hangar and pilot skills:
CREATE TABLE pilots (
pilot_id SERIAL PRIMARY KEY,
pilot_name VARCHAR NOT NULL
);
CREATE TABLE planes (
plane_id SERIAL PRIMARY KEY,
plane_name VARCHAR NOT NULL
);
CREATE TABLE pilot_skills (
pilot_id INT REFERENCES pilots(pilot_id),
plane_id INT REFERENCES planes(plane_id),
PRIMARY KEY (pilot_id, plane_id)
);
CREATE TABLE hangar (
plane_id INT PRIMARY KEY REFERENCES planes(plane_id)
);
I insert the same data as in the Redgate article:
INSERT INTO pilots (pilot_name) VALUES
('Celko'),
('Higgins'),
('Jones'),
('Smith'),
('Wilson')
;
INSERT INTO planes (plane_name) VALUES
('Piper Cub'),
('B-1 Bomber'),
('B-52 Bomber'),
('F-14 Fighter'),
('F-17 Fighter')
;
INSERT INTO pilot_skills (pilot_id, plane_id) VALUES
(1, 1), -- Celko: Piper Cub
(2, 3), (2, 4), (2, 1), -- Higgins: B-52 Bomber, F-14 Fighter, Piper Cub
(3, 3), (3, 4), -- Jones: B-52 Bomber, F-14 Fighter
(4, 2), (4, 3), (4, 4), -- Smith: B-1 Bomber, B-52 Bomber, F-14 Fighter
(5, 2), (5, 3), (5, 4), (5, 5) -- Wilson: B-1 Bomber, B-52 Bomber, F-14 Fighter, F-17 Fighter
;
INSERT INTO hangar (plane_id) VALUES
(2), -- B-1 Bomber
(3), -- B-52 Bomber
(4) -- F-14 Fighter
;
To find pilots who can fly all planes in the hangar, use a double negation: identify all pilots for whom no plane in the hangar exists that they cannot fly. This corresponds to a double NOT EXISTS in SQL:
postgres=> SELECT p.pilot_id, p.pilot_name
FROM pilots p
WHERE NOT EXISTS (
SELECT *
FROM hangar h
WHERE NOT EXISTS (
SELECT *
FROM pilot_skills ps
WHERE ps.pilot_id = p.pilot_id AND ps.plane_id = h.plane_id
)
);
pilot_id | pilot_name
----------+------------
4 | Smith
5 | Wilson
(2 rows)
If you find it more readable, the second negation can be declared with EXCEPT or MINUS instead of NOT EXISTS:
postgres=> SELECT p.pilot_id, p.pilot_name
FROM pilots p
WHERE NOT EXISTS (
SELECT plane_id FROM hangar
EXCEPT
SELECT plane_id FROM pilot_skills ps WHERE ps.pilot_id = p.pilot_id
);
pilot_id | pilot_name
----------+------------
4 | Smith
5 | Wilson
(2 rows)
If you feel more like an accountant than a logician, you can bypass SQL's lack of relational division by counting pilot skills and comparing them to the number of planes in the hangar:
postgres=> SELECT p.pilot_id, p.pilot_name
FROM pilots p
JOIN pilot_skills ps ON p.pilot_id = ps.pilot_id
JOIN hangar h ON ps.plane_id = h.plane_id
GROUP BY p.pilot_id, p.pilot_name
HAVING COUNT(DISTINCT h.plane_id) = (SELECT COUNT(*) FROM hangar);
pilot_id | pilot_name
----------+------------
4 | Smith
5 | Wilson
(2 rows)
In mathematics, relational division may have a remainder. How can you list pilots with additional skills (planes they can fly that are not in the hangar)? How would you identify pilots lacking required skills and specify the missing skills?
E. F. Codd stated that processing normalized tables would not be difficult due to relational operations, but what happens if no SQL database implements them? A document model may be easier.
Relational Division in SQL (unnormalized - 0NF)
Let's violate the first normal form and store the skills as an array (which is a PostgreSQL datatype) embedded with the pilot:
CREATE TABLE pilotsWithSkills (
pilot_name VARCHAR PRIMARY KEY,
planes VARCHAR[] -- An array of plane names
);
INSERT INTO pilotsWithSkills (pilot_name, planes)
VALUES
('Celko', ARRAY['Piper Cub']),
('Higgins', ARRAY['B-52 Bomber', 'F-14 Fighter', 'Piper Cub']),
('Jones', ARRAY['B-52 Bomber', 'F-14 Fighter']),
('Smith', ARRAY['B-1 Bomber', 'B-52 Bomber', 'F-14 Fighter']),
('Wilson', ARRAY['B-1 Bomber', 'B-52 Bomber', 'F-14 Fighter', 'F-17 Fighter']);
I can use the <@
operator with a subquery to find the pilots who can fly all planes in the hangar:
SELECT pilot_name
FROM pilotsWithSkills
WHERE array(
select plane_name from hangar join planes using (plane_id)
) <@ planes
;
With a GIN index on the array, the following execution plan with Bitmap Index Scan
and Recheck Cond
can optimize the search:
I can also create the list in JSONB instead of ARRAY and use jsonb_agg()
instead of array()
. But another possibility is to use a document database.
Relational Division in MongoDB
To normalize without the complexity of embedding arrays or JSON in SQL tables, MongoDB can store and index documents natively:
db.pilotSkills.insertMany([
{ pilot_name: 'Celko', planes: ['Piper Cub'] },
{ pilot_name: 'Higgins', planes: ['B-52 Bomber', 'F-14 Fighter', 'Piper Cub'] },
{ pilot_name: 'Jones', planes: ['B-52 Bomber', 'F-14 Fighter'] },
{ pilot_name: 'Smith', planes: ['B-1 Bomber', 'B-52 Bomber', 'F-14 Fighter'] },
{ pilot_name: 'Wilson', planes: ['B-1 Bomber', 'B-52 Bomber', 'F-14 Fighter', 'F-17 Fighter'] }
]);
db.hangar.insertMany([
{ plane_name: 'B-1 Bomber' },
{ plane_name: 'B-52 Bomber' },
{ plane_name: 'F-14 Fighter' }
]);
db.pilotSkills.createIndex({ planes: 1 });
Here is how I find the pilots who can fly all planes in the hangar:
mdb> db.pilotSkills.find({
planes: { $all: db.hangar.distinct("plane_name") }
} , { pilot_name: 1, _id: 0})
;
[ { pilot_name: 'Smith' }, { pilot_name: 'Wilson' } ]
If you look at the execution plan (adding .explain()
) you will see that the MongoDB Multi-Plan optimizer has considered using the index for one of each value, in case one is more selective, or the combination of the three with an AND_SORTED
It is possible to do it in one query, and even a more complex query, like adding the extra skills of the pilot (the planes not in the hangar) for a relational division with reminder:
db.pilotSkills.aggregate([
// Lookup hangar data to include required planes
{
$lookup: {
from: "hangar",
pipeline: [
{ $group: { _id: null, plane_names: { $addToSet: "$plane_name" } } }
],
as: "hangar_planes"
}
},
// Unwind the hangar_planes array
{
$unwind: "$hangar_planes"
},
// Match pilots who can fly all planes in the hangar
{
$match: {
$expr: {
$setIsSubset: ["$hangar_planes.plane_names", "$planes"]
}
}
},
// Add a field for extra skills beyond the required planes
{
$addFields: {
extraSkills: {
$filter: {
input: "$planes",
as: "plane",
cond: { $not: { $in: ["$$plane", "$hangar_planes.plane_names"] } }
}
}
}
},
// Project only the desired fields
{
$project: {
pilot_name: 1,
extraSkills: 1,
_id: 0
}
}
]);
[
{ pilot_name: 'Smith', extraSkills: [] },
{ pilot_name: 'Wilson', extraSkills: [ 'F-17 Fighter' ] }
]
With $setIsSubset
MongoDB implements relational division with better developer experience than SQL's subqueries with double negations or counting. PostgreSQL JSONB adds operations on documents, but can become more complex, for example for the division with remainder. MongoDB simplifies this through a series of steps in an aggregation pipeline: retrieving planes from the hangar, unwinding into an array to compare to the pilot's array of skills, applying division logic between the two, calculating the remainders, and projecting pilot names with skills.
It may be surprising, but SQL must use complex subqueries, while MongoDB provides a direct method for this relational algebra operator, enabling efficient processing of normalized collections.
Normalization and Relational Division in SQL and MongoDB
MongoDB promotes document data modeling rather than the highest normal forms of a relational model. To understand this approach, it's interesting to examine the reasons why E. F. Codd introduced normalization, as he describes in his article NORMALIZED DATA BASE STRUCTURE: A BRIEF TUTORIAL (1971), and compare with modern applications design:
make formatted data bases readily accessible to users (especially casual users) who have little or no training in programming
casual user at a terminal often has occasion to require tables to be printed out or displayed
Normalizing data structures into a set of two-dimensional tables was originally intended to simplify these structures for non-programmers and interactive users, a role taken by data scientists today. They accessed databases through early computer devices (TTY on CRT terminals and dot matrix printers).
This approach contrasts with the modern graphical user interface experience, where users can navigate and unfold nested structures using a mouse click.
Moreover, when the database client is a program rather than a human, it interacts with object-oriented structures that form a graph of interconnected objects instead of independent tables. In both cases, a document model, whether normalized or not, can be a better fit than the set of tables described by Codd (as long as data consistency and schema validation is maintained, of course).
It's also important to note that Codd's description of normalization focuses on the logical structure and the API, not the physical storage:
remember we are not discussing the physical representation
Data model to process data
Some critiques of the document model proposed by MongoDB sometimes mention Codd's relational theory as if the document model were heretical to established norms. It's important to clarify that physically storing nested structures doesn't violate the principles of normalized logical structures.
For modern applications, a document model API can be the best fit, just as highly normalized structures were once ideal for supporting casual users' ad-hoc queries. Both models serve different needs, and understanding this flexibility is crucial for leveraging the right approach in modern database applications.
Representing data as a set of independent tables makes it easy to display, but how do we efficiently process it, given the absence of navigable pointers and procedural languages?
To address this challenge, Codd introduced relational algebra, with powerful set-based operations such as projection (ρ), selection(σ), join (⋈), union (∪), intersection (∩), difference (−), product (×), and division (÷):
Now, it may be contended that tables are inadequate data structures to support facile manipulation of relationships between and within the tables. Application of elementary operations on relations (operations such as projection, join, and division) shows that this contention is false.
SQL is widely recognized as the declarative, English-based language that implements the operations defined by Codd's relational algebra, designed primarily for human users to interact with databases. However, the same operations can be achieved using MongoDB's aggregation pipelines on documents.
Notably, one operation, relational division — cited as essential for efficient data processing on normalized tables — can be quite complex to achieve in SQL databases but is more straightforward in MongoDB.
Relational Division
Relational division is essential for identifying relationships between tables based on subsets instead of individual rows. It finds records linked to every item in a specific subset. Practical business examples include:
- Finding employees who possess all required certifications.
- Identifying customers who purchase all items in a bundle.
- Locating students who have completed all required courses.
- Determining vendors supplying all contract materials.
- Discovering authors collaborating with all team members.
- Selecting athletes participating in all league events.
- Finding tutors who cover every curriculum topic.
- Identifying devices compatible with all software platforms.
- Find the pilots who can fly all planes in the hangar
This operation is vital for scenarios where comprehensive coverage or participation across a set of criteria is necessary, and while complex in SQL, it's more straightforward to implement in MongoDB.
Relational Division in SQL (normalized - 3NF)
The pilots who can fly all planes in the hangar is an example taken from the excellent article Divided We Stand: The SQL of Relational Division by Joe Celko (Redgate).
In PostgreSQL, the following normalized tables can store pilots and planes entities, as well as the relations for planes in hangar and pilot skills:
CREATE TABLE pilots (
pilot_id SERIAL PRIMARY KEY,
pilot_name VARCHAR NOT NULL
);
CREATE TABLE planes (
plane_id SERIAL PRIMARY KEY,
plane_name VARCHAR NOT NULL
);
CREATE TABLE pilot_skills (
pilot_id INT REFERENCES pilots(pilot_id),
plane_id INT REFERENCES planes(plane_id),
PRIMARY KEY (pilot_id, plane_id)
);
CREATE TABLE hangar (
plane_id INT PRIMARY KEY REFERENCES planes(plane_id)
);
I insert the same data as in the Redgate article:
INSERT INTO pilots (pilot_name) VALUES
('Celko'),
('Higgins'),
('Jones'),
('Smith'),
('Wilson')
;
INSERT INTO planes (plane_name) VALUES
('Piper Cub'),
('B-1 Bomber'),
('B-52 Bomber'),
('F-14 Fighter'),
('F-17 Fighter')
;
INSERT INTO pilot_skills (pilot_id, plane_id) VALUES
(1, 1), -- Celko: Piper Cub
(2, 3), (2, 4), (2, 1), -- Higgins: B-52 Bomber, F-14 Fighter, Piper Cub
(3, 3), (3, 4), -- Jones: B-52 Bomber, F-14 Fighter
(4, 2), (4, 3), (4, 4), -- Smith: B-1 Bomber, B-52 Bomber, F-14 Fighter
(5, 2), (5, 3), (5, 4), (5, 5) -- Wilson: B-1 Bomber, B-52 Bomber, F-14 Fighter, F-17 Fighter
;
INSERT INTO hangar (plane_id) VALUES
(2), -- B-1 Bomber
(3), -- B-52 Bomber
(4) -- F-14 Fighter
;
To find pilots who can fly all planes in the hangar, use a double negation: identify all pilots for whom no plane in the hangar exists that they cannot fly. This corresponds to a double NOT EXISTS in SQL:
postgres=> SELECT p.pilot_id, p.pilot_name
FROM pilots p
WHERE NOT EXISTS (
SELECT *
FROM hangar h
WHERE NOT EXISTS (
SELECT *
FROM pilot_skills ps
WHERE ps.pilot_id = p.pilot_id AND ps.plane_id = h.plane_id
)
);
pilot_id | pilot_name
----------+------------
4 | Smith
5 | Wilson
(2 rows)
If you find it more readable, the second negation can be declared with EXCEPT or MINUS instead of NOT EXISTS:
postgres=> SELECT p.pilot_id, p.pilot_name
FROM pilots p
WHERE NOT EXISTS (
SELECT plane_id FROM hangar
EXCEPT
SELECT plane_id FROM pilot_skills ps WHERE ps.pilot_id = p.pilot_id
);
pilot_id | pilot_name
----------+------------
4 | Smith
5 | Wilson
(2 rows)
If you feel more like an accountant than a logician, you can bypass SQL's lack of relational division by counting pilot skills and comparing them to the number of planes in the hangar:
postgres=> SELECT p.pilot_id, p.pilot_name
FROM pilots p
JOIN pilot_skills ps ON p.pilot_id = ps.pilot_id
JOIN hangar h ON ps.plane_id = h.plane_id
GROUP BY p.pilot_id, p.pilot_name
HAVING COUNT(DISTINCT h.plane_id) = (SELECT COUNT(*) FROM hangar);
pilot_id | pilot_name
----------+------------
4 | Smith
5 | Wilson
(2 rows)
In mathematics, relational division may have a remainder. How can you list pilots with additional skills (planes they can fly that are not in the hangar)? How would you identify pilots lacking required skills and specify the missing skills?
E. F. Codd stated that processing normalized tables would not be difficult due to relational operations, but what happens if no SQL database implements them? A document model may be easier.
Relational Division in SQL (unnormalized - 0NF)
Let's violate the first normal form and store the skills as an array (which is a PostgreSQL datatype) embedded with the pilot:
CREATE TABLE pilotsWithSkills (
pilot_name VARCHAR PRIMARY KEY,
planes VARCHAR[] -- An array of plane names
);
INSERT INTO pilotsWithSkills (pilot_name, planes)
VALUES
('Celko', ARRAY['Piper Cub']),
('Higgins', ARRAY['B-52 Bomber', 'F-14 Fighter', 'Piper Cub']),
('Jones', ARRAY['B-52 Bomber', 'F-14 Fighter']),
('Smith', ARRAY['B-1 Bomber', 'B-52 Bomber', 'F-14 Fighter']),
('Wilson', ARRAY['B-1 Bomber', 'B-52 Bomber', 'F-14 Fighter', 'F-17 Fighter']);
I can use the <@
operator with a subquery to find the pilots who can fly all planes in the hangar:
SELECT pilot_name
FROM pilotsWithSkills
WHERE array(
select plane_name from hangar join planes using (plane_id)
) <@ planes
;
With a GIN index on the array, the following execution plan with Bitmap Index Scan
and Recheck Cond
can optimize the search:
I can also create the list in JSONB instead of ARRAY and use jsonb_agg()
instead of array()
. But another possibility is to use a document database.
Relational Division in MongoDB
To normalize without the complexity of embedding arrays or JSON in SQL tables, MongoDB can store and index documents natively:
db.pilotSkills.insertMany([
{ pilot_name: 'Celko', planes: ['Piper Cub'] },
{ pilot_name: 'Higgins', planes: ['B-52 Bomber', 'F-14 Fighter', 'Piper Cub'] },
{ pilot_name: 'Jones', planes: ['B-52 Bomber', 'F-14 Fighter'] },
{ pilot_name: 'Smith', planes: ['B-1 Bomber', 'B-52 Bomber', 'F-14 Fighter'] },
{ pilot_name: 'Wilson', planes: ['B-1 Bomber', 'B-52 Bomber', 'F-14 Fighter', 'F-17 Fighter'] }
]);
db.hangar.insertMany([
{ plane_name: 'B-1 Bomber' },
{ plane_name: 'B-52 Bomber' },
{ plane_name: 'F-14 Fighter' }
]);
db.pilotSkills.createIndex({ planes: 1 });
Here is how I find the pilots who can fly all planes in the hangar:
mdb> db.pilotSkills.find({
planes: { $all: db.hangar.distinct("plane_name") }
} , { pilot_name: 1, _id: 0})
;
[ { pilot_name: 'Smith' }, { pilot_name: 'Wilson' } ]
If you look at the execution plan (adding .explain()
) you will see that the MongoDB Multi-Plan optimizer has considered using the index for one of each value, in case one is more selective, or the combination of the three with an AND_SORTED
It is possible to do it in one query, and even a more complex query, like adding the extra skills of the pilot (the planes not in the hangar) for a relational division with reminder:
db.pilotSkills.aggregate([
// Lookup hangar data to include required planes
{
$lookup: {
from: "hangar",
pipeline: [
{ $group: { _id: null, plane_names: { $addToSet: "$plane_name" } } }
],
as: "hangar_planes"
}
},
// Unwind the hangar_planes array
{
$unwind: "$hangar_planes"
},
// Match pilots who can fly all planes in the hangar
{
$match: {
$expr: {
$setIsSubset: ["$hangar_planes.plane_names", "$planes"]
}
}
},
// Add a field for extra skills beyond the required planes
{
$addFields: {
extraSkills: {
$filter: {
input: "$planes",
as: "plane",
cond: { $not: { $in: ["$$plane", "$hangar_planes.plane_names"] } }
}
}
}
},
// Project only the desired fields
{
$project: {
pilot_name: 1,
extraSkills: 1,
_id: 0
}
}
]);
[
{ pilot_name: 'Smith', extraSkills: [] },
{ pilot_name: 'Wilson', extraSkills: [ 'F-17 Fighter' ] }
]
With $setIsSubset
MongoDB implements relational division with better developer experience than SQL's subqueries with double negations or counting. PostgreSQL JSONB adds operations on documents, but can become more complex, for example for the division with remainder. MongoDB simplifies this through a series of steps in an aggregation pipeline: retrieving planes from the hangar, unwinding into an array to compare to the pilot's array of skills, applying division logic between the two, calculating the remainders, and projecting pilot names with skills.
It may be surprising, but SQL must use complex subqueries, while MongoDB provides a direct method for this relational algebra operator, enabling efficient processing of normalized collections.
Percona Server for MySQL 8.4.2 vs 8.0.40: Comparison of Variables and Keywords
In this blog, we will look at the differences between LTS (Long Term Stable) versions of Percona Server for MySQL. Released in April 2019, MySQL 8.0 represented a major change from the previous version, 5.7, introducing significant changes to the data dictionary and enabling many features and enhancements. It also provided no direct downgrade path, […]
Local first.
Run Tinybird on your machine. Deploy your data project locally or to the cloud in the exact same way, as it should be.