August 12, 2025
August 11, 2025
How Wiz achieved near-zero downtime for Amazon Aurora PostgreSQL major version upgrades at scale using Aurora Blue/Green Deployments
PostgreSQL UUID: Bulk insert with UUIDv7 vs UUIDv4
PostgreSQL 18 (currently in beta) introduces a UUID version 7 generation function that features a time-ordered prefix, reducing the scattering seen in random UUID version 4. I was surprised by the enthusiasm for this. For time-incrementing identifiers, I prefer cached sequences. I think the primary advantage of UUID is the ability to generate it from the application, before hitting the database. I discussed this in a previous post: UUID in PostgreSQL.
There's one case where generating a UUID version 7 in the database is beneficial: bulk loads, as it shares the same uuid
datatype as UUIDv4, and both can be present in the same column.
For OLTP, UUIDv4 minimizes hotspots in the B-Tree index during concurrent inserts, and is often generated by the application. However, for bulk loads, UUIDv7 improves cache efficiency by updating fewer blocks. When data comes from another table, and necessitates an new identifier, a built-in function is useful.
Bulk Insert throughput
For this demo, I ran a bulk ingest job, inserting 10 million rows, and compared identifiers generated by uuidv7()
and uuidv4
. The purpose is to illustrate the comparison.
Here is the full script:
-- reset (you are in a lab)
\! pkill -f "postgres: .* COPY"
drop table if exists demo;
drop table if exists progress;
-- create table and insert 10M rows in background
create table demo (
id uuid default uuidv7() primary key,
-- id uuid default uuidv4() primary key,
value text
);
\! psql -c "copy demo(value) from program 'base64 -w 100 /dev/urandom | head -10000000'" &
-- monitor size and progress
create table progress as
select now() ts,*,pg_indexes_size('demo')/8192 index_pages, pg_table_size('demo') tablesize , pg_indexes_size('demo') indexsize, pg_current_wal_lsn()
from pg_stat_progress_copy;
create index on progress(ts asc);
-- record last progress snapshot
insert into progress
select now(),*,pg_indexes_size('demo')/8192 index_pages, pg_table_size('demo') tablesize , pg_indexes_size('demo') indexsize, pg_current_wal_lsn()
from pg_stat_progress_copy
returning *
\;
-- display rate over last snapshots
select
pg_size_pretty((bytes_processed - lag(bytes_processed) over w) / extract(epoch from (ts - lag(ts) over w))) || '/s' as "COPY bytes/s",
(tuples_processed - lag(tuples_processed) over w) / extract(epoch from (ts - lag(ts) over w))::int as "rows/s",
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn , lag(pg_current_wal_lsn) over w) / extract(epoch from (ts - lag(ts) over w))) || '/s' as "WAL bytes/s",
round(pg_wal_lsn_diff(pg_current_wal_lsn , lag(pg_current_wal_lsn) over w) / (bytes_processed - lag(bytes_processed) over w),2) || ' %' as "WAL bytes/COPY bytes",
pg_size_pretty(indexsize) "index",
pg_size_pretty(tablesize) "table",
current_setting('shared_buffers') shared_buffers,
ts
from progress
window w as (order by ts asc)
order by ts asc -- limit 30
-- every 10 second
\watch 10
The script first resets the lab environment to avoid leftovers from previous runs. It then creates a demo table, populating it via COPY with pseudo-random values. I run it in the background. The foreground session queries pg_stat_progress_copy
every ten seconds and calculates some thoughput statistics. Each snapshot records:
- Insert rate (rows per second)
- COPY throughput (input bytes per second)
- Index size and table size over time
- WAL (Write-Ahead Log) usage
UUIDv7
Here is the result with id uuid default uuidv7() primary key
:
COPY bytes/s | rows/s | WAL bytes/s | WAL bytes/COPY bytes | index | table | shared_buffers | ts
--------------+--------+-------------+----------------------+---------+---------+----------------+-------------------------------
| | | | 8232 kB | 38 MB | 128MB | 2025-08-11 15:44:56.858214+00
9030 kB/s | 91601 | 22 MB/s | 2.46 x | 36 MB | 169 MB | 128MB | 2025-08-11 15:45:06.858344+00
8979 kB/s | 91036 | 22 MB/s | 2.50 x | 63 MB | 298 MB | 128MB | 2025-08-11 15:45:16.858316+00
819 kB/s | 8314 | 2046 kB/s | 2.50 x | 65 MB | 310 MB | 128MB | 2025-08-11 15:45:26.858289+00
826 kB/s | 8318 | 2048 kB/s | 2.48 x | 68 MB | 321 MB | 128MB | 2025-08-11 15:45:36.858318+00
7693 kB/s | 77984 | 19 MB/s | 2.50 x | 91 MB | 432 MB | 128MB | 2025-08-11 15:45:46.858495+00
7296 kB/s | 74032 | 18 MB/s | 2.50 x | 114 MB | 537 MB | 128MB | 2025-08-11 15:45:56.858338+00
5184 kB/s | 52531 | 13 MB/s | 2.51 x | 130 MB | 612 MB | 128MB | 2025-08-11 15:46:06.858545+00
832 kB/s | 8453 | 2017 kB/s | 2.42 x | 132 MB | 624 MB | 128MB | 2025-08-11 15:46:16.858854+00
1389 kB/s | 14074 | 3528 kB/s | 2.54 x | 136 MB | 644 MB | 128MB | 2025-08-11 15:46:26.858837+00
7642 kB/s | 77440 | 19 MB/s | 2.49 x | 160 MB | 754 MB | 128MB | 2025-08-11 15:46:36.858441+00
7776 kB/s | 78848 | 19 MB/s | 2.50 x | 183 MB | 866 MB | 128MB | 2025-08-11 15:46:46.858377+00
4826 kB/s | 48928 | 12 MB/s | 2.50 x | 198 MB | 935 MB | 128MB | 2025-08-11 15:46:56.858401+00
1350 kB/s | 13675 | 3366 kB/s | 2.49 x | 202 MB | 955 MB | 128MB | 2025-08-11 15:47:06.858323+00
1216 kB/s | 12372 | 3047 kB/s | 2.51 x | 206 MB | 972 MB | 128MB | 2025-08-11 15:47:16.858334+00
7731 kB/s | 78382 | 19 MB/s | 2.50 x | 229 MB | 1084 MB | 128MB | 2025-08-11 15:47:26.859178+00
6996 kB/s | 70923 | 17 MB/s | 2.50 x | 251 MB | 1185 MB | 128MB | 2025-08-11 15:47:36.858273+00
4928 kB/s | 49925 | 12 MB/s | 2.55 x | 266 MB | 1255 MB | 128MB | 2025-08-11 15:47:46.858335+00
934 kB/s | 9504 | 2166 kB/s | 2.32 x | 269 MB | 1269 MB | 128MB | 2025-08-11 15:47:56.858356+00
2259 kB/s | 22880 | 5806 kB/s | 2.57 x | 276 MB | 1301 MB | 128MB | 2025-08-11 15:48:06.858839+00
7744 kB/s | 78496 | 19 MB/s | 2.45 x | 299 MB | 1413 MB | 128MB | 2025-08-11 15:48:16.858306+00
(21 rows)
The UUIDv7 ingest run shows consistently high throughput, with brief dips likely due to vacuum, background I/O or checkpoints, followed by quick recovery. Index growth is smooth and compact, while WAL overhead stays stable at 2.50 times the input data. The sequential, time-sortable nature of UUIDv7 enables fast and predictable bulk load performance, completing the 10M row job in just over 3 minutes and maintaining tight disk usage.
UUIDv4
Here is the result with id uuid default uuidv4() primary key
:
COPY bytes/s | rows/s | WAL bytes/s | WAL bytes/COPY bytes | index | table | shared_buffers | ts
--------------+--------+-------------+----------------------+---------+---------+----------------+-------------------------------
8698 kB/s | 88199 | 24 MB/s | 2.82 x | 37 MB | 142 MB | 128MB | 2025-08-11 15:37:08.184794+00
7802 kB/s | 79130 | 20 MB/s | 2.66 x | 71 MB | 254 MB | 128MB | 2025-08-11 15:37:18.184817+00
5920 kB/s | 59959 | 15 MB/s | 2.58 x | 90 MB | 339 MB | 128MB | 2025-08-11 15:37:28.184804+00
1248 kB/s | 12664 | 3364 kB/s | 2.70 x | 96 MB | 357 MB | 128MB | 2025-08-11 15:37:38.184869+00
877 kB/s | 8924 | 2391 kB/s | 2.73 x | 101 MB | 370 MB | 128MB | 2025-08-11 15:37:48.184803+00
1882 kB/s | 19083 | 12 MB/s | 6.60 x | 112 MB | 397 MB | 128MB | 2025-08-11 15:37:58.184795+00
4384 kB/s | 44420 | 12 MB/s | 2.71 x | 131 MB | 460 MB | 128MB | 2025-08-11 15:38:08.184808+00
3821 kB/s | 38720 | 10006 kB/s | 2.62 x | 142 MB | 515 MB | 128MB | 2025-08-11 15:38:18.184814+00
2778 kB/s | 28160 | 7144 kB/s | 2.57 x | 149 MB | 555 MB | 128MB | 2025-08-11 15:38:28.184822+00
1971 kB/s | 19982 | 5152 kB/s | 2.61 x | 155 MB | 583 MB | 128MB | 2025-08-11 15:38:38.184814+00
1427 kB/s | 14513 | 3750 kB/s | 2.63 x | 159 MB | 604 MB | 128MB | 2025-08-11 15:38:48.184858+00
1152 kB/s | 11658 | 3069 kB/s | 2.66 x | 164 MB | 621 MB | 128MB | 2025-08-11 15:38:58.184933+00
1312 kB/s | 13333 | 15 MB/s | 11.83 x | 169 MB | 639 MB | 128MB | 2025-08-11 15:39:08.184786+00
3373 kB/s | 34144 | 9108 kB/s | 2.70 x | 184 MB | 688 MB | 128MB | 2025-08-11 15:39:18.184791+00
2221 kB/s | 22528 | 5985 kB/s | 2.69 x | 196 MB | 720 MB | 128MB | 2025-08-11 15:39:28.18487+00
3885 kB/s | 39424 | 10 MB/s | 2.72 x | 217 MB | 776 MB | 128MB | 2025-08-11 15:39:38.184842+00
3232 kB/s | 32736 | 8685 kB/s | 2.69 x | 233 MB | 822 MB | 128MB | 2025-08-11 15:39:48.18481+00
832 kB/s | 8448 | 2233 kB/s | 2.68 x | 237 MB | 834 MB | 128MB | 2025-08-11 15:39:58.184806+00
800 kB/s | 8096 | 2136 kB/s | 2.67 x | 241 MB | 846 MB | 128MB | 2025-08-11 15:40:08.184789+00
1107 kB/s | 11264 | 2962 kB/s | 2.68 x | 246 MB | 862 MB | 128MB | 2025-08-11 15:40:18.184786+00
768 kB/s | 7744 | 15 MB/s | 19.64 x | 249 MB | 873 MB | 128MB | 2025-08-11 15:40:28.184804+00
2912 kB/s | 29568 | 11 MB/s | 3.99 x | 260 MB | 915 MB | 128MB | 2025-08-11 15:40:38.184825+00
3475 kB/s | 35200 | 9083 kB/s | 2.61 x | 271 MB | 965 MB | 128MB | 2025-08-11 15:40:48.184843+00
1491 kB/s | 15136 | 3886 kB/
by Franck Pachot
Announcing Neki
NULL BITMAP on SIMD
August 09, 2025
What even is distributed systems
Distributed systems is simply the study of interactions between processes. Every two interacting processes form a distributed system, whether they are on the same host or not. Distributed systems create new challenges (compared to single-process systems) in terms of correctness (i.e. consistency), reliability, and performance (i.e. latency and throughput).
The best way to learn about the principles and fundamentals of distributed systems is to 1) read Designing Data Intensive Applications and 2) read through the papers and follow the notes in the MIT Distributed Systems course.
For Designing Data Intensive Applications (DDIA), I strongly encourage you to find buddies at work or online who will read it through with you. You can also always join the Software Internals Discord's #distsys channel to ask questions as you go. But it's still best if you have some partners to go through the book with, even if they are as new to it as you.
I also used to think that you might want to wait a few years into your career before reading DDIA but when you have friends to read it with I think you need not wait.
If you have only skimmed the book you should definitely go back and give it a thorough read. I have read it three times already and I will read it again as part of the Software Internals Book Club next year after the 2nd Edition is published.
Keep in mind that every chapter of DDIA provides references to papers you can keep reading should you end up memorizing DDIA itself.
When you've read parts of DDIA or the MIT Distributed Systems course and you want practice, the Fly.io x Jepsen Distributed Systems Challenge is one guided option. Other options might include simply implementing (in somewhat ascending complexity):
- two-phase commit
- three-phase commit
- single-decree Paxos
- chain replication (or CRAQ), using a 3rd-party consensus library
- Raft
- epaxos
And if you get bored there you can see Alex Miller's Data Replication Design Spectrum for more ideas and variants.
If these projects and papers sound arcane or intimidating, know that you will see the problems these projects/papers solve whether or not you know and understand these solutions. Developers often end up reinventing hacky versions of these which are more likely to have subtle bugs, while instead you can recognize and use one of these well-known building blocks. Or at least have the background to better reason about correctness should you be in a situation where you must work with a novel distributed system or you end up designing a new one yourself.
And again, if you want folks to bounce ideas off of or ask questions to, I strongly encourage you to join the Software Internals Discord and ask there!
August 08, 2025
Joining and grouping on array fields in MongoDB may require using $unwind before applying $group or $lookup
Working with nested data in MongoDB simplifies mapping between application objects and database structures. However, challenges can arise when grouping or joining values within sub-document arrays, particularly for developers shifting from SQL databases with normalized data where the result is always flattened to tabular result.
I'll go through an example with the following collection, and link to MongoDB Playground for each explanation.
[
{
"projectName": "Troubleshooting PostgreSQL issues",
"team": [
{ "memberId": "cyclops", "name": "Cyclops", "role": "Postgres Expert" },
{ "memberId": "wolverine", "name": "Wolverine", "role": "Consultant" },
{ "memberId": "storm", "name": "Storm", "role": "DBA" },
{ "memberId": "beast", "name": "Beast", "role": "Developer" },
{ "memberId": "tony", "name": "Tony", "role": "Architect" }
],
"status": "active"
},
{
"projectName": "Build new apps with MongoDB",
"team": [
{ "memberId": "tony", "name": "Tony", "role": "Developer" }
],
"status": "planned"
}
]
Suppose you want a report of which projects each person was involved in, and their role(s) for each.
In PostgreSQL, you’d have normalized to three tables because of the Many-to-Many relationship. The join returns one row per project and team member, with information about projects and members duplicated. You can then aggregate them per team member to get the projects per person as an array:
SELECT
tm.member_id,
tm.name,
ARRAY_AGG(p.project_name ORDER BY p.project_id) AS projects
FROM
team_members tm
JOIN project_team pt ON tm.member_id = pt.member_id
JOIN projects p ON pt.project_id = p.project_id
GROUP BY
tm.member_id, tm.name
ORDER BY
tm.member_id;
Here's the example on db<>fiddle: https://dbfiddle.uk/FhsA9DpU
In MongoDB, relationships are embedded. I created a "team" array for project members. Let's explore how to aggregate it, starting with some common mistakes.
The Wrong Way: Grouping Directly on an Array
Suppose you try to group directly on the array field:
db.projects.aggregate([
{ $group: { _id: "$team.memberId", projects: { $push: "$projectName" } } }
])
What you get:
[
{
"_id": [
"cyclops", "wolverine", "storm", "beast", "tony"
],
"projects": ["Troubleshooting PostgreSQL issues"]
},
{
"_id": [
"tony"
],
"projects": ["Build new apps with MongoDB"]
}
]
Here's the example on MongoDB Playground: https://mongoplayground.net/p/bn7xsRMQhu2
What went wrong?
MongoDB used the whole array as the grouping key. Instead of grouping by each individual member, you get one group for each unique full-team combination.
Tip for SQL users:
In SQL, GROUP BY splits rows by scalar values, but in MongoDB, grouping on an array field groups by the entire array as a single value.
The Wrong Way: Lookup Directly on an Array
Similarly, you may try to enrich team data with $lookup
on the array field:
{
$lookup: {
from: "members",
localField: "team.memberId",
foreignField: "memberId",
as: "memberInfo"
}
}
Here's the example on MongoDB Playground:
https://mongoplayground.net/p/RVXoY5z7ke1
What you get:
- The entire array is used as the lookup key, so
memberInfo
will usually just be an empty array (no matches), unless a member’smemberId
somehow matches an array, which it never will.
Key for SQL folks:
Unlike SQL, where the JOIN is applied per row value, MongoDB $lookup
expects scalar fields, not arrays. If your join keys are inside an array, you need to flatten that array first.
The Right Way: Flatten with $unwind
, Then $group
First, flatten out your team
array so each person/project combination gets its own pipeline document:
{ $unwind: "$team" }
Here's how it looks like on MongoDB Playground:
https://mongoplayground.net/p/yGvvM-FZM5p
Then group by the member ID:
Don’t forget to include the role for each project. Here, we also collect all roles for completeness.
{
$group: {
_id: "$team.memberId",
memberName: { $first: "$team.name" },
roles: { $addToSet: "$team.role" },
projects: {
$push: {
name: "$projectName",
id: "$_id",
role: "$team.role"
}
}
}
}
The full working pipeline is like this:
db.projects.aggregate([
{ $unwind: "$team" },
{
$group: {
_id: "$team.memberId",
memberName: { $first: "$team.name" },
roles: { $addToSet: "$team.role" },
projects: {
$push: {
name: "$projectName",
id: "$_id",
role: "$team.role"
}
}
}
}
])
Here's the result on MongoDB Playground:
https://mongoplayground.net/p/oMOfvrZXa2a
This is the expected output, with one document per member and their project or array of projects:
[
{
"_id": "cyclops",
"memberName": "Cyclops",
"roles": ["Postgres Expert"],
"projects": [
{ "name": "Troubleshooting PostgreSQL issues", "id": ObjectId("60f...a1"), "role": "Postgres Expert" }
]
},
{
"_id": "wolverine",
"memberName": "Wolverine",
"roles": ["Consultant"],
"projects": [
{ "name": "Troubleshooting PostgreSQL issues", "id": ObjectId("60f...a1"), "role": "Consultant" }
]
},
{
"_id": "storm",
"memberName": "Storm",
"roles": ["DBA"],
"projects": [
{ "name": "Troubleshooting PostgreSQL issues", "id": ObjectId("60f...a1"), "role": "DBA" }
]
},
{
"_id": "beast",
"memberName": "Beast",
"roles": ["Developer"],
"projects": [
{ "name": "Troubleshooting PostgreSQL issues", "id": ObjectId("60f...a1"), "role": "Developer" }
]
},
{
"_id": "tony",
"memberName": "Tony",
"roles": ["Architect", "Developer"],
"projects": [
{ "name": "Troubleshooting PostgreSQL issues", "id": ObjectId("60f...a1"), "role": "Architect" },
{ "name": "Build new apps with MongoDB", "id": ObjectId("60f...a2"), "role": "Developer" }
]
}
]
Now you get the right result: Tony appears in both projects, and each is listed with the correct role.
Practical reminder:
MongoDB’s $unwind
makes array data behave more like SQL’s flattened rows, so aggregation and joins work as SQL users expect.
Here is an example with $lookup: https://mongoplayground.net/p/I3__p2EE9ps
What About Filtering Instead of Grouping?
If you just want to filter out all but the developers in each project, no unwinding is needed:
db.projects.aggregate([
{
$addFields: {
team: {
$filter: {
input: "$team",
as: "member",
cond: { $eq: ["$$member.role", "Developer"] }
}
}
}
}
])
Here's the MongoDB playground:
https://mongoplayground.net/p/bhDQJVKu7Rp
[
{
"_id": ObjectId("60f...a1"),
"projectName": "Troubleshooting PostgreSQL issues",
"team": [
{ "memberId": "beast", "name": "Beast", "role": "Developer" }
],
"status": "active"
},
{
"_id": ObjectId("60f...a2"),
"projectName": "Build new apps with MongoDB",
"team": [
{ "memberId": "tony", "name": "Tony", "role": "Developer" }
],
"status": "planned"
}
]
I used the following aggregation operators:
- $unwind: Flattens array fields, creating a separate document for each element in the array.
- $group: Similar to SQL’s GROUP BY, especially effective after arrays have been unwound.
- $lookup: Similar to a SQL LEFT OUTER JOIN, but by default MongoDB does not flatten joined results into multiple documents with repeated data. Instead, matched documents are returned normalized (0NF) as an array field.
SQL databases store their data normalized, often in third normal form, with multiple tables for one-to-many and many-to-many relationships to avoid duplicated data. The query result requires a join and is always in first normal form (1NF), as the result of a SQL query is a single flat table. Since 1NF cannot have arrays, SQL databases have to unnest groups, flattening relationships to multiple rows and introducing redundancy in a denormalized tabular result set.
MongoDB is not constrained by normal forms and supports rich document models, with arrays for repeating groups and nested objects directly in each document. When you use the aggregation pipeline, results can keep this nested structure. But if you want to group or join on values nested inside arrays, you’ll need to flatten the array to multiple documents using $unwind
, so further aggregation stages work as expected. In practice, $lookup
in MongoDB is often compared to JOINs in SQL, but if your fields live inside arrays, a join operation is really $unwind
followed by $lookup
.
Key takeaways for SQL users moving to MongoDB:
- Arrays are not automatically expanded or unnested.
- Whenever your “join key” or “group by value” is inside an array, always unwind first.
- In MongoDB, aggregation pipeline results can be nested and contain arrays, unlike the flat results of SQL queries.
-
$unwind
is the document-model equivalent of SQL’sUNNEST
: it’s your bridge from nested arrays to flat, row-like documents for further aggregation. - When joining with
$lookup
, always check whether your localField or foreignField are arrays and flatten as needed.
Joining and grouping on array fields in MongoDB may require using $unwind before applying $group or $lookup
Working with nested data in MongoDB simplifies mapping between application objects and database structures. However, challenges can arise when grouping or joining values within sub-document arrays, particularly for developers shifting from SQL databases with normalized data where the result is always flattened to tabular result.
I'll go through an example with the following collection, and link to MongoDB Playground for each explanation.
[
{
"projectName": "Troubleshooting PostgreSQL issues",
"team": [
{ "memberId": "cyclops", "name": "Cyclops", "role": "Postgres Expert" },
{ "memberId": "wolverine", "name": "Wolverine", "role": "Consultant" },
{ "memberId": "storm", "name": "Storm", "role": "DBA" },
{ "memberId": "beast", "name": "Beast", "role": "Developer" },
{ "memberId": "tony", "name": "Tony", "role": "Architect" }
],
"status": "active"
},
{
"projectName": "Build new apps with MongoDB",
"team": [
{ "memberId": "tony", "name": "Tony", "role": "Developer" }
],
"status": "planned"
}
]
Suppose you want a report of which projects each person was involved in, and their role(s) for each.
In PostgreSQL, you’d have normalized to three tables because of the Many-to-Many relationship. The join returns one row per project and team member, with information about projects and members duplicated. You can then aggregate them per team member to get the projects per person as an array:
SELECT
tm.member_id,
tm.name,
ARRAY_AGG(p.project_name ORDER BY p.project_id) AS projects
FROM
team_members tm
JOIN project_team pt ON tm.member_id = pt.member_id
JOIN projects p ON pt.project_id = p.project_id
GROUP BY
tm.member_id, tm.name
ORDER BY
tm.member_id;
Here's the example on db<>fiddle: https://dbfiddle.uk/FhsA9DpU
In MongoDB, relationships are embedded. I created a "team" array for project members. Let's explore how to aggregate it, starting with some common mistakes.
The Wrong Way: Grouping Directly on an Array
Suppose you try to group directly on the array field:
db.projects.aggregate([
{ $group: { _id: "$team.memberId", projects: { $push: "$projectName" } } }
])
What you get:
[
{
"_id": [
"cyclops", "wolverine", "storm", "beast", "tony"
],
"projects": ["Troubleshooting PostgreSQL issues"]
},
{
"_id": [
"tony"
],
"projects": ["Build new apps with MongoDB"]
}
]
Here's the example on MongoDB Playground: https://mongoplayground.net/p/bn7xsRMQhu2
What went wrong?
MongoDB used the whole array as the grouping key. Instead of grouping by each individual member, you get one group for each unique full-team combination.
Tip for SQL users:
In SQL, GROUP BY splits rows by scalar values, but in MongoDB, grouping on an array field groups by the entire array as a single value.
The Wrong Way: Lookup Directly on an Array
Similarly, you may try to enrich team data with $lookup
on the array field:
{
$lookup: {
from: "members",
localField: "team.memberId",
foreignField: "memberId",
as: "memberInfo"
}
}
Here's the example on MongoDB Playground:
https://mongoplayground.net/p/RVXoY5z7ke1
What you get:
- The entire array is used as the lookup key, so
memberInfo
will usually just be an empty array (no matches), unless a member’smemberId
somehow matches an array, which it never will.
Key for SQL folks:
Unlike SQL, where the JOIN is applied per row value, MongoDB $lookup
expects scalar fields, not arrays. If your join keys are inside an array, you need to flatten that array first.
The Right Way: Flatten with $unwind
, Then $group
First, flatten out your team
array so each person/project combination gets its own pipeline document:
{ $unwind: "$team" }
Here's how it looks like on MongoDB Playground:
https://mongoplayground.net/p/yGvvM-FZM5p
Then group by the member ID:
Don’t forget to include the role for each project. Here, we also collect all roles for completeness.
{
$group: {
_id: "$team.memberId",
memberName: { $first: "$team.name" },
roles: { $addToSet: "$team.role" },
projects: {
$push: {
name: "$projectName",
id: "$_id",
role: "$team.role"
}
}
}
}
The full working pipeline is like this:
db.projects.aggregate([
{ $unwind: "$team" },
{
$group: {
_id: "$team.memberId",
memberName: { $first: "$team.name" },
roles: { $addToSet: "$team.role" },
projects: {
$push: {
name: "$projectName",
id: "$_id",
role: "$team.role"
}
}
}
}
])
Here's the result on MongoDB Playground:
https://mongoplayground.net/p/oMOfvrZXa2a
This is the expected output, with one document per member and their project or array of projects:
[
{
"_id": "cyclops",
"memberName": "Cyclops",
"roles": ["Postgres Expert"],
"projects": [
{ "name": "Troubleshooting PostgreSQL issues", "id": ObjectId("60f...a1"), "role": "Postgres Expert" }
]
},
{
"_id": "wolverine",
"memberName": "Wolverine",
"roles": ["Consultant"],
"projects": [
{ "name": "Troubleshooting PostgreSQL issues", "id": ObjectId("60f...a1"), "role": "Consultant" }
]
},
{
"_id": "storm",
"memberName": "Storm",
"roles": ["DBA"],
"projects": [
{ "name": "Troubleshooting PostgreSQL issues", "id": ObjectId("60f...a1"), "role": "DBA" }
]
},
{
"_id": "beast",
"memberName": "Beast",
"roles": ["Developer"],
"projects": [
{ "name": "Troubleshooting PostgreSQL issues", "id": ObjectId("60f...a1"), "role": "Developer" }
]
},
{
"_id": "tony",
"memberName": "Tony",
"roles": ["Architect", "Developer"],
"projects": [
{ "name": "Troubleshooting PostgreSQL issues", "id": ObjectId("60f...a1"), "role": "Architect" },
{ "name": "Build new apps with MongoDB", "id": ObjectId("60f...a2"), "role": "Developer" }
]
}
]
Now you get the right result: Tony appears in both projects, and each is listed with the correct role.
Practical reminder:
MongoDB’s $unwind
makes array data behave more like SQL’s flattened rows, so aggregation and joins work as SQL users expect.
Here is an example with $lookup: https://mongoplayground.net/p/I3__p2EE9ps
What About Filtering Instead of Grouping?
If you just want to filter out all but the developers in each project, no unwinding is needed:
db.projects.aggregate([
{
$addFields: {
team: {
$filter: {
input: "$team",
as: "member",
cond: { $eq: ["$$member.role", "Developer"] }
}
}
}
}
])
Here's the MongoDB playground:
https://mongoplayground.net/p/bhDQJVKu7Rp
[
{
"_id": ObjectId("60f...a1"),
"projectName": "Troubleshooting PostgreSQL issues",
"team": [
{ "memberId": "beast", "name": "Beast", "role": "Developer" }
],
"status": "active"
},
{
"_id": ObjectId("60f...a2"),
"projectName": "Build new apps with MongoDB",
"team": [
{ "memberId": "tony", "name": "Tony", "role": "Developer" }
],
"status": "planned"
}
]
I used the following aggregation operators:
- $unwind: Flattens array fields, creating a separate document for each element in the array.
- $group: Similar to SQL’s GROUP BY, especially effective after arrays have been unwound.
- $lookup: Similar to a SQL LEFT OUTER JOIN, but by default MongoDB does not flatten joined results into multiple documents with repeated data. Instead, matched documents are returned normalized (0NF) as an array field.
SQL databases store their data normalized, often in third normal form, with multiple tables for one-to-many and many-to-many relationships to avoid duplicated data. The query result requires a join and is always in first normal form (1NF), as the result of a SQL query is a single flat table. Since 1NF cannot have arrays, SQL databases have to unnest groups, flattening relationships to multiple rows and introducing redundancy in a denormalized tabular result set.
MongoDB is not constrained by normal forms and supports rich document models, with arrays for repeating groups and nested objects directly in each document. When you use the aggregation pipeline, results can keep this nested structure. But if you want to group or join on values nested inside arrays, you’ll need to flatten the array to multiple documents using $unwind
, so further aggregation stages work as expected. In practice, $lookup
in MongoDB is often compared to JOINs in SQL, but if your fields live inside arrays, a join operation is really $unwind
followed by $lookup
.
Key takeaways for SQL users moving to MongoDB:
- Arrays are not automatically expanded or unnested.
- Whenever your “join key” or “group by value” is inside an array, always unwind first.
- In MongoDB, aggregation pipeline results can be nested and contain arrays, unlike the flat results of SQL queries.
-
$unwind is the document-model equivalent of SQL’s
UNNEST
: it’s your bridge from nested arrays to flat, row-like documents for further aggregation. - When joining with $lookup, always check whether your localField or foreignField is an array and flatten as needed.
Neurosymbolic AI: The 3rd Wave
The paper (arXiv 2020, also AI review 2023) opens up with discussing recent high-profile AI debates: the Montréal AI Debate and the AAAI 2020 fireside chat with Kahneman, Hinton, LeCun, and Bengio. A consensus seems to be emerging: for AI to be robust and trustworthy, it must combine learning with reasoning. Kahneman's "System 1 vs. System 2" dual framing of cognition maps well to deep learning and symbolic reasoning. And AI needs both.
Neurosymbolic AI promises to combine data-driven learning with structured reasoning, and provide modularity, interpretability, and measurable explanations. The paper moves from philosophical context to representation, then to system design and technical challenges in neurosymbolic AI.
Neurons and Symbols: Context and Current Debate
This section lays out the historic divide within symbolic AI and neural AI. Symbolic approach supports logic, reasoning, and explanation. Neural approach excels at perception and learning from data. Symbolic systems are good at thinking, but not learning. Deep learning is good at learning, but not thinking. Despite the great progress recently, deep learning still lacks transparency and remains energy-hungry. Critics like Gary Marcus argue that symbolic manipulation is needed for generalization and commonsense.
The authors here appeal to Valiant's call for a "semantics of knowledge" and say that neural-symbolic computing aims to answer this call. Symbolic logic can be embedded in neural systems, and neural representations can be interpreted in symbolic terms. Logic Tensor Networks (LTNs) are presented as a concrete solution. They embed first order logic formulas into tensors, and sneak logic into the loss function to help learn not just from data, but from rules. For this, logical formulas are relaxed into differentiable constraints. These are then used during training, guiding the model to satisfy logical relationships while learning from data. I was surprised to see some concrete work and software for LTNs on github. There is also a paper explaining the principles.
Distributed and Localist Representation
This section reframes the debate around representation. Neural networks use distributed representations: knowledge is encoded in continuous vectors, over which the concepts are smeared. This works well for learning and optimization. Symbolic systems use localist representations: discrete identifiers for concepts. These are better for reasoning and abstraction.
The challenge is to bridge the two. LTNs do this by grounding symbolic logic into tensor-based representations. Logic formulas are mapped to constraints over continuous embeddings. This enables symbolic querying over learned neural structures, while preserving the strengths of gradient-based learning. LTNs also allow symbolic structure to emerge during learning.
There is an interesting contrast here with the Neurosymbolic AI paper we reviewed yesterday. That paper favored Option2 approaches, which begins with the neural representation and lifts it into symbolic form. In other words, it advocates extracting structured symbolic patterns, explanations, or logical chains of reasoning from the output of neural systems. This paper, through advocating for LTNs, seems to favor Option1: embedding symbolic structures into neural vector spaces.
Neurosymbolic Computing Systems: Technical Aspects
Symbolic models use rules: decision trees, logic programs, structured knowledge. They are interpretable but brittle to change and new information. Deep nets, on the other hand, learn vector patterns using gradient descent. They are great with fuzz, but awful with generalizing rules. They speak linear algebra, not logic.
The first approach to combine them is to bake logic into the network's structure. The second approach is to encode logic in the loss function, but otherwise keep it separate from the network's architecture. The authors seem to lean toward the second approach for its flexibility, modularity, and scalability.
LTNs also seem to fall into the second approach. LTNs represent logical formulas as differentiable constraints, which are added to the loss function during training. The network learns to satisfy logic, but the logic is not hardwired into its structure. So the logic guides learning, but it is not embedded in the weights.
Challenges for the Principled Combination of Reasoning and Learning
Combining reasoning and learning introduces new challenges. One is how to handle quantifiers. Symbolic systems handle universal quantifiers (\forall) well. Neural networks are better at spotting existential patterns (\exists). This asymmetry makes hybrid systems attractive: let each side do what it does best.
Restricted Boltzmann Machines (RBMs) are discussed as early examples of hybrid models. They learn probability distributions over visible and hidden variables. With modular design, rules can be extracted from trained RBMs. But as models grow deeper, they lose modularity and interpretability. Autoencoders, GANs, and model-based reinforcement learning may offer ways to address this.
Syncing with Postgres: Logical Replication vs. ETL
August 07, 2025
Neurosymbolic AI: Why, What, and How
The paper (2023) argues for integrating two historically divergent traditions in artificial intelligence (neural networks and symbolic reasoning) into a unified paradigm called Neurosymbolic AI. It argues that the path to capable, explainable, and trustworthy artificial intelligence lies in marrying perception-driven neural systems with structure-aware symbolic models.
The authors lean on Daniel Kahneman’s story of two systems in the mind (Thinking Fast and Slow). Neural networks are the fast ones: pattern-hungry, intuitive, good with unstructured mess. Symbolic methods are the slow ones: careful, logical, good with rules and plans. Neural networks, especially in their modern incarnation as large language models (LLMs), excel at pattern recognition, but fall short in tasks demanding multi-step reasoning, abstraction, constraint satisfaction, or explanation. Conversely, symbolic systems offer interpretability, formal correctness, and composability, but tend to be brittle (not incremental/monotonic), difficult to scale, and poorly suited to noisy or incomplete inputs.
The paper argues that true AI systems must integrate both paradigms, leveraging the adaptability of neural systems while grounding them in symbolic structure. This argument is compelling, particularly in safety-critical domains like healthcare or law, where transparency and adherence to rules are essential.
The paper divides Neurosymbolic AI into two complementary approaches: compressing symbolic knowledge into neural models, and lifting neural outputs into symbolic structures.
The first approach is to embed symbolic structures such as ontologies or knowledge graphs (KGs) into neural vector spaces suitable for integration with neural networks. This can be done via embedding techniques that convert symbolic structures into high-dimensional vectors, or through more direct inductive bias mechanisms that influence a model's architecture. While this enables neural systems to make use of background knowledge, it often loses semantic richness in the process. The neural model benefits from the knowledge, but the end-user gains little transparency, and the symbolic constraints are difficult to trace or modify. Nevertheless, this approach scales well and offers modest improvements in cognitive tasks like abstraction and planning.
The second approach works in the opposite direction. It begins with the neural representation and lifts it into symbolic form. This involves extracting structured symbolic patterns, explanations, or logical chains of reasoning from the output of neural systems. One common approach is through federated pipelines where a large language model decomposes a query into subtasks and delegates those to domain-specific symbolic solvers, such as math engines or search APIs. Another strategy involves building fully differentiable pipelines where symbolic constraints, expert-defined rules, and domain concepts are embedded directly into the neural training process. This allows for true end-to-end learning while preserving explainability and control. These lifting-based systems show the greatest potential: they not only maintain large-scale perception but also achieve high marks in abstraction, analogy, and planning, along with excellent explainability and adaptability.
The case study in mental health application shows promise. The system's ability to map raw social media text to clinical ontologies and generate responses constrained by medical knowledge illustrates the potential of well-integrated symbolic and neural components. However, these examples also hint at the limitations of current implementations: it is not always clear how the symbolic reasoning is embedded or whether the system guarantees consistency under update or multi-agent interaction.
Knowledge graphs versus symbolic solvers
The paper claims that knowledge graphs (KGs) are especially well-suited for this integration—serving as the symbolic scaffolding that supports neural learning. KGs are graph-structured representations of facts, typically in the form of triples: (subject, predicate, object). KGs are praised for their flexibility, updateability, and ability to represent dynamic real-world entities. But the paper then waves off formal logic, especially first-order logic (FOL) as static and brittle. That's not fair. Knowledge graphs are great for facts: "Marie Curie discovered radium". But when it comes to constraint satisfaction or verifying safety, you'll need real logic. The kind with proofs.
First-order logic is only brittle when you try to do too much with it all at once. Modern logic systems (SMT solvers, expressive type systems, modular specs) can be quite robust. The paper misses a chance here. It doesn't mention the rich and growing field where LLMs and symbolic solvers already collaborate (e.g., GPT writes a function and Z3 checks if it's wrong, and logic engines validate that generated plans do not violate physics or safety).
Knowledge graphs and symbolic logic don’t need to fight, as they don't compete like Coke and Pepsi. They are more like peanut-butter and jelly. You can use a knowledge graph to instantiate a logic problem. You can generate FOL rules from ontologies. You can use SMT to enforce constraints (e.g., cardinality, ontological coherence). You can even use a theorem prover to validate new triples before inserting them into the graph. You can also run inference rules to expand a knowledge graph deductively.
But the paper doesn't explore how lifted neural outputs could feed into symbolic solvers for planning or synthesis or reasoning. It misses the current push to combine neural generation with symbolic checking, where LLMs propose, and the verifiers dispose in a feedback loop.
MongoDB indexing internals: .showRecordId() and .hint({$natural:1})
You can understand how MongoDB stores documents internally with simple queries that rely on the physical storage ordering. Some databases store records (called rows or tuples) in heap tables, using their physical location in the data files, such as ROWID in Oracle or CTID in PostgreSQL, to reference those records from index entries. In contrast, databases like MySQL's InnoDB or YugabyteDB store records in the primary key index ("clustered index", or "index organized table"), storing them by the logical order of their primary key values, so that secondary indexes point to these logical locations with the primary key, or an encoded version of it.
MongoDB default collections are similar to heap tables because their documents are stored independently of the primary key ("_id") exposed to the application. Internally, the WiredTiger storage engine organizes collection documents using a B+Tree structure, with an internal RecordId as the key, assigned by MongoDB. This structure resembles a clustered index, but it is clustered on an internal key rather than the primary key.
MongoDB’s approach improves on traditional heap tables, especially for storing variable-size documents, because WiredTiger uses B+Tree nodes for efficient space management, reusing space and splitting pages as needed, rather than relying on settings like PCTFREE or FILLFACTOR to reserve space for updates, or SHRINK/VACUUM operations to defragment after deletes.
To cover all cases, with clustered collections, MongoDB can generate the RecordId from the "_id", the primary key exposed to the application, making storage similar to how some databases organize tables in clustered indexes, as "_id" can be a generated ObjectId or defined by the application at insert time. So, when looking at storage internals levels, there are two keys:
- "_id" is the application's primary key, a generated surrogate key, or a natural key. It is always indexed, and this unique index, like other secondary indexes, references the document with a RecordId
- RecordId is the internal key. It can be generated from "_id" (in clustered collections), but is more generally generated as a monotonically increasing 64-bit integer during inserts. It can be considered a physical address by the query layer, but it is not directly mapped to an address in the filesystem because files are B+Tree structures. This offers physical data independence since the primary key generation pattern does not affect the storage organization. However, it is helpful to understand how it functions when reviewing execution plans.
Another perspective is that, aside from clustered collections, all indexes in MongoDB, including the primary key index on "_id", are essentially secondary indexes, similar to those found in heap table databases (such as Db2, Oracle, and PostgreSQL). However, instead of a heap table with a fixed block size and row identification tied to their physical location, MongoDB documents are stored within an internal B+Tree index using the WiredTiger engine. Both approaches have their rationale:
- SQL databases are primarily optimized for fixed, normalized schemas with small, uniform row lengths, where a PCTFREE or FILLFACTOR can be set according to the expected updates. Storing larger types involves row chaining or slicing (like Oracle LOB chunks or PostgreSQL TOAST)
- MongoDB is designed for flexible schemas, and collections can contain documents of any size up to the BSON limit. Typically, documents are tens to hundreds of kilobytes, with some larger ones reaching a few megabytes. This flexibility requires adaptable storage management and efforts to minimize fragmentation beyond a small, fixed page size. A B+Tree with a flexible leaf block size is a suitable structure for this purpose.
The document size is flexible, thanks to the storage described above, but the ideal document size is a frequent question. Until I write a blog post on this, here’s a slide: the green area indicates where the most efficient access is, the red side is acceptable for outliers if they don't grow further, but may be a sign of embedding too much, and the blue side works for small documents inserted and queried together, but it may also be a sign that you did unnecessary normalization and should embed more to avoid runtime scattered lookups.
An example to understand the internal ordering
To demonstrate how it works, I generate ten documents and insert them asynchronously, so they may be written to the database in a random order:
db.collection.drop();
Array.from({ length: 10 }, (_, i) => {
db.collection.insertOne({
_id: `#${String(i).padStart(5, '0')}` ,
val: Math.random()
});
});
If I query without any filter or sort, the query planner chooses a COLLSCAN, which reads the records in the order of their RecordID, in the order they were inserted:
test> db.collection.find();
[
{ _id: '#00002', val: 0.07658988613973294 },
{ _id: '#00008', val: 0.39893981577036675 },
{ _id: '#00009', val: 0.5279631881196858 },
{ _id: '#00007', val: 0.8445363162277748 },
{ _id: '#00006', val: 0.01935050813731909 },
{ _id: '#00004', val: 0.0732484258238264 },
{ _id: '#00005', val: 0.7733464850237388 },
{ _id: '#00003', val: 0.3356001641172073 },
{ _id: '#00000', val: 0.8956753135566624 },
{ _id: '#00001', val: 0.4952318922619017 }
]
Keep in mind that I'm working with just a single node here. Sharding and parallel processing might retrieve rows in a different order than how they're stored. You should not rely on any "natural" order. Instead, unless you're conducting this type of investigation, where you're guessing the physical ordering from the query layer, ensure that you use an explicit sort operation to specify the expected order of results.
I can display the RecordId with .showRecordId()
, which adds it to the cursor projection:
test> db.collection.find().showRecordId();
[
{ _id: '#00002', val: 0.07658988613973294, '$recordId': Long('1') },
{ _id: '#00008', val: 0.39893981577036675, '$recordId': Long('2') },
{ _id: '#00009', val: 0.5279631881196858, '$recordId': Long('3') },
{ _id: '#00007', val: 0.8445363162277748, '$recordId': Long('4') },
{ _id: '#00006', val: 0.01935050813731909, '$recordId': Long('5') },
{ _id: '#00004', val: 0.0732484258238264, '$recordId': Long('6') },
{ _id: '#00005', val: 0.7733464850237388, '$recordId': Long('7') },
{ _id: '#00003', val: 0.3356001641172073, '$recordId': Long('8') },
{ _id: '#00000', val: 0.8956753135566624, '$recordId': Long('9') },
{ _id: '#00001', val: 0.4952318922619017, '$recordId': Long('10') }
]
Documentation: showRecordId()
Forcing an index with a hint
I can force an index with a hint, for example the index on "_id" which was created automatically:
test> db.collection.find().hint( { _id: 1} ).showRecordId();
[
{ _id: '#00000', val: 0.8956753135566624, '$recordId': Long('9') },
{ _id: '#00001', val: 0.4952318922619017, '$recordId': Long('10') },
{ _id: '#00002', val: 0.07658988613973294, '$recordId': Long('1') },
{ _id: '#00003', val: 0.3356001641172073, '$recordId': Long('8') },
{ _id: '#00004', val: 0.0732484258238264, '$recordId': Long('6') },
{ _id: '#00005', val: 0.7733464850237388, '$recordId': Long('7') },
{ _id: '#00006', val: 0.01935050813731909, '$recordId': Long('5') },
{ _id: '#00007', val: 0.8445363162277748, '$recordId': Long('4') },
{ _id: '#00008', val: 0.39893981577036675, '$recordId': Long('2') },
{ _id: '#00009', val: 0.5279631881196858, '$recordId': Long('3') }
]
This runs a IXSCAN instead of a COLLSCAN and returns the documents in the order of the index. You can verify it with .explain()
, but it is also perceptible from the order of the document fetched, which follows the order of "_id" rather than the order of insertion as before (also called "natural" order).
Rather than using a hint, I can add a filter, and the query planner chooses the index. A filter like {$gt:MinKey}
or {$lt:MaxKey}
does not change the result, but changes the execution plan to an IXSCAN:
test> db.collection.find({_id:{$gt:MinKey}}).showRecordId();
[
{ _id: '#00000', val: 0.8956753135566624, '$recordId': Long('9') },
{ _id: '#00001', val: 0.4952318922619017, '$recordId': Long('10') },
{ _id: '#00002', val: 0.07658988613973294, '$recordId': Long('1') },
{ _id: '#00003', val: 0.3356001641172073, '$recordId': Long('8') },
{ _id: '#00004', val: 0.0732484258238264, '$recordId': Long('6') },
{ _id: '#00005', val: 0.7733464850237388, '$recordId': Long('7') },
{ _id: '#00006', val: 0.01935050813731909, '$recordId': Long('5') },
{ _id: '#00007', val: 0.8445363162277748, '$recordId': Long('4') },
{ _id: '#00008', val: 0.39893981577036675, '$recordId': Long('2') },
{ _id: '#00009', val: 0.5279631881196858, '$recordId': Long('3') }
]
An equality filter will also run an IXSCAN, and we observe the result fetched in that order:
test> db.collection.find({_id:{$ne:null}}).showRecordId();
[
{ _id: '#00000', val: 0.8956753135566624, '$recordId': Long('9') },
{ _id: '#00001', val: 0.4952318922619017, '$recordId': Long('10') },
{ _id: '#00002', val: 0.07658988613973294, '$recordId': Long('1') },
{ _id: '#00003', val: 0.3356001641172073, '$recordId': Long('8') },
{ _id: '#00004', val: 0.0732484258238264, '$recordId': Long('6') },
{ _id: '#00005', val: 0.7733464850237388, '$recordId': Long('7') },
{ _id: '#00006', val: 0.01935050813731909, '$recordId': Long('5') },
{ _id: '#00007', val: 0.8445363162277748, '$recordId': Long('4') },
{ _id: '#00008', val: 0.39893981577036675, '$recordId': Long('2') },
{ _id: '#00009', val: 0.5279631881196858, '$recordId': Long('3') }
]
This technique is used to add an unbounded range predicate on the indexed sort field to get the index used for the sort in the absence of an equality predicate: MongoDB Equality, Sort, Range (ESR) without Equality (SR)
Forcing a full scan with a hint for natural order
Hints specify the index definition, and you may wonder how to force a full scan instead of the index scan chosen by the query planner. Remember that it's an index on RecordId that stores the documents. So you can hint this internal index using the $natural
operator - asking for natural order of the collection documents:
test> db.collection.find({_id:{$ne:null}}).hint({$natural:1}).showRecordId();
[
{ _id: '#00002', val: 0.07658988613973294, '$recordId': Long('1') },
{ _id: '#00008', val: 0.39893981577036675, '$recordId': Long('2') },
{ _id: '#00009', val: 0.5279631881196858, '$recordId': Long('3') },
{ _id: '#00007', val: 0.8445363162277748, '$recordId': Long('4') },
{ _id: '#00006', val: 0.01935050813731909, '$recordId': Long('5') },
{ _id: '#00004', val: 0.0732484258238264, '$recordId': Long('6') },
{ _id: '#00005', val: 0.7733464850237388, '$recordId': Long('7') },
{ _id: '#00003', val: 0.3356001641172073, '$recordId': Long('8') },
{ _id: '#00000', val: 0.8956753135566624, '$recordId': Long('9') },
{ _id: '#00001', val: 0.4952318922619017, '$recordId': Long('10') }
]
The documents are fetched in order of RecordId from a COLLSCAN. The hint syntax allows an ascending or descending option to start at the beginning or end of the collection. I'm showing this to explain how records are stored internally. However, if you need a specific order, you should use sort()
and let the query planner decide whether to use the index to avoid a sort operation.
MongoDB is more than a NoSQL database:
- Like many NoSQL databases, it allows you to query the indexes directly with
.hint()
, forcing the access path - Like all SQL databases, it has a query planner offering data independence, allowing you to declare the collection and expected order with
.sort()
and let the database optimize the access path.
Avoid combining storage-level instructions, such as .hint()
, .min()
, or .max()
, with declarative query filters in find()
or $match
, as this can undermine the query planner's guarantees that results match the query predicates. For example, hinting at a partial index might lead to incomplete results.
Covering indexes and "_id" projection
Understanding what is stored in the index entries helps optimize queries to use an index-only scan (covering index).
For example, the following query reads the index on "_id" and projects only "_id" (which is by default) and "val":
test> db.collection.find(
{ _id: { $ne: null } },
{ val: 1 }
).explain().queryPlanner.winningPlan;
{
isCached: false,
stage: 'PROJECTION_SIMPLE',
transformBy: { val: 1 },
inputStage: {
stage: 'FETCH',
inputStage: {
stage: 'IXSCAN',
keyPattern: { _id: 1 },
indexName: '_id_',
isMultiKey: false,
isUnique: true,
isSparse: false,
isPartial: false,
indexVersion: 2,
direction: 'forward',
indexBounds: { _id: [ '[MinKey, null)', '(null, MaxKey]' ] }
}
}
}
Because the index on "_id" holds only the key ("_id") and RecordId, it must fetch the document (FETCH) before the projection (PROJECTION_SIMPLE). Even if it is a primary index from the application's point of view, it is physically equivalent to a secondary index.
I can see the same with another secondary index:
test> db.collection.createIndex( { val: 1 } );
val_1
test> db.collection.find(
{ val:{$gt: 0} },
{ val: 1 }
).explain().queryPlanner.winningPlan;
{
isCached: false,
stage: 'PROJECTION_SIMPLE',
transformBy: { val: 1 },
inputStage: {
stage: 'FETCH',
inputStage: {
stage: 'IXSCAN',
keyPattern: { val: 1 },
indexName: 'val_1',
isMultiKey: false,
multiKeyPaths: { val: [] },
isUnique: false,
isSparse: false,
isPartial: false,
indexVersion: 2,
direction: 'forward',
indexBounds: { val: [ '(0, inf.0]' ] }
}
}
}
Such query projects "_id" because it is there by default, and then the index on "val" is not covering all fields. To avoid the FETCH, I need to remove "_id" from the projection explicitly:
test> db.collection.find(
{ val:{$gt: 0} },
{ val: 1 , _id: 0 }
).explain().queryPlanner.winningPlan;
{
isCached: false,
stage: 'PROJECTION_COVERED',
transformBy: { val: 1, _id: 0 },
inputStage: {
stage: 'IXSCAN',
keyPattern: { val: 1 },
indexName: 'val_1',
isMultiKey: false,
multiKeyPaths: { val: [] },
isUnique: false,
isSparse: false,
isPartial: false,
indexVersion: 2,
direction: 'forward',
indexBounds: { val: [ '(0, inf.0]' ] }
}
}
Another possibility: if I need to project "_id", I can add it to the index definition, making it a covering index for my query:
test> db.collection.createIndex( { val: 1 , _id: 1 } );
val_1__id_1
test> db.collection.find(
{
MongoDB indexing internals: .showRecordId() and .hint({$natural:1})
You can understand how MongoDB stores documents internally with simple queries that rely on the physical storage ordering. Some databases store records (called rows or tuples) in heap tables, using their physical location in the data files, such as ROWID in Oracle or CTID in PostgreSQL, to reference those records from index entries. In contrast, databases like MySQL's InnoDB or YugabyteDB store records in the primary key index ("clustered index", or "index organized table"), storing them by the logical order of their primary key values, so that secondary indexes point to these logical locations with the primary key, or an encoded version of it.
MongoDB default collections are similar to heap tables because their documents are stored independently of the primary key ("_id") exposed to the application. Internally, the WiredTiger storage engine organizes collection documents using a B+Tree structure, with an internal RecordId as the key, assigned by MongoDB. This structure resembles a clustered index, but it is clustered on an internal key rather than the primary key.
MongoDB’s approach improves on traditional heap tables, especially for storing variable-size documents, because WiredTiger uses B+Tree nodes for efficient space management, reusing space and splitting pages as needed, rather than relying on settings like PCTFREE or FILLFACTOR to reserve space for updates, or SHRINK/VACUUM operations to defragment after deletes.
To cover all cases, with clustered collections, MongoDB can generate the RecordId from the "_id", the primary key exposed to the application, making storage similar to how some databases organize tables in clustered indexes, as "_id" can be a generated ObjectId or defined by the application at insert time. So, when looking at storage internals levels, there are two keys:
-
"_id" is the application's primary key, a generated surrogate key, or a natural key. It is always indexed, and this unique index, like other secondary indexes, references the document with a RecordId
-
RecordId is the internal key. It can be generated from "_id" (in clustered collections), but is more generally generated as a monotonically increasing 64-bit integer during inserts. It can be considered a physical address by the query layer, but it is not directly mapped to an address in the filesystem because files are B+Tree structures.
This offers physical data independence since the primary key generation pattern does not affect the storage organization. However, it is helpful to understand how it functions when reviewing execution plans.
Another perspective is that, aside from clustered collections, all indexes in MongoDB, including the primary key index on "_id", are essentially secondary indexes, similar to those found in heap table databases (such as Db2, Oracle, and PostgreSQL). However, instead of a heap table with a fixed block size and row identification tied to their physical location, MongoDB documents are stored within an internal B+Tree index using the WiredTiger engine. Both approaches have their rationale:
-
SQL databases are primarily optimized for fixed, normalized schemas with small, uniform row lengths, where a PCTFREE or FILLFACTOR can be set according to the expected updates. Storing larger types involves row chaining or slicing (like Oracle LOB chunks or PostgreSQL TOAST)
-
MongoDB is designed for flexible schemas, and collections can contain documents of any size up to the BSON limit. Typically, documents are tens to hundreds of kilobytes, with some larger ones reaching a few megabytes. This flexibility requires adaptable storage management and efforts to minimize fragmentation beyond a small, fixed page size. A B+Tree with a flexible leaf block size is a suitable structure for this purpose.
The document size is flexible, thanks to the storage described above, but the ideal document size is a frequent question. Until I write a blog post on this, here’s a slide: the green area indicates where the most efficient access is, the red side is acceptable for outliers if they don't grow further, but may be a sign of embedding too much, and the blue side works for small documents inserted and queried together, but it may also be a sign that you did unnecessary normalization and should embed more to avoid runtime scattered lookups.
An example to understand the internal ordering
To demonstrate how it works, I generate ten documents and insert them asynchronously, so they may be written to the database in a random order:
db.collection.drop();
Array.from({ length: 10 }, (_, i) => {
db.collection.insertOne({
_id: `#${String(i).padStart(5, '0')}` ,
val: Math.random()
});
});
If I query without any filter or sort, the query planner chooses a COLLSCAN, which reads the records in the order of their RecordID, in the order they were inserted:
test> db.collection.find();
[
{ _id: '#00002', val: 0.07658988613973294 },
{ _id: '#00008', val: 0.39893981577036675 },
{ _id: '#00009', val: 0.5279631881196858 },
{ _id: '#00007', val: 0.8445363162277748 },
{ _id: '#00006', val: 0.01935050813731909 },
{ _id: '#00004', val: 0.0732484258238264 },
{ _id: '#00005', val: 0.7733464850237388 },
{ _id: '#00003', val: 0.3356001641172073 },
{ _id: '#00000', val: 0.8956753135566624 },
{ _id: '#00001', val: 0.4952318922619017 }
]
Keep in mind that I'm working with just a single node here. Sharding and parallel processing might retrieve rows in a different order than how they're stored. You should not rely on any "natural" order. Instead, unless you're conducting this type of investigation, where you're guessing the physical ordering from the query layer, ensure that you use an explicit sort operation to specify the expected order of results.
I can display the RecordId with .showRecordId()
, which adds it to the cursor projection:
test> db.collection.find().showRecordId();
[
{ _id: '#00002', val: 0.07658988613973294, '$recordId': Long('1') },
{ _id: '#00008', val: 0.39893981577036675, '$recordId': Long('2') },
{ _id: '#00009', val: 0.5279631881196858, '$recordId': Long('3') },
{ _id: '#00007', val: 0.8445363162277748, '$recordId': Long('4') },
{ _id: '#00006', val: 0.01935050813731909, '$recordId': Long('5') },
{ _id: '#00004', val: 0.0732484258238264, '$recordId': Long('6') },
{ _id: '#00005', val: 0.7733464850237388, '$recordId': Long('7') },
{ _id: '#00003', val: 0.3356001641172073, '$recordId': Long('8') },
{ _id: '#00000', val: 0.8956753135566624, '$recordId': Long('9') },
{ _id: '#00001', val: 0.4952318922619017, '$recordId': Long('10') }
]
Documentation: showRecordId()
Forcing an index with a hint
I can force an index with a hint, for example the index on "_id" which was created automatically:
test> db.collection.find().hint( { _id: 1} ).showRecordId();
[
{ _id: '#00000', val: 0.8956753135566624, '$recordId': Long('9') },
{ _id: '#00001', val: 0.4952318922619017, '$recordId': Long('10') },
{ _id: '#00002', val: 0.07658988613973294, '$recordId': Long('1') },
{ _id: '#00003', val: 0.3356001641172073, '$recordId': Long('8') },
{ _id: '#00004', val: 0.0732484258238264, '$recordId': Long('6') },
{ _id: '#00005', val: 0.7733464850237388, '$recordId': Long('7') },
{ _id: '#00006', val: 0.01935050813731909, '$recordId': Long('5') },
{ _id: '#00007', val: 0.8445363162277748, '$recordId': Long('4') },
{ _id: '#00008', val: 0.39893981577036675, '$recordId': Long('2') },
{ _id: '#00009', val: 0.5279631881196858, '$recordId': Long('3') }
]
This runs a IXSCAN instead of a COLLSCAN and returns the documents in the order of the index. You can verify it with .explain()
, but it is also perceptible from the order of the document fetched, which follows the order of "_id" rather than the order of insertion as before (also called "natural" order).
Rather than using a hint, I can add a filter, and the query planner chooses the index. A filter like {$gt:MinKey}
or {$lt:MaxKey}
does not change the result, but changes the execution plan to an IXSCAN:
test> db.collection.find({_id:{$gt:MinKey}}).showRecordId();
[
{ _id: '#00000', val: 0.8956753135566624, '$recordId': Long('9') },
{ _id: '#00001', val: 0.4952318922619017, '$recordId': Long('10') },
{ _id: '#00002', val: 0.07658988613973294, '$recordId': Long('1') },
{ _id: '#00003', val: 0.3356001641172073, '$recordId': Long('8') },
{ _id: '#00004', val: 0.0732484258238264, '$recordId': Long('6') },
{ _id: '#00005', val: 0.7733464850237388, '$recordId': Long('7') },
{ _id: '#00006', val: 0.01935050813731909, '$recordId': Long('5') },
{ _id: '#00007', val: 0.8445363162277748, '$recordId': Long('4') },
{ _id: '#00008', val: 0.39893981577036675, '$recordId': Long('2') },
{ _id: '#00009', val: 0.5279631881196858, '$recordId': Long('3') }
]
An equality filter will also run an IXSCAN, and we observe the result fetched in that order:
test> db.collection.find({_id:{$ne:null}}).showRecordId();
[
{ _id: '#00000', val: 0.8956753135566624, '$recordId': Long('9') },
{ _id: '#00001', val: 0.4952318922619017, '$recordId': Long('10') },
{ _id: '#00002', val: 0.07658988613973294, '$recordId': Long('1') },
{ _id: '#00003', val: 0.3356001641172073, '$recordId': Long('8') },
{ _id: '#00004', val: 0.0732484258238264, '$recordId': Long('6') },
{ _id: '#00005', val: 0.7733464850237388, '$recordId': Long('7') },
{ _id: '#00006', val: 0.01935050813731909, '$recordId': Long('5') },
{ _id: '#00007', val: 0.8445363162277748, '$recordId': Long('4') },
{ _id: '#00008', val: 0.39893981577036675, '$recordId': Long('2') },
{ _id: '#00009', val: 0.5279631881196858, '$recordId': Long('3') }
]
This technique is used to add an unbounded range predicate on the indexed sort field to get the index used for the sort in the absence of an equality predicate: MongoDB Equality, Sort, Range (ESR) without Equality (SR)
Forcing a full scan with a hint for natural order
Hints specify the index definition, and you may wonder how to force a full scan instead of the index scan chosen by the query planner. Remember that it's an index on RecordId that stores the documents. So you can hint this internal index using the $natural
operator - asking for natural order of the collection documents:
test> db.collection.find({_id:{$ne:null}}).hint({$natural:1}).showRecordId();
[
{ _id: '#00002', val: 0.07658988613973294, '$recordId': Long('1') },
{ _id: '#00008', val: 0.39893981577036675, '$recordId': Long('2') },
{ _id: '#00009', val: 0.5279631881196858, '$recordId': Long('3') },
{ _id: '#00007', val: 0.8445363162277748, '$recordId': Long('4') },
{ _id: '#00006', val: 0.01935050813731909, '$recordId': Long('5') },
{ _id: '#00004', val: 0.0732484258238264, '$recordId': Long('6') },
{ _id: '#00005', val: 0.7733464850237388, '$recordId': Long('7') },
{ _id: '#00003', val: 0.3356001641172073, '$recordId': Long('8') },
{ _id: '#00000', val: 0.8956753135566624, '$recordId': Long('9') },
{ _id: '#00001', val: 0.4952318922619017, '$recordId': Long('10') }
]
The documents are fetched in order of RecordId from a COLLSCAN. The hint syntax allows an ascending or descending option to start at the beginning or end of the collection. I'm showing this to explain how records are stored internally. However, if you need a specific order, you should use sort()
and let the query planner decide whether to use the index to avoid a sort operation.
MongoDB is more than a NoSQL database:
- Like many NoSQL databases, it allows you to query the indexes directly with
.hint()
, forcing the access path
- Like all SQL databases, it has a query planner offering data independence, allowing you to declare the collection and expected order with
.sort()
and let the database optimize the access path.
Avoid combining storage-level instructions, such as .hint()
, .min()
, or .max()
, with declarative query filters in find()
or $match
, as this can undermine the query planner's guarantees that results match the query predicates. For example, hinting at a partial index might lead to incomplete results.
Covering indexes and "_id" projection
Understanding what is stored in the index entries helps optimize queries to use an index-only scan (covering index).
For example, the following query reads the index on "_id" and projects only "_id" (which is by default) and "val":
test> db.collection.find(
{ _id: { $ne: null } },
{ val: 1 }
).explain().queryPlanner.winningPlan;
{
isCached: false,
stage: 'PROJECTION_SIMPLE',
transformBy: { val: 1 },
inputStage: {
stage: 'FETCH',
inputStage: {
stage: 'IXSCAN',
keyPattern: { _id: 1 },
indexName: '_id_',
isMultiKey: false,
isUnique: true,
isSparse: false,
isPartial: false,
indexVersion: 2,
direction: 'forward',
indexBounds: { _id: [ '[MinKey, null)', '(null, MaxKey]' ] }
}
}
}
Because the index on "_id" holds only the key ("_id") and RecordId, it must fetch the document (FETCH) before the projection (PROJECTION_SIMPLE). Even if it is a primary index from the application's point of view, it is physically equivalent to a secondary index.
I can see the same with another secondary index:
test> db.collection.createIndex( { val: 1 } );
val_1
test> db.collection.find(
{ val:{$gt: 0} },
{ val: 1 }
).explain().queryPlanner.winningPlan;
{
isCached: false,
stage: 'PROJECTION_SIMPLE',
transformBy: { val: 1 },
inputStage: {
stage: 'FETCH',
inputStage: {
stage: 'IXSCAN',
keyPattern: { val: 1 },
indexName: 'val_1',
isMultiKey: false,
multiKeyPaths: { val: [] },
isUnique: false,
isSparse: false,
isPartial: false,
indexVersion: 2,
direction: 'forward',
indexBounds: { val: [ '(0, inf.0]' ] }
}
}
}
Such query projects "_id" because it is there by default, and then the index on "val" is not covering all fields. To avoid the FETCH, I need to remove "_id" from the projection explicitly:
test> db.collection.find(
{ val:{$gt: 0} },
{ val: 1 , _id: 0 }
).explain().queryPlanner.winningPlan;
{
isCached: false,
stage: 'PROJECTION_COVERED',
transformBy: { val: 1, _id: 0 },
inputStage: {
stage: 'IXSCAN',
keyPattern: { val: 1 },
indexName: 'val_1',
isMultiKey: false,
multiKeyPaths: { val: [] },
isUnique: false,
isSparse: false,
isPartial: false,
indexVersion: 2,
direction: 'forward',
indexBounds: { val: [ '(0, inf.0]' ] }
}
}
Another possibility: if I need to project "_id", I can add it to the index definition, making it a covering index for my query:
test> db.collection.createIndex( { val: 1 , _id: 1 } );
val_1__id_1
test> db.collection.find(
{
LDAP Isn’t Going Away, and Neither Is Our Support for Percona Server for MongoDB
As enterprise software vendors race toward proprietary cloud ecosystems, some features long relied upon by businesses are being quietly deprecated. One recent example is MongoDB Enterprise Advanced and Atlas dropping support for LDAP authentication, a foundational identity protocol for countless organizations. At Percona, we’re taking a different path. We’ve supported LDAP in Percona Server for MongoDB for […]
August 06, 2025
Can a Client–Server Cache Tango Accelerate Disaggregated Storage?
This paper from HotStorage'25 presents OrcaCache, a design proposal for a coordinated caching framework tailored to disaggregated storage systems. In a disaggregated architecture, compute and storage resources are physically separated and connected via high-speed networks. These became increasingly common in modern data centers as they enable flexible resource scaling and improved fault isolation. (Follow the money as they say!) But accessing remote storage introduces serious latency and efficiency challenges. The paper positions OrcaCache as a solution to mitigate these challenges by orchestrating caching logic across clients and servers. Important note: in the paper's terminology the server means the storage node, and the client means the compute node.
As we did last week for another paper, Aleksey and I live-recorded our reading/discussion of this paper. We do this to teach the thought-process and mechanics of how experts read papers in real time. Check our discussion video below (please listen at 1.5x, I sound less horrible at that speed). The paper I annotated during our discussion is also available here.
The problem
Caching plays a crucial role in reducing the overheads of disaggregated storage, but the paper claims that current strategies (client-local caching, server-only caching, and independent client-server caching) fall short. Client-local caching is simple and avoids server overhead but underutilizes memory on the server. Server-only caching can reduce backend I/O pressure but comes at the cost of network round-trips and significant server CPU load. Independent client-server caching combines the two but lacks coordination between the caches, leading to data duplication, inefficient eviction and prefetching policies, and causes fairness issues in multi-client environments.
The proposed design
OrcaCache proposes to address these shortcomings by shifting the cache index and coordination responsibilities to the client side. Clients maintain a global view of the cache and communicate directly with the server-side cache using RDMA, which enables bypassing the server CPU in the common case. Server-side components are minimized to a daemon that tracks resource usage and allocates memory based on fairness and pressure.
Discussion
OrcaCache stops short of addressing the core system-level challenges in a realistic multi-client deployment. A single server single client setup is used in experiments in Figure 1, and also for most of the description in the paper. The paper's solution to dealing with multiple clients is to use a separate namespace for each client, but then at the server-side this uses up a lot of resources, cause duplication of cached items. There is no mutual benefit and collaboration among clients in this setup.
The paper also mentions how clients could interact with a server-side daemon, how RDMA-based lookups and cache updates would be issued, and how resources might be allocated based on monitored pressure, but many of these mechanisms remain speculative. The authors mention about flexible eviction and prefetching but do not explore the complexity of maintaining consistency or fairness across diverse workloads. AI/ML workloads mentioned/alluded but not really tested in the paper.
In the end, the paper's contribution lies more in reopening a line of thought from 1990s cooperative caching and global memory management research: how to make cache coherence across disaggregated compute and storage both efficient and scalable. The idea OrcaCache seems to lean on is that rather than burden the server, it makes the client responsible for coordination, enabled by fast networks and abundant memory.
Also despite the title, there was not much Tango in the paper. It was mostly cache.
MySQL 8.0 End of Life Date: What Happens Next?
If you’re running MySQL 8.0 databases, you need to know this: Oracle will stop supporting them in April 2026. That means no more security patches, bug fixes, or help when things go wrong. Maybe you’re thinking, “But April 2026 feels far away!“. But once that date hits, every day you keep running MySQL 8.0 makes […]
Transaction Healing: Scaling Optimistic Concurrency Control on Multicores
This paper from SIGMOD 2016 proposes a transaction healing approach to improve the scalability of Optimistic Concurrency Control (OCC) in main-memory OLTP systems running on multicore architectures. Instead of discarding the entire execution when validation fails, the system repairs only the inconsistent operations to improve throughput in high-contention scenarios.
If this sounds familiar, it's because we recently reviewed the Morty paper from EuroSys 2023, which applied healing ideas to interactive transactions using continuations to support re-execution. This 2016 Transaction Healing paper is scoped to static stored procedures, and focuses more on integrating healing into OCC for stored procedures.
Key Ideas
OCC works well under low contention because it separates reads from writes and keeps critical sections short (only for validation). But under high contention, especially in workloads with skewed access patterns (like Zipfian distributions), transactions are frequently invalidated by concurrent updates. The naive OCC response of abort and restart leads to wasting CPU cycles and degrading cache locality.
Transaction healing aims to address this problem by observing/betting that most validation failures affect only a subset of a transaction's operations. If only the affected operations can be detected and recovered, the system can avoid redoing the entire transaction. They implement this by leveraging two components.
First, a static analysis phase extracts operation dependencies from the stored procedure a priori. The dependency analysis distinguishes between two types of relations: key-dependencies, where the result of one operation determines the lookup key for another; and value-dependencies, where the value produced by one operation is used in a subsequent one. With this graph in hand, transaction healing can surgically repair any non-serializable operation at runtime.
Second, a runtime access cache, maintained per thread, tracks the behavior of each executed operation (its inputs, outputs, effects, and the memory addresses it accessed) and identifies conflicted parts of a transaction at runtime. The access cache supports this by recording memory addresses (avoiding repeated index lookups) and allowing efficient reuse of unaffected results.
Transaction healing
The healing process is triggered during the validation phase, when an inconsistency is detected in the read/write set. Rather than aborting immediately, the system identifies the earliest affected operation (using its dependency graph), and restores it. If the operation is value-dependent, healing updates its effects based on cached inputs and outputs. If it's key-dependent, a re-execution is necessary since the accessed record may change. The healing propagates forward through the dependency graph, recursively restoring all operations affected by the initial inconsistency.
The healing mechanism is built to preserve serializability. Validation acquires locks in a globally consistent order (e.g., sorted by memory address) to avoid deadlocks. If during healing a lock must be acquired out of order (e.g., due to new dependencies introduced by re-executed operations), the transaction is aborted in order not to risk a deadlock. The paper says this situation is rare due to validation-order optimizations. Despite occasional aborts, transaction healing guarantees forward progress and eventual termination: each transaction's read/write set is finite and every element is validated at most once, which ensures that healing either succeeds or fails definitively.
Evaluation Highlights
They implemented a C++ in-memory database engine, THEDB, to test these ideas. THEDB employs LLVM to perform static dependency analysis on stored procedures and includes support for standard database features like inserts, deletes, and range queries (the latter protected against phantoms via B+-tree versioning, as in Silo). The authors evaluate THEDB on a 48-core AMD machine using two common benchmarks: TPC-C and Smallbank. THEDB is compared against five systems: variants of OCC (including Silo-style), 2PL, a hybrid OCC-2PL approach, and a deterministic partitioned system.
The results show that, under high contention, THEDB significantly outperforms the alternatives, achieving up to 6.2x higher throughput than Silo and approaching the performance of an idealized OCC system with validation disabled. This shows that transaction healing adds minimal overhead and successfully eliminates the restart costs that dominate OCC's performance under load. Moreover, THEDB maintains stable throughput as contention increases (e.g., under more skewed Zipfian distributions), while traditional OCC and Silo degrade rapidly. Scalability is also great up to 48 cores.
Discussion
**** What are the limitations of static analysis used?
Transaction healing proposed here is limited to stored procedures because it relies on static dependency extraction. Unlike Morty, which handles interactive transactions using runtime continuations, this work cannot deal with dynamic control flow or unknown transaction logic at runtime. As a result, ad-hoc queries revert to standard OCC, where any healing benefit is lost.
On the other hand, there is some subtlety here. Transaction healing does not require read/write sets to be declared in advance as the deterministic systems like Calvin do. Deterministic systems must know the exact records a transaction will access before it begins execution, so they can assign transactions to partitions and establish a global execution order. Transaction healing avoids this rigidity. It doesn't need to know which specific records a transaction will access ahead of time. Instead, it relies on static analysis to extract the structure of the transaction logic, namely which operations depend on which others. These dependencies, such as key or value dependencies between operations, are known statically because the transaction logic is written as a stored procedure. But the actual keys and values involved are discovered dynamically as the transaction executes. The system uses an access cache to record which memory locations were read or written, and validation happens afterward. This flexibility allows transaction healing to support dynamic, cross-partition access patterns without prior declaration.
**** How does this compare with Morty?
Transaction Healing is designed for in-memory OLTP systems running with OCC on multicore machines, where the workload consists of static stored procedures. Morty, in contrast, is built for a distributed geo-replicated system and handles interactive transactions with dynamic control flow. It uses MVTSO, with speculative execution and a priori ordering. Unlike THEDB, Morty allows transactions to read from uncommitted versions, exposing concurrency that traditional systems suppress. It tracks execution through continuation-passing style (CPS) in order to make control dependencies explicit and enable partial re-execution of logic branches. While transaction healing employed LLVM to automatically perform static dependency analysis on stored procedures, Morty did not automate translation of transaction program to CPS program. Finally, since it is distributed and deployed over WAN, Morty integrates concurrency control with replication to reduce latency and uses quorum voting to maintain fault-tolerant correctness without centralized logging.
August 05, 2025
Planning Ahead for PostgreSQL 18: What Matters for Your Organization
PostgreSQL 18 is on the way, bringing a set of improvements that many organizations will find useful. It’s not a revolutionary release, but it does move things in a good direction, especially in performance, replication, and simplifying daily operations. For teams already using PostgreSQL, it’s a good time to look into what’s new. For others […]
Analysing Snapshot Isolation
This paper (PODC'2016) presents a clean and declarative treatment of Snapshot Isolation (SI) using dependency graphs. It builds on the foundation laid by prior work, including the SSI paper we reviewed recently, which had already identified that SI permits cycles with two adjacent anti-dependency (RW) edges, the so-called inConflict and outConflict edges. While the SSI work focused on algorithmic results and implementation, this paper focuses more on the theory (this is PODC after all) of defining a declarative dependency-graph-based model for SI. It strips away implementation details such as commit timestamps and lock management, and provides a purely symbolic framework. It also proves a soundness result (Theorem 10), and leverages the model for two practical static analyses: transaction chopping and robustness under isolation-level weakening.
Soundness result and dependency graph model
Let's begin with Theorem 10, which establishes both the soundness and completeness of the dependency graph characterization of SI. The soundness direction states that any dependency graph satisfying the SI condition (i.e., every cycle contains at least two adjacent RW edges) corresponds to a valid SI execution. The completeness direction, which follows from prior work, asserts that every valid SI execution induces such a dependency graph. The proof of soundness is technically involved, requiring the authors to construct valid SI executions from dependency graphs by solving a system of relational constraints that preserve the required visibility and ordering properties.Building on foundational work by Adya, this model represents executions as graphs whose nodes are transactions and whose edges capture observable transactional dependencies in terms of 3 edge types: write-read (WR), write-write (WW), and the anti-dependency capturing read-write (RW) edges. The SI, Serializability (SER), and Parallel SI (PSI) isolation levels are then defined in terms of the structural properties in these graphs, specifically by the presence or absence of certain cycles. This abstraction supports symbolic reasoning about anomalies like write skew or long fork manifest as specific, checkable subgraphs. Informally, a WR edge from T to S means that S reads T’s write to some object x; a WW edge means that S overwrites T’s write; and a RW edge indicates that S overwrites the value of x read by T, introducing an anti-dependency.
Definition 4 and Figure 1 provide an elegant axiomatization of abstract executions. The visibility relation (VIS) must always be a subset of the commit order (CO), and in the case of Serializability, the two are equal. In my mind, this captures the key conceptual divide between SER and SI: Serializability enforces a total order over committed transactions, wheras SI permits partial orders.
Figure 2 illustrates the anomalies that differentiate SER, SI, and PSI. Figure 2(d) captures the classic write skew anomaly, which SI allows but SER prohibits. This scenario arises when two transactions read disjoint keys and then write disjoint values based on those reads, each unaware of the other's effects. SI permits this since it allows partial visibility so long as snapshots are consistent. On the other hand, the long fork anomaly shown in Figure 2(c) is prohibited by SI but allowed by PSI, which weakens the snapshot guarantees further.
Applications of the model
The second half of the paper shows applications of the model for static analyses. The first application is transaction chopping, where large transactions are split into smaller subtransactions to improve performance. The challenge here is to ensure that the interleaving of chopped pieces does not introduce new behaviors/anomalies that the original monolithic transaction would have prevented. This is captured through spliceability: whether an execution of chopped transactions can be "stitched back" into an execution that would have been legal under SI for the unchopped program. Spliceability is formulated through a chopping graph, which augments standard dependencies with session-local ordering among chopped subtransactions. A cycle in the chopping graph DCG(G) is considered critical if (1) it does not contain two occurrences of the same vertex, (2) it includes a sequence of three edges where a conflict edge is followed by a session (predecessor) edge and then another conflict edge, and (3) any two RW (anti-dependency) edges in the cycle are not adjacent. Such critical cycles represent dependency patterns that cannot be reconciled with the atomicity guarantees expected by the original transaction, and thus cannot be realized under SI. Figures 4, 5 and 6 illustrate how small structural differences in the chop can lead to either results that are sound (Figure 6) or unsound (Figure 5 creates a critical cycle). Compared to serializability, SI's more relaxed visibility rules allow for a wider range of safe chops, but care must still be taken to avoid dependency structures that violate snapshot consistency.
The second application of the dependency graph model is in analyzing robustness across isolation levels. The central question is whether a program behaves identically under SI and a weaker or stronger model. An interesting case here is the relation between SI and Parallel SI (PSI). We covered PSI in our earlier review of Walter (SOSP 2011). PSI weakens SI by discarding the prefix requirement on snapshots: it ensures only that visibility is transitive, not that it forms a prefix of the commit order. Thus, PSI admits behaviors that SI prohibits. Theorem 22 formalizes one such divergence. It shows that if a cycle in the dependency graph contains at least two RW edges and no two of them are adjacent, then this cycle is allowed under PSI but not under SI. This captures the long fork anomaly, in which concurrent writers are seen inconsistently by different readers (each reader forming a different branch of the history).
To illustrate the long fork, consider a cycle where T1 and T2 are concurrent writers, and two readers, T3 and T4, observe them inconsistently.
- T1 --WR--> T3
- T2 --WR--> T4
- T3 --RW--> T2
- T4 --RW--> T1
In this scenario, T3 sees T1's write but not T2's, and T4 sees T2's write but not T1's. Both readers construct transitive but incompatible snapshots that fork the timeline. SI prohibits this because it cannot construct prefix-closed snapshots that explain both T3 and T4's observations. But since PSI lacks the prefix constraint, it allows this behavior, while still disallowing anomalies like lost update (through its NOCONFLICT axiom).
Robustness from SI to PSI therefore requires ruling out that specific structural pattern: cycles with multiple RW edges where none are adjacent. If such a cycle appears in the dependency graph, PSI will admit the behavior, while SI will not, and robustness would fail.
Discussion
This does invite comparison to the Seeing is Believing (SiB) paper (PODC'17), one of my favorite papers, and its state-centric formulation of isolation guarantees. In SiB, executions are modeled as sequences of global states and snapshots. Transactions observe one of these states and transition the system to a new one. Isolation models are defined in terms of whether there exists a sequence of global states consistent with the observations and effects of each transaction.
While structurally different, the two models are not in conflict. It appears feasible to translate between the dependency graph and state-centric views. The SI model used in this PODC2016 paper already adopts a declarative, axiomatic approach centered on visibility and commit order that is already close to SiB.
For static program analysis, the dependency graph model seems to offer advantages. By abstracting away from global states, it allows symbolic reasoning directly over transactional dependencies. This makes it well-suited to analyses like transaction chopping and robustness checking, which rely on detecting structural patterns such as cycles with certain edge configurations. While the SiB model is semantically expressive and well-suited to observational reasoning, it may be less conducive to structural checks like cycle-freedom or anti-dependency adjacency.
Transaction performance 👉🏻 retry with backoff
A benchmark sponsored by EDB, a PostgreSQL company, in 2019 contributed to the myth that MongoDB transactions are slow. Even though the work was done by the reputable OnGres team, the code wasn't properly designed to test MongoDB's scalability. At that time, the feature was new, likely not well-documented, and some demos overlooked the retry logic. In this context, no one is to blame for past publications, but analyzing this benchmark will help prevent the spread of these myths.
MongoDB uses lock-free optimistic concurrency control (OCC) with fail-on-conflict as soon as a write detects concurrent changes to the Multi-Version Concurrency Control (MVCC), requiring applications to manage transient errors differently than traditional RDBMS with pessimistic locking and wait-on-conflict behavior. The benchmark developers, PostgreSQL experts, likely missed this because they based the benchmark on a MongoDB demo focused on capabilities, not performance, and neglecting proper concurrency control.
We should disregard this benchmark today, but this blog post series offers an opportunity to analyze its flaws, debunk myths, and educate readers on effective transaction handling in MongoDB applications.
The problematic code in the MongoDB 4.0 demo from 7 years ago was:
def run_transaction_with_retry(functor, session):
assert (isinstance(functor, Transaction_Functor))
while True:
try:
with session.start_transaction():
result=functor(session) # performs transaction
commit_with_retry(session)
break
except (pymongo.errors.ConnectionFailure, pymongo.errors.OperationFailure) as exc:
# If transient error, retry the whole transaction
if exc.has_error_label("TransientTransactionError"):
print("TransientTransactionError, retrying "
"transaction ...")
continue
else:
raise
return result
It was translated to Java in the benchmark code as:
private void runWithRetry() {
while (true) {
try {
runnable.run();
break;
} catch (RetryUserOperationException ex) {
retryMeter.mark();
continue;
}
}
}
If you are familiar with Optimistic or Fail-on-Conflict Concurrency Control, you may recognize a significant issue: there is no wait (backoff) before retry. With such an infinite loop, high concurrency access acts like a DDoS attack on the database, rather than resolving contention.
A typical retry loop implements exponential backoff, and here is an example:
private void runWithRetry() {
final long initialDelayMillis = 5; // start with 5ms
final long maxDelayMillis = 1000; // max wait of 1s
long delay = initialDelayMillis;
while (true) {
try {
runnable.run();
break;
} catch (RetryUserOperationException ex) {
retryMeter.mark();
try {
// jitter by up to 50% to avoid thundering herd
long jitter = (long) (Math.random() * delay / 2);
long sleep = delay + jitter;
Thread.sleep(sleep);
} catch (InterruptedException ie) {
Thread.currentThread().interrupt();
// Optionally log or handle interruption here
throw new RuntimeException("Retry loop interrupted", ie);
}
delay = Math.min(delay * 2, maxDelayMillis);
continue;
}
}
}
This code makes the first retry wait 5–7 ms, then 10–13 ms, 20–25 ms, and so on, up to 1000–1500 ms. I you use Spring Data, you can simply annotate your @Transactional method with:
@Retryable(
value = RetryUserOperationException.class,
maxAttempts = 10,
backoff = @Backoff(
delay = 5, // first delay in ms
maxDelay = 1000, // max delay in ms
multiplier = 2, // exponential
random = true // adds jitter
)
)
MongoDB employs a similar approach for auto-commit single-document transactions, transparently, so that it appears as if the application is waiting for a lock to be acquired. However, it cannot automatically cancel and retry explicit transactions where the application might perform non-transactional work, such as writing to a file, pushing to a queue, or sending an email. For transactions involving multiple statements, no database can automatically retry the process. The application itself must handle retries.
In PostgreSQL, a conflict might cause a serializable error, even under the Read Committed isolation level, where deadlocks can still occur. PostgreSQL locks data while writing during a transaction using two-phase locking and typically waits for the lock to be released. In this case, the impact of an inefficient retry loop is minimal.
However, MongoDB is optimized for high concurrency, allowing it to avoid holding locks between database calls. Instead of waiting, it detects write conflicts instantly and raises a retriable error. Therefore, implementing an efficient retry mechanism is essential.
As I mentioned earlier, there's no one to blame for the benchmark's flaws, as it was created when transactions in MongoDB were relatively new and perhaps not well documented. The problem is people still referencing this benchmark without understanding what was wrong. The poor performance was due to unnecessary retries because there was no backoff implemented in the retry loop.
The authors of the benchmark have been looking for documentation that they believe explains this behavior, which likely contributed to their decision not to implement backoff in the application, mistakenly thinking it was handled by the database:
Since the probability of collision increases (possibly exponentially) with the effective number of transactions processed, it follows that MongoDB is more eager to retry transactions. This is consistent with the expectation set on MongoDB’s documentation about transactions and locking, which states that “by default, transactions waits up to 5 milliseconds to acquire locks required by the operations in the transaction. If the transaction cannot acquire its required locks within the 5 milliseconds, the transaction aborts”. This behavior can be changed by setting the maxTransactionLockRequestTimeoutMillis parameter.
What is called "lock" here is different from what SQL databases call "lock" with two-phase locking transactions where locks are acquired for the duration of the transaction. MongoDB is lock-free in that sense, using optimistic concurrency control rather than locking. What is called "lock" here is more similar to what SQL databases call "latch" or "lightweight locks", which are short duration and do not span multiple database calls. For such wait, five milliseconds is a good default. But this is not what the benchmark experienced.
Such timeout would raise the following exception: Unable to acquire lock ... within a max lock request timeout of '5ms' milliseconds.
What the benchmark catches in the retry loop is a write conflict, that happens before trying to acquire such short lock: Command failed with error 112 (WriteConflict): 'Caused by :: Write conflict during plan execution and yielding is disabled. :: Please retry your operation or multi-document transaction.'
Such write conflict has nothing to do with maxTransactionLockRequestTimeoutMillis, it doesn't try to acquire a lock because it has nothing to write, as transaction isolation (the 'I' in 'ACID') is not possible. When reading, it has detected that the read snapshot, the state as of the beginning of the transaction, has been modified by another transaction. It doesn’t wait because the snapshot would be stale if the other transaction commits, and it immediately returns to the application. The application must compensate or roll back what it did during the transaction, wait a small amount of time (the exponential backoff), and retry.
In PostgreSQL, when operating under the Read Committed isolation level, it can wait because it allows reading a state that may not be consistent with the transaction's start time. If a concurrent transaction commits during this time, PostgreSQL simply continues to read the committed data, mixing data from different states. This is not permitted in higher isolation levels, like serializable, and a transient error must be raised, like in MongoDB, to guarantee ACID properties. However, PostgreSQL uses a pessimistic locking approach, waits to determine whether the other transaction commits, allowing it to retry immediately once the conflict is resolved. This is why the retry logic without backoff does not have the same consequences.
You may wonder why MongoDB doesn't implement waiting like PostgreSQL does. PostgreSQL is designed for a single-writer instance, which cannot scale horizontally, making it simple to use a wait queue in shared memory. However, when sharding PostgreSQL using the Citus extension, this design breaks down, leading to eventual consistency for cross-shard reads. In contrast, MongoDB is built for horizontal scalability and opts for optimistic concurrency control instead of a distributed wait queue, providing consistent cross shard reads across nodes (when the read concern is set to majority).
I prefer not to link the benchmark paper to avoid helping search engines or LLM crawlers find outdated content, but it is easy to find. The benchmark code is available in a repository. I prefer to link to the MongoDB transaction documentation instead. Now you know where the myth about slow transactions comes from: incorrect understanding of MongoDB lock-free ACID transactions.
There's more to say about this benchmark. In the benchmark code, there's a hotspot on the "audit" table which is indexed for the PostgreSQL definition, but not for the MongoDB definition. This is visible as MongoDB logs slow queries by default:
mongo-1 | {"t":{"$date":"2025-08-05T21:31:22.655+00:00"},"s":"I", "c":"WRITE", "id":51803, "ctx":"conn31"
,"msg":"Slow query","attr":{"type":"update","isFromUserConnection":true,"ns":"postgres.audit","collectionType":"normal"
,"command":{"q":{"schedule_id":4778,"day":{"$date":"2025-08-05T00:00:00.000Z"}},"u":{"$set":{"date":{"$date":"2025-08-05T21:31:22.533Z"}},"$inc":{"seats_occupied":1}},"multi":false,"upsert":true}
,"planSummary":"COLLSCAN","planningTimeMicros":255,"keysExamined":0,"docsExamined":5962,"nMatched":1,"nModified":1,"nUpserted":0,"keysInserted":0,"keysDeleted":0,"numYields":0,"planCacheShapeHash":"99470B66","queryHash":"99470B66","planCacheKey":"031BFB16"
,"locks":{"MultiDocumentTransactionsBarrier":{"acquireCount":{"w":1}},"ReplicationStateTransition":{"acquireCount":{"w":3}},"Global":{"acquireCount":{"w":1}},"Database":{"acquireCount":{"w":1}},"Collection":{"acquireCount":{"w":5}}},"flowControl":{"acquireCount":1},"readConcern":{"level":"snapshot","provenance":"clientSupplied"},"storage":{"data":{"txnBytesDirty":128}},"cpuNanos":32601533,"remote":"172.18.0.6:42292","queues":{"execution":{"admissions":1},"ingress":{"admissions":1}},"workingMillis":121,"durationMillis":121}}
To improve performance for scalability, create indexes and avoid hotspots. If hotspots are unavoidable, fail fast by performing operations subject to write conflict early in the transaction, rather than at the end like it is done here. The data model should allow critical transactions to be single-document, avoiding the need for normalization across multiple tables, but this benchmark uses the same normalized data model on both databases. Finally, no real application will perform business transaction like this: reserving a flight seat, recording payment, and incrementing an audit counter all in one database transaction. You don't want to maintain a database state with locks while waiting for payment validation that typically depends on an external service.