JSONB vs. BSON: Tracing PostgreSQL and MongoDB Wire Protocols
There is an essential difference between MongoDB’s BSON and PostgreSQL’s JSONB. Both are binary JSON formats, but they serve different roles. JSONB is purely an internal storage format for JSON data in PostgreSQL. BSON, on the other hand, is MongoDB’s native data format: it is used by application drivers, over the network, in memory, and on disk.
JSONB: PostgreSQL internal storage format
JSONB is a storage format, as defined by the PostgreSQL documentation:
PostgreSQL offers two types for storing JSON data: json and jsonb
PostgreSQL uses JSONB solely for internal storage, requiring the entire structure to be read to access a field, as observed in JSONB DeTOASTing (read amplification).
BSON: MongoDB storage and exchange format
BSON is used for storage and also as an exchange format between the application and the database, as defined in the BSON specification:
BSON can be compared to binary interchange formats, like Protocol Buffers. BSON is more "schema-less" than Protocol Buffers
On the application side, the MongoDB driver converts application objects to BSON, which supports more data types than JSON or JSONB, including datetime and binary. This BSON is sent and received over the network and stored and manipulated on the server as-is, with no extra parsing. Both the driver and the database can efficiently access fields via the binary structure because BSON includes metadata such as field length prefixes and explicit type information, even for large or nested documents.
PostgreSQL protocol is JSON (text), not JSONB
To illustrate this, I've written a small Python program that inserts a document into a PostgreSQL table with a JSONB column, and queries that table to retrieve the document:
from sqlalchemy import Column, Integer, create_engine
from sqlalchemy.dialects.postgresql import JSONB
from sqlalchemy.orm import declarative_base, sessionmaker
Base = declarative_base()
class Item(Base):
__tablename__ = 'items'
id = Column(Integer, primary_key=True)
data = Column(JSONB) # our JSONB column
# Connect to Postgres
engine = create_engine('postgresql+psycopg2://', echo=True)
Session = sessionmaker(bind=engine)
session = Session()
# Create table
Base.metadata.create_all(engine)
# Insert an object into JSONB column
obj = {"name": "widget", "price": 9.99, "tags": ["new", "sale"]}
session.add(Item(data=obj))
session.commit()
# Read back the table
for row in session.query(Item).all():
print(row.id, row.data)
The program uses SQLAlchemy to send and retrieve Python objects to and from PostgreSQL via the Psycopg2 driver. I've stored it in demo.py.
When I run the program, with python demo.py, before it displays the final result, it logs all SQL statements:
2025-12-21 12:50:22,484 INFO sqlalchemy.engine.Engine select pg_catalog.version()
2025-12-21 12:50:22,485 INFO sqlalchemy.engine.Engine [raw sql] {}
2025-12-21 12:50:22,486 INFO sqlalchemy.engine.Engine select current_schema()
2025-12-21 12:50:22,486 INFO sqlalchemy.engine.Engine [raw sql] {}
2025-12-21 12:50:22,486 INFO sqlalchemy.engine.Engine show standard_conforming_strings
2025-12-21 12:50:22,486 INFO sqlalchemy.engine.Engine [raw sql] {}
2025-12-21 12:50:22,487 INFO sqlalchemy.engine.Engine BEGIN (implicit)
2025-12-21 12:50:22,488 INFO sqlalchemy.engine.Engine select relname from pg_class c join pg_namespace n on n.oid=c.relnamespace where pg_catalog.pg_table_is_visible(c.oid) and relname=%(name)s
2025-12-21 12:50:22,488 INFO sqlalchemy.engine.Engine [generated in 0.00015s] {'name': 'items'}
2025-12-21 12:50:22,489 INFO sqlalchemy.engine.Engine
CREATE TABLE items (
id SERIAL NOT NULL,
data JSONB,
PRIMARY KEY (id)
)
2025-12-21 12:50:22,489 INFO sqlalchemy.engine.Engine [no key 0.00011s] {}
2025-12-21 12:50:22,491 INFO sqlalchemy.engine.Engine COMMIT
2025-12-21 12:50:22,493 INFO sqlalchemy.engine.Engine BEGIN (implicit)
2025-12-21 12:50:22,494 INFO sqlalchemy.engine.Engine INSERT INTO items (data) VALUES (%(data)s) RETURNING items.id
2025-12-21 12:50:22,494 INFO sqlalchemy.engine.Engine [generated in 0.00018s] {'data': '{"name": "widget", "price": 9.99, "tags": ["new", "sale"]}'}
2025-12-21 12:50:22,495 INFO sqlalchemy.engine.Engine COMMIT
2025-12-21 12:50:22,497 INFO sqlalchemy.engine.Engine BEGIN (implicit)
2025-12-21 12:50:22,498 INFO sqlalchemy.engine.Engine SELECT items.id AS items_id, items.data AS items_data
FROM items
2025-12-21 12:50:22,498 INFO sqlalchemy.engine.Engine [generated in 0.00013s] {}
1 {'name': 'widget', 'tags': ['new', 'sale'], 'price': 9.99}
To see what is sent and received through the network by the PostgreSQL protocol, I run the program with strace, showing the sendto and recv system calls with their arguments: strace -e trace=sendto,recvfrom -yy -s 1000 python demo.py.
Like most SQL database drivers, the protocol is basic: send SQL commands as text, and fetch a tabular result set. In the PostgreSQL protocol's messages, the first letter is the message type (Q for Simple Query Message, followed by the length of the message, and the message in text, X to terminate the session, C for command completion status, T abd D for the resultset).
Here is the output, the lines starting with the timestamp are the logs from SQL Alchemy, those starting with sendto and recv are the network system calls with the message to the database, and the result from it
Where is the trace when inserting one document:
2025-12-21 16:52:20,278 INFO sqlalchemy.engine.Engine BEGIN (implicit)
2025-12-21 16:52:20,279 INFO sqlalchemy.engine.Engine INSERT INTO items (data) VALUES (%(data)s) RETURNING items.id
2025-12-21 16:52:20,279 INFO sqlalchemy.engine.Engine [generated in 0.00029s] {'data': '{"name": "widget", "price": 9.99, "tags": ["new", "sale"]}'}
sendto(3<TCPv6:[[::1]:41858->[::1]:5432]>, "Q\0\0\0\nBEGIN\0", 11, MSG_NOSIGNAL, NULL, 0) = 11
recvfrom(3<TCPv6:[[::1]:41858->[::1]:5432]>, "C\0\0\0\nBEGIN\0Z\0\0\0\5T", 16384, 0, NULL, NULL) = 17
sendto(3<TCPv6:[[::1]:41858->[::1]:5432]>, "Q\0\0\0vINSERT INTO items (data) VALUES ('{\"name\": \"widget\", \"price\": 9.99, \"tags\": [\"new\", \"sale\"]}') RETURNING items.id\0", 119, MSG_NOSIGNAL, NULL, 0) = 119
recvfrom(3<TCPv6:[[::1]:41858->[::1]:5432]>, "T\0\0\0\33\0\1id\0\0\0@\310\0\1\0\0\0\27\0\4\377\377\377\377\0\0D\0\0\0\v\0\1\0\0\0\0011C\0\0\0\17INSERT 0 1\0Z\0\0\0\5T", 16384, 0, NULL, NULL) = 62
2025-12-21 16:52:20,281 INFO sqlalchemy.engine.Engine COMMIT
sendto(3<TCPv6:[[::1]:41858->[::1]:5432]>, "Q\0\0\0\vCOMMIT\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
recvfrom(3<TCPv6:[[::1]:41858->[::1]:5432]>, "C\0\0\0\vCOMMIT\0Z\0\0\0\5I", 16384, 0, NULL, NULL) = 18
It started a transaction (Q\0\0\0\nBEGIN), received command completion (C\0\0\0\nBEGIN), then sent the full text of the INSERT command, including the JSON payload (Q\0\0\0vINSERT INTO items (data) VALUES ('{\"name\": \"widget\", \"price\": 9.99, \"tags\": [\"new\", \"sale\"]}). It subsequently received command completion (INSERT 0 1) and the returned ID (T\0\0\0\33\0\1id, D\0\0\0\v\0\1\0\0\0\001).
Here is the trace when I query and fetch the document:
2025-12-21 16:52:20,283 INFO sqlalchemy.engine.Engine BEGIN (implicit)
2025-12-21 16:52:20,285 INFO sqlalchemy.engine.Engine SELECT items.id AS items_id, items.data AS items_data
FROM items
2025-12-21 16:52:20,285 INFO sqlalchemy.engine.Engine [generated in 0.00024s] {}
sendto(3<TCPv6:[[::1]:41858->[::1]:5432]>, "Q\0\0\0\nBEGIN\0", 11, MSG_NOSIGNAL, NULL, 0) = 11
recvfrom(3<TCPv6:[[::1]:41858->[::1]:5432]>, "C\0\0\0\nBEGIN\0Z\0\0\0\5T", 16384, 0, NULL, NULL) = 17
sendto(3<TCPv6:[[::1]:41858->[::1]:5432]>, "Q\0\0\0FSELECT items.id AS items_id, items.data AS items_data \nFROM items\0", 71, MSG_NOSIGNAL, NULL, 0) = 71
recvfrom(3<TCPv6:[[::1]:41858->[::1]:5432]>, "T\0\0\0>\0\2items_id\0\0\0@\310\0\1\0\0\0\27\0\4\377\377\377\377\0\0items_data\0\0\0@\310\0\2\0\0\16\332\377\377\377\377\377\377\0\0D\0\0\0I\0\2\0\0\0\0011\0\0\0:{\"name\": \"widget\", \"tags\": [\"new\", \"sale\"], \"price\": 9.99}C\0\0\0\rSELECT 1\0Z\0\0\0\5T", 16384, 0, NULL, NULL) = 157
It started another transaction, sent the SELECT statement as text and received the result as JSON text (D\0\0\0I\0\2\0\0\0\0011\0\0\0:{\"name\": \"widget\", \"tags\": [\"new\", \"sale\"], \"price\": 9.99}).
Finally, the transaction ends, and the sessionis disconnected:
sendto(3<TCPv6:[[::1]:41858->[::1]:5432]>, "Q\0\0\0\rROLLBACK\0", 14, MSG_NOSIGNAL, NULL, 0) = 14
recvfrom(3<TCPv6:[[::1]:41858->[::1]:5432]>, "C\0\0\0\rROLLBACK\0Z\0\0\0\5I", 16384, 0, NULL, NULL) = 20
sendto(3<TCPv6:[[::1]:41858->[::1]:5432]>, "X\0\0\0\4", 5, MSG_NOSIGNAL, NULL, 0) = 5
If you want to dig into the code, the server-side parsing is in jsonb_send and jsonb_recv ("The type is sent as text in binary mode"), and while it tests the version before converting to text, there's only one version. The client-side for Psycopg2 shows that register_default_jsonb is the same as register_default_json
Comparing with MongoDB
To compare with MongoDB, created the following demo-mongodb.py:
from pymongo import MongoClient
client = MongoClient("mongodb://127.0.0.1:27017")
db = client.my_database
insert_result = db.items.insert_one({"name": "widget", "price": 9.99, "tags": ["new", "sale"]})
print("Inserted document ID:", insert_result.inserted_id)
for doc in items_collection.find():
print(doc["_id"], doc)
I used the same strace command, but displaying all characters as hexadecimal to be able to decode them with bsondump:
$ strace -e trace=sendto,recvfrom -xx -yy -s 1000 python demo-mongodb.py 2>&1
Here is the network request for the insert:
sendto(5<TCP:[127.0.0.1:44570->127.0.0.1:27017]>, "\xd6\x00\x00\x00\x51\xdc\xb0\x74\x00\x00\x00\x00\xdd\x07\x00\x00\x00\x00\x00\x00\x00\x5a\x00\x00\x00\x02\x69\x6e\x73\x65\x72\x74\x00\x06\x00\x00\x00\x69\x74\x65\x6d\x73\x00\x08\x6f\x72\x64\x65\x72\x65\x64\x00\x01\x03\x6c\x73\x69\x64\x00\x1e\x00\x00\x00\x05\x69\x64\x00\x10\x00\x00\x00\x04\x31\xb8\x9a\x81\xfd\x35\x42\x1a\x88\x44\xa8\x69\xe8\xba\x6f\x30\x00\x02\x24\x64\x62\x00\x0c\x00\x00\x00\x6d\x79\x5f\x64\x61\x74\x61\x62\x61\x73\x65\x00\x00\x01\x66\x00\x00\x00\x64\x6f\x63\x75\x6d\x65\x6e\x74\x73\x00\x58\x00\x00\x00\x07\x5f\x69\x64\x00\x69\x48\x3f\x7f\x87\x46\xd5\x2e\xe2\x0b\xbc\x0b\x02\x6e\x61\x6d\x65\x00\x07\x00\x00\x00\x77\x69\x64\x67\x65\x74\x00\x01\x70\x72\x69\x63\x65\x00\x7b\x14\xae\x47\xe1\xfa\x23\x40\x04\x74\x61\x67\x73\x00\x1c\x00\x00\x00\x02\x30\x00\x04\x00\x00\x00\x6e\x65\x77\x00\x02\x31\x00\x05\x00\x00\x00\x73\x61\x6c\x65\x00\x00\x00", 214, 0, NULL, 0) = 214
recvfrom(5<TCP:[127.0.0.1:44570->127.0.0.1:27017]>, "\x2d\x00\x00\x00\x06\x00\x00\x00\x51\xdc\xb0\x74\xdd\x07\x00\x00", 16, 0, NULL, NULL)<... (truncated)
December 20, 2025
IO-bound sysbench vs MySQL on a 48-core server
This has results for an IO-bound sysbench benchmark on a 48-core server for MySQL versions 5.6 through 9.5. Results from a CPU-bound sysbench benchmark on the 48-core server are here.
tl;dr
- the regressions here on read-only tests are smaller than on the CPU bound workload, but when they occur are from new CPU overheads
- the large improvements here on write-heavy tests are similar to the CPU bound workload
The read-heavy microbenchmarks are run for 600 seconds and the write-heavy for 900 seconds. The benchmark is run with 40 clients and 8 tables with 250M rows per table. With 250M rows per table this is IO-bound. I normally use 10M rows per table for CPU-bound workloads.
I provide charts below with relative QPS. The relative QPS is the following:
(QPS for some version) / (QPS for MySQL 5.6.51)
- the improvement for hot-points is similar to the CPU-bound results
- the regressions here for the IO-bound tests are smaller than for the CPU-bound results
- the regression in point-query is from new CPU overhead, see cpu/o here which is 1.37X larger in 9.5.0 vs 5.6.51
- the regression in points-covered-si is from new CPU overhead, see cpu/o here which is 1.24X larger in 9.5.0 vs 5.6.51. This test is CPU-bound, the queries don't do IO because the secondary indexes are cached.
- the regressions here for the IO-bound tests are smaller than for the CPU-bound results, except for the scan test
- the regressions in scan are from new CPU overhead, see cpu/o here, which is 1.38X larger in 9.5.0 vs 5.6.51
- the regressions here for the IO-bound tests are smaller than for the CPU-bound results
- the regressions in read-only-count are from new CPU overhead, see cpu/o here, which is 1.25X larger in 9.5.0 vs 5.6.51
- the improvements here for the IO-bound tests are similar to the CPU-bound results
- the largest improvement, for the update-index test, is from using less CPU, fewer context switches, less read IO and less write IO per operation -- see cpu/o, cs/o, rKB/o and wKB/o here
December 19, 2025
Atlas Search score details (the BM25 calculation)
With @james_blackwoodsewell_58 we were comparing the BM25 text search scores between MongoDB Atlas (Lucene), ElasticSearch (Lucene) and ParadeDB (using Tantivy) which provide the same ordering but MongoDB Atlas shows constantly a lower score by a factor of 2.2:
It was the occasion for me to look at the score details which gives the calculation details for the score.
Test case
I've built the same test case as in my previous blog:
db.articles.drop();
db.articles.deleteMany({});
db.articles.insertMany([
{ description : "🍏 🍌 🍊" }, // short, 1 🍏
{ description : "🍎 🍌 🍊" }, // short, 1 🍎
{ description : "🍎 🍌 🍊 🍎" }, // larger, 2 🍎
{ description : "🍎 🍌 🍊 🍊 🍊" }, // larger, 1 🍎
{ description : "🍎 🍌 🍊 🌴 🫐 🍈 🍇 🌰" }, // large, 1 🍎
{ description : "🍎 🍎 🍎 🍎 🍎 🍎" }, // large, 6 🍎
{ description : "🍎 🍌" }, // very short, 1 🍎
{ description : "🍌 🍊 🌴 🫐 🍈 🍇 🌰 🍎" }, // large, 1 🍎
{ description : "🍎 🍎 🍌 🍌 🍌" }, // shorter, 2 🍎
]);
db.articles.createSearchIndex("default",
{ mappings: { dynamic: true } }
);
Score with details
I ran the same query, adding scoreDetails: true to the search stage, and scoreDetails: { $meta: "searchScoreDetails" } } to the projection stage:
db.articles.aggregate([
{
$search: {
text: { query: ["🍎", "🍏"], path: "description" },
index: "default",
scoreDetails: true
}
},
{ $project: {
_id: 0, description: 1,
score: { $meta: "searchScore" },
scoreDetails: { $meta: "searchScoreDetails" } } },
{ $sort: { score: -1 } } ,
{ $limit: 1 }
])
Here is the result:
mdb> db.articles.aggregate([
... {
... $search: {
... text: { query: ["🍎", "🍏"], path: "description" },
... index: "default",
... scoreDetails: true
... }
... },
... { $project: { _id: 0, description: 1, score: { $meta: "searchScore" }, scoreDetails: { $meta: "searchScoreDetails" } } },
... { $sort: { score: -1 } } ,
... { $limit: 1 }
... ])
[
{
description: '🍏 🍌 🍊',
score: 1.0242118835449219,
scoreDetails: {
value: 1.0242118835449219,
description: 'sum of:',
details: [
{
value: 1.0242118835449219,
description: '$type:string/description:🍏 [BM25Similarity], result of:',
details: [
{
value: 1.0242118835449219,
description: 'score(freq=1.0), computed as boost * idf * tf from:',
details: [
{
value: 1.8971199989318848,
description: 'idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:',
details: [
{
value: 1,
description: 'n, number of documents containing term',
details: []
},
{
value: 9,
description: 'N, total number of documents with field',
details: []
}
]
},
{
value: 0.5398772954940796,
description: 'tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:',
details: [
{
value: 1,
description: 'freq, occurrences of term within document',
details: []
},
{
value: 1.2000000476837158,
description: 'k1, term saturation parameter',
details: []
},
{
value: 0.75,
description: 'b, length normalization parameter',
details: []
},
{
value: 3,
description: 'dl, length of field',
details: []
},
{
value: 4.888888835906982,
description: 'avgdl, average length of field',
details: []
}
]
}
]
}
]
}
]
}
}
]
So all is there. Here is the scoring breakdown for "🍏 🍌 🍊", which produced a score of 1.0242118835449219.
IDF calculation (inverse document frequency)
Search result:
- Number of documents containing the term:
n = 1 - Total number of documents with this field:
N = 9
idf = log(1 + (N - n + 0.5) / (n + 0.5))
= log(1 + (9 - 1 + 0.5) / (1 + 0.5))
= log(6.666666666666667)`
≈ 1.8971199989318848
TF calculation (term frequency)
Parameters are the Lucene defaults:
- Term saturation parameter:
k1 = 1.2000000476837158 - Length normalization parameter:
b = 0.75
Document field statistics:
- Average length of the field:
avgdl = 44 / 9 ≈ 4.888888835906982 - Occurrences of the term in this document:
freq = 1
tf = freq / (freq + k1 * (1 - b + b * dl / avgdl))
= 1 / (1 + 1.2000000476837158 × (0.25 + 0.75 × (3 / 4.888888835906982)))
≈ 0.5398772954940796
Final score
Parameter:
- Boost:
1.0
score = boost × idf × tf
= 1.0 × 1.8971199989318848 × 0.5398772954940796
≈ 1.0242118835449219
That confirms that Atlas Search uses the same scoring as Lucene https://github.com/apache/lucene/blob/releases/lucene/10.3.2/lucene/core/src/java/org/apache/lucene/search/similarities/BM25Similarity.java#L183
What about ElasticSearch and Tantivy
Eight years ago, Lucene removed the (k1 + 1) factor in LUCENE-8563. For k1 = 1.2, this change reduces the score by a factor of 2.2 from that version onward. Tantivy and Elasticsearch apparently still use the old formula, while Atlas Search uses the updated one, which explains the observed differences in scoring.
Conclusion
MongoDB Atlas Search indexes use the same scoring mechanism as Lucene indexes. When you compare them with other text search engines that also use Lucene, you might notice a score difference by a factor of about 2.2. This doesn’t actually matter, because scores are used only for ordering results, and the relative ordering remains the same.
Text search scores can seem magical, but they are fully deterministic and based on open-source formulas. In MongoDB, you can include the score details option in a text search query to inspect all the parameters and formulas behind the score.
Improve Developer Velocity with Kubernetes Databases
LeaseGuard: Raft Leases Done Right!
Many distributed systems have a leader-based consensus protocol at their heart. The protocol elects one server as the "leader" who receives all writes. The other servers are "followers", hot standbys who replicate the leader’s data changes. Paxos and Raft are the most famous leader-based consensus protocols.
These protocols ensure consistent state machine replication, but reads are still tricky. Imagine a new leader L1 is elected, while the previous leader L0 thinks it's still in charge. A client might write to L1, then read stale data from L0, violating Read Your Writes. How can we prevent stale reads? The original Raft paper recommended that the leader communicate with a majority of followers before each read, to confirm it's the real leader. This guarantees Read Your Writes but it's slow and expensive.
A leader lease is an agreement among a majority of servers that one server will be the only leader for a certain time. This means the leader can run queries without communicating with the followers, and still ensure Read Your Writes. The original description of Raft included a lease protocol that was inherited from the earlier Paxos, where followers refuse to vote for a new leader until the old leader's lease expires. This entangles leases and elections, and it delays recovery after a crash. Besides, lease protocols have never been specified in detail, for either Raft or Paxos. For all these reasons, many Raft implementations don't use leases at all, or their leases are buggy.
In the MongoDB Distributed Systems Research Group, we designed a simple lease protocol tailored for Raft, called LeaseGuard. Our main innovation is to rely on Raft-specific guarantees to design a simpler lease protocol that recovers faster from a leader crash. Here’s a preprint of our SIGMOD'26 paper. This is a joint blog post by A. Jesse Jiryu Davis and Murat Demirbas, published on both of our blogs.
A huge simplification: the log is the lease
LeaseGuard makes leader failovers smoother and faster
| Transitions in the read/write availability of leaders with LeaseGuard. While the new leader waits for a lease, it can serve some consistent reads and accept writes. |
Tests and benchmarks
- Inconsistent: LogCabin running fast and loose, with no guarantee of Read Your Writes.
- Quorum: The default Read Your Writes mechanism, where the leader talks to a majority of followers before running each query, is miserably slow—notice that its Y axis is one tenth as high as the other charts!
- Lease: The “log is the lease” protocol with no optimizations. Its throughput is as high as “inconsistent”, but it has a long time to recovery after the old leader crashes.
- Defer commit: The log is the lease, plus our write optimization—you can see that write throughput spikes off the chart at 1500 ms, because the leader has been processing writes while waiting for the lease. As soon as it gets the lease, it commits all the writes at once.
- Inherit lease: LeaseGuard with all our optimizations. Read throughput recovers as soon as a new leader is elected, without waiting for the old lease to expire.
Conclusion
Sync Stripe data to your Supabase database in one click
December 18, 2025
Optimizing correlated subqueries in Amazon Aurora PostgreSQL
Introducing Percona Load Generator for MongoDB Clusters: The Benchmark Tool That Simulates Your Actual Application
December 17, 2025
Performance regressions in MySQL 8.4 and 9.x with sysbench
I have been claiming that I don't find significant performance regressions in MySQL 8.4 and 9.x when I use sysbench. I need to change that claim. There are regressions for write-heavy tests, they are larger for tests with more concurrency and larger when gtid support is enabled.
By gtid support is enabled I mean that these options are set to ON:
This blog post has results from the write-heavy tests with sysbench for MySQL 8.0, 8.4, 9.4 and 9.5 to explain my claims above.
tl;dr
- Regressions are most likely and larger on the insert test
- There are regressions for write-heavy workloads in MySQL 8.4 and 9.x
- Throughput is typically 15% less in MySQL 9.5 than in 8.0 for tests with 16 clients on the 24-core/2-socket srever
- Throughput is typically 5% less in MySQL 9.5 than 8.0 for tests with 40 clients on the 48-core server
- The regressions are larger when gtid_mode and enforce_gtid_consistency are set to ON
- Throughput is typically 5% to 10% less with the -gtid configs vs the -nogtid configs with 40 clients on the 48-core server. But this is less of an issue on other servers.
- There are significant increases in CPU, context switch rates and KB written to storage for the -gtid configs relative to the same MySQL version using the -nogtid configs
- Regressions might be larger for the insert and update-inlist tests because they have larger transactions relative to other write-heavy tests. Performance regressions are correlated with increases in CPU, context switches and KB written to storage per transaction.
- 8.0.44-nogtid
- MySQL 8.0.44 with gtid_mode and enforce_gtid_consistency =OFF
- 8.0.44-gtid
- MySQL 8.0.44 with gtid_mode and enforce_gtid_consistency =ON
- 8.4.7-notid
- MySQL 8.4.7 with gtid_mode and enforce_gtid_consistency =OFF
- 8.4.7-gtid
- MySQL 8.4.7 with gtid_mode and enforce_gtid_consistency =ON
- 9.4.0-nogtid
- MySQL 9.4.0 with gtid_mode and enforce_gtid_consistency =OFF
- 9.4.0-gtid
- MySQL 9.4.0 with gtid_mode and enforce_gtid_consistency =ON
- 9.5.0-nogtid
- MySQL 9.5.0 with gtid_mode and enforce_gtid_consistency =OFF
- 9.5.0-gtid
- MySQL 9.5.0 with gtid_mode and enforce_gtid_consistency =ON
- 8-core
- The server is an ASUS ExpertCenter PN53 with and AMD Ryzen 7 7735HS CPU, 8 cores, SMT disabled, 32G of RAM. Storage is one NVMe device for the database using ext-4 with discard enabled. The OS is Ubuntu 24.04.
- my.cnf for the -nogtid configs are here for 8.0, 8.4, 9.4, 9.5
- my.cnf for the -gtid configs are here for 8.0, 8.4, 9.4, 9.5
- The benchmark is run with 1 thread, 1 table and 50M rows per table
- 24-core
- The server is a SuperMicro SuperWorkstation 7049A-T with 2 sockets, 12 cores/socket, 64G RAM, one m.2 SSD (2TB, ext4 with discard enabled). The OS is Ubuntu 24.04. The CPUs are Intel Xeon Silver 4214R CPU @ 2.40GHz.
- my.cnf for the -nogtid configs are here for 8.0, 8.4, 9.4, 9.5
- my.cnf for the -gtid configs are here for 8.0, 8.4, 9.4, 9.5
- The benchmark is run with 16 threads, 8 tables and 10M rows per table
- 48-core
- The server is ax162s from Hetzner with an AMD EPYC 9454P 48-Core Processor with SMT disabled and 128G of RAM. Storage is 2 Intel D7-P5520 NVMe devices with RAID 1 (3.8T each) using ext4. The OS is Ubuntu 22.04 running the non-HWE kernel (5.5.0-118-generic).
- my.cnf for the -nogtid configs are here for 8.0, 8.4, 9.4, 9.5
- my.cnf for the -gtid configs are here for 8.0, 8.4, 9.4, 9.5
- The benchmark is run with 40 threads, 8 tables and 10M rows per table
The read-heavy microbenchmarks are run for 600 seconds and the write-heavy for 900 seconds.
(QPS for some version) / (QPS for MySQL 8.0.44)
- For many tests there are small regressions from 8.0 to 8.4 and 8.4 to 9.x
- There are small improvements (~5%) for the -gtid configs vs the -nogtid result for update-index
- There is a small regression (~5%) for the -gtid configs vs the -nogtid result for insert
- There are small regression (~1%) for the -gtid configs vs the -nogtid result for other tests
- CPU per operation (cpu/o) increases by 1.10X with the -gtid config
- Context switches per operation (cs/o) increases by 1.45X with the -gtid config
- KB written to storage per commit (wKB/o) increases by 1.16X with the -gtid config
- CPU per operation (cpu/o) decreases by ~3% with the -gtid config
- Context switches per operation (cs/o) decrease by ~2% with the -gtid config
- KB written to storage per commit (wKB/o) decreases by ~3% with the -gtid config
- This result is odd. I might try to reproduce it in the future
- For many tests there are regressions from 8.0 to 8.4 and 8.4 to 9.x and throughput is typically 15% less in 9.5.0 than 8.0.44
- There are large regressions in 9.4 and 9.5 for update-inlist
- There is usually a small regression (~5%) for the -gtid configs vs the -nogtid result
- Throughput is 1.15X larger in 9.5.0-nogtid
- CPU per operation (cpu/o) is 1.15X larger in 9.5.0-gtid
- Context switches per operation (cs/o) are 1.23X larger in 9.5.0-gtid
- KB written to storage per commit (wKB/o) is 1.24X larger in 9.5.0-gtid
- The problems here look different than most other tests as the regressions in 9.4 and 9.5 are similar for the -gtid and -nogtid configs. If I have time I will get flamegraphs and PMP output. The server here has two sockets and can suffer more from false-sharing and real contention on cache lines.
- Throughput is 1.43X larger in 8.0.44-nogtid
- CPU per operation (cpu/o) is 1.05X larger in 8.0.44-nogtid
- Context switches per operation (cs/o) are 1.18X larger in 8.0.44-nogtid
- KB written to storage per commit (wKB/o) is ~1.12X larger in 9.5.0
- For many tests there are regressions from 8.0 to 8.4
- For some tests there are regressions from 8.4 to 9.x
- There is usually a large regression for the -gtid configs vs the -nogtid result and the worst case occurs on the insert test
- Throughput is 1.17X larger in 9.5.0-nogtid
- CPU per operation (cpu/o) is 1.13X larger in 9.5.0-gtid
- Context switches per operation (cs/o) are 1.26X larger in 9.5.0-gtid
- KB written to storage per commit (wKB/o) is 1.24X larger in 9.5.0-gtid
Improve Aurora PostgreSQL throughput by up to 165% and price-performance ratio by up to 120% using Optimized Reads on AWS Graviton4-based R8gd instances
Bridging the Gap: Querying PostgreSQL Tables from an Oracle Database
The $sql aggregation stage in Oracle Database to replace explain("executionStats")
As application-driven data models and document databases—made popular by MongoDB—continue to gain traction, Oracle Database has added MongoDB emulation capabilities on top of its SQL query engine. It's only the logical model and exposed API that resemble it, as physical documents are stored in relational tables and fixed-size blocks. This adds another abstraction on top of SQL, and when performance is different from what you expected, you need to look at the physical execution behind the logical query. The $sql stage of an aggregation pipeline can help troubleshoot. Let's take an example.
I create a one-million-document collection and index it with the ESR (Equality, Sort, Range) Guideline in mind:
// create the collection
db.oneMillion.drop();
db.oneMillion.createIndex({ e: 1, s: 1, r: 1 });
// insert documents
for (let i = 0; i < 1e2; i++) {
void db.oneMillion.insertMany( Array.from(
{ length: 1e4 },
(_, i) => ({ e: i%3, s: new Date(), r: Math.random(), x: UUID() })
) )
}
// check count
db.oneMillion.countDocuments();
I run a query that completes in 1 millisecond on MongoDB but takes seconds in the Oracle Database emulation. It’s a simple pagination query combining:
- An equality filter:
{ e: 2 }, which returns one-third of the collection - A range filter:
{ r: { $gt: 0.5 } }, which returns half of those documents - A sort with pagination:
sort({ s: 1, r: 1 }).limit(10), which returns the last ten documents by date and value
This query runs much faster on MongoDB than in the Oracle emulation. To get execution statistics, I add hint({"$native":"MONITOR"}) so that the underlying SQL query is run with the /*+ MONITOR */ hint:
db.oneMillion.find(
{ e:2, r: { $gt:0.5 } },
{ s:1, r:1, _id:0 }
).sort({ s:1, r:1 }).limit(10).hint({"$native":"MONITOR"});
The query is executed and returns the result:
[
{ s: ISODate('2025-12-17T08:52:50.475Z'), r: 0.5222276191239983 },
{ s: ISODate('2025-12-17T08:52:50.475Z'), r: 0.7565247894880116 },
{ s: ISODate('2025-12-17T08:52:50.476Z'), r: 0.6099160713187135 },
{ s: ISODate('2025-12-17T08:52:50.476Z'), r: 0.765542699487576 },
{ s: ISODate('2025-12-17T08:52:50.476Z'), r: 0.8144790402364248 },
{ s: ISODate('2025-12-17T08:52:50.476Z'), r: 0.8328191789951023 },
{ s: ISODate('2025-12-17T08:52:50.476Z'), r: 0.8356551175440483 },
{ s: ISODate('2025-12-17T08:52:50.476Z'), r: 0.9779607167502489 },
{ s: ISODate('2025-12-17T08:52:50.477Z'), r: 0.5236033088481526 },
{ s: ISODate('2025-12-17T08:52:50.477Z'), r: 0.5290926931399482 }
]
After running it, I can get the SQL Monitor report for the last query in my session by calling the dbms_sqltune function through the $sql aggregation stage:
db.aggregate([{ $sql : `select
dbms_sqltune.report_sql_monitor(report_level=>'all',type=>'text') as "text"
`}]).forEach(row => print(row.text))
The output shows the underlying SQL query generated by the emulation, with its execution plan and execution statistics:
SQL Monitoring Report
SQL Text
------------------------------
select /*+ FIRST_ROWS(10) MONITOR */ json_patch("DATA",:1 project ),rawtohex("RESID"),"ETAG" from "ORA"."oneMillion" where JSON_EXISTS("DATA",'$?( (@.e.numberOnly() == $B0) && (@.r.numberOnly() > $B1) )' passing :2 as "B0", :3 as "B1" type(strict)) order by JSON_QUERY("DATA", '$.s[*].min()') asc nulls first, JSON_QUERY("DATA", '$.r[*].min()') asc nulls first fetch next 10 rows only
Global Information
------------------------------
Status : DONE (ALL ROWS)
Instance ID : 4
Session : ORA (36619:4028)
SQL ID : 420n5y2ytx6zh
SQL Execution ID : 67108867
Execution Started : 12/17/2025 09:04:51
First Refresh Time : 12/17/2025 09:04:51
Last Refresh Time : 12/17/2025 09:04:52
Duration : 1s
Module/Action : ORDS_ADBS_Managed/-
Service : CQWRIAXKGYBKVNX_O23_low.adb.oraclecloud.com
Program : ORDS_ADBS_Managed
Fetch Calls : 1
Binds
========================================================================================================================
| Name | Position | Type | Value |
========================================================================================================================
| :2 | 2 | NUMBER | 2 |
| :3 | 3 | NUMBER | .5 |
========================================================================================================================
Global Stats
=================================================
| Elapsed | Cpu | Other | Fetch | Buffer |
| Time(s) | Time(s) | Waits(s) | Calls | Gets |
=================================================
| 1.18 | 1.18 | 0.00 | 1 | 98194 |
=================================================
SQL Plan Monitoring Details (Plan Hash Value=2462396944)
==============================================================================================================================================================================
| Id | Operation | Name | Rows | Cost | Time | Start | Execs | Rows | Mem | Activity | Activity Detail |
| | | | (Estim) | | Active(s) | Active | | (Actual) | (Max) | (%) | (# samples) |
==============================================================================================================================================================================
| 0 | SELECT STATEMENT | | | | 1 | +1 | 1 | 10 | . | | |
| 1 | COUNT STOPKEY | | | | 1 | +1 | 1 | 10 | . | | |
| 2 | VIEW | | 52 | 316 | 1 | +1 | 1 | 10 | . | | |
| 3 | SORT ORDER BY STOPKEY | | 52 | 316 | 2 | +1 | 1 | 10 | 4096 | 50.00 | Cpu (1) |
| 4 | FILTER | | | | 1 | +1 | 1 | 167K | . | | |
| 5 | TABLE ACCESS BY INDEX ROWID BATCHED | oneMillion | 52 | 38 | 1 | +1 | 1 | 167K | . | 50.00 | Cpu (1) |
| 6 | HASH UNIQUE | | 52 | | 1 | +1 | 1 | 167K | . | | |
| 7 | INDEX RANGE SCAN (MULTI VALUE) | $ora:oneMillion.e_1_s_1_r_1 | 38 | 27 | 1 | +1 | 1 | 167K | . | | |
==============================================================================================================================================================================
Returning those 10 rows used 1.18 seconds of CPU because it used the index only for the filters, returning 166,666 rows that had to be deduplicated (HASH UNIQUE), and sorted (SORT ORDER BY STOPKEY) before returning the result.
Oracle has powerful hints, and you can use them with the hint({"$native": }) (not to be confused with hint({"$natural":1})) of MongoDB. For example, I can try to avoid this HASH UNIQUE that doesn't preserve the ordering from the index:
db.oneMillion.find(
{ e:2, r: { $gt:0.5 } },
{ s:1, r:1, _id:0 }
).sort({ s:1, r:1 }).limit(10).hint(
{"$native":'NO_USE_HASH_AGGREGATION MONITOR'}
);
It uses a SORT UNIQUE but still doesn't preserve the index ordering because the deduplication is on the ROWID, so finally it's just an additional sort:
| 3 | SORT ORDER BY STOPKEY | | 1 | 3 | 1 | +1 | 1 | 10 | 4096 | | |
| 4 | FILTER | | | | 1 | +1 | 1 | 167K | . | | |
| 5 | TABLE ACCESS BY INDEX ROWID BATCHED | oneMillion | 1 | 2 | 1 | +1 | 1 | 167K | . | 100.00 | Cpu (1) |
| 6 | SORT UNIQUE | | 1 | | 1 | +1 | 1 | 167K | 10MB | | |
| 7 | INDEX RANGE SCAN (MULTI VALUE) | $ora:oneMillion.e_1_s_1_r_1 | 1 | 2 | 1 | +1 | 1 | 167K | . | | |
If you don't have the license for all options (Enterprise Edition, Diagnostic Pack, and Tuning Pack), don't use SQL Monitor. You can still view the execution plan with DBMS_XPLAN. To obtain execution statistics, use the GATHER_PLAN_STATISTICS hint:
db.oneMillion.find(
{ e:2, r: { $gt:0.5 } },
{ s:1, r:1, _id:0 }
).sort({ s:1, r:1 }).limit(10).hint({"$native":"GATHER_PLAN_STATISTICS"});
The query to get all execution plan sections is:
db.aggregate( [ { $sql : `
select *
from dbms_xplan.display_cursor(null,null,'ADVANCED ALLSTATS LAST')
` } ] ).forEach(row => print(row.PLAN_TABLE_OUTPUT));
SQL_ID 66jyw5hxfx4zh, child number 0
-------------------------------------
select /*+ FIRST_ROWS(10) GATHER_PLAN_STATISTICS */
json_patch("DATA",:1 project ),rawtohex("RESID"),"ETAG" from
"ORA"."oneMillion" where JSON_EXISTS("DATA",'$?( (@.e.numberOnly() ==
$B0) && (@.r.numberOnly() > $B1) )' passing :2 as "B0", :3 as "B1"
type(strict)) order by JSON_QUERY("DATA", '$.s[*].min()') asc nulls
first, JSON_QUERY("DATA", '$.r[*].min()') asc nulls first fetch next 10
rows only
Plan hash value: 2462396944
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows |E-Bytes| Cost (%CPU)| E-Time | A-Rows | A-Time | Buffers | OMem | 1Mem | Used-Mem |
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | | 3 (100)| | 10 |00:00:00.98 | 98194 | | | |
|* 1 | COUNT STOPKEY | | 1 | | | | | 10 |00:00:00.98 | 98194 | | | |
| 2 | VIEW | | 1 | 1 | 18314 | 3 (34)| 00:00:01 | 10 |00:00:00.98 | 98194 | | | |
|* 3 | SORT ORDER BY STOPKEY | | 1 | 1 | 179 | 3 (34)| 00:00:01 | 10 |00:00:00.98 | 98194 | 4096 | 4096 | 4096 (0)|
|* 4 | FILTER | | 1 | | | | | 166K|00:00:00.32 | 98194 | | | |
| 5 | TABLE ACCESS BY INDEX ROWID BATCHED| oneMillion | 1 | 1 | 179 | 2 (0)| 00:00:01 | 166K|00:00:00.30 | 98194 | | | |
| 6 | HASH UNIQUE | | 1 | 1 | 179 | | | 166K|00:00:00.11 | 2048 | 772K| 772K| |
|* 7 | INDEX RANGE SCAN (MULTI VALUE) | $ora:oneMillion.e_1_s_1_r_1 | 1 | 1 | | 2 (0)| 00:00:01 | 166K|00:00:00.08 | 2048 | | | |
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Query Block Name / Object Alias (identified by operation id):
-------------------------------------------------------------
1 - SEL$2
2 - SEL$1 / "from$_subquery$_002"@"SEL$2"
3 - SEL$1
5 - SEL$1 / "oneMillion"@"SEL$1"
7 - SEL$1 / "oneMillion"@"SEL$1"
Outline Data
-------------
/*+
BEGIN_OUTLINE_DATA
IGNORE_OPTIM_EMBEDDED_HINTS
OPTIMIZER_FEATURES_ENABLE('23.1.0')
DB_VERSION('23.1.0')
OPT_PARAM('_fix_control' '20648883:0 26552730:1 27175987:0 29972495:0 22387320:0 30195773:0 31945701:1 32108311:1 33659818:3 34092979:1 35495824:1 33792497:1 36554842:1 36283175:1
31720959:1 36004220:1 36635255:1 36675198:1 36868551:1 37400112:1 37346200:0 37626161:1')
FIRST_ROWS(10)
FORCE_XML_QUERY_REWRITE
FORCE_JSON_TABLE_TRANSFORM
XML_DML_RWT_STMT
XMLINDEX_REWRITE
XMLINDEX_REWRITE_IN_SELECT
NO_COST_XML_QUERY_REWRITE
OUTLINE_LEAF(@"SEL$1")
OUTLINE_LEAF(@"SEL$2")
NO_ACCESS(@"SEL$2" "from$_subquery$_002"@"SEL$2")
INDEX_RS_ASC(@"SEL$1" "oneMillion"@"SEL$1" "$ora:oneMillion.e_1_s_1_r_1")
BATCH_TABLE_ACCESS_BY_ROWID(@"SEL$1" "oneMillion"@"SEL$1")
USE_HASH_AGGREGATION(@"SEL$1" UNIQUE)
END_OUTLINE_DATA
*/
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter(ROWNUM<=10)
3 - filter(ROWNUM<=10)
4 - filter(HEXTORAW('04')>SYS_CONS_ANY_SCALAR(:3, 3))
7 - access("oneMillion"."SYS_NC00005$"=SYS_CONS_ANY_SCALAR(:2, 3) AND "oneMillion"."SYS_NC00007$">SYS_CONS_ANY_SCALAR(:3, 3) AND "oneMillion"."SYS_NC00007$"<HEXTORAW('04'))
filter(("oneMillion"."SYS_NC00007$"<HEXTORAW('04') AND "oneMillion"."SYS_NC00007$">SYS_CONS_ANY_SCALAR(:3, 3)))
Column Projection Information (identified by operation id):
-----------------------------------------------------------
1 - "from$_subquery$_002"."JSON_PATCH("DATA",:1PROJECT)"[JSON,32600], "from$_subquery$_002"."RAWTOHEX("RESID")"[VARCHAR2,4000], "from$_subquery$_002"."ETAG"[RAW,16]
2 - "from$_subquery$_002"."JSON_PATCH("DATA",:1PROJECT)"[JSON,32600], "from$_subquery$_002"."RAWTOHEX("RESID")"[VARCHAR2,4000], "from$_subquery$_002"."ETAG"[RAW,16]
3 - (#keys=2) JSON_VALUE( /*+ QJSNMD5_TC_JCMP_JV */ JSON_QUERY("DATA" /*+ LOB_BY_VALUE */ FORMAT OSON , '$.s[*].min()' RETURNING JSON WITHOUT ARRAY WRAPPER NULL ON ERROR
TYPE(LAX) ) FORMAT OSON , '$' RETURNING ANY ORA_RAWCOMPARE(32767) ERROR ON ERROR TYPE(LAX) )[32767], JSON_VALUE( /*+ QJSNMD5_TC_JCMP_JV */ JSON_QUERY("DATA" /*+ LOB_BY_VALUE */
FORMAT OSON , '$.r[*].min()' RETURNING JSON WITHOUT ARRAY WRAPPER NULL ON ERROR TYPE(LAX
Building ChatGPT Apps with Supabase Edge Functions and mcp-use
Learn how to build a ChatGPT app that connects to your Supabase database using mcp-use and Edge Functions. Create interactive widgets for schema exploration, data viewing, and SQL queries.
Build a Real-Time E-Commerce Analytics API from Kafka in 15 Minutes
Learn how to build a production-ready real-time analytics API from Kafka streams, including data enrichment with dimension tables and PostgreSQL, materialized views, and multiple API endpoints.
Postgres 18 is now available
Postgres 18 is now available on PlanetScale
December 16, 2025
Managed Database vs. Kubernetes: Taking Back Control of Your Cloud Costs and Agility
Using MotherDuck with PlanetScale
How to Achieve Real-Time Exactly-Once Ingestion from Kafka to ClickHouse
December 15, 2025
Many-to-One: Stronger Relationship Design with MongoDB
In a relational database, a one-to-many relationship is typically implemented with two tables. The "one" side is the parent and has a primary key that is indexed to guarantee uniqueness. The "many" side is the child and references the parent’s primary key with a foreign key. An index is often added to the foreign key as well, so that operations on the parent can efficiently locate its child rows.
While this design seems straightforward and ideal from a database administrator’s point of view—where data integrity is the primary concern—it can surprise developers. This normalized data model does not account for access patterns or cardinalities: the same structure is used whether the "many" side contains millions of rows and keeps growing or only a few items.
MongoDB takes a different approach: its data model is optimized for a specific application, based on known access patterns and cardinalities. In a document model, a one-to-many relationship can be implemented either as multiple documents linked by references or as an embedded array or subdocument within a single document. In both cases, you can choose whether to embed or reference from the parent (the "one" side) or from the child (the "many" side).
An example
I use an HR Dataset with employees that I load in an "employees" collection. It has two million documents:
cd /var/tmp
## Download HR_Dataset
from Kaggle (https://www.kaggle.com/datasets/kadirduran/hr-dataset)
## Unzip and import
curl -L -o hr-data-mnc.zip https://www.kaggle.com/api/v1/datasets/download/rohitgrewal/hr-data-mnc &&
unzip -o hr-data-mnc.zip &&
mongoimport -d mdb -c employees --type=csv --headerline --drop 'HR_Data_MNC_Data Science Lovers.csv'
Once imported, I connect with mongosh and update the performance rating to add random decimal digits, to recognize them better when comparing the results:
use mdb;
db.employees.updateMany( {}, [
{
$set: {
Performance_Rating: {
$add: [ { $toDouble: "$Performance_Rating" }, { $rand: {} } ]
}
}
}
]);
This collection contains employees associated with a department name. I’ll add additional details about each department, such as a description, and then explore alternative models for this one-to-many relationship, where one department has many employees and each employee belongs to one department.
For each model, I'll look at the performance of the following query:
Let's identify the top 10 most outstanding active employees in the IT department and list their names along with their performance ratings.
Embed in the "many" side
The many-to-one relationship here is employee-to-department, as each employee has a department name:
{
_id: ObjectId('693f1d61e235ef0960ae2b53'),
'Unnamed: 0': 22,
Employee_ID: 'EMP0000023',
Full_Name: 'James Valdez',
Department: 'R&D',
Job_Title: 'Research Scientist',
Hire_Date: '2017-10-30',
Location: 'East Scott, Mauritius',
Performance_Rating: 5.882713091486423,
Experience_Years: 7,
Status: 'Retired',
Work_Mode: 'On-site',
Salary_INR: 789283
}
As we generally embed more than one field, I add a description and structure it as a sub-object:
db.employees.aggregate([
{
$addFields: {
Department: {
$switch: {
branches: [
{ case: { $eq: ["$Department", "Finance"] }, then: { Name: "Finance", Description: "Manages the company’s budgets, expenses, and financial planning to ensure fiscal health." } },
{ case: { $eq: ["$Department", "HR"] }, then: { Name: "HR", Description: "Handles recruitment, employee relations, and organizational development initiatives." } },
{ case: { $eq: ["$Department", "IT"] }, then: { Name: "IT", Description: "Maintains technology infrastructure, software systems, and cybersecurity protections." } },
{ case: { $eq: ["$Department", "Marketing"] }, then: { Name: "Marketing", Description: "Promotes the company’s products and services through strategic campaigns and market research." } },
{ case: { $eq: ["$Department", "Operations"] }, then: { Name: "Operations", Description: "Oversees daily business activities, logistics, and process optimization for efficiency." } },
{ case: { $eq: ["$Department", "R&D"] }, then: { Name: "R&D", Description: "Researches and develops innovative products and services to support future growth." } },
{ case: { $eq: ["$Department", "Sales"] }, then: { Name: "Sales", Description: "Builds customer relationships and drives revenue through product and service sales." } }
],
default: { Name: "$Department", Description: "No description available" }
}
}
}
},
{
$merge: {
into: "employees", // same collection
whenMatched: "merge", // update existing docs
whenNotMatched: "fail"
}
}
])
The result for the same employee is:
{
_id: ObjectId('693f1d61e235ef0960ae2b53'),
'Unnamed: 0': 22,
Employee_ID: 'EMP0000023',
Full_Name: 'James Valdez',
Department: {
Name: 'R&D',
Description: 'Researches and develops innovative products and services to support future growth.'
},
Job_Title: 'Research Scientist',
Hire_Date: '2017-10-30',
Location: 'East Scott, Mauritius',
Performance_Rating: 5.882713091486423,
Experience_Years: 7,
Status: 'Retired',
Work_Mode: 'On-site',
Salary_INR: 789283
}
I want to retrieve the top 10 best-performing active employees from the IT department and display their names and performance ratings:
db.employees.find(
{ "Status": "Active", "Department.Name": "IT" },
{ _id: 0, "Full_Name": 1, "Performance_Rating": 1 }
).sort({ "Performance_Rating": -1 }).limit(10)
// result:
[
{ Full_Name: 'Stuart Lopez', Performance_Rating: 5.999973276392604 },
{ Full_Name: 'Mr. Ethan Morton', Performance_Rating: 5.9999561502903065 },
{ Full_Name: 'Lee White', Performance_Rating: 5.999935393136708 },
{ Full_Name: 'Amber Coleman', Performance_Rating: 5.999919949194189 },
{ Full_Name: 'Eugene Brown', Performance_Rating: 5.999917240114123 },
{ Full_Name: 'Nicole Edwards', Performance_Rating: 5.999914413630196 },
{ Full_Name: 'Erika Stewart', Performance_Rating: 5.999902351452448 },
{ Full_Name: 'Jenna King', Performance_Rating: 5.999896490219257 },
{ Full_Name: 'Douglas Hill', Performance_Rating: 5.999886177014563 },
{ Full_Name: 'Richard Gonzalez', Performance_Rating: 5.999879794558417 }
]
Since I have no index, it reads all documents, which takes 1.3 seconds:
x=db.employees.find(
{ "Status": "Active", "Department.Name": "IT" },
{ _id: 0, "Full_Name": 1, "Performance_Rating": 1 }
).sort({ "Performance_Rating": -1 }).limit(10).explain("executionStats");
printjson({
nReturned: x.executionStats.nReturned,
executionTimeMillis: x.executionStats.executionTimeMillis,
totalKeysExamined: x.executionStats.totalKeysExamined,
totalDocsExamined: x.executionStats.totalDocsExamined,
});
// Execution statistics:
{
nReturned: 10,
executionTimeMillis: 1367,
totalKeysExamined: 0,
totalDocsExamined: 2000000
}
One benefit of embedding on the "many" side is that you can use all fields to create a compound index. For instance, I can build an index that supports my filter, sort, and projection needs:
db.employees.createIndex({
"Status":1, // for equality predicate on employee
"Department.Name":1, // for equality predicate on department
"Performance_Rating": 1, // for sort and limit (pagination)
"Full_Name": 1, // for projection (covering index)
})
The query now instantly retrieves the top 10 documents from the index:
x=db.employees.find(
{ "Status": "Active", "Department.Name": "IT" },
{ _id: 0, "Full_Name": 1, "Performance_Rating": 1 }
).sort({ "Performance_Rating": -1 }).limit(10).explain("executionStats")
printjson({
nReturned: x.executionStats.nReturned,
executionTimeMillis: x.executionStats.executionTimeMillis,
totalKeysExamined: x.executionStats.totalKeysExamined,
totalDocsExamined: x.executionStats.totalDocsExamined,
});
// Execution statistics:
{
nReturned: 10,
executionTimeMillis: 0,
totalKeysExamined: 10,
totalDocsExamined: 0
}
Embedding data on the "many" side helps create optimized indexes and improves response times but involves duplicating data from the "one" side into the "many" side. In our example, each employee records the department name and description. This leads to two main effects:
- Increased storage usage, which can be reduced through compression. You might also opt not to embed all fields—such as storing only the department name, which is often queried with the employee, and keeping the description in a separate "departments" collection.
- Any update to the department information must be reflected across all associated employee records. This is generally manageable for infrequent changes, like name updates. Often, a department name change coincides with broader reorganizations, requiring employee record updates anyway.
Reference and lookup from the "many" side
To minimize duplication, I create a separate "departments" collection using the unique department names and descriptions I embedded, ensuring each department's information is stored only once:
db.employees.aggregate([
{
$group: {
_id: "$Department.Name",
Name: { $first: "$Department.Name" },
Description: { $first: "$Department.Description" }
}
},
{
$project: { _id: 0, Name: 1, Description: 1 }
},
{
$merge: {
into: "departments",
whenMatched: "keepExisting",
whenNotMatched: "insert"
}
}
]);
You might be surprised by the speed of this aggregation pipeline. Instead of scanning all documents, MongoDB efficiently retrieved the unique departments by searching for distinct values in the index (a loose index scan).
Then, I can substitute the "Department" sub-object with a reference to the "_id" from the "departments" collection:
db.employees.aggregate([
{
$lookup: {
from: "departments",
localField: "Department.Name",
foreignField: "Name",
as: "deptInfo"
}
},
{
$addFields: {
Department: { $arrayElemAt: ["$deptInfo._id", 0] }
}
},
{ $project: { deptInfo: 0 } },
{
$merge: {
into: "employees",
whenMatched: "merge",
whenNotMatched: "fail"
}
}
]);
Here is the shape of an employee document with a single field for the department:
{
_id: ObjectId('693f1d61e235ef0960ae2b53'),
'Unnamed: 0': 22,
Employee_ID: 'EMP0000023',
Full_Name: 'James Valdez',
Department: ObjectId('693f2e38c2dd5ab4fbfd73b8'),
Job_Title: 'Research Scientist',
Hire_Date: '2017-10-30',
Location: 'East Scott, Mauritius',
Performance_Rating: 5.882713091486423,
Experience_Years: 7,
Status: 'Retired',
Work_Mode: 'On-site',
Salary_INR: 789283
}
To find the top 10 highest-performing active employees in the IT department and display their names and ratings, I will join the department's collection using $lookup. Since $lookup incurs a cost, it's more efficient to filter the data beforehand. Therefore, I first apply $match to filter employees by status, then perform the $lookup with the "departments" collection from the reference, and filter more with the department name fetched from the foreign collection:
x=db.employees.aggregate([
{ $match: { Status: "Active" } },
{
$lookup: {
from: "departments",
localField: "Department",
foreignField: "_id",
as: "deptInfo"
}
},
{ $unwind: "$deptInfo" },
{ $match: { "deptInfo.Name": "IT" } },
{
$project: {
_id: 0,
Full_Name: 1,
Performance_Rating: 1
}
},
{ $sort: { Performance_Rating: -1 } },
{ $limit: 10 }
]).explain("executionStats")
print(x.stages[1])
// Execution plan for the lookup stage:
{
'$lookup': {
from: 'departments',
as: 'deptInfo',
localField: 'Department',
foreignField: '_id',
let: {},
pipeline: [
{ '$match': { Name: { '$eq': 'IT' } } }
],
unwinding: { preserveNullAndEmptyArrays: false }
},
totalDocsExamined: Long('1401558'),
totalKeysExamined: Long('1401558'),
collectionScans: Long('0'),
indexesUsed: [ '_id_' ],
nReturned: Long('421333'),
executionTimeMillisEstimate: Long('94596')
}
I printed the statistics for the $lookup stage, which read 1,401,558 documents—one per active employee. Each access was quick, thanks to the index on "_id," but executing it a million times took over a minute. It returned only 421,333 documents because the department name filter was pushed down by the query planner from the subsequent $unwind and $match stages into the $lookup pipeline. The main issue remains reading a million identical documents.
If you have a many-to-one relationship and numerous documents on the many side, it's better to join from the application rather than using an aggregation pipeline lookup.
I retrieved the "_id" of the departments I am interested in:
var itDeptId = db.departments.findOne({ Name: "IT" })._id;
If the department names are unlikely to change, this can be run once and stored in the application's cache.
Then I can access employees filtered by the reference '_id', so I will create an index for it:
db.employees.createIndex(
{ Status: 1, Department: 1, Performance_Rating: -1, Full_Name: 1 }
)
Since I'm querying just a single collection, I don't require an aggregation pipeline:
x=db.employees.find(
{
Status: "Active",
Department: itDeptId
},
{
_id: 0,
Full_Name: 1,
Performance_Rating: 1
}
).sort({ Performance_Rating: -1 }).limit(10).explain("executionStats");
printjson({
nReturned: x.executionStats.nReturned,
executionTimeMillis: x.executionStats.executionTimeMillis,
totalKeysExamined: x.executionStats.totalKeysExamined,
totalDocsExamined: x.executionStats.totalDocsExamined,
});
// Execution statistics:
{
nReturned: 10,
executionTimeMillis: 0,
totalKeysExamined: 10,
totalDocsExamined: 0
}
Finally, when the application accesses the lookup table first, I observe the same performance with a many-to-one reference as with embedding the single item into the many.
Reference from the "many" side, but lookup from the "one" side
I can perform the same operation as above in the application, using an aggregation pipeline that starts with the departments and includes employees via a lookup. I use a lookup pipeline to add the filter for active employees:
x=db.departments.aggregate([
{
$match: { Name: "IT" }
},
{
$lookup: {
from: "employees",
let: { deptId: "$_id" },
pipeline: [
{
$match: {
$expr: {
$and: [
{ $eq: ["$Department", "$$deptId"] },
{ $eq: ["$Status", "Active"] }
]
}
}
},
{
$sort: { Performance_Rating: -1 }
},
{
$limit: 10
},
{
$project: {
_id: 0,
Full_Name: 1,
Performance_Rating: 1
}
}
],
as: "employees"
}
},
{
$project: {
_id: 0,
Name: 1,
employees: 1
}
}
]).explain("executionStats")
print(x.stages[1])
// Execution plan for the lookup stage:
{
'$lookup': {
from: 'employees',
as: 'employees',
let: { deptId: '$_id' },
pipeline: [
{
'$match': { '$expr': { '$and': [ [Object], [Object] ] } <... (truncated)
TLA+ modeling tips
Model minimalistically
Start from a tiny core, and always keep a working model as you extend. Your default should be omission. Add a component only when you can explain why leaving it out would not work. Most models are about a slice of behavior, not the whole system in full glory: E.g., Leader election, repair, reconfiguration. Cut entire layers and components if they do not affect that slice. Abstraction is the art of knowing what to cut. Deleting should spark joy.
Model specification, not implementation
Write declaratively. State what must hold, not how it is achieved. If your spec mirrors control flow, loops, or helper functions, you are simulating code. Cut it out. Every variable must earn its keep. Extra variables multiply the state space (model checking time) and hide bugs. Ask yourself repeatedly: can I derive this instead of storing it? For example, you do not need to maintain a WholeSet variable if you can define it as a state function of existing variables: WholeSet == provisionalItems \union nonProvisionalItems.
Review the model for illegal knowledge
Do a full read-through of your model and check what each process can really see. TLA+ makes it easy to read global state (or another process's state) that no real distributed process could ever observe atomically. This is one of the most common modeling errors. Make a dedicated pass to eliminate illegal global knowledge.
Check atomicity granularity
Push actions to be as fine-grained as correctness allows. Overly large atomic actions hide races and invalidate concurrency arguments. Fine-grained actions expose the real interleavings your protocol must tolerate.
Think in guarded commands, not procedures
Each action should express one logical step in guarded-command style. The guard should ideally define the meaning of the action. Put all enablement conditions in the guard. If the guard holds, the action may fire at any time in true event-driven style. This is why I now prefer writing TLA+ directly over PlusCal: TLA+ forces you to think in guarded-command actions, which is how distributed algorithms are meant to be designed. Yes, PlusCal is easier for developers to read, but it also nudges you toward sequential implementation-shaped thinking. And recently, with tools like Spectacle, sharing and visually exploring TLA+ specs got much easier.
Step back and ask what you forgot to model
There is no substitute for thinking hard about your system. TLA+ modeling is only there to help you think hard about your system, and cannot substitute thinking about it. Check that you incorporated all relevant aspects: failures, message reordering, repair, reconfiguration.
Write TypeOK invariants
TLA+ is not typed, so you should state types explicitly and early by writing TypeOK invariants. A good TypeOK invariant provides an executable documentation for your model. Writing this in seconds can save you many minutes of hunting runtime bugs through TLA+ counterexample logs.
Write as many invariants as you can
If a property matters, make it explicit as an invariant. Write them early. Expand them over time. Try to keep your invariants as tight as possible. Document your learnings about invariants and non-invariants. A TLA+ spec is a communication artifact. Write it for readers, not for the TLC model checker. Be explicit and boring for the sake of clarity.
Write progress properties
Safety invariants alone are not enough. Check that things eventually happen: requests complete, leaders emerge, and goals accomplished. Many "correct" models may quietly do nothing forever. Checking progress properties catch paths that stall.
Be suspicious of success
A successful TLC run proves nothing unless the model explores meaningful behavior. Low coverage or tiny state spaces usually mean the model is over-constrained or wrong. Break the spec on purpose to check that your spec is actually doing some real work, and not giving up in a vacuous/trivial way. Inject bugs on purpose. If your invariants do not fail, they are too weak. Test the spec by sabotaging it.
Optimize model checking efficiency last
Separate the model from the model checker. The spec should stand on its own. Using the cfg file, you can optimize for model checking by using appropriate configuration, constraints, bounds for counters, and symmetry terms.
You can find many examples and walkthroughs of TLA+ specifications on my blog.
There are many more in the TLA+ repo as well.
The Future of Radiation Safety is Lies, I Guess
Here’s a page from AEG Test (archive), a company which sells radiation detectors, talking about the safety of uranium glass. Right from the get-go it feels like LLM slop. “As a passionate collector of uranium glass,” the unattributed author begins, “I’ve often been asked: ‘Does handling these glowing antiques pose a health risk?’” It continues into SEO-friendly short paragraphs, each with a big header and bullet points. Here’s one:
Uranium glass emits low levels of alpha and beta radiation, detectable with a Geiger counter. However, most pieces register less than 10 microsieverts per hour (μSv/h), which is:
- Far below the 1,000 μSv annual limit recommended for public exposure.
- Comparable to natural background radiation from rocks, soil, or even bananas (which contain potassium-40, a mildly radioactive isotope).
First, uranium glass emits gamma rays too, not just alpha and beta particles. More importantly, these numbers are hot nonsense.
First, the Sievert is a measure of dose, not source intensity; saying a piece emits 10 µSv/hour is like saying a microwave oven emits sixty degrees of warming per minute. It depends on whether the food is in the microwave or outside it, how much food there is, whether it was shielded, and so on. The dose from a uranium glass cup depends on whether you’re chipping off bits and eating them, cuddling it every night, or keeping it in a display case.
10 μSv/hour is 87,600 μSv/year. How is that “far below” 1,000 μSv/year? If you’ve got a uranium glass candy dish on your desk that delivers 10 µSv/hour to your body, and you keep that up for eight hours a day, you’re looking at 29,200 µSv (29.2 mSv) per year. That’s over the DHS emergency guidelines for public relocation, and about half of the NRC dose limit for radiation workers.
The other comparisons are also bonkers: 10 μSv/hour is not comparable to typical background radiation: in the US, that’s roughly 3,100 μSv/year, or 0.35 μSv/hour. Nor is it on par with a banana: the banana equivalent dose is very roughly 0.1 μSv. Nobody is eating 100 bananas an hour.
The best source I know of for uranium glass safety (and which, now that we’re drowning in LLM slop, is surprisingly hard to actually track down) is the Nuclear Regulatory Commission’s NUREG-1717. The section on glassware begins on page 499 of the PDF, labeled 3-217. You should read their methods for estimating dosages, as exposure is highly dependent on uranium density, exposure vector, acidity, distance, etc. The NRC estimated a negligible 1.8 × 10-5 mSv/year (0.018 μSv/year) from drinking glass leachate, up to 0.02 mSv/year (20 μSv/year) from external exposure to drinking glasses (e.g. washing dishes, being in the same room, etc.), and 0.002 mSv/year (2 μSv/year) from occasional handling, admiring, and generally being around four pieces of decorative glassware scattered through a home. These exposures are almost certainly fine.
Please stop asking large language models to tell you or others about radiation safety. Ask a physicist or regulator instead.