Databases in 2025: A Year in Review
The world tried to kill Andy off but he had to stay alive to to talk about what happened with databases in 2025.
The world tried to kill Andy off but he had to stay alive to to talk about what happened with databases in 2025.
The patent expired for US7680791B2. I invented this while at Oracle and it landed in 10gR2 with claims of ~5X better performance vs the previous sort algorithm used by Oracle. I hope for an open-source implementation one day. The patent has a good description of the algorithm, it is much easier to read than your typical patent. Thankfully the IP lawyer made good use of the functional and design docs that I wrote.
The patent is for a new in-memory sort algorithm that needs a name. Features include:
How it came to be
From 2000 to 2005 I worked on query processing for Oracle. I am not sure why I started on this effort and it wasn't suggested by my bosses or peers. But the Sort Benchmark contest was active and I had more time to read technical papers. Perhaps I was inspired by the Alphasort paper.
While the Sort Benchmark advanced the state of the art in sort algorithms, it also encouraged algorithms that were great for benchmarks (focus on short keys with uniform distribution). But keys sorted by a DBMS are often much larger than 8 bytes and adjacent rows often have long common prefixes in their keys.
So I thought about this while falling to sleep and after many nights realized that with a divide and conquer sort, as the algorithm descends into subpartitions of the data, that the common prefixes of the keys in each subpartition were likely to grow:
Hopefully this saves someone else an hour of digging. Philips Hue has a comparison page for their strip lights. This table says that for the Ambiance Gradient lightstrips, “Cut pieces can be reconnected”. Their “Can you cut LED strip lights” page also says “Many strip lights are cuttable, and some even allow the cut parts to be reused and added to the strip light using a connector. Lightstrip V4 and many of the latest models will enable this level of customization.”
Lightstrip V4 has been out for six years, but at least as far as Hue’s official product line goes, the statement “Cut pieces can be reconnected” is not true. Hue’s support representative told me that Hue might someday release a connector which allows reconnecting cut pieces. However, that product does not exist yet, and they can’t say when it might be released.
In the meantime, there’s a company called Litcessory which makes third-party adapters for Hue strips. I haven’t tried them, but I think they might do the trick.
SELECT ... FOR UPDATE SKIP LOCKED is a vendor-specific SQL feature available in several relational databases (e.g., PostgreSQL, Oracle, MySQL). It helps parallel workers avoid waiting on locked rows. MongoDB’s concurrency model uses optimistic concurrency: reads don’t block writes, and writes don’t block reads. To coordinate parallel processing, you can reserve a document by writing a lock field so other workers skip it.
I'll use an example discussed in the Reddit question "ACID read then write – Python":
Client in python, multi process. Each process picks and reads one document, calls some public APIs, and add data to the document and save it. Then next document. What is written can depend on the read data. Question is: in python, how can I create and configure transactions in the code to make sure no other process can read or write its current document from the moment a process starts reading it until done writing its additional data? This means concurrent reads should not happen…
In this example, I'll process messages based on their originating IP address. Multiple threads will enrich them with location data fetched from the public API at https://ip-api.com/.
Here is an example of an initial document:
{
_id: ObjectId('6956e772baea71e37a818e73'),
originatingIp: '1.1.1.1',
location: null
}
Here is the document while it is being processed:
{
_id: ObjectId('6956e772baea71e37a818e73'),
originatingIp: '1.1.1.1',
location: null,
lock: {
by: 'franck',
until: datetime.datetime(2026, 1, 1, 22, 33, 10, 833000)
}
}
Here is the same document after processing:
{
_id: ObjectId('6956e772baea71e37a818e73'),
originatingIp: '1.1.1.1',
location: {
status: 'success',
country: 'Hong Kong',
countryCode: 'HK',
region: 'HCW',
regionName: 'Central and Western District',
city: 'Hong Kong',
zip: '',
lat: 22.3193,
lon: 114.1693,
timezone: 'Asia/Hong_Kong',
isp: 'Cloudflare, Inc',
org: 'APNIC and Cloudflare DNS Resolver project',
as: 'AS13335 Cloudflare, Inc.',
query: '1.1.1.1'
}
}
Storing in-process information avoids long transactions that hide the current status and make troubleshooting difficult when the public API is slow.
This script is designed as a complete, runnable demonstration of how to implement SELECT ... FOR UPDATE SKIP LOCKED-style parallel job claiming in MongoDB. The script will generate everything it needs, process it, and show the end state.
insert_test_docs() inserts test data with random IP addresses in a new collection "message", and creates a partial index to get the message to process ({location: null}).claim_document() updates a message to process, adding lock information so that another thread will not pick the same, and fetches the document. The criteria are that it must be processed ({location: null}) and not locked, or the lock must have expired (with a 1s grace to account for clock skew).fetch_location() is the call to the public API, here getting location information for an IP address.process_document() calls claim_document() to get a message to process, with a lock. It calls fetch_location() and updates the document with the location. It ensures the lock is still in place before the update, then unsets it. Each thread runs in a loop, claiming and processing documents until the timeout.main() calls those in sequence and displays the final documents.This solution avoids explicit transactions, which is preferable because they would include a call to a public API with unpredictable response time. It also avoids using findOneAndUpdate, whose higher overhead comes from storing full pre- and post-images of documents for retryable operations. For large documents—possible in real workloads, even if not shown in this demo—this would lead to significant write amplification. Finally, setting an expiration timestamp allows automatic re-processing if a message fails.
Below is the complete Python program, which you can test using different numbers of documents and threads:
import os
import random
import socket
import threading
import time
from datetime import datetime, timedelta
import requests
from pymongo import MongoClient
# Mongo connection and collection
client = MongoClient("mongodb://127.0.0.1:27017/?directConnection=true")
db = client.test
messages = db.message
# Test settings (the test inserts documents, then runs the processing threads for some duration)
DOCUMENTS = 10 # number of documents created initially
THREADS = 5 # number of threads that loop to claim a document
SECONDS = 15 # thread stops looping on claim
# Worker identity (to identify the thread, and set an expiration on the lock)
WORKER_ID = f"{socket.gethostname()}-{os.getpid()}"
LOCK_DURATION = timedelta(seconds=60) # assumes processing completes within that duration, if not, it will be claimed by another, and this one will not update it
# Get the time
def utcnow(): return datetime.utcnow()
MAX_CLOCK_SKEW=timedelta(seconds=1) # used as a grace period when lock is expired
# --- Prepare test messages (with random generated IP) ---
def insert_test_docs():
# Drop the collection completely (removes data + indexes)
messages.drop()
# Create the partial index for unprocessed docs (they have location = null )
messages.create_index( [("lock.until", 1)], partialFilterExpression={"location": None} )
# Generate random IPs for the test
ips = [
".".join(str(random.randint(1, 255)) for _ in range(4))
for _ in range(DOCUMENTS)
]
# Explicitly set location=None to match the partial index filter
docs = [
{ "originatingIp": ip, "location": None } # A null location is the marker to process it
for ip in ips
]
messages.insert_many(docs)
print(f"[STARTUP] Inserted {DOCUMENTS} test docs into 'message'")
for doc in messages.find({}, {"_id": 0, "originatingIp": 1, "location": 1}):
print(doc)
# --- Claim a message ---
def claim_document():
now = utcnow()
lock_expiry = now + LOCK_DURATION
token = random.randint(1000, 9999) # unique lock token for extra safety
# Atomic lock claim: match unlocked or steal locks expired
result = messages.update_one(
{
"$and": [
# the location is not set
{ "location": None },
# the document is not locked, or locked expired including grace period
{ "$or": [ { "lock": { "$exists": False } }, { "lock.until": { "$lt": now - MAX_CLOCK_SKEW } } ] }
]
},
{ "$set": { "lock": { "by": WORKER_ID, "until": lock_expiry, "token": token } }}
)
if result.modified_count == 0:
return None
# Fetch exactly the doc we locked — match by worker, expiry, AND token
doc = messages.find_one({ "lock.by": WORKER_ID, "lock.until": lock_expiry, "lock.token": token })
if doc:
print(f"[{WORKER_ID}] {threading.current_thread().name} claimed IP {doc['originatingIp']} with token={token}")
else:
print(f"[{WORKER_ID}] {threading.current_thread().name} claim succeeded but fetch failed — possible race?")
return doc
# --- Call the public API ---
def fetch_location(ip):
url = f"http://ip-api.com/json/{ip}"
try:
resp = requests.get(url, timeout=30)
if resp.status_code == 200:
return resp.json()
print(f"[API] Error: HTTP {resp.status_code} for {ip}")
return None
except Exception as e:
print(f"[API] Exception for {ip}: {e}")
return None
# --- Process messages in a loop ---
def process_document():
start_time = time.time()
timeout = SECONDS # seconds
thread_name = threading.current_thread().name
while True:
# Try to claim a doc
doc = claim_document()
if doc:
# We successfully claimed a doc — process it
ip = doc["originatingIp"]
location_data = fetch_location(ip)
if not location_data:
print(f"[{WORKER_ID}] {thread_name} failed to fetch location for {ip}")
return
# Final update only if lock is still valid
now = utcnow()
result = messages.update_one(
{
"_id": doc["_id"],
"lock.by": WORKER_ID,
"lock.until": {"$gte": now},
"lock.token": doc["lock"]["token"]
},
{
"$set": {"location": location_data},
"$unset": {"lock": ""}
}
)
# No doc claimed — check elapsed time before wait and retry
elapsed = time.time() - start_time
if elapsed >= timeout:
print(f"[{WORKER_ID}] {thread_name} exiting after {elapsed:.2f}s")
return
time.sleep(5) # avoid hammering DB and the public API
# --- Initialize and run multiple processing threads ---
def main():
print(f"\nInserting documents")
insert_test_docs()
print(f"\nStarting threads")
threads = []
for i in range(THREADS):
tname = f"T{i}"
t = threading.Thread(target=process_document, name=tname)
t.start()
threads.append(t)
for t in threads:
t.join()
print(f"\n[{WORKER_ID}] Check final documents:")
for doc in messages.find({}, {"originatingIp": 1, "location.query": 1, "location.country": 1, "location.message": 1, "lock.by": 1, "lock.until": 1}):
print(doc)
if __name__ == "__main__":
main()
MongoDB’s storage engine guarantees atomicity for each update_one through its WriteUnitOfWork and RecoveryUnit mechanisms. However, maintaining read consistency across multiple operations requires application-level coordination. In this implementation, that coordination is provided by an atomic claim with conditional criteria, ensuring that only one worker can lock an unprocessed or expired document at a time.
Several safeguards mitigate race conditions. The claim step narrows matches using the worker ID, lock expiry, and a random token. The final update then re-verifies all these fields before committing changes, preventing stale or stolen locks from being applied. Lock expiration enables automatic recovery from failures, and a small grace window accounts for clock skew in distributed systems.
Write conflicts during concurrent updates are automatically resolved at the storage layer via optimistic concurrency control. This ensures correctness without blocking other operations. The result is a robust, non-blocking parallel processing workflow that preserves document-level ACID guarantees while scaling effectively in shared or cloud environments.
In this design, each thread processes one message at a time, in index order. Enforcing strict global message ordering would be more complex. The primary goal here is the scalability of the parallel processing.
When migrating from PostgreSQL to MongoDB—like between any two databases—avoid a direct feature-by-feature mapping, because the systems are fundamentally different. SKIP LOCKED works around blocking FOR UPDATE reads in PostgreSQL, while reads and writes do not block in MongoDB. Instead of replicating another database behavior, clarify the business requirement and design the most appropriate solution. In this example, rather than relying on generic transaction control like SQL, we modeled object states—such as claim acquisition and expiration—and store that state directly in the documents.
TL;DR: If you’re used to traditional SQL databases and synchronous request–response flows—where you read your writes in the same transaction or session—use the "majority" read concern in MongoDB and you will have the highest isolation and durability you can expect from a database. It’s not the default, but it’s safe to change it for your connection. The default is optimized for event-driven, microservice architectures with asynchronous communication, where lower latency is preferred even if it means sometimes reading a state that may later be rolled back.
PostgreSQL users typically expect writes to become visible to other sessions only after they’re acknowledged, either via auto-commit DML or an explicit COMMIT. By contrast, in MongoDB, you must enable the "majority" read concern to achieve similar ACID guarantees, and this is not the default. It may seem surprising that MongoDB offers the strongest consistency option—full ACID semantics in a distributed database—yet doesn’t enable it by default, despite seemingly no significant performance impact. This caught my attention and made me want to understand the reasoning behind it. NoSQL and SQL now address similar use cases, but their origins are fundamentally different. Let’s explore that.
In the SQL standard, isolation levels were first defined by the anomalies (phenomena) that can occur when concurrent sessions read and write the same data. But these definitions were tied to a specific lock-based implementation rather than an abstract model: they assumed that reads and writes use locks and that active transactions share and modify a single current database state.
In reality, many databases chose different designs for scalability:
To understand isolation and durability in MongoDB, we must first consider read and write concerns independently, especially in a replicated, distributed setup where reads and writes can hit different servers. Then we can examine how they interact when we read after writing.
First, let’s distinguish isolation and durability — the I and D in ACID:
Initially, these definitions assumed a single-node database. In modern systems, durability must also handle network and data center failures, so data is persisted across multiple nodes rather than just on a local disk.
A commit, whether in an explicit transaction or implicit in a write operation, typically proceeds as follows:
Durability and isolation each involve multiple operations, and their order can vary. The sequence above matches PostgreSQL with synchronous_commit = on, or MongoDB with w:majority and a majority read concern in other sessions.
Other configurations are possible. For example, Oracle Database uses a different order for durability and isolation, making changes visible before the redo log is flushed (except when paranoid_concurrency_mode is set). With PostgreSQL synchronous_commit = local or MongoDB w:1, acknowledgment occurs before global durability. With MongoDB’s local read concern, data becomes visible before it is durable.
Why isn’t the above sequence—which seems to offer the strongest isolation and durability—the default in MongoDB?
There is another anomaly not described by the SQL standard, which assumes that read and write locks on a single database state are mutually exclusive. With MVCC, a transaction instead works with two states:
Because the read time is earlier than the write time, another anomaly can occur:
In a non-MVCC database with blocking reads, this preserves causality because, in step 2, microservice B requires a share lock and waits on an exclusive lock acquired by A and released at step 3, so B sees the write only after it acquires the share lock, after step 3. Non-MVCC is rare (e.g., DB2 or SQL Server without RCSI isolation level), but SQL isolation levels were defined based on it, and didn't mention causality.
Keep in mind that in this example, the application doesn’t wait for the write acknowledgment before telling the other service to read, yet it still expects the write to be complete when the read occurs. Read-after-write causality was guaranteed with read locks in the non-MVCC database.
However, in an MVCC database, as in most modern systems, microservice B may read a state from before a write is visible, causing a read-after-write anomaly. If the write is acknowledged only locally—for example, PostgreSQL with synchronous_commit = local or MongoDB with w:1—it will likely be visible by the time B receives the notification, because the write usually completes faster than the notification is delivered.
By contrast, PostgreSQL with synchronous_commit = on, or MongoDB with majority read concern, may not see the write yet if it has not been replicated to a majority. Thus, when using w:1, users should select the local read concern to avoid read-after-write anomalies. w:1 is not the default. Still, it can be chosen to reduce latency, at the risk of losing events on failure—something event-driven architectures can often tolerate.
With PostgreSQL synchronous_commit = on or MongoDB w:majority (the default), writes incur extra network latency because they must wait for remote acknowledgment. In this case, the scenario can still show a read-after-write anomaly if the majority has not yet acknowledged microservice A's write when microservice B reads. Using MongoDB local read concern avoids this anomaly, but risks reading data that might later be rolled back on failure.
The default read concern is well-suited to event-driven architectures. As event-driven systems were a primary use case for NoSQL databases like MongoDB, retaining this default makes sense, at least for backward compatibility. Users also often expect reads to return the latest changes, even if those changes have not yet been acknowledged in the thread that performed the write operation.
Today, MongoDB is also used with traditional architectures, where it’s reasonable to prefer durability over fast visibility and use the "majority" read concern. This adds no performance penalty, because you already paid the synchronization latency when waiting for the write acknowledgment. "Majority" read concern sets the read time to the last commit time, while keeping reads local. It can wait in rare cases, such as during instance startup or rollback, until it can obtain a committed timestamp snapshot, or when secondaries are unavailable or lagging. But generally, there's no performance impact.
Unlike SQL databases—which must guarantee consistency for any DML executed by any user, including non-programmers at the command line—MongoDB shifts more responsibility to developers. Instead of relying on a one-size-fits-all default, developers must configure their session or connection by choosing:
w:majority for durability over network or data center failures),majority, or snapshot for stronger consistency in multi-shard transactions), andThis post has results for RocksDB performance using db_bench on 8-core and 48-core servers. I previously shared results for RocksDB performance using gcc and clang and then for RocksDB on a small Arm server.
tl;dr
Software
I used RocksDB versions 9.8 through 10.0.
I compiled each version clang version 18.3.1 with link-time optimization enabled (LTO). The build command line was:
flags=( DISABLE_WARNING_AS_ERROR=1 DEBUG_LEVEL=0 V=1 VERBOSE=1 )# for clang+LTOAR=llvm-ar-18 RANLIB=llvm-ranlib-18 CC=clang CXX=clang++ \make "${flags[@]}" static_lib db_bench
I used servers with 8 and 48 cores, both run Ubuntu 22.04:
Benchmark
Overviews on how I use db_bench are here and here.
Most benchmark steps were run for 1800 seconds and all used the LRU block cache. I try to use Hyperclock on large servers but forgot that this time.
Tests were run for three workloads:
Relative QPS
Many of the tables below (inlined and via URL) show the relative QPS which is:
(QPS for my version / QPS for RocksDB 9.8)
The base version varies and is listed below. When the relative QPS is > 1.0 then my version is faster than RocksDB 9.8. When it is < 1.0 then there might be a performance regression or there might just be noise.
The spreadsheet with numbers and charts is here. Performance summaries are here.
Results: cached database (byrx)
From 1 client on the 8-core server
From 36 clients on the 48-core server
Results: IO-bound with buffered IO (iobuf)
From 1 client on the 8-core server
From 1 client on the 8-core server
From 36 clients on the 48-core server
This has results for an IO-bound sysbench benchmark on a 48-core server for Postgres versions 12 through 18. Results from a CPU-bound sysbench benchmark on the 48-core server are here.
tl;dr - for Postgres 18.1 relative to 12.22
(QPS for some version) / (QPS for base version)
This paper (HOTNETS'25) re-teaches a familiar systems lesson: caching is not just about reducing latency, it is also about saving CPU! The paper makes this point concrete by focusing on the second-order effect that often dominates in practice: the monetary cost of computation. The paper shows that caching --even after accounting for the cost of DRAM you use for caching-- still yields 3–4x better cost efficiency thanks to the reduction in CPU usage. In today's cloud pricing model, that CPU cost dominates. DRAM is cheap. Well, was cheap... I guess the joke is on them now, since right after this paper got presented, the DRAM prices jumped by 3-4x! Damn Machine Learning ruining everything since 2018!
Anyways, let's ignore that point conveniently to get back to the paper. Ok, so caches do help, but when do they help the most? Many database-centric or storage-side cache designs miss this point. Even when data is cached at the storage/database cache, an application read still needs to travel there, pay for RPCs, query planning, serialization, and coordination checks.
The paper advocates for moving the caches as close to the application as possible to cut costs for CPU. The key argument is that application-level linked caches deliver far better cost savings than storage-layer caches. By caching fully materialized application objects and bypassing the storage/database read path entirely, linked caches eliminate query amplification and coordination overhead. Across production workloads, this yields 3–4x better cost efficiency than storage-layer caching, easily offsetting the additional DRAM cost. Remote caches help, but still burn CPU on RPCs and serialization. Storage-layer caches save disk I/O but leave most of the query and coordination path intact, delivering the weakest cost savings. The results are consistent across different access skews and read intensities, reinforcing that cache placement dominates cache size.
So that is the gist of the paper. The paper makes two adjacent points. Special cases of this observation, if you will. And let's cover them for completeness.
The first point is rich-object workloads, which is where the most striking evaluation results come from. For services where a single logical read expands into many database queries (e.g., metadata services and control planes), caching fully materialized objects at the application level avoids query amplification entirely. And this yields up to an order-of-magnitude cost reduction versus uncached reads and roughly 2x improvement over caching denormalized key-value representations.
The second result, a negative result, is also important. Adding even lightweight freshness or version checks largely erases these gains, because the check itself traverses most of the database stack. The experiments make clear that strong consistency remains fundamentally at odds with the cost benefits of application-level caching. The paper leaves this as an open challenge, saying that we still lack a clean, low-cost way to combine strong consistency with the economic benefits of application-level caching. I think it is possible to employ leases to trade off an increase in update latency with cost efficiency, and alleviate this problem. Or we could just say: Cache coherence is hard, let's go shopping for CXL!
Overall, the paper quantifies something many practitioners intuit but rarely measure. If you care about cost (also monetary cost), move caching up the stack, cache rich objects, and trade memory against CPU burn.
As usual Aleksey and I did a live-reading of the paper. And as usual we had a lot to argue and gripe about. Above is a recording of our discussion, and this links to my annotated paper.
Of course, Aleksey zeroed in on the metastability implications right from the abstract. And yes the metastability implications remained unaddressed in the paper. If you cut costs and operate at lower CPU provisioning (thanks to this cache assist), you are making yourself prone to failure by operating at maximum utilization, without any slack. That means, the moment the cache fails or becomes inaccessible, your application will also get overwhelmed by 2-3x more traffic than it can handle and suffer unavailability or metastability.
I had some reservations about application-level caches. They are undeniably effective, but they lack the reusability and black-box nature of storage-layer caching. Storage-side caching is largely free, transparent, and naturally shared across nodes and applications. Application-level caching, by contrast, requires careful design and nontrivial development effort. It also sacrifices reuse and sharing, since each application must manage its own cache semantics and lifecycle. I wish the paper could discuss these costs and tradeoffs.
Writing, after the introduction section, was repetitive and sub par. Sections 2 and 3 largely repeated the Introduction and wasted space. Then we only had 2 paragraphs of the Theoretical Analysis section, which we actually looked forward to reading. That section is effectively cropped out of the paper, when it makes the core of the arguments for the paper.
The paper's subtitle (see headers on Page 3, 5, 7) is a copy-paste error from the authors' HotNets 2024 paper. There did not seem to be any camera-ready time checks on the paper. To motivate strong consistency, the paper drops several citations in Section 2.3, calling them as recent work. Only 2 out of 6 of these are after 2014. The figures were sloppy as well. Did you notice Figure 6 above? The y-axis are not covering the same ranges, which makes it very hard to compare about the subfigures. The y-axis in Figure 5 uses relative costs, which is also of not much use. It may be that in 2025 most people use LLMs to read papers, but one should still write papers as if humans will read them, past the introduction section, and line by line to understand and check the work.
Finally, here is an interesting question to ponder on. Does this paper conflict with storage disaggregation trend?
At first glance, the paper appears to push against the storage disaggregation trend by arguing for tighter coupling between computation and cached data to meet real-time freshness constraints. In reality, it does not reject disaggregation but warns that disaggregated designs require additional caching above the storage layer. Just storage side caching would not be able to suffice from a latency as well as cost perspective! The paper also points to a hidden cost: freshness guarantees degrade when cache coherence is treated as a best-effort side effect of an eventually consistent pipeline. The paper's message is that disaggregation needs explicit freshness semantics and coordination mechanisms. So maybe a corollary here is that, we should expect disaggregated systems to inevitably grow "stateful edges" over time in order to recover performance and control.
I've realized that chess has been eating my downtime. Because it lives on my phone (Lichess), it is frictionless to start a bullet game, and get a quick dopamine hit.
The problem is that I no longer get bored. That is bad. I need to get bored so I can start to imagine, daydream, think, self-reflect, plan, or even get mentally prepared for things (like the Stoics talked about). I badly need that empty space back.
So bye chess. Nothing personal. I will play only when teaching/playing with my daughters. I may occasionally cheat and play a bullet game on my wife's phone. But no more chess apps on my phone.
While I was at it, I installed the Website Blocker extension for Chrome. I noticed my hands typing reddit or twitter at the first hint of boredom. The blocker is easy to disable, but that is fine. I only need that slight friction to catch myself before opening the site on autopilot.
In 2008, Reddit had a popular thread on Attacking Generals problem with about 300 confident and largely misguided comments. I shared it with two senior distributed researchers, with the subject line "Proof that distributed reasoning is really tricky".
One asked: "Who are these people and where do they come from?" The other mentioned this book: "The Dumbest Generation: How the Digital Age Stupefies Young Americans and Jeopardizes Our Future"
Today, Hacker News is at the same threshold. People with little understanding discuss opinions with other uninformed people back and forth.
I am aware this comes across cranky. I am not angry, and I am not trying to gatekeep or anything. I am glad to see interest in these topics. I just want to see a higher signal to noise ratio in these discussions. Right now, maybe less than 10% of the comments say something useful or substantial.
Marc Brooker has a quadrant model to explain why this discourse happens. As he also admits, we can do better, and the forums can design incentives that reward insight instead of noise.
A Trader Joe's opened nearby. Now we eat simit for breakfast. Trader Joe's simit is spot-on authentic. As a Turk, this is an amazing treat. Simit is deeply nostalgic. Every kid loves it. Adults too. It's simple, inexpensive, and perfect.
They also had frozen baklava. It was very fresh, authentic, delicious, and inexpensive as well. Past tense because, it turns out this was seasonal.
Acquired did an episode on Trader Joe's. Worth a listen. I am always impressed by how helpful the Trader's Joe associates are. Shopping should feel like this. Hey Trader Joe's, after this shout-out, you should sponsor me with unlimited simit. Or at least make baklava year-round. Please.
A Yemeni coffee chain opened nearby. This appears to be the year of pistachio. First Dubai chocolate, now this. We tried pistachio lattes, hot and cold. The cold one was better, and looked better too. Both were very sweet. I had to return the hot one, and ask for a replacement with less Pistaschio paste, so I don't die of sweet Pistaschio overdose on the spot. The caramel tres leches was excellent. Really light, soft, and moist.
Yemeni coffee tastes better. I am done with Starbucks. The Starbucks branch near us is consistently awful. Some locations are better, but overall, I am finished with it. I hear good things about Luckin Coffee. That is next on my list.
Zootopia 1 was better.
What happened to scriptwriting? The industry no longer seems to try. No creativity. Not even basic logic. In most movies, there are plot holes large enough to drive a garbage truck through.
Wake Up Dead Man: A Knives Out Mystery was good, but not as engaging as the first.
I did like The Great Flood, a Korean movie on Netflix. At least it is an engaging machine learning movie.
No good Christmas movie this year? Time to watch Die Hard again.
Baby vs Man: Rowan Atkinson is brilliant. Unfortunately, I have a strong stress reaction to watching stupidity unfold slowly and inevitably. Atkinson is exceptionally good at this, which is exactly why I cannot bring myself to watch it.
For users used to SQL databases, MongoDB’s consistency model can be confusing. Typical SQL systems layer read replicas on top of data recovery, not into the query layer. MongoDB instead includes replication into its read and write paths, extending ACID guarantees across a horizontally scalable cluster.
The SQL world has similar misconceptions, especially the belief that read replicas can safely scale horizontally—until data is lost. This article follows the six myths from Alexander Kukushkin’s PGConf.DE talk, Myths and Truths about Synchronous Replication in PostgreSQL (slides). For each myth, I relate the idea to MongoDB’s write concern, read concern, and majority commit point.
In MongoDB, writeConcern: { w: 1 } acknowledges a write as soon as the primary accepts it, before replication. With writeConcern: { w: "majority" } — the default for most replica set configurations — the write is acknowledged only after it has been replicated to a majority of voting nodes.
PostgreSQL’s synchronous commit is similar to MongoDB’s majority write concern, except that in PostgreSQL, replicas push changes directly from the primary and the primary waits for synchronous network confirmation from each selected standby. MongoDB replicas pull changes from any peer, and send a commit progress report back to the primary.
🐘 In PostgreSQL, commits are always local first: XLogFlush(XactLastRecEnd) runs before waiting for synchronous replication via SyncRepWaitForLSN(XactLastRecEnd, true).
The transaction is not visible until the synchronous standby has received, flushed, or applied the WAL because locks remain held. If that wait is interrupted — by cancellation, connection loss, or restart — locks may be released early. The transaction can then be visible but not yet durable and may be lost on failover.
These are “dirty reads” in terms of durability, not isolation — they read only committed local data, but that data can still be rolled back in case of failure.
🍃 MongoDB behaves similarly with { w: "majority" } (default in most configurations): it waits for acknowledgment after writing locally. But MongoDB does not hold locks. Visibility is controlled entirely by the read concern. With the default readConcern: "local", you see the change before it is committed to the quorum, and it is therefore subject to rollback on failover. 🌱 With readConcern: "majority", you only read from a point in time where all changes are quorum‑durable.
🐘 In PostgreSQL, if synchronous commit wait is cancelled — by query cancel, TCP connection reset, or server restart — the transaction becomes visible immediately. The application receives a warning:
The transaction has already committed locally, but might not have been replicated to the standby.
If the primary fails before standby confirmation, the promoted standby may miss the transaction, resulting in data loss if the application ignored the warning.
If the TCP connection between client and server is interrupted at the wrong moment, the application may not know whether the transaction committed after disconnect — and it will not have received a warning. To verify, you must query data back, or get the transaction ID (txid_current()) before committing, and check it after reconnect using txid_status().
🍃 MongoDB is similar: if the client loses its connection after sending the write but before receiving the acknowledgment for w: "majority", the commit status is unclear. 🌱 However, the driver can handle this with retryable writes — specifying retryWrites: true — for idempotent operations, and for writes with deterministic keys such as a driver‑generated ObjectId. Retrying with the same _id will either match the existing document or insert it once.
MongoDB uses MVCC and reads from a recent state of the database that excludes uncommitted changes. By default, readConcern: "local" returns the most recent node‑local commit, durable to that node’s own disk (fdatasync). A write with { w: "majority"} may be visible before it is acknowledged by the quorum, and can be rolled back in a failover.
To extend ACID beyond a single node, readConcern: "majority" guarantees that what you read has been acknowledged by a majority of voting members and is durable cluster‑wide. It does so by reading from the commit snapshot corresponding to the majority commit point.
🐘 In PostgreSQL, a standby can show a transaction before the primary finishes waiting for other standbys. The ACID properties in PostgreSQL are not automatically extended to read replicas.
🍃 In MongoDB you can read from any replica with readConcern: "majority" and guarantee that it reads the same durable state as the primary. Drivers can automatically distribute reads with readPreference: "secondary" or "secondaryPreferred". 🌱 Adding readConcern: "majority" ensures that all those reads see the majority‑committed snapshot. Replicas can lag a little, but that works in an MVCC database where reads do not acquire locks.
In MongoDB replica sets, the majority commit point is a logical timestamp indicating that all operations up to that point have been replicated and made durable on a majority of members. The primary advances this point when it observes a higher point on a majority of nodes. With majority write concern, MongoDB acknowledges a write only when it is at or before this point, guaranteeing survival across elections and failovers.
🐘 In PostgreSQL, setting a node to synchronous doesn’t synchronise it instantly. The standby must first reach the "streaming" state, with zero lag, before it can serve as a synchronous standby. Until then, there is a catch‑up window during which promotion can result in missing transactions.
🍃 MongoDB uses the majority commit point in elections. 🌱 If writes use w: "majority", the elected node always has those writes acknowledged, avoiding the catch‑up hazard.
🐘 Even in synchronous mode, a PostgreSQL standby can miss certain WAL changes not generated by client transactions — for example, VACUUM — leading to divergence after failover. This cannot be fixed by simple roll‑forward. pg_rewind must identify and copy the differing blocks from the new primary to reinstate the old primary as a standby.
🍃 In MongoDB, rollback when a node rejoins can happen if the oplog has diverged, but normally only with w: 1. 🌱 With writeConcern: "majority" — default in most deployments — writes wait for the commit point to advance and are protected from rollback in the most common cases.
With w: 1, MongoDB only performs a local disk write, so latency is minimal. With w: "majority", it waits for replication to enough nodes and their disk commits (j: true is the default), so each write can incur multiple intra‑cluster and cross‑region RTTs.
For reads, readConcern: "local" avoids extra RTTs by reading the latest local commit. readConcern: "majority" also does not require extra network hops. It reads from the local snapshot corresponding to the majority commit point.
Linearizable reads in MongoDB have the highest cost: they require a no-op write and wait for majority replication before returning, adding a full RTT to the read.
🐘 In PostgreSQL, synchronous replication increases latency proportional to RTT between nodes. The push‑based approach also increases primary CPU and network overhead to transmit WAL to all standbys.
🍃 MongoDB behaves similarly with w: "majority", but the pull‑based replication puts less pressure on the primary. 🌱 Secondaries can fetch changes from other secondaries, reducing primary load.
Synchronous replication in PostgreSQL and majority write concern in MongoDB are aimed at the same goal: protecting committed data against failover. Both commit locally first, then wait for quorum confirmation, but PostgreSQL’s locking model delays visibility, whereas MongoDB lets you choose visibility via read concern.
Warnings in PostgreSQL about cancelled sync waits are critical to avoid silent data loss. In MongoDB, retryable writes (retryWrites: true) and idempotent operations solve similar problems of uncertainty after a disconnect.
Read replicas in PostgreSQL do not automatically carry the same durability as the primary. In MongoDB, majority read concern enforces that. PostgreSQL requires manual lag checks before safe promotion. MongoDB elections ensure the majority commit point is present. PostgreSQL can still diverge on non‑transactional WAL and require pg_rewind after a failover, while MongoDB avoids rollback for majority writes.
Performance costs rise in both systems when ensuring strongest guarantees, with PostgreSQL’s push model loading the primary and MongoDB’s pull model distributing replication work.
Knowing exactly what these guarantees mean—and when they don’t apply—is essential to using them safely. In short, MongoDB by default offers similar protection, along with additional operational benefits from its built-in orchestration and cluster-aware drivers. 🌱 By using a {w: "majority"} write concern, MongoDB can extend ACID properties to a horizontally scalable cluster.