a curated list of database news from authoritative sources

July 02, 2025

Testing ReadySet as a Query Cacher for PostgreSQL (Plus ProxySQL and HAproxy) Part 2: Test Results

In the first post of this series (Testing ReadySet as a Query Cacher for PostgreSQL (Plus ProxySQL and HAproxy) Part 1: How-To), I presented my test environment and methodology and explained how to install ReadySet, ProxySQL, and HAproxy and configure them to work with PostgreSQL. In this final part, I present the different test scenarios […]

Chapter 7: Distributed Recovery (Concurrency Control Book)

Chapter 7 of the Concurrency Control and Recovery in Database Systems book by Bernstein and Hadzilacos (1987) tackles the distributed commit problem: ensuring atomic commit across a set of distributed sites that may fail independently.

The chapter covers these concepts:

  • The challenges of transaction processing in distributed database systems (which wasn't around in 1987)
  • Failure models (site and communication) and timeout-based detection
  • The definition and guarantees of Atomic Commitment Protocols (ACPs)
  • The Two-Phase Commit (2PC) protocol (and its cooperative termination variant)
  • The limitations of 2PC (especially blocking)
  • Introduction and advantages of the Three-Phase Commit (3PC) protocol

Despite its rigor and methodical development, the chapter feels like a suspense movie today. We, the readers, equipped with modern tools like FLP impossibility result and Paxos protocol watch as the authors try to navigate a minefield, unaware of the lurking impossibility results that were published a couple years earlier and the robust consensus frameworks (Viewstamped replication and Paxos) that would emerge just a few years later.

Ok, let's dive in. 


Atomic Commitment Protocol (ACP) problem

The problem is to ensure that in the presence of partial failures (individual site failures), a distributed transaction either commits at all sites or aborts at all sites, and never splits the decision. The authors define the desired properties of ACPs through a formal list of conditions (AC1–AC5).

We know that achieving these in an asynchronous setting with even one faulty process is impossible as FLP impossibility result established in 1985. Unfortunately, this impossibility result is entirely absent from the chapter’s framework. The authors implicitly assume bounded (and with known bounds) message delays and processing times, effectively assuming a synchronous system. That is an unrealistic portrayal of real-world distributed systems, even today in the data-centers. 

A more realistic framework for distributed systems is the partially asynchronous model. Rather than assuming known and fixed bounds on message delays and processing times, the partially asynchronous model allows for periods of unpredictable latency, with the only guarantee being that bounds exist, just not that we know them. This model captures the reality of modern data centers, where systems often operate efficiently but can occasionally experience transient slowdowns or outages where fixed bounds would be violated and maybe higher bounds might be established for some duration before convergence to stable. This also motivates the use of weak failure detectors, which cannot definitively distinguish between a crashed node and a slow one.

This is where Paxos enters the picture. Conceived just a few years after this chapter, Paxos provides a consensus protocol that is safe under all conditions, including arbitrary message delays, losses, and reordering. It guarantees progress only during periods of partial synchrony, when the system behaves reliably enough for long enough, but it never violates safety even when conditions degrade. This doesn't conflict with what the FLP impossibility result of 1985 proves: you cannot simultaneously guarantee both safety and liveness in an asynchronous system with even one crash failure. But that doesn't mean you must give up on safety. In fact, the brilliance of Paxos lies in this separation: it preserves correctness unconditionally and defers liveness until the network cooperates. This resilience is exactly what's missing in the ACP designs of Bernstein and Hadzilacos even when using 3PC protocols.

If you like a quick intro to the FLP and earlier Coordinated Attack impossibility results, these three posts would help.


2PC and 3PC protocols

The authors first present the now-classic Two-Phase Commit (2PC) protocol, where the coordinator collects YES/NO votes from participants (the voting phase) and then broadcasts a COMMIT or ABORT (the decision phase). While 2PC satisfies AC1–AC4 in failure-free cases, it fails AC5 under partial failures. If a participant votes YES and then loses contact with the coordinator, it is stuck in an uncertainty period, unable to decide unilaterally whether to commit or abort. The authors provide a cooperative termination protocol, where uncertain participants consult peers to try to determine the outcome. It reduces, but does not eliminate, blocking.

Thus comes the Three-Phase Commit (3PC) protocol, which attempts to address 2PC's blocking flaw by introducing an intermediate state: PRE-COMMIT. The idea is that before actually committing, the coordinator ensures all participants are "prepared" and acknowledges that they can commit. Only once everyone has acknowledged this state does the coordinator send the final COMMIT. If a participant times out during this phase, it engages in a distributed election protocol and uses a termination rule to reach a decision. 

Indeed, in synchronous systems, 3PC is non-blocking, and provides an improvement over 2PC. The problem is that 3PC relies critically on timing assumptions, always requiring bounded message and processing delays. The protocol's reliance on perfect timeout detection and a perfect failure detector makes it fragile. As another secondary problem, the 3PC protocol discussed in the book (Skeen 1982) has also been shown to contain some subtle bugs as well even in the synchronous model.


In retrospect

Reading this chapter today feels like watching a group of mountaineers scale a cliff without realizing they’re missing key gear. I spurted out my tea when I read these lines in the 3PC discussion. "To complete our discussion of this protocol we must address the issue of elections and what to do with blocked processes." Oh, no, don't go up that path without Paxos and distributed consensus formalization!! But the book predates Paxos (1989, though published later), Viewstamped Replication (1988), and the crystallization of the consensus problem. It also seems to be completely unaware of the FLP impossibility result (1985), which should have stopped them in their tracks.

This chapter is an earnest and technically careful work, but it's flying blind without the consensus theory that would soon reframe the problem. The chapter is an important historical artifact. It captures the state of the art before consensus theory illuminated the terrain. The authors were unable to realize that the distributed commit problem includes in it the distributed consensus problem, and that all the impossibility, safety, and liveness tradeoffs that apply to consensus apply here too.

Modern distributed database systems use Paxos-based commit. This is often via 2PC over Paxos/Raft groups for participant-sites. See for example our discussion and TLA+ modeling of distributed transactions in MongoDB.


Miscellaneous

This is funny. Someone is trolling on Wikipedia, trying to introduce Tupac as an alternative way to refer to 2PC. 






July 01, 2025

Fluent Commerce’s approach to near-zero downtime Amazon Aurora PostgreSQL upgrade at 32 TB scale using snapshots and AWS DMS ongoing replication

Fluent Commerce, an omnichannel commerce platform, offers order management solutions that enable businesses to deliver seamless shopping experiences across various channels. Fluent uses Amazon Aurora PostgreSQL-Compatible Edition as its high-performance OLTP database engine to process their customers’ intricate search queries efficiently. Fluent Commerce strategically combined AWS-based upgrade approaches—including snapshot restores and AWS DMS ongoing replication—to seamlessly upgrade their 32 TB Aurora PostgreSQL databases with minimal downtime. In this post, we explore a pragmatic and cost-effective approach to achieve near-zero downtime during database upgrades. We explore the method of using the snapshot and restore method followed by continuous replication using AWS DMS.

Accelerate SQL Server to Amazon Aurora migrations with a customizable solution

Migrating from SQL Server to Amazon Aurora can significantly reduce database licensing costs and modernize your data infrastructure. To accelerate your migration journey, we have developed a migration solution that offers ease and flexibility. You can use this migration accelerator to achieve fast data migration and minimum downtime while customizing it to meet your specific business requirements. In this post, we showcase the core features of the migration accelerator, demonstrated through a complex use case of consolidating 32 SQL Server databases into a single Amazon Aurora instance with near-zero downtime, while addressing technical debt through refactoring.

Testing ReadySet as a Query Cacher for PostgreSQL (Plus ProxySQL and HAproxy) Part 1: How-To

A couple of weeks ago, I attended a PGDay event in Blumenau, a city not far away from where I live in Brazil. Opening the day were former Percona colleagues Marcelo Altmann and Wagner Bianchi, showcasing ReadySet’s support for PostgreSQL. Readyset is a source-available database cache service that differs from other solutions by not relying […]

Benchmarking Postgres

Benchmarking Postgres in a transparent, standardized and fair way is challenging. Here, we look at the process of how we did it in-depth

June 30, 2025

Strong consistency 👉🏻 MongoDB highly available durable writes

In the previous post, I used strace to display all calls to write and sync to disk from any MongoDB server thread:

strace -tT -fp $(pgrep -d, mongod) -yye trace=pwrite64,fdatasync -qqs 0

Adding replicas for High Availability

I did this with a single server, started with Atlas CLI. Let's do the same on a replicaset with three servers. I start it with the following Docker Compose:

services:  

  mongo-1:  
    image: mongo:8.0.10  
    ports:  
      - "27017:27017"  
    volumes:  
      - ./pgbench-mongo.js:/pgbench-mongo.js:ro  
      - mongo-data-1:/data/db  
    command: mongod --bind_ip_all --replSet rs0  
    networks:  
      - mongoha

  mongo-2:  
    image: mongo:8.0.10  
    ports:  
      - "27018:27017"  
    volumes:  
      - ./pgbench-mongo.js:/pgbench-mongo.js:ro  
      - mongo-data-2:/data/db  
    command: mongod --bind_ip_all --replSet rs0 
    networks:  
      - mongoha

  mongo-3:  
    image: mongodb/mongodb-community-server:latest  
    ports:  
      - "27019:27017"  
    volumes:  
      - ./pgbench-mongo.js:/pgbench-mongo.js:ro  
      - mongo-data-3:/data/db  
    command: mongod --bind_ip_all --replSet rs0   
    networks:  
      - mongoha

  init-replica-set:  
    image: mongodb/mongodb-community-server:latest  
    depends_on:  
      - mongo-1  
      - mongo-2  
      - mongo-3  
    entrypoint: |  
      bash -xc '  
        sleep 10  
        mongosh --host mongo-1 --eval "  
         rs.initiate( {_id: \"rs0\", members: [  
          {_id: 0, priority: 3, host: \"mongo-1:27017\"},  
          {_id: 1, priority: 2, host: \"mongo-2:27017\"},  
          {_id: 2, priority: 1, host: \"mongo-3:27017\"}]  
         });  
        "  
      '     
    networks:  
      - mongoha

volumes:  
  mongo-data-1:  
  mongo-data-2:  
  mongo-data-3:  

networks:  
  mongoha:  
    driver: bridge  

I started this with docker compose up -d ; sleep 10 and then ran the strace command. I connected to the primary node with docker compose exec -it mongo-1 mongosh

run some transactions

I've executed the same as in the previous post, with ten writes to a collection:

db.mycollection.drop();
db.mycollection.insert( { _id: 1, num:0 });

for (let i = 1; i <= 10; i++) {
 print(` ${i} ${new Date()}`)
 db.mycollection.updateOne( { _id: 1 }, { $inc: { num: 1 } });
 print(` ${i} ${new Date()}`)
}

 1 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
 1 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
 2 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
 2 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
 3 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
 3 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
 4 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
 4 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
 5 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
 5 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
 6 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
 6 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
 7 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
 7 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
 8 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
 8 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
 9 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
 9 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
 10 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
 10 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)

Here is the strace output during this:

[pid  8786] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 512, 61184) = 512 <0.000086>
[pid  8786] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002> <unfinished ...>
[pid  8736] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 55808) = 384 <0.000097>
[pid  8736] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000656>
[pid  8786] 10:05:38 <... fdatasync resumed>) = 0 <0.002739>
[pid  8889] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 54528) = 384 <0.000129>
[pid  8889] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000672>
[pid  8786] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 512, 61696) = 512 <0.000094>
[pid  8786] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.001070>
[pid  8736] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 56192) = 384 <0.000118>
[pid  8736] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000927>
[pid  8889] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 54912) = 384 <0.000112>
[pid  8889] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000687>
[pid  8786] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 512, 62208) = 512 <0.000066>
[pid  8786] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000717>
[pid  8736] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 56576) = 384 <0.000095>
[pid  8736] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000745>
[pid  8889] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 55296) = 384 <0.000063>
[pid  8889] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000782>
[pid  8786] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 512, 62720) = 512 <0.000084>
[pid  8786] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000712>
[pid  8736] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 56960) = 384 <0.000080>
[pid  8736] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000814>
[pid  8889] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 55680) = 384 <0.000365>
[pid  8889] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000747>
[pid  8786] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 512, 63232) = 512 <0.000096>
[pid  8786] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000724>
[pid  8736] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 57344) = 384 <0.000108>
[pid  8736] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.001432>
[pid  8889] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 56064) = 384 <0.000118>
[pid  8889] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000737>
[pid  8786] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 512, 63744) = 512 <0.000061>
[pid  8786] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000636>
[pid  8736] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 57728) = 384 <0.000070>
[pid  8736] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000944>
[pid  8889] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 56448) = 384 <0.000105>
[pid  8889] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000712>
[pid  8786] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 512, 64256) = 512 <0.000092>
[pid  8786] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000742>
[pid  8736] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 58112) = 384 <0.000067>
[pid  8736] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000704>
[pid  8889] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 56832) = 384 <0.000152>
[pid  8889] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000732>
[pid  8786] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 512, 64768) = 512 <0.000061>
[pid  8786] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000672>
[pid  8736] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 58496) = 384 <0.000062>
[pid  8736] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000653>
[pid  8889] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 57216) = 384 <0.000102>
[pid  8889] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.001502>
[pid  8786] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 512, 65280) = 512 <0.000072>
[pid  8786] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002> <unfinished ...>
[pid  8736] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 58880) = 384 <0.000123>
[pid  8736] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002> <unfinished ...>
[pid  8786] 10:05:38 <... fdatasync resumed>) = 0 <0.001538>
[pid  8736] 10:05:38 <... fdatasync resumed>) = 0 <0.000625>
[pid  8889] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 57600) = 384 <0.000084>
[pid  8889] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000847>
[pid  8786] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 512, 65792) = 512 <0.000060>
[pid  8786] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000661>
[pid  8736] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 59264) = 384 <0.000074>
[pid  8736] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000779>
[pid  8889] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 57984) = 384 <0.000077>
[pid  8889] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000816>

I can see writes and sync from three processes. Let's check which process belongs to which container:

for pid in 8736 8786 8889; do  
  cid=$(grep -ao 'docker[-/][0-9a-f]\{64\}' /proc/$pid/cgroup | head -1 | grep -o '[0-9a-f]\{64\}')  
    svc=$(docker inspect --format '{{ index .Config.Labels "com.docker.compose.service"}}' "$cid" 2>/dev/null)  
    echo "PID: $pid -> Container ID: $cid -> Compose Service: ${svc:-<not-found>}"  
done  

PID: 8736 -> Container ID: 93e3ebd715867f1cd885d4c6191064ba0eb93b02c0884a549eec66026c459ac2 -> Compose Service: mongo-3
PID: 8786 -> Container ID: cf52ad45d25801ef1f66a7905fa0fb4e83f23376e4478b99dbdad03456cead9e -> Compose Service: mongo-1
PID: 8889 -> Container ID: c28f835a1e7dc121f9a91c25af1adfb1d823b667c8cca237a33697b4683ca883 -> Compose Service: mongo-2

This confirms that by default, the WAL is synced to disk at commit on each replica and not only on the primary.

Simulate one node failure

[pid 8786] is mongo-1 and it is my primary:

rs0 [direct: primary] test> rs.status().members.find(r=>r.state===1).name
... 
mongo-1:27017

I stop one replica:

docker compose pause mongo-3

[+] Pausing 1/0
 ✔ Container pgbench-mongo-mongo-3-1  Paused                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  

I run my updates again, they are not impacted by one replica down:

rs0 [direct: primary] test> rs.status().members.find(r=>r.state===1).name
mongo-1:27017

rs0 [direct: primary] test> for (let i = 1; i <= 10; i++) {
...  print(` ${i} ${new Date()}`)
...  db.mycollection.updateOne( { _id: 1 }, { $inc: { num: 1 } });
...  print(` ${i} ${new Date()}`)
... }
...
 1 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
 1 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
 2 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
 2 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
 3 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
 3 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
 4 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
 4 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
 5 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
 5 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
 6 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
 6 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
 7 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
 7 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
 8 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
 8 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
 9 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
 9 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
 10 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
 10 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)

Simulate two nodes failure

I stopped another replica:

docker compose pause mongo-2

[+] Pausing 1/0
 ✔ Container demo-mongo-2-1  Paused    

As there's no quorum anymore, with only one replica in a replicaset of three members, the primary was stepped down and cannot serve reads or updates:

rs0 [direct: primary] test> for (let i = 1; i <= 10; i++) {
...  print(` ${i} ${new Date()}`)
...  db.mycollection.updateOne( { _id: 1 }, { $inc: { num: 1 } });
...  print(` ${i} ${new Date()}`)
... }
 1 Mon Jun 30 2025 09:28:36 GMT+0000 (Coordinated Universal Time)
MongoServerError[NotWritablePrimary]: not primary

Reads from secondary

The node that remains is now a secondary and exposes the last writes acknowledged by the majority:

rs0 [direct: secondary] test> db.mycollection.find()

[ { _id: 1, num: 20 } ]

rs0 [direct: secondary] test> db.mycollection.find().readConcern("majority")  

[ { _id: 1, num: 20 } ]

If the other nodes restart but are isolated from this secondary, the secondary still show the same timeline consistent but stale reads.

I simulate that by dicoonnecting this node, and restarting the others:

docker network disconnect demo_mongoha demo-mongo-1-1
docker unpause demo-mongo-2-1
docker unpause demo-mongo-3-1

As the two others form a quorum, there is a primary that accepts the writes:

-bash-4.2# docker compose exec -it mongo-2 mongosh                                                                                                                                                                                                                                                         
Current Mongosh Log ID: 686264bd3e0326801369e327
Connecting to:          mongodb://127.0.0.1:27017/?directConnection=true&serverSelectionTimeoutMS=2000&appName=mongosh+2.5.2
Using MongoDB:          8.0.10
Using Mongosh:          2.5.2

rs0 [direct: primary] test> for (let i = 1; i <= 10; i++) {
...  print(` ${i} ${new Date()}`)
...  db.mycollection.updateOne( { _id: 1 }, { $inc: { num: 1 } });
...  print(` ${i} ${new Date()}`)
... }
 1 Mon Jun 30 2025 10:20:09 GMT+0000 (Coordinated Universal Time)
 1 Mon Jun 30 2025 10:20:09 GMT+0000 (Coordinated Universal Time)
 2 Mon Jun 30 2025 10:20:09 GMT+0000 (Coordinated Universal Time)
 2 Mon Jun 30 2025 10:20:09 GMT+0000 (Coordinated Universal Time)
 3 Mon Jun 30 2025 10:20:09 GMT+0000 (Coordinated Universal Time)
 3 Mon Jun 30 2025 10:20:09 GMT+0000 (Coordinated Universal Time)
 4 Mon Jun 30 2025 
                                    
                                    
                                    
                                    
                                

The PG_TDE Extension Is Now Ready for Production

Lately, it feels like every time I go to a technical conference, someone is talking about how great PostgreSQL is. I’d think it’s just me noticing, but the rankings and surveys say otherwise. PostgreSQL is simply very popular. From old-school bare metal setups to VMs, containers, and fully managed cloud databases, PostgreSQL keeps gaining ground. And […]

June 28, 2025

Flush to disk on commit 👉🏻 MongoDB durable writes

A Twitter (𝕏) thread was filled with misconceptions about MongoDB, spreading fear, uncertainty, and doubt (FUD). This led one user to question whether MongoDB acknowledges writes before they are actually flushed to disk:

Doesn't MongoDB acknowledge writes before it's actually flushed them to disk?

MongoDB, like many databases, employs journaling—also known as write-ahead logging (WAL)—to ensure durability (the D in ACID) with high performance. This involves safely recording write operations in the journal, and ensuring they are flushed to disk before the commit is acknowledged. Further details can be found in the documentation under Write Concern and Journaling

Here is how you can test it, in a lab, with Linux STRACE and GDB, to debunk the myths.

Start the lab

I created a local MongoDB server. I uses a single-node local atlas cluster here but you can do the same with replicas:

atlas deployments setup  atlas --type local --port 27017 --force

Start it if it was stopped, and connect with MongoDB Shell:

atlas deployment start atlas
mongosh

Trace the system calls with strace

In another terminal, I used strace to display the system calls (-e trace) to write (pwrite64) and sync (fdatasync) the files, with the file names (-yy), by the MongoDB server process (-p $(pgrep -d, mongod)) and its threads (-f), with the execution time and timestamp (-tT):

strace -tT -fp $(pgrep -d, mongod) -yye trace=pwrite64,fdatasync -qqs 0

Some writes and sync happen in the background

[pid 2625869] 08:26:13 fdatasync(11</data/db/WiredTiger.wt>) = 0 <0.000022>                                                                                    
[pid 2625869] 08:26:13 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 384, 19072) = 384 <0.000024>                                             
[pid 2625869] 08:26:13 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.002123>                                                                 
[pid 2625868] 08:26:13 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 128, 19456) = 128 <0.000057>                                             
[pid 2625868] 08:26:13 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.002192>                                                                 
[pid 2625868] 08:26:23 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 384, 19584) = 384 <0.000057>                                             
[pid 2625868] 08:26:23 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.002068>                                                                 
[pid 2625868] 08:26:33 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 384, 19968) = 384 <0.000061>                                             
[pid 2625868] 08:26:33 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.002747>                                                                 
[pid 2625868] 08:26:43 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 384, 20352) = 384 <0.000065>                                             
[pid 2625868] 08:26:43 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.003008>                                                                 
[pid 2625868] 08:26:53 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 384, 20736) = 384 <0.000075>                                             
[pid 2625868] 08:26:53 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.002092>                                                                 
[pid 2625868] 08:27:03 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 384, 21120) = 384 <0.000061>                                             
[pid 2625868] 08:27:03 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.002527>                                                                 
[pid 2625869] 08:27:13 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.000033>                                                                 

Write to the collection

In the MongoDB shell, I created a collection and ran ten updates:

db.mycollection.drop();
db.mycollection.insert( { _id: 1, num:0 });

for (let i = 1; i <= 10; i++) {
 db.mycollection.updateOne( { _id: 1 }, { $inc: { num: 1 } });
 print(` ${i} ${new Date()}`)
}

The strace output the following when running the loop of ten updates:

[pid 2625868] 08:33:07 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 512, 76288) = 512 <0.000066>                                             
[pid 2625868] 08:33:07 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.001865>                                                                 
[pid 2625868] 08:33:07 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 512, 76800) = 512 <0.000072>                                             
[pid 2625868] 08:33:07 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.001812>                                                                 
[pid 2625868] 08:33:07 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 512, 77312) = 512 <0.000056>                                             
[pid 2625868] 08:33:07 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.001641>                                                                 
[pid 2625868] 08:33:07 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 512, 77824) = 512 <0.000043>                                             
[pid 2625868] 08:33:07 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.001812>                                                                 
[pid 2625868] 08:33:07 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 512, 78336) = 512 <0.000175>                                             
[pid 2625868] 08:33:07 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.001944>                                                                 
[pid 2625868] 08:33:07 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 512, 78848) = 512 <0.000043>                                             
[pid 2625868] 08:33:07 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.001829>                                                                 
[pid 2625868] 08:33:07 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 512, 79360) = 512 <0.000043>                                             
[pid 2625868] 08:33:07 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.001917>                                                                 
[pid 2625868] 08:33:07 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 512, 79872) = 512 <0.000050>                                             
[pid 2625868] 08:33:07 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.002260>                                                                 
[pid 2625868] 08:33:07 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 512, 80384) = 512 <0.000035>                                             
[pid 2625868] 08:33:07 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.001940>                                                                 
[pid 2625868] 08:33:07 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 512, 80896) = 512 <0.000054>                                             
[pid 2625868] 08:33:07 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.001984>                                                                 

Each write (pwrite64) to the journal files was followed by a sync to disk (fdatasync). This system call is well documented:

FSYNC(2)                                                         Linux Programmer's Manual                                                         FSYNC(2)

NAME
       fsync, fdatasync - synchronize a file's in-core state with storage device

DESCRIPTION
       fsync() transfers ("flushes") all modified in-core data of (i.e., modified buffer cache pages for) the file referred to by the file descriptor fd to
       the disk device (or other permanent storage device) so that all changed information can be retrieved even if the  system  crashes  or  is  rebooted.
       This includes writing through or flushing a disk cache if present.  The call blocks until the device reports that the transfer has completed.
...
       fdatasync() is similar to fsync(), but does not flush modified metadata unless that metadata is needed in order to allow a subsequent data retrieval to  be  correctly  handled.   For  example,  changes  to  st_atime or st_mtime (respectively, time of last access and time of last modification
...
       The aim of fdatasync() is to reduce disk activity for applications that do not require all metadata to be synchronized with the disk.

Since I display both the committed time and the system call trace times, you can see that they match. The output related to the traces above demonstrates this alignment:

 1 Sat Jun 28 2025 08:33:07 GMT+0000 (Greenwich Mean Time)                                                                                                    
 2 Sat Jun 28 2025 08:33:07 GMT+0000 (Greenwich Mean Time)                                                                                                    
 3 Sat Jun 28 2025 08:33:07 GMT+0000 (Greenwich Mean Time)                                                                                                    
 4 Sat Jun 28 2025 08:33:07 GMT+0000 (Greenwich Mean Time)                                                                                                    
 5 Sat Jun 28 2025 08:33:07 GMT+0000 (Greenwich Mean Time)                                                                                                    
 6 Sat Jun 28 2025 08:33:07 GMT+0000 (Greenwich Mean Time)                                                                                                    
 7 Sat Jun 28 2025 08:33:07 GMT+0000 (Greenwich Mean Time)                                                                                                    
 8 Sat Jun 28 2025 08:33:07 GMT+0000 (Greenwich Mean Time)
 9 Sat Jun 28 2025 08:33:07 GMT+0000 (Greenwich Mean Time)
 10 Sat Jun 28 2025 08:33:07 GMT+0000 (Greenwich Mean Time)

Multi-document transactions

The previous example ran ten autocommit updates, each calling a synchronisation to disk.
In general, with good document data modeling, a document should match the business transaction. However, it is possible to use multi-document transaction and they are ACID (atomic, consistent, isolated and durable). Using multi-document transactions also reduces the sync latency as it is required only once per transaction, at commit.

I've run the following with five transactions, each running one update and one insert:


const session = db.getMongo().startSession();
for (let i = 1; i <= 5; i++) {
 session.startTransaction();
  const sessionDb = session.getDatabase(db.getName());
  sessionDb.mycollection.updateOne( { _id: 1 }, { $inc: { num: 1 } });
  print(` ${i} updated ${new Date()}`)
  sessionDb.mycollection.insertOne( { answer:42 });
  print(` ${i} inserted ${new Date()}`)
 session.commitTransaction();
 print(` ${i} committed ${new Date()}`)
}

Strace still shows ten calls to pwrite64 and fdatasync. I used this multi-document transaction to go further and prove that not only the commit triggers a sync to disk, but also waits for its acknlowledgement before returning a sucessful feedback to the application.

Inject some latency with gdb

To show that the commit waits for the acknowledgment of fdatasync I used a GDB breakpoint for the fdatasyc call.

I stopped strace, and started GDB with a script that adds a latency of five seconds to fdatasync:

cat > gdb_slow_fdatasync.gdb <<GDB

break fdatasync
commands
  shell sleep 5
  continue
end
continue

GDB

gdb --batch -x gdb_slow_fdatasync.gdb -p $(pgrep mongod)

I ran the five transactions and two writes. GDB shows when it hits the breakpoint:

Thread 31 "JournalFlusher" hit Breakpoint 1, 0x0000ffffa6096eec in fdatasync () from target:/lib64/libc.so.6 

My GDB script automatically waits fives seconds and continues the program, until the next call to fdatasync.

Here was the output from my loop with five transactions:

 1 updated Sat Jun 28 2025 08:49:32 GMT+0000 (Greenwich Mean Time)
 1 inserted Sat Jun 28 2025 08:49:32 GMT+0000 (Greenwich Mean Time)
 1 committed Sat Jun 28 2025 08:49:37 GMT+0000 (Greenwich Mean Time)
 2 updated Sat Jun 28 2025 08:49:37 GMT+0000 (Greenwich Mean Time)
 2 inserted Sat Jun 28 2025 08:49:37 GMT+0000 (Greenwich Mean Time)
 2 committed Sat Jun 28 2025 08:49:42 GMT+0000 (Greenwich Mean Time)
 3 updated Sat Jun 28 2025 08:49:42 GMT+0000 (Greenwich Mean Time)
 3 inserted Sat Jun 28 2025 08:49:42 GMT+0000 (Greenwich Mean Time)
 3 committed Sat Jun 28 2025 08:49:47 GMT+0000 (Greenwich Mean Time)
 4 updated Sat Jun 28 2025 08:49:47 GMT+0000 (Greenwich Mean Time)
 4 inserted Sat Jun 28 2025 08:49:47 GMT+0000 (Greenwich Mean Time)
 4 committed Sat Jun 28 2025 08:49:52 GMT+0000 (Greenwich Mean Time)
 5 updated Sat Jun 28 2025 08:49:52 GMT+0000 (Greenwich Mean Time)
 5 inserted Sat Jun 28 2025 08:49:52 GMT+0000 (Greenwich Mean Time)

The insert and update operations occur immediately, but the commit itself waits five seconds, because of the latency I injected with GDB. This demonstrates that the commit waits for fdatasync, guaranteeing the flush to persistent storage. For this demo, I used all default settings in MongoDB 8.0, but this behavior can still be tuned through write concern and journaling configurations.

Next time you encounter claims from ignorants people or detractors suggesting that MongoDB is not consistent or fails to flush committed changes to disk, you can confidently debunk these myths by referring to official documentation and conducting your own experiments.

Flush to disk on commit 👉🏻 MongoDB durable writes

A Twitter (𝕏) thread was filled with misconceptions about MongoDB, spreading fear, uncertainty, and doubt (FUD). This led one user to question whether MongoDB acknowledges writes before they are actually flushed to disk:

Doesn't MongoDB acknowledge writes before it's actually flushed them to disk?

MongoDB, like many databases, employs journaling—also known as write-ahead logging (WAL)—to ensure durability (the D in ACID) with high performance. This involves safely recording write operations in the journal, and ensuring they are flushed to disk before the commit is acknowledged. Further details can be found in the documentation under Write Concern and Journaling

Here is how you can test it, in a lab, with Linux STRACE and GDB, to debunk the myths.

Start the lab

I created a local MongoDB server. I uses a single-node local atlas cluster here but you can do the same with replicas:

atlas deployments setup  atlas --type local --port 27017 --force

Start it if it was stopped, and connect with MongoDB Shell:

atlas deployment start atlas
mongosh

Trace the system calls with strace

In another terminal, I used strace to display the system calls (-e trace) to write (pwrite64) and sync (fdatasync) the files, with the file names (-yy), by the MongoDB server process (-p $(pgrep -d, mongod)) and its threads (-f), with the execution time and timestamp (-tT):

strace -tT -fp $(pgrep -d, mongod) -yye trace=pwrite64,fdatasync -qqs 0

Some writes and sync happen in the background

[pid 2625869] 08:26:13 fdatasync(11</data/db/WiredTiger.wt>) = 0 <0.000022>                                                                                    
[pid 2625869] 08:26:13 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 384, 19072) = 384 <0.000024>                                             
[pid 2625869] 08:26:13 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.002123>                                                                 
[pid 2625868] 08:26:13 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 128, 19456) = 128 <0.000057>                                             
[pid 2625868] 08:26:13 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.002192>                                                                 
[pid 2625868] 08:26:23 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 384, 19584) = 384 <0.000057>                                             
[pid 2625868] 08:26:23 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.002068>                                                                 
[pid 2625868] 08:26:33 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 384, 19968) = 384 <0.000061>                                             
[pid 2625868] 08:26:33 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.002747>                                                                 
[pid 2625868] 08:26:43 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 384, 20352) = 384 <0.000065>                                             
[pid 2625868] 08:26:43 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.003008>                                                                 
[pid 2625868] 08:26:53 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 384, 20736) = 384 <0.000075>                                             
[pid 2625868] 08:26:53 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.002092>                                                                 
[pid 2625868] 08:27:03 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 384, 21120) = 384 <0.000061>                                             
[pid 2625868] 08:27:03 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.002527>                                                                 
[pid 2625869] 08:27:13 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.000033>                                                                 

Write to the collection

In the MongoDB shell, I created a collection and ran ten updates:

db.mycollection.drop();
db.mycollection.insert( { _id: 1, num:0 });

for (let i = 1; i <= 10; i++) {
 print(` ${i} ${new Date()}`)
 db.mycollection.updateOne( { _id: 1 }, { $inc: { num: 1 } });
 print(` ${i} ${new Date()}`)
}

The strace output the following when running the loop of ten updates:

[pid 2625868] 08:33:07 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 512, 76288) = 512 <0.000066>                                             
[pid 2625868] 08:33:07 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.001865>                                                                 
[pid 2625868] 08:33:07 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 512, 76800) = 512 <0.000072>                                             
[pid 2625868] 08:33:07 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.001812>                                                                 
[pid 2625868] 08:33:07 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 512, 77312) = 512 <0.000056>                                             
[pid 2625868] 08:33:07 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.001641>                                                                 
[pid 2625868] 08:33:07 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 512, 77824) = 512 <0.000043>                                             
[pid 2625868] 08:33:07 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.001812>                                                                 
[pid 2625868] 08:33:07 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 512, 78336) = 512 <0.000175>                                             
[pid 2625868] 08:33:07 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.001944>                                                                 
[pid 2625868] 08:33:07 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 512, 78848) = 512 <0.000043>                                             
[pid 2625868] 08:33:07 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.001829>                                                                 
[pid 2625868] 08:33:07 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 512, 79360) = 512 <0.000043>                                             
[pid 2625868] 08:33:07 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.001917>                                                                 
[pid 2625868] 08:33:07 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 512, 79872) = 512 <0.000050>                                             
[pid 2625868] 08:33:07 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.002260>                                                                 
[pid 2625868] 08:33:07 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 512, 80384) = 512 <0.000035>                                             
[pid 2625868] 08:33:07 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.001940>                                                                 
[pid 2625868] 08:33:07 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 512, 80896) = 512 <0.000054>                                             
[pid 2625868] 08:33:07 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.001984>                                                                 

Each write (pwrite64) to the journal files was followed by a sync to disk (fdatasync). This system call is well documented:

FSYNC(2)                                                         Linux Programmer's Manual                                                         FSYNC(2)

NAME
       fsync, fdatasync - synchronize a file's in-core state with storage device

DESCRIPTION
       fsync() transfers ("flushes") all modified in-core data of (i.e., modified buffer cache pages for) the file referred to by the file descriptor fd to
       the disk device (or other permanent storage device) so that all changed information can be retrieved even if the  system  crashes  or  is  rebooted.
       This includes writing through or flushing a disk cache if present.  The call blocks until the device reports that the transfer has completed.
...
       fdatasync() is similar to fsync(), but does not flush modified metadata unless that metadata is needed in order to allow a subsequent data retrieval to  be  correctly  handled.   For  example,  changes  to  st_atime or st_mtime (respectively, time of last access and time of last modification
...
       The aim of fdatasync() is to reduce disk activity for applications that do not require all metadata to be synchronized with the disk.

Since I display both the committed time and the system call trace times, you can see that they match. The output related to the traces above demonstrates this alignment:

 1 Sat Jun 28 2025 08:33:07 GMT+0000 (Greenwich Mean Time)                                                                                                    
 2 Sat Jun 28 2025 08:33:07 GMT+0000 (Greenwich Mean Time)                                                                                                    
 3 Sat Jun 28 2025 08:33:07 GMT+0000 (Greenwich Mean Time)                                                                                                    
 4 Sat Jun 28 2025 08:33:07 GMT+0000 (Greenwich Mean Time)                                                                                                    
 5 Sat Jun 28 2025 08:33:07 GMT+0000 (Greenwich Mean Time)                                                                                                    
 6 Sat Jun 28 2025 08:33:07 GMT+0000 (Greenwich Mean Time)                                                                                                    
 7 Sat Jun 28 2025 08:33:07 GMT+0000 (Greenwich Mean Time)                                                                                                    
 8 Sat Jun 28 2025 08:33:07 GMT+0000 (Greenwich Mean Time)
 9 Sat Jun 28 2025 08:33:07 GMT+0000 (Greenwich Mean Time)
 10 Sat Jun 28 2025 08:33:07 GMT+0000 (Greenwich Mean Time)

Multi-document transactions

The previous example ran ten autocommit updates, each calling a synchronisation to disk.
In general, with good document data modeling, a document should match the business transaction. However, it is possible to use multi-document transaction and they are ACID (atomic, consistent, isolated and durable). Using multi-document transactions also reduces the sync latency as it is required only once per transaction, at commit.

I've run the following with five transactions, each running one update and one insert:


const session = db.getMongo().startSession();
for (let i = 1; i <= 5; i++) {
 session.startTransaction();
  const sessionDb = session.getDatabase(db.getName());
  sessionDb.mycollection.updateOne( { _id: 1 }, { $inc: { num: 1 } });
  print(` ${i} updated ${new Date()}`)
  sessionDb.mycollection.insertOne( { answer:42 });
  print(` ${i} inserted ${new Date()}`)
 session.commitTransaction();
 print(` ${i} committed ${new Date()}`)
}

Strace still shows ten calls to pwrite64 and fdatasync. I used this multi-document transaction to go further and prove that not only the commit triggers a sync to disk, but also waits for its acknlowledgement before returning a sucessful feedback to the application.

Inject some latency with gdb

To show that the commit waits for the acknowledgment of fdatasync I used a GDB breakpoint for the fdatasyc call.

I stopped strace, and started GDB with a script that adds a latency of five seconds to fdatasync:

cat > gdb_slow_fdatasync.gdb <<GDB

break fdatasync
commands
  shell sleep 5
  continue
end
continue

GDB

gdb --batch -x gdb_slow_fdatasync.gdb -p $(pgrep mongod)

I ran the five transactions and two writes. GDB shows when it hits the breakpoint:

Thread 31 "JournalFlusher" hit Breakpoint 1, 0x0000ffffa6096eec in fdatasync () from target:/lib64/libc.so.6 

My GDB script automatically waits fives seconds and continues the program, until the next call to fdatasync.

Here was the output from my loop with five transactions:

 1 updated Sat Jun 28 2025 08:49:32 GMT+0000 (Greenwich Mean Time)
 1 inserted Sat Jun 28 2025 08:49:32 GMT+0000 (Greenwich Mean Time)
 1 committed Sat Jun 28 2025 08:49:37 GMT+0000 (Greenwich Mean Time)
 2 updated Sat Jun 28 2025 08:49:37 GMT+0000 (Greenwich Mean Time)
 2 inserted Sat Jun 28 2025 08:49:37 GMT+0000 (Greenwich Mean Time)
 2 committed Sat Jun 28 2025 08:49:42 GMT+0000 (Greenwich Mean Time)
 3 updated Sat Jun 28 2025 08:49:42 GMT+0000 (Greenwich Mean Time)
 3 inserted Sat Jun 28 2025 08:49:42 GMT+0000 (Greenwich Mean Time)
 3 committed Sat Jun 28 2025 08:49:47 GMT+0000 (Greenwich Mean Time)
 4 updated Sat Jun 28 2025 08:49:47 GMT+0000 (Greenwich Mean Time)
 4 inserted Sat Jun 28 2025 08:49:47 GMT+0000 (Greenwich Mean Time)
 4 committed Sat Jun 28 2025 08:49:52 GMT+0000 (Greenwich Mean Time)
 5 updated Sat Jun 28 2025 08:49:52 GMT+0000 (Greenwich Mean Time)
 5 inserted Sat Jun 28 2025 08:49:52 GMT+0000 (Greenwich Mean Time)

The insert and update operations occur immediately, but the commit itself waits five seconds, because of the latency I injected with GDB. This demonstrates that the commit waits for fdatasync, guaranteeing the flush to persistent storage. For this demo, I used all default settings in MongoDB 8.0, but this behavior can still be tuned through write concern and journaling configurations.

I used GDB to examine the call stack. Alternatively, you can inject a delay with strace by adding this option: -e inject=fdatasync:delay_enter=5000000.

Look at the open source code

When calling fdatasync, errors can occur, and this may compromise durability if operations on the file descriptor continue (remember the PostgreSQL fsyncgate). MongoDB uses the open-source WiredTiger storage engine, which implemented the same solution as PostgreSQL to avoid that: panic instead of retry. You can review the os_fs.c code to verify this.

The fdatasync call is in the JournalFlusher thread and here is the backtrace:

#0  0x0000ffffa0b5ceec in fdatasync () from target:/lib64/libc.so.6
#1  0x0000aaaadf5312c0 in __posix_file_sync ()
#2  0x0000aaaadf4f53c8 in __log_fsync_file ()
#3  0x0000aaaadf4f58d4 in __wt_log_force_sync ()
#4  0x0000aaaadf4fb8b8 in __wt_log_flush ()
#5  0x0000aaaadf588348 in __session_log_flush ()
#6  0x0000aaaadf41b878 in mongo::WiredTigerSessionCache::waitUntilDurable(mongo::OperationContext*, mongo::WiredTigerSessionCache::Fsync, mongo::WiredTigerSessionCache::UseJournalListener) ()
#7  0x0000aaaadf412358 in mongo::WiredTigerRecoveryUnit::waitUntilDurable(mongo::OperationContext*) ()
#8  0x0000aaaadfbe855c in mongo::JournalFlusher::run() ()

Here are some entrypoints if you want to look at the code behind this:

Have your opinions based on facts, not myths.

MongoDB began as a NoSQL database that prioritized availability and low latency over strong consistency. However, that was over ten years ago. As technology evolves, experts who refuse to constantly learn risk their knowledge becoming outdated, their skills diminishing, and their credibility suffering.
Today, MongoDB is a general-purpose database that supports transaction atomicity, consistency, isolation, and durability—whether the transaction involves a single document, or multiple documents.

Next time you encounter claims from ignorants or detractors suggesting that MongoDB is not consistent or fails to flush committed changes to disk, you can confidently debunk these myths by referring to official documentation, the open source code, and conducting your own experiments. MongoDB is similar to PostgreSQL: buffered writes and WAL sync to disk on commit.

Want to meet people, try charging them for it?

I have been blogging consistently since 2017. And one of my goals in speaking publicly was always to connect with like-minded people. I always left my email and hoped people would get in touch. Even while my blog and twitter became popular, passing 1M views and 20k followers, I basically never had people get in touch to chat or meet up.

So it felt kind of ridiculous when last November I started charging people $100 to chat. I mean, who am I? But people started showing up fairly immediately. Now granted the money did not go to me. It went to an education non-profit and I merely received the receipt.

And at this point I've met a number of interesting people, from VCs to business professors to undergraduate students to founders and everyone in between. People wanting to talk about trends in databases, about how to succeed as a programmer, about marketing for developers, and so on. Women and men thoughout North America, Europe, Africa, New Zealand, India, Nepal, and so on. And I've raised nearly $6000 for educational non-profits.

How is it that you go from giving away your time for free and getting no hits to charging and almost immediately getting results? For one, every person responded very positively to it being a fundraiser. It also helps me be entirely shameless about sharing on social media every single time someone donates; because it's such a positive thing.

But also I think that in "charging" for my time it helps people feel more comfortable about actually taking my time, especially when we have never met. It gives you a reasonable excuse to take time from an internet rando.

On the other hand, a lot of people come for advice and I think giving advice is pretty dangerous, especially since my background is not super conventional. I try to always frame things as just sharing my opinion and my perspective and that they should talk with many others and not take my suggestions without consideration.

And there's also the problem that by charging everyone for my time now, I'm no longer available to people who could maybe use it the most. I do mention on my page that I will still take calls from people who don't donate, as my schedule allows. But to be honest I feel less incentivized to spend time when people do not donate. So I guess this is an issue with the program.

But I mitigated even this slightly, and significantly jump-started the program, during my 30th birthday when I took calls with any person who donated at least $30.

Anyway, I picked this path because I have wanted to get involved with helping students figure out their lives and careers. But without a degree I am literally unqualified for many volunteering programs. And I always found the time commitments for non-profits painful.

So until starting this I figured it wouldn't be until I retire that I find some way to make a difference. But ultimately I kept meeting people who were starting their own non-profits now or donated significantly to help students. Peer pressure. I wanted to do my part now. And 30 minutes of my time in return for a donation receipt has been an easy trade.

While only raising a humble $6,000 to date, the Chat for Education program has been more successful than I imagined. I've met many amazing people through it. And it's something that should be easy to keep up indefinitely.

I hope to meet you through it too!

June 27, 2025

Supercharging AWS database development with AWS MCP servers

Amazon Aurora, Amazon DynamoDB, and Amazon ElastiCache are popular choices for developers powering critical workloads, including global commerce platforms, financial systems, and real-time analytics applications. To enhance productivity, developers are supplementing everyday tasks with AI-assisted tools that understand context, suggest improvements, and help reason through system configurations. Model Context Protocol (MCP) is at the helm of this revolution, rapidly transforming how developers integrate AI assistants into their development pipelines. In this post, we explore the core concepts behind MCP and demonstrate how new AWS MCP servers can accelerate your database development through natural language prompts.

Managing PostgreSQL on Kubernetes with Percona Everest’s REST API

I’ve been working with Kubernetes and databases for the past few months, and I’m enjoying learning and exploring more about Percona Everest’s features. Percona Everest is a free, open source tool that makes it easier for teams to manage databases in the cloud. In a Cloud Native world, everything is programmable, including databases. Percona Everest […]

June 26, 2025

Percona XtraDB Cluster: Our Commitment to Open Source High Availability

At Percona, we’ve always been dedicated to providing robust, open source solutions that meet our users’ evolving needs. Percona XtraDB Cluster (PXC) stands as a testament to this commitment, offering a highly available and scalable solution for your MySQL and Percona Server for MySQL deployments. We understand that database high availability is critical for your […]

Scaling Smarter: What You Have Missed in MongoDB 8.0

MongoDB has always made it relatively easy to scale horizontally, but with version 8.0, the database takes a significant step forward. If you’re working with large datasets or high-throughput applications, some of the changes in this release will make your life a lot easier — and your architecture cleaner. Let’s take a look at some […]

June 25, 2025

Building a job search engine with PostgreSQL’s advanced search features

In today’s employment landscape, job search platforms play a crucial role in connecting employers with potential candidates. Behind these platforms lie complex search engines that must process and analyze vast amounts of structured and unstructured data to deliver relevant results. This post explores how to use PostgreSQL’s search features to build an effective job search engine. We examine each search capability in detail, discuss how they can be combined in PostgreSQL, and offer strategies for optimizing performance as your search engine scales.

Using Percona Everest Operator CRDs to Manage Databases in Kubernetes

Percona Everest is a free and open source tool for running and managing databases like PostgreSQL, MySQL, and MongoDB inside Kubernetes. It simplifies things by providing three ways to work with your databases: a web interface (UI), a set of commands (API), and direct access through Kubernetes itself using built-in tools like kubectl. > Note: […]

Build a Personalized AI Assistant with Postgres

Learn how to build a Supabase powered AI assistant that combines PostgreSQL with scheduling and external tools for long-term memory, structured data management and autonomous actions.

Use CedarDB to search the CedarDB docs and blogs

Motivation

Not so long ago, I shared that I have an interest in finding things and, in that case, the question was about where something could be found. Another common requrement is, given some expression of an interest, finding the set of documents that best answers the question. For example, coupled with the geospatial question, we might include that we’re looking for Indian restaurants within the specified geographic area.

For this article, though, we’ll restrict the focus to the problem of finding the most relevant documents within some collection, where that collection just happens to be the CedarDB documentation. To that end, I’ll assert up front that my query “Does the CedarDB ‘asof join’ use an index?” should return a helpful response, while the query “Does pickled watermelon belong on a taco?” should ideally return an empty result.