September 11, 2025
September 09, 2025
Beyond EOL: The Real Benefits of Upgrading to MySQL 8.4
OrioleDB Patent: now freely available to the Postgres community
September 08, 2025
Resilience of MongoDB's WiredTiger Storage Engine to Disk Failure Compared to PostgreSQL and Oracle
There have been jokes that have contributed to persistent myths about MongoDB's durability. The authors of those myths ignore that MongoDB's storage engine is among the most robust in the industry, and it's easy to demonstrate. MongoDB uses WiredTiger (created by the same author as Berkeley DB), which provides block corruption protection stronger than that of many other databases. In this article I'll show how to reproduce a simple write loss, in a lab, and see how the database detects it to avoid returning corrupt data.
PostgreSQL
To expose the issue when a database doesn't detect lost writes, I chose PostgreSQL for this demonstration. As of version 18, PostgreSQL enables checksums by default. I'm testing it with the Release Candidate in a docker lab:
docker run --rm -it --cap-add=SYS_PTRACE postgres:18rc1 bash
# Install some utilities
apt update -y && apt install -y strace
# Start PostgreSQL
POSTGRES_PASSWORD=x \
strace -fy -e trace=pread64,pwrite64 \
docker-entrypoint.sh postgres &
# Connect to PostgreSQL
psql -U postgres
I've started PostgreSQL, tracing the read and write calls with strace.
I check that block checksum is enabled:
postgres=# show data_checksums;
data_checksums
----------------
on
(1 row)
I create a demo table and insert random data:
create table demo (k int primary key, v text);
copy demo from program $$
cat /dev/urandom |
base64 |
head -10000 |
awk '{print NR"\t"$0}'
$$ ;
vacuum demo;
checkpoint;
create extension pg_buffercache;
select distinct pg_buffercache_evict(bufferid) from pg_buffercache;
I triggered a checkpoint to write to disk and flushed the shared buffers, so that I can see the read IO for the next query:
set enable_seqscan to false;
select distinct pg_buffercache_evict(bufferid) from pg_buffercache;
select * from demo where k=999;
[pid 161] pread64(59</var/lib/postgresql/18/docker/base/5/16394>, "\0\0\0\0@j\225\1R\232\0\0\210\08\36\360\37\4 \0\0\0\0\350\237\20\0\330\237 \0"..., 8192, 24576) = 8192
[pid 161] pread64(59</var/lib/postgresql/18/docker/base/5/16394>, "\0\0\0\0\320\243\200\1mj\0\0\324\5\0\t\360\37\4 \0\0\0\0\340\237 \0\320\237 \0"..., 8192, 32768) = 8192
[pid 161] pread64(58</var/lib/postgresql/18/docker/base/5/16388>, "\0\0\0\0\10\232\232\1\302\260\4\0000\1`\1\0 \4 \0\0\0\0\220\237\322\0 \237\322\0"..., 8192, 114688) = 8192
k | v
-----+------------------------------------------------------------------------------
999 | MQIEZSsmBjk7MRtgIZLL/MqABsjhuMR6I4LtayWfR2764PdB+AcQt2saRtFXkgBUGCLKzM8SBmKX
(1 row)
This has read two pages from the index, and then the page that contains the row I'm querying. This page is in base/5/16388 at offset 114688.
I use dd to save the content of this page, then update the row 999, trigger a checkpoint and flush the buffer cache:
\! dd if=/var/lib/postgresql/18/docker/base/5/16388 of=block1.tmp bs=1 skip=114688 count=8192
update demo set v='xxxxxx' where k=999;
checkpoint;
select distinct pg_buffercache_evict(bufferid) from pg_buffercache;
I query the row again, and it shows the updated value:
set enable_seqscan to false;
select distinct pg_buffercache_evict(bufferid) from pg_buffercache;
select * from demo where k=999;
[pid 161] pread64(59</var/lib/postgresql/18/docker/base/5/16394>, "\0\0\0\0@j\225\1R\232\0\0\210\08\36\360\37\4 \0\0\0\0\350\237\20\0\330\237 \0"..., 8192, 24576) = 8192
[pid 161] pread64(59</var/lib/postgresql/18/docker/base/5/16394>, "\0\0\0\0\320\243\200\1mj\0\0\324\5\0\t\360\37\4 \0\0\0\0\340\237 \0\320\237 \0"..., 8192, 32768) = 8192
[pid 161] pread64(58</var/lib/postgresql/18/docker/base/5/16388>, "\0\0\0\0\10m\261\1K~\0\08\1\200\1\0 \4 \2\3\0\0\220\237\322\0 \237\322\0"..., 8192, 114688) = 8192
k | v
-----+--------
999 | xxxxxx
(1 row)
It reads the same index pages, but the leaf points to another table page, at offset 114688, that holds the new value.
To simulate disk corruption, I copy the previous block to this new location, and query again:
\! dd of=/var/lib/postgresql/18/docker/base/5/16388 if=block1.tmp bs=1 seek=114688 conv=notrunc
set enable_seqscan to false;
select distinct pg_buffercache_evict(bufferid) from pg_buffercache;
select * from demo where k=999;
[pid 161] pread64(59</var/lib/postgresql/18/docker/base/5/16394>, "\0\0\0\0@j\225\1R\232\0\0\210\08\36\360\37\4 \0\0\0\0\350\237\20\0\330\237 \0"..., 8192, 24576) = 8192
[pid 161] pread64(59</var/lib/postgresql/18/docker/base/5/16394>, "\0\0\0\0\320\243\200\1mj\0\0\324\5\0\t\360\37\4 \0\0\0\0\340\237 \0\320\237 \0"..., 8192, 32768) = 8192
[pid 161] pread64(58</var/lib/postgresql/18/docker/base/5/16388>, "\0\0\0\0\10\232\232\1\302\260\4\0000\1`\1\0 \4 \0\0\0\0\220\237\322\0 \237\322\0"..., 8192, 114688) = 8192
k | v
-----+------------------------------------------------------------------------------
999 | MQIEZSsmBjk7MRtgIZLL/MqABsjhuMR6I4LtayWfR2764PdB+AcQt2saRtFXkgBUGCLKzM8SBmKX
(1 row)
There's no error because this new block has a correct checksum, as it is a valid block, just not at its right place. And it holds the right structure, as it comes from a block of the same table. However, it shows a row that should not be there. This is an error that can happen with a failure in the storage that does not write a block at the right place.
Checksum is still useful if the corruption is not aligned with well formatted blocks. For example, I'm replacing the first half of the page with the second part of the page:
checkpoint;
\! dd of=/var/lib/postgresql/18/docker/base/5/16388 if=block1.tmp bs=1 seek=118784 count=4096 conv=notrunc
set enable_seqscan to false;
select distinct pg_buffercache_evict(bufferid) from pg_buffercache;
select * from demo where k=999;
[pid 161] pread64(59</var/lib/postgresql/18/docker/base/5/16394>, "\0\0\0\0@j\225\1R\232\0\0\210\08\36\360\37\4 \0\0\0\0\350\237\20\0\330\237 \0"..., 8192, 24576) = 8192
[pid 161] pread64(59</var/lib/postgresql/18/docker/base/5/16394>, "\0\0\0\0\320\243\200\1mj\0\0\324\5\0\t\360\37\4 \0\0\0\0\340\237 \0\320\237 \0"..., 8192, 32768) = 8192
[pid 161] pread64(58</var/lib/postgresql/18/docker/base/5/16388>, "\0\0\0\0\10\232\232\1\302\260\4\0000\1`\1\0 \4 \0\0\0\0\220\237\322\0 \237\322\0"..., 8192, 114688) = 8192
2025-09-08 17:58:41.876 UTC [161] LOG: page verification failed, calculated checksum 20176 but expected 45250
2025-09-08 17:58:41.876 UTC [161] STATEMENT: select * from demo where k=999;
2025-09-08 17:58:41.876 UTC [161] LOG: invalid page in block 14 of relation "base/5/16388"
2025-09-08 17:58:41.876 UTC [161] STATEMENT: select * from demo where k=999;
2025-09-08 17:58:41.876 UTC [161] ERROR: invalid page in block 14 of relation "base/5/16388"
2025-09-08 17:58:41.876 UTC [161] STATEMENT: select * from demo where k=999;
ERROR: invalid page in block 14 of relation "base/5/16388"
Here the checksum calculated is not correct and an error has been raised. PostgreSQL checksums can detect some block corruption, but it is still possible that a bug or a malicious user that has access to the filesystem can change the data without being detected.
Oracle Database
To detect lost writes like the one I simulated above, Oracle Database compares the block checksum with the standby databases, as there is a low chance that the corruption happened in both environments. I've demonstrated this in the past with a similar demo: 18c new Lost Write Protection
WiredTiger (MongoDB storage engine)
MongoDB employs the WiredTiger storage engine, which is designed to prevent lost writes and detect disk failures that might return the wrong page. To achieve this, WiredTiger stores a checksum alongside each page address within the pointers between the BTree pages, in an address cookie:
Address cookie: an opaque set of bytes returned by the block manager to reference a block in a Btree file, it includes an offset, size, checksum, and object id.
In my lab, I first start a MongoDB container and compile wt, a command-line utility that allows direct interaction with WiredTiger files. This tool enables me to examine the storage engine without relying on the MongoDB query layer, and I'll use it in this series of blog posts:
docker run --rm -it --cap-add=SYS_PTRACE mongo bash
# install required packages
apt-get update && apt-get install -y git xxd strace curl jq python3 python3-dev python3-pip python3-venv build-essential cmake gcc g++ libstdc++-12-dev libtool autoconf automake swig liblz4-dev zlib1g-dev libmemkind-dev libsnappy-dev libsodium-dev libzstd-dev
# get latest WiredTiger
curl -L $(curl -s https://api.github.com/repos/wiredtiger/wiredtiger/releases/latest | jq -r '.tarball_url') -o wiredtiger.tar.gz
# Compile
mkdir /wiredtiger && tar -xzf wiredtiger.tar.gz --strip-components=1 -C /wiredtiger ; cd /wiredtiger
mkdir build && cmake -S /wiredtiger -B /wiredtiger/build \
-DHAVE_BUILTIN_EXTENSION_SNAPPY=1 \
-DCMAKE_BUILD_TYPE=Release \
-DENABLE_WERROR=0 \
-DENABLE_QPL=0 \
-DCMAKE_C_FLAGS="-O0 -Wno-error -Wno-format-overflow -Wno-error=array-bounds -Wno-error=format-overflow -Wno-error=nonnull" \
-DPYTHON_EXECUTABLE=$(which python3)
cmake --build /wiredtiger/build
ln -s /wiredtiger/build/wt /usr/local/bin/wt
It takes some time, but the wt command line utility will make the investigation easier.
I create a demo table and insert ten thousand records:
root@7d6d105a1663:/tmp# wt create table:demo
root@7d6d105a1663:/tmp# ls -alrt
total 68
drwxr-xr-x. 1 root root 4096 Sep 8 19:21 ..
-rw-r--r--. 1 root root 21 Sep 8 19:35 WiredTiger.lock
-rw-r--r--. 1 root root 50 Sep 8 19:35 WiredTiger
-rw-r--r--. 1 root root 299 Sep 8 19:35 WiredTiger.basecfg
-rw-r--r--. 1 root root 4096 Sep 8 19:35 demo.wt
-rw-r--r--. 1 root root 4096 Sep 8 19:35 WiredTigerHS.wt
-rw-r--r--. 1 root root 1475 Sep 8 19:35 WiredTiger.turtle
-rw-r--r--. 1 root root 32768 Sep 8 19:35 WiredTiger.wt
drwxrwxrwt. 1 root root 4096 Sep 8 19:35 .
root@7d6d105a1663:/tmp# wt list
colgroup:demo
file:demo.wt
table:demo
cat /dev/urandom |
base64 |
head -10000 |
awk '{print "i",NR,$0}' |
wt dump -e table:demo
...
Inserted key '9997' and value 'ILZeUq/u/ErLB/i7LOUb4nwYP6D535trb8Mt3vcJXXRAqLeAiYIHn5bEWs1buflmiZMYd3rMMvhh'.
Inserted key '9998' and value 'y+b0eTV/4Ao12qRqtHhgP2xGUr+C9ZOfvOG3ZwbdDNXvpnbM1/laoJ9Yzyt6cbLJOR6jdQktpgFM'.
Inserted key '9999' and value 'cJY9uWtopqFOuZjggkZDWVZEEdygpMLyL7LscqehnKoVY7BrmTh4ZzyTLrZ1glROwLtZYbvLbu5c'.
Inserted key '10000' and value 'QYwyjaRxa9Q+5dvzwQtvv2QE/uS/vhRPCVsQ6p7re/L2yDrVRxyqkvSyMHeRCzMIsIovrCUJpPXI'.
I read record 9999 with wt and use strace to see the read calls:
strace -yy -e trace=pread64 -xs 36 wt dump -e table:demo <<<"s 9999"
...
pread64(6</tmp/demo.wt>, "\x00...\x1f\x63\x77\x41"..., 28672, 806912) = 28672
9999
cJY9uWtopqFOuZjggkZDWVZEEdygpMLyL7LscqehnKoVY7BrmTh4ZzyTLrZ1glROwLtZYbvLbu5c
...
This record is at offset 806912 in a 28672 bytes block. I 'save' this block with dd:
dd if=demo.wt of=block1.tmp bs=1 skip=806912 count=28672
28672+0 records in
28672+0 records out
28672 bytes (29 kB, 28 KiB) copied, 0.0680832 s, 421 kB/s
I update this record to "xxxxxx" and trace the write calls:
strace -yy -e trace=pwrite64 -xs 36 wt dump -e table:demo <<<"u 9999 xxxxxx"
...
Updated key '9999' to value 'xxxxxx'.
pwrite64(6</tmp/demo.wt>, "\x00...\x02\xb1\x1f\xf5"..., 28672, 847872) = 28672
pwrite64(6</tmp/demo.wt>, "\x00...\x4d\x19\x14\x4e"..., 4096, 876544) = 4096
pwrite64(6</tmp/demo.wt>, "\x00...\x4f\x74\xdb\x1e"..., 4096, 880640) = 4096
pwrite64(6</tmp/demo.wt>, "\x00...\x21\xcc\x25\x06"..., 4096, 884736) = 4096
...
This writes a new block (WiredTiger do not write in-place, which helps to avoid corruption) of 28672 bytes, and updates the BTree branches.
I can read this new value:
strace -yy -e trace=pread64 -xs 36 wt dump -e table:demo <<<"s 9999"
...
pread64(6</tmp/demo.wt>, "\x00...\x02\xb1\x1f\xf5"..., 28672, 847872) = 28672
9999
xxxxxx
...
To simulate disk corruption, I do the same as I did on PostgreSQL: replace the current block with the old one. I save the current block before overwriting it:
dd if=demo.wt of=block2.tmp bs=1 skip=847872 count=28672
28672+0 records in
28672+0 records out
28672 bytes (29 kB, 28 KiB) copied, 0.0688249 s, 417 kB/s
dd of=demo.wt if=block1.tmp bs=1 seek=847872 conv=notrunc
28672+0 records in
28672+0 records out
28672 bytes (29 kB, 28 KiB) copied, 0.0666375 s, 430 kB/s
If I try to read the record in this block, the corruption is detected:
strace -yy -e trace=pread64 -xs 36 wt dump -e table:demo <<<"s 9999"
...
pread64(6</tmp/demo.wt>, "\x00...\x1f\x63\x77\x41"..., 28672, 847872) = 28672
[1757361305:392519][8246:0x7fe9a087e740], wt, file:demo.wt, WT_SESSION.open_cursor: [WT_VERB_DEFAULT][ERROR]: __wti_block_read_off, 279: demo.wt: potential hardware corruption, read checksum error for 28672
B block at offset 847872: block header checksum of 0x4177631f doesn't match expected checksum of 0xf51fb102
[1757361305:392904][8246:0x7fe9a087e740], wt, file:demo.wt, WT_SESSION.open_cursor: [WT_VERB_DEFAULT][ERROR]: __bm_corrupt_dump, 86: {0: 847872, 28672, 0xf51fb102}: (chunk 1 of 28): 00 00 00 00 00 00 00 00
3a 00 00 00 00 00 00 00 40 6e 00 00 a8 02 00 00 07 04 00 01 00 70 00 00 1f 63 77 41 01 00 00 00 11 39 36 39 33 80 8c 43 38 30 58 51 66 64
...
Even if the block checksum is correct for the block itself, it was detected that the checksum of the block, 0x4177631f, which is visible as 1f 63 77 41 in the hexadecimal dump, or \x1f\x63\x77\x41 in the read trace, is different from the expected 0xf51fb102 from the address cookie.
0xf51fb102 was visible as \x02\xb1\x1f\xf5 in the write call of the update, and is visible as 02 b1 1f f5 in the block that I've saved before overwriting it:
root@7d6d105a1663:/tmp# xxd -l 36 block2.tmp
00000000: 0000 0000 0000 0000 5c00 0000 0000 0000 ........\.......
00000010: f96d 0000 a802 0000 0704 0001 0070 0000 .m...........p..
00000020: 02b1 1ff5
Even with access to the files, it would be extremely difficult to corrupt the data in an undetected way because any change must update the checksum, and the checksum is referenced in all address cookies in other blocks. Block corruption is highly unlikely as the blocks are not updated in place, and failure to write blocks would break the pointers.
Conclusion
PostgreSQL requires you to enable checksums to detect data corruption. This detects when a page’s checksum does not match its content. However, if the system erroneously writes a different, but valid, block from the same table in place of the intended one, or misses a write and the previous version of the block remains, PostgreSQL cannot identify this issue. As a result, some disk failures may escape detection and return wrong results.
Oracle Database stores blocks with checksums and can enable checking them on read. With a Data Guard standby and some network overhead, the database can transmit checksums over the network to verify data integrity when reading.
MongoDB WiredTiger enables checksums by default and can detect the wrong blocks without the need to contact replicas. It embeds the expected checksum inside the BTree address cookie so that every internal BTree pointer to a leaf page includes the checksum for the referenced page. If an obsolete or different page is swapped in, any mismatch will be detected because the pointer's checksum won’t match. WiredTiger uses copy-on-write, not in-place overwrites, further reducing the risk of corruption.
Here is a description of WiredTiger by Keith Bostic:
Resilience of MongoDB's WiredTiger Storage Engine to Disk Failure Compared to PostgreSQL and Oracle
There have been jokes that have contributed to persistent myths about MongoDB's durability. The authors of those myths ignore that MongoDB's storage engine is among the most robust in the industry. MongoDB uses WiredTiger (created by the same author as Berkeley DB), which provides block corruption protection stronger than that of many other databases. In this article, I'll show how to reproduce a simple write loss, in a lab, and see how the database detects it to avoid returning corrupt data.
I like comparing the trade-offs and implementations of various databases, but I stick to discussing those I'm familiar with. Let's deliberately corrupt PostgreSQL, Oracle, and MongoDB at the storage level to see what happens. This experiment can be reproduced in a lab and your comments are welcome.
PostgreSQL
To expose the issue when a database doesn't detect lost writes, I chose PostgreSQL for this demonstration. As of version 18, PostgreSQL enables checksums by default. I'm testing it with the Release Candidate in a Docker lab:
docker run --rm -it --cap-add=SYS_PTRACE postgres:18rc1 bash
# Install some utilities
apt update -y && apt install -y strace
# Start PostgreSQL
POSTGRES_PASSWORD=x \
strace -fy -e trace=pread64,pwrite64 \
docker-entrypoint.sh postgres &
# Connect to PostgreSQL
psql -U postgres
I've started PostgreSQL, tracing the read and write calls with strace.
I check that block checksum is enabled:
postgres=# show data_checksums;
data_checksums
----------------
on
(1 row)
I create a demo table and insert random data:
create table demo (k int primary key, v text);
copy demo from program $$
cat /dev/urandom |
base64 |
head -10000 |
awk '{print NR"\t"$0}'
$$ ;
vacuum demo;
checkpoint;
create extension pg_buffercache;
select distinct pg_buffercache_evict(bufferid) from pg_buffercache;
I triggered a checkpoint to write to disk and flushed the shared buffers, so that I can see the read IO for the next query:
set enable_seqscan to false;
select distinct pg_buffercache_evict(bufferid) from pg_buffercache;
select * from demo where k=999;
[pid 161] pread64(59</var/lib/postgresql/18/docker/base/5/16394>, "\0\0\0\0@j\225\1R\232\0\0\210\08\36\360\37\4 \0\0\0\0\350\237\20\0\330\237 \0"..., 8192, 24576) = 8192
[pid 161] pread64(59</var/lib/postgresql/18/docker/base/5/16394>, "\0\0\0\0\320\243\200\1mj\0\0\324\5\0\t\360\37\4 \0\0\0\0\340\237 \0\320\237 \0"..., 8192, 32768) = 8192
[pid 161] pread64(58</var/lib/postgresql/18/docker/base/5/16388>, "\0\0\0\0\10\232\232\1\302\260\4\0000\1`\1\0 \4 \0\0\0\0\220\237\322\0 \237\322\0"..., 8192, 114688) = 8192
k | v
-----+------------------------------------------------------------------------------
999 | MQIEZSsmBjk7MRtgIZLL/MqABsjhuMR6I4LtayWfR2764PdB+AcQt2saRtFXkgBUGCLKzM8SBmKX
(1 row)
This has read two pages from the index, and then the page that contains the row I'm querying. This page is in base/5/16388 at offset 114688.
I use dd to save the content of this page, then update the row 999, trigger a checkpoint, and flush the buffer cache:
\! dd if=/var/lib/postgresql/18/docker/base/5/16388 of=block1.tmp bs=1 skip=114688 count=8192
update demo set v='xxxxxx' where k=999;
checkpoint;
select distinct pg_buffercache_evict(bufferid) from pg_buffercache;
I query the row again, and it shows the updated value:
set enable_seqscan to false;
select distinct pg_buffercache_evict(bufferid) from pg_buffercache;
select * from demo where k=999;
[pid 161] pread64(59</var/lib/postgresql/18/docker/base/5/16394>, "\0\0\0\0@j\225\1R\232\0\0\210\08\36\360\37\4 \0\0\0\0\350\237\20\0\330\237 \0"..., 8192, 24576) = 8192
[pid 161] pread64(59</var/lib/postgresql/18/docker/base/5/16394>, "\0\0\0\0\320\243\200\1mj\0\0\324\5\0\t\360\37\4 \0\0\0\0\340\237 \0\320\237 \0"..., 8192, 32768) = 8192
[pid 161] pread64(58</var/lib/postgresql/18/docker/base/5/16388>, "\0\0\0\0\10m\261\1K~\0\08\1\200\1\0 \4 \2\3\0\0\220\237\322\0 \237\322\0"..., 8192, 114688) = 8192
k | v
-----+--------
999 | xxxxxx
(1 row)
It reads the same index pages, but the leaf points to another table page, at offset 114688, that holds the new value.
To simulate disk corruption, I copy the previous block to this new location, and query again:
\! dd of=/var/lib/postgresql/18/docker/base/5/16388 if=block1.tmp bs=1 seek=114688 conv=notrunc
set enable_seqscan to false;
select distinct pg_buffercache_evict(bufferid) from pg_buffercache;
select * from demo where k=999;
[pid 161] pread64(59</var/lib/postgresql/18/docker/base/5/16394>, "\0\0\0\0@j\225\1R\232\0\0\210\08\36\360\37\4 \0\0\0\0\350\237\20\0\330\237 \0"..., 8192, 24576) = 8192
[pid 161] pread64(59</var/lib/postgresql/18/docker/base/5/16394>, "\0\0\0\0\320\243\200\1mj\0\0\324\5\0\t\360\37\4 \0\0\0\0\340\237 \0\320\237 \0"..., 8192, 32768) = 8192
[pid 161] pread64(58</var/lib/postgresql/18/docker/base/5/16388>, "\0\0\0\0\10\232\232\1\302\260\4\0000\1`\1\0 \4 \0\0\0\0\220\237\322\0 \237\322\0"..., 8192, 114688) = 8192
k | v
-----+------------------------------------------------------------------------------
999 | MQIEZSsmBjk7MRtgIZLL/MqABsjhuMR6I4LtayWfR2764PdB+AcQt2saRtFXkgBUGCLKzM8SBmKX
(1 row)
There's no error because this new block has a correct checksum, as it is a valid block, just not at its right place. And it holds the right structure, as it comes from a block of the same table. However, it shows a row that should not be there. This is an error that can happen with a failure in the storage that does not write a block at the right place.
Checksum is still useful if the corruption is not aligned with well formatted blocks. For example, I'm replacing the first half of the page with the second part of the page:
checkpoint;
\! dd of=/var/lib/postgresql/18/docker/base/5/16388 if=block1.tmp bs=1 seek=118784 count=4096 conv=notrunc
set enable_seqscan to false;
select distinct pg_buffercache_evict(bufferid) from pg_buffercache;
select * from demo where k=999;
[pid 161] pread64(59</var/lib/postgresql/18/docker/base/5/16394>, "\0\0\0\0@j\225\1R\232\0\0\210\08\36\360\37\4 \0\0\0\0\350\237\20\0\330\237 \0"..., 8192, 24576) = 8192
[pid 161] pread64(59</var/lib/postgresql/18/docker/base/5/16394>, "\0\0\0\0\320\243\200\1mj\0\0\324\5\0\t\360\37\4 \0\0\0\0\340\237 \0\320\237 \0"..., 8192, 32768) = 8192
[pid 161] pread64(58</var/lib/postgresql/18/docker/base/5/16388>, "\0\0\0\0\10\232\232\1\302\260\4\0000\1`\1\0 \4 \0\0\0\0\220\237\322\0 \237\322\0"..., 8192, 114688) = 8192
2025-09-08 17:58:41.876 UTC [161] LOG: page verification failed, calculated checksum 20176 but expected 45250
2025-09-08 17:58:41.876 UTC [161] STATEMENT: select * from demo where k=999;
2025-09-08 17:58:41.876 UTC [161] LOG: invalid page in block 14 of relation "base/5/16388"
2025-09-08 17:58:41.876 UTC [161] STATEMENT: select * from demo where k=999;
2025-09-08 17:58:41.876 UTC [161] ERROR: invalid page in block 14 of relation "base/5/16388"
2025-09-08 17:58:41.876 UTC [161] STATEMENT: select * from demo where k=999;
ERROR: invalid page in block 14 of relation "base/5/16388"
Here, the checksum calculated is not correct and an error has been raised. PostgreSQL checksums can detect some block corruption, but it is still possible that a bug or a malicious user that has access to the filesystem can change the data without being detected.
Oracle Database
To detect lost writes like the one I simulated above, Oracle Database compares the block checksum with the standby databases, as there is a low chance that the corruption happened in both environments. I've demonstrated this in the past with a similar demo: 18c new Lost Write Protection
WiredTiger (MongoDB storage engine)
MongoDB employs the WiredTiger storage engine, which is designed to prevent lost writes and detect disk failures that might return the wrong page. To achieve this, WiredTiger stores a checksum alongside each page address within the pointers between the B-tree pages, in an address cookie:
An address cookie is an opaque set of bytes returned by the block manager to reference a block in a B-tree file. It includes an offset, size, checksum, and object id.
In my lab, I first start a MongoDB container and compile wt, a command-line utility that allows direct interaction with WiredTiger files. This tool enables me to examine the storage engine without relying on the MongoDB query layer, and I'll use it in this series of blog posts:
docker run --rm -it --cap-add=SYS_PTRACE mongo bash
# install required packages
apt-get update && apt-get install -y git xxd strace curl jq python3 python3-dev python3-pip python3-venv python3-bson build-essential cmake gcc g++ libstdc++-12-dev libtool autoconf automake swig liblz4-dev zlib1g-dev libmemkind-dev libsnappy-dev libsodium-dev libzstd-dev
# get latest WiredTiger
curl -L $(curl -s https://api.github.com/repos/wiredtiger/wiredtiger/releases/latest | jq -r '.tarball_url') -o wiredtiger.tar.gz
# Compile
mkdir /wiredtiger && tar -xzf wiredtiger.tar.gz --strip-components=1 -C /wiredtiger ; cd /wiredtiger
mkdir build && cmake -S /wiredtiger -B /wiredtiger/build \
-DHAVE_BUILTIN_EXTENSION_SNAPPY=1 \
-DCMAKE_BUILD_TYPE=Release \
-DENABLE_WERROR=0 \
-DENABLE_QPL=0 \
-DCMAKE_C_FLAGS="-O0 -Wno-error -Wno-format-overflow -Wno-error=array-bounds -Wno-error=format-overflow -Wno-error=nonnull" \
-DPYTHON_EXECUTABLE=$(which python3)
cmake --build /wiredtiger/build
export PATH=$PATH:/wiredtiger/build:/wiredtiger/tools
It takes some time, but the wt command line utility will make the investigation easier. That's an advantage of MongoDB pluggable storage—you can examine it in layers.
I create a demo table and insert 10,000 records:
root@7d6d105a1663:/tmp# wt create table:demo
root@7d6d105a1663:/tmp# ls -alrt
total 68
drwxr-xr-x. 1 root root 4096 Sep 8 19:21 ..
-rw-r--r--. 1 root root 21 Sep 8 19:35 WiredTiger.lock
-rw-r--r--. 1 root root 50 Sep 8 19:35 WiredTiger
-rw-r--r--. 1 root root 299 Sep 8 19:35 WiredTiger.basecfg
-rw-r--r--. 1 root root 4096 Sep 8 19:35 demo.wt
-rw-r--r--. 1 root root 4096 Sep 8 19:35 WiredTigerHS.wt
-rw-r--r--. 1 root root 1475 Sep 8 19:35 WiredTiger.turtle
-rw-r--r--. 1 root root 32768 Sep 8 19:35 WiredTiger.wt
drwxrwxrwt. 1 root root 4096 Sep 8 19:35 .
root@7d6d105a1663:/tmp# wt list
colgroup:demo
file:demo.wt
table:demo
cat /dev/urandom |
base64 |
head -10000 |
awk '{print "i",NR,$0}' |
wt dump -e table:demo
...
Inserted key '9997' and value 'ILZeUq/u/ErLB/i7LOUb4nwYP6D535trb8Mt3vcJXXRAqLeAiYIHn5bEWs1buflmiZMYd3rMMvhh'.
Inserted key '9998' and value 'y+b0eTV/4Ao12qRqtHhgP2xGUr+C9ZOfvOG3ZwbdDNXvpnbM1/laoJ9Yzyt6cbLJOR6jdQktpgFM'.
Inserted key '9999' and value 'cJY9uWtopqFOuZjggkZDWVZEEdygpMLyL7LscqehnKoVY7BrmTh4ZzyTLrZ1glROwLtZYbvLbu5c'.
Inserted key '10000' and value 'QYwyjaRxa9Q+5dvzwQtvv2QE/uS/vhRPCVsQ6p7re/L2yDrVRxyqkvSyMHeRCzMIsIovrCUJpPXI'.
I read record 9999 with wt and use strace to see the read calls:
strace -yy -e trace=pread64 -xs 36 wt dump -e table:demo <<<"s 9999"
...
pread64(6</tmp/demo.wt>, "\x00...\x1f\x63\x77\x41"..., 28672, 806912) = 28672
9999
cJY9uWtopqFOuZjggkZDWVZEEdygpMLyL7LscqehnKoVY7BrmTh4ZzyTLrZ1glROwLtZYbvLbu5c
...
This record is at offset 806912 in a 28672 bytes block. I “save” this block with dd:
dd if=demo.wt of=block1.tmp bs=1 skip=806912 count=28672
28672+0 records in
28672+0 records out
28672 bytes (29 kB, 28 KiB) copied, 0.0680832 s, 421 kB/s
I update this record to "xxxxxx" and trace the write calls:
strace -yy -e trace=pwrite64 -xs 36 wt dump -e table:demo <<<"u 9999 xxxxxx"
...
Updated key '9999' to value 'xxxxxx'.
pwrite64(6</tmp/demo.wt>, "\x00...\x02\xb1\x1f\xf5"..., 28672, 847872) = 28672
pwrite64(6</tmp/demo.wt>, "\x00...\x4d\x19\x14\x4e"..., 4096, 876544) = 4096
pwrite64(6</tmp/demo.wt>, "\x00...\x4f\x74\xdb\x1e"..., 4096, 880640) = 4096
pwrite64(6</tmp/demo.wt>, "\x00...\x21\xcc\x25\x06"..., 4096, 884736) = 4096
...
This writes a new block (WiredTiger does not write in-place, which helps to avoid corruption) of 28672 bytes, and updates the B-tree branches.
I can read this new value:
strace -yy -e trace=pread64 -xs 36 wt dump -e table:demo <<<"s 9999"
...
pread64(6</tmp/demo.wt>, "\x00...\x02\xb1\x1f\xf5"..., 28672, 847872) = 28672
9999
xxxxxx
...
To simulate disk corruption, I do the same as I did on PostgreSQL: replace the current block with the old one. I save the current block before overwriting it:
dd if=demo.wt of=block2.tmp bs=1 skip=847872 count=28672
28672+0 records in
28672+0 records out
28672 bytes (29 kB, 28 KiB) copied, 0.0688249 s, 417 kB/s
dd of=demo.wt if=block1.tmp bs=1 seek=847872 conv=notrunc
28672+0 records in
28672+0 records out
28672 bytes (29 kB, 28 KiB) copied, 0.0666375 s, 430 kB/s
If I try to read the record in this block, the corruption is detected:
strace -yy -e trace=pread64 -xs 36 wt dump -e table:demo <<<"s 9999"
...
pread64(6</tmp/demo.wt>, "\x00...\x1f\x63\x77\x41"..., 28672, 847872) = 28672
[1757361305:392519][8246:0x7fe9a087e740], wt, file:demo.wt, WT_SESSION.open_cursor: [WT_VERB_DEFAULT][ERROR]: __wti_block_read_off, 279: demo.wt: potential hardware corruption,
read checksum error for 28672B block at offset 847872: block header
checksum of 0x4177631f doesn't match expected checksum of 0xf51fb102
[1757361305:392904][8246:0x7fe9a087e740], wt, file:demo.wt,
WT_SESSION.open_cursor: [WT_VERB_DEFAULT][ERROR]: __bm_corrupt_dump, 86: {0: 847872, 28672, 0xf51fb102}: (chunk 1 of 28): 00 00 00 00 00
00 00 00 3a 00 00 00 00 00 00 00 40 6e 00 00 a8 02 00 00 07 04 00 01 00 70 00 00 1f 63 77 41 01 00 00 00 11 39 36 39 33 80 8c 43 38 30 58 51 66 64
...
Even if the block checksum is correct for the block itself, it was detected that the checksum of the block, 0x4177631f, which is visible as 1f 63 77 41 in the hexadecimal dump, or \x1f\x63\x77\x41 in the read trace, is different from the expected 0xf51fb102 from the address cookie.
0xf51fb102 was visible as \x02\xb1\x1f\xf5 in the write call of the update, and is visible as 02 b1 1f f5 in the block that I've saved before overwriting it:
root@7d6d105a1663:/tmp# xxd -l 36 block2.tmp
00000000: 0000 0000 0000 0000 5c00 0000 0000 0000 ........\.......
00000010: f96d 0000 a802 0000 0704 0001 0070 0000 .m...........p..
00000020: 02b1 1ff5
Even with access to the files, it would be extremely difficult to corrupt the data in an undetected way because any change must update the checksum, and the checksum is referenced in all address cookies in other blocks. Block corruption is highly unlikely as the blocks are not updated in place, and failure to write blocks would break the pointers.
WiredTiger is open-source and you can check WT_BLOCK_HEADER definition. In this structure, the block size (disk_size) field appears before the checksum field: for example, 00 70 00 00 = 0x00007000 = 28,672 bytes, followed by the checksum 02 b1 1f f5 = 0xf51fb102. One advantage of WiredTiger is that B-tree leaf blocks can have flexible sizes, which MongoDB uses to keep documents as one chunk on disk and improve data locality.
Checksum verification is implemented in block_read.c and performs two validations:
- It checks that the checksum stored in the block header matches the expected checksum from the address cookie (the B-tree pointer created when the block was written).
- It zeroes out the checksum field in the header and recomputes the checksum over the block content, verifying it also matches the expected checksum. This ensures both the block’s integrity and its identity.
Conclusion
PostgreSQL requires you to enable checksums to detect data corruption. This detects when a page’s checksum does not match its content. However, if the system erroneously writes a different, but valid, block from the same table in place of the intended one, or misses a write and the previous version of the block remains, PostgreSQL cannot identify this issue. As a result, some disk failures may escape detection and return wrong results.
Oracle Database stores blocks with checksums and can enable checking them on read. With a Data Guard standby and some network overhead, the database can transmit checksums over the network to verify data integrity when reading.
MongoDB WiredTiger enables checksums by default and can detect the wrong blocks without the need to contact replicas. It embeds the expected checksum inside the B-tree address cookie so that every internal B-tree pointer to a leaf page includes the checksum for the referenced page. If an obsolete or different page is swapped in, any mismatch will be detected because the pointer's checksum won’t match. WiredTiger uses copy-on-write, not in-place overwrites, further reducing the risk of corruption.
Here is a description of WiredTiger by Keith Bostic:
And an article by Bob Liles about how MongoDB Atlas deals with that: Managing Data Corruption in the Cloud
Disaggregation: A New Architecture for Cloud Databases
This short VLDB'25 paper surveys disaggregation for cloud databases. It has several insightful points, and I found it worth summarizing.
The key advantage of the cloud over on-prem is elastic scalability: users can scale resources up and down and pay only for what they use. Traditional database architectures, like shared-nothing, do not fully exploit this. Thus, cloud-native databases increasingly adopt disaggregated designs.
Disaggregation is primarily motivated by the asymmetry between compute and storage:
- Compute is far more expensive than storage in the cloud.
- Compute demand fluctuates quickly; storage grows slowly.
- Compute can be stateless and easier to scale, while storage is inherently stateful.
Decoupling them lets compute scale elastically while storage remains relatively stable and cheap.
Review of Disaggregation in the Clouds
Early cloud-native systems like Snowflake and Amazon Aurora separate compute and storage into independent clusters. Modern systems push disaggregation further. Socrates splits storage into three services: Logging service (small footprint, strict latency), Page cache, and Durable page store. This way each service can be tuned for its performance/cost tradeoffs. The logging service, for example, can use faster storage hardware.
Other disaggregation examples include computation pushdown (Redshift Spectrum, S3 Select), intermediate caching (Snowflake), a metadata service (Lakehouse), and memory disaggregation (PolarDB). Many other functions (indexing, concurrency control, query optimization) remain underexplored. There is room for a unified middleware layer between compute and storage that can consolidate these.
The paper doesn't mention it but, this discussion mirrors the microservices trend in systems design. Breaking monoliths into smaller, independently scalable services improves modularity and resource efficiency, and also enables better sharing and pooling of resources across workloads. We may see disaggregated databases evolve the way microservices did: first a simple split, then a bunch of fine-grained services, and eventually a need for orchestration layers, observability, and service meshes. Today it is compute and storage; tomorrow it could be dozens of database microservices (maybe even including concurrency control), stitched together by a middleware layer that looks suspiciously like Kubernetes.
Tradeoffs in disaggregated design
The main tradeoff is performance. Since disaggregated components are physically separate, the communication overhead can be high. A 2019 study shows a 10x throughput hit compared to a tuned shared-nothing system. Optimizations can help narrow the gap, but disaggregation should be applied only when its benefits outweigh the network cost. This tradeoff also motivates research into techniques that reduce communication overhead. Hello, distributed systems research!
Rethinking Core Protocols
Many distributed database protocols assume shared-nothing architecture, so with disaggregation, some of these assumptions no longer hold. This creates new opportunities, and not just more problems.
For example, 2PC normally faces a blocking problem because a failed node's log is inaccessible. With disaggregated storage, logs live in a shared, reliable service. Cornus 2PC protocol (2022) leverages this: active nodes can vote NO on behalf of failed nodes by writing directly to their logs. A compare-and-swap API ensures only one decision is recorded.
Disaggregating the Query Engine
Pushdown reduces data movement by executing operators closer to storage. This idea has been studied in database machines, Smart SSDs, and PIM, but fits especially well in the cloud. Leveraging this, PushdownDB uses S3 Select to push down both basic and advanced operators. It cuts query time 6.7x and cost 30%. FlexPushdownDB combines pushdown with caching so that operators like filters or hash probes can run locally on cached data and remotely via pushdown, with results merged. This hybrid mode outperforms either technique alone by 2.2x.
Enabling New Capabilities and Embracing New Hardware
Modern applications want queries to reflect the latest transactions, not data from hours ago. HTAP systems support this but require migration to new engines. Disaggregated architectures offer an opportunity here, and Hermes (VLDB'25) exploits this by placing itself between the compute and storage. Hermes intercepts transactional logs and analytical reads, merging recent updates into queries on the fly. Updates are later batched into stable storage.
Disaggregation also motivates and eases adoption of new hardware. Different components can use specialized GPUs, RDMA, CXL in order to achieve the best cost-performance tradeoff. The paper cites a GPU-based DuckDB engine (VLDB'25) that achieves large speedups by exploiting parallelism.
Discussion
So this paper mentions several new research directions. What should systems researchers in academia work on?
Here is what I think would be an impactful work. Take a monolithic database (like Postgres, RocksDB, or MySQL) and transform it to a disaggregated database. But not just for shits and giggles. Study the efficiency tradeoffs in alternative designs. Also study the software engineering tradeoffs, cost-to-production tradeoffs, resilience tradeoffs and metastability risks. Compare and contrast different transformation paths. This would provide a good roadmap (and paths to avoid minefields) for many databases that consider a similar redesign/implementation. Last year I had reviewed this paper which made a small crack at this, but I believe there is a lot of research and work here to be tackled.
For distributed protocol designers, the rethinking core protocols section provides a good blueprint. Pick other protocols, such as consensus, leader election, replication, caching, and revisit them in the disaggregated setting, and consider the new opportunities that open up alongside the challenges introduced.
Swimming with Sharks: Analyzing Encrypted Database Traffic Using Wireshark
Directly query the underlying ClickHouse database in Tinybird via the native HTTP interface
Directly query the underlying ClickHouse database in Tinybird via the native HTTP interface
Our Myrtle Beach vacation
This year was a bastard. Not from work. God, no, I find work relaxing. Reading papers, inventing algorithms, ripping apart distributed systems with TLA+ models -- that's how I have fun. I can do that all day with a grin on my face. But the minute I need to do adulting (like simply calling the cable company and ask why keep increasing our bill when I'm not looking), I will stress and procrastinate for weeks. And this year, I had a lot of adulting to do to put our house on market, and plan a move to California, all the while juggling to help three kids with school and EC activities. I was pretty stressed most of the time, and I've been grinding my teeth at night like a mule chewing rocks.
Anywho, we botched our great escape to California. House didn't sell quickly, as we hoped it would, and we are stuck in Buffalo for another year. Summer disappeared in a cloud of errands and disappointment, and suddenly it was late August with our kids twitching with pre-school nerves. There was still some time left to salvage the wreck. We needed a beach, any beach. Myrtle Beach! We heard good things about it. One of my friends called it the Redneck Riviera. Good enough for me, and far enough from Buffalo, so we decided to give it a try.
Planning is not my strong suit. My wife took the reins. She scoured hotel reviews like a CIA interrogator, picking through tales of bedbugs, mold, broken elevators. She has a radar for doom. Without her, I'd have booked us straight into some cockroach casino on the boardwalk. But she nailed it. Ten days before departure, she locked down an Airbnb room inside a proper resort hotel, facing the ocean. We chose Airbnb for better customer service and because the photos showed the exact floor and view we would get. There was no guessing which floor or room we would get if we went with the resort directly.
The best thing about the vacation is anticipation and the wait. We counted down the days, giddy with excitement.
And then, the drive. Always the drive. That is how the Demirbas family rolls: No planes if at all possible. Planes are a scam. For five people, it's bankruptcy on wings. You waste a whole day shuffling through TSA lines, just to pray nervously that our planes don't cancel on you, and if you are lucky to sit in a recycled air canister for hours. We once drove from Buffalo to Seattle, and back. And another time to Florida and back. For us seven hours on asphalt is a warm-up lap from Buffalo to Boston. Myrtle Beach was thirteen. Still doable. Just load the audiobooks, clamp the Bose headphones on my head, and hit the highway. Driving is my meditation: the road pours itself under my car, like some childhood arcade game where the scenery and other cars on the road scrolls through you for hours as I nudge the steering wheel left and right to accommodate.
We left Buffalo at 8:30 in our tightly packed Highlander. By noon the youngest announced that she hadn't hit the bathroom that morning and we stopped at a McDonald's little south of Pittsburgh. We mostly pass as a normal American family, but at this stop we stood out like a sore thumb. We received a lot of cold blond stares. I later understood why, when we drove another 30 minutes, the barns started shouting TRUMP in dripping paint, and we entered West Virginia. God's Country, they call it. Heaven on earth. But it was just some green hills, and the scenery didn't impress me much.
Our next stop was at a rest area in North Carolina, which turned out to become the cleanest, most immaculate rest area I'd ever seen. Somebody in Raleigh must be laundering money through landscaping contracts, but damn if it didn't impress us. Even the butterflies were impressive!
Then Myrtle Beach: 85 degrees weather, ocean air, great view from our flat, and a nice waterpark at the resort. Southern hospitality is real, everyone was smiling. Compared to the winter-scarred faces in Buffalo, it felt like stepping onto another planet. The Carolinas had already shoved their kids back into classrooms, so we owned the pools and ruled the lazy river. The kids tore through the slides. I soaked in the jacuzzi like a tired warrior. At night we binge-watched Quantum Leap. We would have also watched during day, but my wife dragged us to beach walks, waterpark raids. Sometimes we need the push.
By the third day, the spell had taken hold. I started to relax. Sleep came easy, deeper than home. The wave sounds and the sea view worked its magic. Staying right on the beach was great. No hauling gear, no logistics. Step out the door, fall into the ocean, and crawl back to the flat when you're cooked. The flat was clean, spacious, and blessed with a kitchen so we could gorge on comfort food without shame.
We were wondering if we made a mistake by getting the resort 4-5 miles north of the Boardwalk. When we visited the boardwalk on the third day, we realized that it was overrated anyways. It was full of tourist-trap shops, neon lights, and featured a SkyWheel, which we didn't bother to try. We didn't need the Boardwalk. Myrtle Beach itself is the show: the waves, the horizon, and the beach.
Of course, I had to ruin myself. The kids used sunscreen like sensible citizens, and I, an idiot heir to Turkish tanning lore, slathered on olive oil (which I swiped from our kitchen). If it fries an egg, it'll bronze a body, right? Well... I roasted into a lobster, alright... But I ended up slowly shedding skin like a reptile for days afterwards.
The drive back was clean. Salem's Grill in Pittsburgh was our mandatory detour. You go to great food, if great food doesn't get to you. We hit it before 7pm and dined like kings until closing at 8pm. We were back home before midnight. Eventless driving, the way I like it.
But vacations are lies, sweet lies. Within days the teeth grinding returned. The adult machinery reloaded with forms to sign, kids to shuttle, bills to pay. Adulting feels like having to deal with a constant deluge of junk mail and random chores from the universe.
And, then the saddest part... we will be shipping Ahmet to college. He leaves for Caltech soon (must be his mother's genes). I am proud, of course, but I will miss him a lot. I bought the plane ticket yesterday after weeks of pretending I didn't have to. Kids grow fast. Too fast... It isn't fair.
Building a DOOM-like multiplayer shooter in pure SQL
DOOMQL: A DOOM-like multiplayer shooter in pure SQL
I recently stumbled across Patrick’s excellent DOOM clone running in a browser powered by DuckDB-WASM. Ever since I’ve read that, I wanted to push his awesome idea to the logical extreme: Build a multiplayer DOOM-like shooter entirely in SQL with CedarDB doing all the heavy lifting. During a month of parental leave (i.e., a lot of sleepless nights), I tried exactly that.
Here’s a sneak peek at DOOMQL:
September 06, 2025
Oldest recorded transaction
September 05, 2025
Automating vector embedding generation in Amazon Aurora PostgreSQL with Amazon Bedrock
Group database tables under AWS Database Migration Service tasks for PostgreSQL source engine
September 03, 2025
Recent Reads (September 25)
Small Gods (1992)
I absolutely loved Small Gods. Pratchett takes on religion, faith, and power. The story follows Om, a god trapped in the body of a tortoise, who has only one true believer left: a novice named Brutha. The central premise is that gods and mythical beings exist because people believe in them, and their power fades as belief fades.
There’s no point in believing in things that exist.
The book is funny, clever, and surprisingly philosophical. Pratchett skewers organized religion, but he also asks bigger questions: What is faith? What is belief? How do institutions shape people, and how do people shape institutions? It's satire, but not heavy-handed. Like Vonnegut, he writes with a wink, yet there's real depth under the jokes.
Gods don't like people not doing much work. People who aren't busy all the time might start to think.
Why not? If enough people believe, you can be god of anything….
The figures looked more or less human. And they were engaged in religion. You could tell by the knives (it's not murder if you do it for a god).
The trouble with being a god is that you've got no one to pray to.
I came across Small Gods after reading Gaiman's American Gods (2001), which credits it as inspiration. Both explore gods and belief, but Pratchett's is lighter, sharper, and full of characters you actually care about. The audiobook is narrated by Andy Serkis, and he's brilliant. My precious!
Some more quotes from the book:
But is all this true?" said Brutha. / Didactylos shrugged. "Could be. Could be. We are here and it is now. The way I see it is, after that, everything tends towards guesswork." / "You mean you don't KNOW it's true?" said Brutha. / "I THINK it might be," said Didactylos. "I could be wrong. Not being certain is what being a philosopher is all about.
What have I always believed? That on the whole, and by and large, if a man lived properly, not according to what any priests said, but according to what seemed decent and honest inside, then it would, at the end, more or less, turn out all right.
Sometimes the crime follows the punishment, which only serves to prove the foresight of the Great God." / "That's what my grandmother used to say," said Brutha automatically. / "Indeed? I would like to know more about this formidable lady." / "She used to give me a thrashing every morning because I would certainly do something to deserve it during the day," said Brutha. / "A most complete understanding of the nature of mankind,.
Probably the last man who knew how it worked had been tortured to death years before. Or as soon as it was installed. Killing the creator was a traditional method of patent protection.
Last night there seemed to be a chance. Anything was possible last night. That was the trouble with last nights. They were always followed by this mornings.
The Turtle Moves!
I Shall Wear Midnight (2010)
I had read The Shepherd's Crown earlier, and it made me a witch lover. I loved this book too. Pratchett's witches are sharp, strong, and unforgettable. Tiffany Aching, the protagonist, is brave, clever, and endlessly practical: “I make it my business. I'm a witch. It's what we do. When it's nobody else's business, it's my business.”
Pratchett's prose is excellent and witty. He makes you laugh and think at the same time. Lines like “You've taken the first step.” / “There's a second step?” / “No; there's another first step. Every step is a first step if it's a step in the right direction” stayed with me. There’s so much care in how he builds characters, their choices, and their world. The book is about people, not the fantasy world, and that is Pratchett's genius. The witches, the townsfolk, even the Nac Mac Feegles ... they all feel alive.
I also loved how the story quietly mirrors what researchers like us do: “We look to the edges. There’s a lot of edges, more than people know. Between life and death, this world and the next, night and day, right and wrong … an’ they need watchin’. We watch ’em, we guard the sum of things.” That is the witches' charter, but also a motto for formal methods researchers, or anyone keeping an eye on the boundaries of a complex system.
Another great line: “Well, as a lawyer I can tell you that something that looks very simple indeed can be incredibly complicated, especially if I'm being paid by the hour. The sun is simple. A sword is simple. A storm is simple. Behind everything simple is a huge tail of complicated.”
Quantum Leap (TV series 2022)
As a kid growing up in Turkey, I watched the original Quantum Leap, and it felt magical. We were filled with wildly incorrect optimism about science and technology, but it felt inspiring. Some lies are useful. The reboot was fun to watch with my kids too. Raymond Lee as Ben Song is a strong lead: he's the physicist who ends up stuck leaping into the bodies of other people. Caitlin Bassett guides him as his fiance and observer.
The show is well-paced and fun for family viewing. Many episodes lean into socially conscious themes, which I appreciated, but at times it becomes unintentionally offensive: portraying “dumb Americans” with exaggerated Southern accents or mocking Salem settlers in ways that feel worse than silly. The writers clearly aimed for humor or commentary, but the execution backfired. I won't spoil the full story, but the second season tries a pivot/twist, only to throw it out again. What were they thinking? Moments like this make me want to be a TV writer.
September 02, 2025
Automating Amazon RDS and Amazon Aurora recommendations via notification with AWS Lambda, Amazon EventBridge, and Amazon SES
Postgres 18 beta3, large server, sysbench
This has performance results for Postgres 18 beta3, beta2, beta1, 17.5 and 17.4 using the sysbench benchmark and a large server. The working set is cached and the benchmark is run with high concurrency (40 connections). The goal is to search for CPU and mutex regressions. This work was done by Small Datum LLC and not sponsored
tl;dr
- There might be small regressions (~2%) for several range queries that don't do aggregation. This is similar to what I reported for 18 beta3 on a small server, but here it only occurs for 3 of the 4 microbenchmarks and on the small server it occurs on all 4. I am still uncertain about whether this really is a regression.
- x10b_c32r128 is functionally the same as x10a_c32r128 but adds io_method=sync
- x10d_c32r128 starts with x10a_c2r128 and adds io_method=io_uring
Benchmark
The tests are run using 8 tables with 10M rows per table. The read-heavy microbenchmarks run for 600 seconds and the write-heavy for 900 seconds.
I provide charts below with relative QPS. The relative QPS is the following:
(QPS for some version) / (QPS for Postgres 17.5)
Asymmetric Linearizable Local Reads
People want data fast. They also want it consistent. Those two wants pull in opposite directions. This VLDB'25 paper does another take on this conundrum. Rather than assuming a symmetric network environment where all replicas face similar latencies, the paper emphasizes that in practice, some replicas are closer to the leader, where others are stranded halfway across the globe. By embracing this asymmetry, the authors propose two new algorithms: Pairwise-Leader (PL) and Pairwise-All (PA). Both cut read latency compared to the prior approaches. PL could even achieve 50x latency improvements in some cases.
Aleksey and I did our usual thing. We recorded our first blind read of the paper. You can watch it here (link to come soon), if you like seeing two people puzzle through a paper in real time. I also annotated a copy while reading which you can access here.
We liked the ideas, even though the protocols themselves didn't thrill us particularly. I particularly liked finding another good example of the use of synchronized time in distributed database systems. Another study to add to my survey.
Background
At the heart of the problem is linearizability (see my explanation for a primer), a strong consistency condition that ensures operations on a distributed object appear to occur atomically in a single total order consistent with real time. If one operation finishes before another begins, then all replicas must reflect this order. This strong model spares developers from reasoning about concurrency anomalies: if you read after a write, you are guaranteed to observe it. That means there are no stale reads.
While linearizability makes life simple for the developers, it makes it harder for the system, which has to make sure every replica behaves like one machine. This is typically enforced by state machine replication (SMR) protocols such as Paxos or Raft, which order all operations through a leader. This works fine, except for reads. Reads dominate workloads in practice, and forcing every read to consult the leader or coordinate with a quorum introduces unnecessary WAN round trips. To improve efficiency, many systems have proposed linearizable local read algorithms, which allow followers to serve reads locally under certain conditions. However, this is tricky ground. Local reads introduce the risk of staleness, because that replica/follower has not yet applied the latest writes.
And here's the rub: WANs exacerbate the problem. Some replicas are close. Others are hopelessly far. The close ones see fresh data. The far ones lag. Prior work noticed this and tried various tricks to smooth out the unfairness. This paper doesn’t smooth it out. It embraces it as we will see in the coming sections.
The key problem addressed by the paper is: How can linearizable local reads be achieved in a system where replicas are asymmetric in their ability to keep up with the updates?
The Blocking Problem
Over the years, many protocols have been proposed for enabling linearizable local reads. The paper reviews these under three broad categories.
Invalidation-based algorithms (e.g., Megastore, PQL, Hermes): Replicas mark themselves invalid when they receive prepares and regain validity only after commits. But this can lead to unbounded blocking: If the leader issues prepares faster than it gathers acknowledgments in a write-heavy workload, the replicas stay perpetually invalid.
Eager Stamping algorithms (e.g., CHT): Reads are stamped with the latest prepare index, then block until the corresponding commits arrive. This avoids perpetual invalidation thanks to instance-based tracking of prepares, but it still results in blocking proportional to twice the leader's eccentricity.
Delayed Stamping algorithms (e.g., CockroachDB Global Tables): These use synchronized clocks to assign visibility windows. Here, blocking time depends on the clock skew bound Δ, which theoretically in the worst case would be bound to the relative network diameter. This theoretical worst case bound does not apply if GPS clock synchronization (e.g., Google's Truetime or AWS Timesync) is available, and so the paper, in order to make its case, assumes GPS based synchronization is not available (which is actually pretty available). https://muratbuffalo.blogspot.com/2024/12/utilizing-highly-synchronized-clocks-in.html
Ok, I owe you an explation of delayed stamping approach. But first let me set this up using the paper's unifying stop/go events framework. I like this framing: when you name something, you own it. In this model, each index i on each replica has two events:
- A stop event at which the replica stops assigning reads to indices less than i.
- A go event at which the replica can safely serve reads at i.
This abstraction ultimately guarantees linearizability by ensuring that, across all replicas, all go events for i occur at or after all stop events for i. The framework is clean and clarifies why blocking occurs.
Now let's walk through Fig. 2 and explain the two approaches using the stop/go framework.
Eager Stamping (Fig. 2a). A follower stamps a read with the highest index i it has seen a prepare for (the stop event). It can only apply the read after it has received all commits up to i (the go event). The leader is special here: since it is always up to date, its stop and go events collapse into one.
Delayed Stamping (Fig. 2b). This approach uses synchronized clocks to decouple stop and go events from message arrival. When the leader accepts an update, it assigns it a future visibility time t+α, where the visibility delay α is derived from estimated commit time of the update. Followers stop stamping reads with indices less than i once that visibility time has passed on their clocks. They then apply the read after an additional Δ (the clock skew/uncertainty) has elapsed. Unlike Eager Stamping, the leader does not have special stop and go events; it also follows this visibility rule. I had talked about this earlier when explaining CockroachDB's global transactions and the Aurora Limitless and DSQL future timestamping.
Table 2 summarizes the worst-case blockage time at followers for the three category of algorithms mentioned above and the two new algorithms introduced in this work (which we explain next).
I would be amiss (like the paper), if I do not emphasize a common flaw shared across all these algorithms: the leader in these algorithms requires acknowledgments from all nodes (rather than just a quorum) before it can commit a write! If you want a local linearizable read from a single node, the write protocol is forced into a write-all model to ensure that the one-node read quorum intersects with the write quorum. This design hurts both availability and tail-latency for the algorithms in Table 2. In contrast, in our Paxos Quorum Reads (PQR 2019) work we avoided the write-all model: PQR used only LSN tracking and quorum acknowledgments, and no clocks are needed. I discuss PQR at the end of this post.
Pairwise-Leader (PL)
The central idea of PL is to tailor blocking time to each replica's distance from the leader. Nearby replicas get near-zero latency, while distant ones may fare worse than before. Figure 3 provides the stepping stone for explaining the PL algorithm in Figure 4.
The central idea of PL is that blocking time should depend on how far a replica is from the leader. Replicas close to the leader see near-zero latency, while distant replicas may wait longer. To get there, the paper first introduces a stepping-stone algorithm (as shown in Figure 3). The trick is to deliberately time/delay prepare messages so that acknowledgments from all replicas reach the leader at the same time and hence blockage time is reduced for followers closer to the leader. Specifically, this ensures that for any replica, the gap between its prepare and commit messages is just the round-trip distance to the leader. That alone already improves over older algorithms that tied blocking to the full network diameter.
PL then builds on this stepping-stone by further decoupling stop and go events, borrowing the spirit of Delayed Stamping but applying it pairwise. Instead of relying on synchronized clocks, PL introduces a new event scheduling primitive that ensures a replica's stop event happens just before a visibility time, and its go event just after. Figure 4 illustrates this: each replica's worst-case blocking time becomes exactly 2 relative message delay between the leader and itself (see Table 1 for notation). In other words, nearby replicas get fast, almost instant reads, as the cost for distant ones reflects only their distance from the leader.
PL introduces a new pairwise event scheduling/synchronization primitive: Instead of requiring global clock synchronization, the leader coordinates stop and go events directly with each follower. This scheduling/synchronization ensures stop/go events happen at predictable real-time offsets relative to the leader's visibility time, while exploiting known lower bounds on delays to followers to maintain correctness. Yes, unfortunately the drawback is that the delays to followers need to be reliable/predictable for the correctness to work. I discuss this problem at the end of the post.
Pairwise-All (PA)
PL optimizes aggressively for leader-adjacent replicas, but it penalizes distant ones. PA extends the pairwise trick to all replicas using all-to-all communication as shown in Figure 6.
PA's stepping-stone in Figure 6 works like PL's but shifts the synchronization target: instead of aligning acknowledgments at the leader, it delays prepare messages so they all arrive every replica at the same time. Each process also sends acknowledgments to all others, not just the leader. The effect is that every replica can commit once its own eccentricity time has passed since receiving a prepare. As a result, the worst-case read blocking time for each replica is its relative eccentricity.
To further reduce blocking, PA applies the same decoupling (delayed stamping) idea as PL but with all replicas aligned at the visibility time (as shown in Figure 7). The leader schedules stop events near this visibility point, and each process waits for stopped events from others before issuing its own go. Since a go event is the maximum of all stopped events, correctness holds regardless of scheduling accuracy. This ensures that every replica's worst-case blocking time is bounded by its eccentricity rather than its leader-distance. In practice, that means nearby replicas don't get PL's extreme gains, but distant ones aren't punished.
That's it, that's the story. Asymmetry is real. PL exploits it ruthlessly. PA makes it fair. If you're close to the leader, PL is your friend. If you're far, PA keeps you from suffering.
Discussion
Both PL and PA assume stable latencies and non-faulty processes. The paper sketches how to tolerate failures, but variance and reconfiguration remain open issues. The funny thing is to justify predictable network latencies, the authors cite Aleksey's paper: "Cloudy Forecast: How Predictable is Communication Latency in the Cloud?" Ironically, that paper shows the opposite, the variance can be massive: up to 10x of median in WAN setup, and 3000x in same AZ setup! So either they didn't read it carefully, or they cited it for sport. Cloud networks aren't that tame and predictable for tenants.
The treatment of clock skew Δ is also odd. The paper insists Δ must be proportional to the network diameter, but that's a theoretical result, and I don't know how much it would apply to even NTP based synchronization. Moreover, in practice, GPS clocks exist, and AWS Timesync provides 50 microsecond clock uncertainty. Why not use these? The paper explicitly disallows GPS clocks to make the results from PL and PA look more favorable. A comparison against synchronized clocks would have been valuable. With clocks, blocking could be in less than millisecond (as we designed in AWS DSQL) with just delayed timestamping and that would not only be a lot more simple, but also beat anything from PL and PA significantly.
Our PQR work (2019) also tackled linearizable non-leader reads. You can also frame it as local reads, though PQR used multiple nodes. The key idea in PQR is to involve the client: The client contacts a quorum of nodes, usually gets a linearizable read in one shot, and in the rare case of an ongoing update, waits briefly and completes with a callback. PQR required no synchronized clocks and worked in a fully asynchronous model using only LSNs. It fits naturally in this space.
Postgres 18 beta3, small server, sysbench
This has performance results for Postgres 18 beta3, beta2, beta1 and 17.6 using the sysbench benchmark and a small server. The working set is cached and the benchmark is run with low concurrency (1 connection). The goal is to search for CPU regressions. This work was done by Small Datum LLC and not sponsored
tl;dr
- There might be small regressions (~2%) for several range queries that don't do aggregation. This is similar to what I reported for 18 beta1.
- Vacuum continues to be a problem for me and I had to repeat the benchmark a few times to get a stable result. It appears to be a big source of non-deterministic behavior leading to false alarms for CPU regressions in read-heavy tests that run after vacuum. In some ways, RocksDB compaction causes similar problems. Fortunately, InnoDB MVCC GC (purge) does not cause such problems.
- x10b_c8r32 is functionally the same as x10a_c8r32 but adds io_method=sync
- x10b1_c8r32 starts with x10b_c8r32 and adds vacuum_max_eager_freeze_failure_rate =0
- x10b2_c8r32 starts with x10b_c8r32 and adds vacuum_max_eager_freeze_failure_rate =0.99
Benchmark
The tests are run using 1 table with 50M rows. The read-heavy microbenchmarks run for 600 seconds and the write-heavy for 900 seconds.
I provide charts below with relative QPS. The relative QPS is the following:
(QPS for some version) / (QPS for Postgres 17.6)
The numbers highlighted in yellow below might be from a small regression for range queries that don't do aggregation. But note that this does reproduce for the full table scan microbenchmark (scan). I am not certain it is a regression as this might be from non-deterministic CPU overheads for read-heavy workloads that are run after vacuum. I hope to look at CPU flamegraphs soon.
- the mapping from microbenchmark name to Lua script is here
- the range query without aggregation microbenchmarks use oltp_range_covered.lua with various flags set and the SQL statements it uses are here. All of these return 100 rows.
- the scan microbenchmark uses oltp_scan.lua which is a SELECT with a WHERE clause that filters all rows (empty result set)
col-1 col-2 col-3 col-4 col-5 -> writes