a curated list of database news from authoritative sources

September 12, 2025

September 11, 2025

Postgres 18rc1 vs sysbench

This post has results for Postgres 18rc1 vs sysbench on small and large servers. Results for Postgres 18beta3 are here for a small and large server.

tl;dr

  • Postgres 18 looks great
  • I continue to see small CPU regressions in Postgres 18 for range queries that don't do aggregation on low-concurrency workloads. I have yet to explain that. 
  • The throughput for the scan microbenchmark has more variance with Postgres 18. I assume this is related to more or less work getting done by vacuum but I have yet to debug the root cause.

Builds, configuration and hardware

I compiled Postgres from source for versions 17.6, 18 beta3 and 18 rc1.

The servers are:
  • small
    • an ASUS ExpertCenter PN53 with AMD Ryzen 7735HS CPU, 32G of RAM, 8 cores with AMD SMT disabled, Ubuntu 24.04 and an NVMe device with ext4 and discard enabled.
  • large32
    • Dell Precision 7865 Tower Workstation with 1 socket, 128G RAM, AMD Ryzen Threadripper PRO 5975WX with 32 Cores and AMD SMT disabled, Ubuntu 24.04 and and NVMe device with ext4 and discard.
  • large48
    • an ax162s from Hetzner with an AMD EPYC 9454P 48-Core Processor with SMT disabled
    • 2 Intel D7-P5520 NVMe storage devices with RAID 1 (3.8T each) using ext4
    • 128G RAM
    • Ubuntu 22.04 running the non-HWE kernel (5.5.0-118-generic)
All configurations use synchronous IO which is the the only option prior to Postgres 18 and for Postgres 18 the config file sets io_method=sync.

Configuration files:

Benchmark

I used sysbench and my usage is explained here. To save time I only run 32 of the 42 microbenchmarks 
and most test only 1 type of SQL statement. Benchmarks are run with the database cached by Postgres.

For all servers the read-heavy microbenchmarks run for 600 seconds and the write-heavy for 900 seconds.

The number of tables and rows per table was:
  • small server - 1 table, 50M rows
  • large servers - 8 tables, 10M rows per table
The number of clients (amount of concurrency) was:
  • small server - 1
  • large32 server - 24
  • large48 servcer- 40
Results

The microbenchmarks are split into 4 groups -- 1 for point queries, 2 for range queries, 1 for writes. For the range query microbenchmarks, part 1 has queries that don't do aggregation while part 2 has queries that do aggregation. 

I provide charts below with relative QPS. The relative QPS is the following:
(QPS for some version) / (QPS for Postgres 17.6)
When the relative QPS is > 1 then some version is faster than PG 17.6.  When it is < 1 then there might be a regression. Values from iostat and vmstat divided by QPS are also provided here. These can help to explain why something is faster or slower because it shows how much HW is used per request.

The numbers highlighted in yellow below might be from a small regression for range queries that don't do aggregation. But note that this does reproduce for the full table scan microbenchmark (scan). I am not certain it is a regression as this might be from non-deterministic CPU overheads for read-heavy workloads that are run after vacuum. I hope to look at CPU flamegraphs soon.

Results: small server

I continue to see small (~3%) regressions in throughput for range queries without aggregation across Postgres 18 beta1, beta2, beta3 and rc1. But I have yet to debug this and am not certain it is a regression. I am also skeptical about the great results for scan. I suspect that I have more work to do to make the benchmark less subject to variance from MVCC GC (vacuum here). I also struggle with that on RocksDB (compaction), but not on InnoDB (purge).

Relative to: Postgres 17.6
col-1 : 18beta3
col-2 : 18rc1

col-1   col-2   point queries
1.01    0.98    hot-points_range=100
1.01    1.00    point-query_range=100
1.02    1.02    points-covered-pk_range=100
0.99    1.01    points-covered-si_range=100
1.00    0.99    points-notcovered-pk_range=100
1.00    0.99    points-notcovered-si_range=100
1.01    1.00    random-points_range=1000
1.01    0.99    random-points_range=100
1.01    1.00    random-points_range=10

col-1   col-2   range queries without aggregation
0.97    0.96    range-covered-pk_range=100
0.97    0.97    range-covered-si_range=100
0.99    0.99    range-notcovered-pk_range=100
0.99    0.99    range-notcovered-si_range=100
1.35    1.36    scan_range=100

col-1   col-2   range queries with aggregation
1.02    1.03    read-only-count_range=1000
1.00    1.00    read-only-distinct_range=1000
0.99    0.99    read-only-order_range=1000
1.00    1.00    read-only_range=10000
1.00    0.99    read-only_range=100
0.99    0.98    read-only_range=10
1.01    1.01    read-only-simple_range=1000
1.02    1.00    read-only-sum_range=1000

col-1   col-2   writes
0.99    0.99    delete_range=100
0.99    1.01    insert_range=100
0.99    0.99    read-write_range=100
0.99    0.99    read-write_range=10
0.98    0.98    update-index_range=100
1.00    0.99    update-inlist_range=100
0.98    0.98    update-nonindex_range=100
0.98    0.97    update-one_range=100
0.98    0.97    update-zipf_range=100
0.99    0.98    write-only_range=10000

Results: large32 server

I don't see small regressions in throughput for range queries without aggregation across Postgres 18 beta1, beta2, beta3 and rc1. I have only seen that on the low concurrency (small server) results.

The improvements on the scan microbenchmark come from using less CPU. But I am skeptical about the improvements. I might have more work to do to make the benchmark less subject to variance from MVCC GC (vacuum here). I also struggle with that on RocksDB (compaction), but not on InnoDB (purge).

Relative to: Postgres 17.6
col-1 : Postgres 18rc1

col-1   point queries
1.01    hot-points_range=100
1.01    point-query_range=100
1.01    points-covered-pk_range=100
1.01    points-covered-si_range=100
1.00    points-notcovered-pk_range=100
1.00    points-notcovered-si_range=100
1.01    random-points_range=1000
1.00    random-points_range=100
1.01    random-points_range=10

col-1   range queries without aggregation
0.99    range-covered-pk_range=100
0.99    range-covered-si_range=100
0.99    range-notcovered-pk_range=100
0.99    range-notcovered-si_range=100
1.12    scan_range=100

col-1   range queries with aggregation
1.00    read-only-count_range=1000
1.02    read-only-distinct_range=1000
1.01    read-only-order_range=1000
1.03    read-only_range=10000
1.00    read-only_range=100
1.00    read-only_range=10
1.00    read-only-simple_range=1000
1.00    read-only-sum_range=1000

col-1   writes
1.01    delete_range=100
1.00    insert_range=100
1.00    read-write_range=100
1.00    read-write_range=10
1.00    update-index_range=100
1.00    update-inlist_range=100
1.00    update-nonindex_range=100
0.99    update-one_range=100
1.00    update-zipf_range=100
1.00    write-only_range=10000

Results: large48 server

I don't see small regressions in throughput for range queries without aggregation across Postgres 18 beta1, beta2, beta3 and rc1. I have only seen that on the low concurrency (small server) results.

The improvements on the scan microbenchmark come from using less CPU. But I am skeptical about the improvements. I might have more work to do to make the benchmark less subject to variance from MVCC GC (vacuum here). I also struggle with that on RocksDB (compaction), but not on InnoDB (purge).

I am skeptical about the regression I see here for scan. That comes from using ~10% more CPU per query. I might have more work to do to make the benchmark less subject to variance from MVCC GC (vacuum here). I also struggle with that on RocksDB (compaction), but not on InnoDB (purge).

I have not see the large improvements for the insert and delete microbenchmarks on previous tests on that large server. I assume this is another case where I need to figure out how to reduce variance when I run the benchmark.

Relative to: Postgres 17.6
col-1 : Postgres 18beta3
col-2 : Postgres 18rc1

col-1   col-2   point queries
0.99    0.99    hot-points_range=100
0.99    0.99    point-query_range=100
1.00    0.99    points-covered-pk_range=100
0.99    1.02    points-covered-si_range=100
1.00    0.99    points-notcovered-pk_range=100
0.99    1.01    points-notcovered-si_range=100
1.00    0.99    random-points_range=1000
1.00    0.99    random-points_range=100
1.00    1.00    random-points_range=10

col-1   col-2   range queries without aggregation
0.99    0.99    range-covered-pk_range=100
0.98    0.99    range-covered-si_range=100
0.99    0.99    range-notcovered-pk_range=100
1.01    1.01    range-notcovered-si_range=100
0.91    0.91    scan_range=100

col-1   col-2   range queries with aggregation
1.04    1.03    read-only-count_range=1000
1.02    1.01    read-only-distinct_range=1000
1.01    1.00    read-only-order_range=1000
1.06    1.06    read-only_range=10000
0.98    0.97    read-only_range=100
0.99    0.99    read-only_range=10
1.02    1.02    read-only-simple_range=1000
1.03    1.03    read-only-sum_range=1000

col-1   col-2   writes
1.46    1.49    delete_range=100
1.32    1.32    insert_range=100
0.99    1.00    read-write_range=100
0.98    1.00    read-write_range=10
0.99    1.00    update-index_range=100
0.95    1.03    update-inlist_range=100
1.00    1.02    update-nonindex_range=100
0.96    1.04    update-one_range=100
1.00    1.01    update-zipf_range=100
1.00    1.00    write-only_range=10000




September 09, 2025

Beyond EOL: The Real Benefits of Upgrading to MySQL 8.4

Right now, you’re probably hoping someone else will deal with this MySQL 8.0 end-of-life situation. Maybe your team can squeeze another few months out of it. Maybe Oracle will extend support. Maybe it won’t be as bad as everyone says. We get it.  You’ve got enough things going on without adding “major database upgrade” to […]

September 08, 2025

Resilience of MongoDB's WiredTiger Storage Engine to Disk Failure Compared to PostgreSQL and Oracle

There have been jokes that have contributed to persistent myths about MongoDB's durability. The authors of those myths ignore that MongoDB's storage engine is among the most robust in the industry, and it's easy to demonstrate. MongoDB uses WiredTiger (created by the same author as Berkeley DB), which provides block corruption protection stronger than that of many other databases. In this article I'll show how to reproduce a simple write loss, in a lab, and see how the database detects it to avoid returning corrupt data.

PostgreSQL

To expose the issue when a database doesn't detect lost writes, I chose PostgreSQL for this demonstration. As of version 18, PostgreSQL enables checksums by default. I'm testing it with the Release Candidate in a docker lab:

docker run --rm -it --cap-add=SYS_PTRACE postgres:18rc1 bash

# Install some utilities
apt update -y && apt install -y strace

# Start PostgreSQL
POSTGRES_PASSWORD=x \
strace -fy -e trace=pread64,pwrite64 \
docker-entrypoint.sh postgres &

# Connect to PostgreSQL
psql -U postgres

I've started PostgreSQL, tracing the read and write calls with strace.

I check that block checksum is enabled:

postgres=# show data_checksums;
 data_checksums
----------------
 on
(1 row)

I create a demo table and insert random data:

create table demo (k int primary key, v text);
copy demo from program $$
cat /dev/urandom |
 base64 |
 head -10000 |
 awk '{print NR"\t"$0}' 
$$ ;
vacuum demo;
checkpoint;
create extension pg_buffercache;
select distinct pg_buffercache_evict(bufferid) from pg_buffercache;

I triggered a checkpoint to write to disk and flushed the shared buffers, so that I can see the read IO for the next query:

set enable_seqscan to false;
select distinct pg_buffercache_evict(bufferid) from pg_buffercache;
select * from demo where k=999;

[pid   161] pread64(59</var/lib/postgresql/18/docker/base/5/16394>, "\0\0\0\0@j\225\1R\232\0\0\210\08\36\360\37\4 \0\0\0\0\350\237\20\0\330\237 \0"..., 8192, 24576) = 8192
[pid   161] pread64(59</var/lib/postgresql/18/docker/base/5/16394>, "\0\0\0\0\320\243\200\1mj\0\0\324\5\0\t\360\37\4 \0\0\0\0\340\237 \0\320\237 \0"..., 8192, 32768) = 8192
[pid   161] pread64(58</var/lib/postgresql/18/docker/base/5/16388>, "\0\0\0\0\10\232\232\1\302\260\4\0000\1`\1\0 \4 \0\0\0\0\220\237\322\0 \237\322\0"..., 8192, 114688) = 8192

  k  |                                      v
-----+------------------------------------------------------------------------------
 999 | MQIEZSsmBjk7MRtgIZLL/MqABsjhuMR6I4LtayWfR2764PdB+AcQt2saRtFXkgBUGCLKzM8SBmKX
(1 row)

This has read two pages from the index, and then the page that contains the row I'm querying. This page is in base/5/16388 at offset 114688.

I use dd to save the content of this page, then update the row 999, trigger a checkpoint and flush the buffer cache:

\! dd if=/var/lib/postgresql/18/docker/base/5/16388 of=block1.tmp bs=1 skip=114688 count=8192

update demo set v='xxxxxx' where k=999;
checkpoint;
select distinct pg_buffercache_evict(bufferid) from pg_buffercache;

I query the row again, and it shows the updated value:

set enable_seqscan to false;
select distinct pg_buffercache_evict(bufferid) from pg_buffercache;
select * from demo where k=999;

[pid   161] pread64(59</var/lib/postgresql/18/docker/base/5/16394>, "\0\0\0\0@j\225\1R\232\0\0\210\08\36\360\37\4 \0\0\0\0\350\237\20\0\330\237 \0"..., 8192, 24576) = 8192
[pid   161] pread64(59</var/lib/postgresql/18/docker/base/5/16394>, "\0\0\0\0\320\243\200\1mj\0\0\324\5\0\t\360\37\4 \0\0\0\0\340\237 \0\320\237 \0"..., 8192, 32768) = 8192
[pid   161] pread64(58</var/lib/postgresql/18/docker/base/5/16388>, "\0\0\0\0\10m\261\1K~\0\08\1\200\1\0 \4 \2\3\0\0\220\237\322\0 \237\322\0"..., 8192, 114688) = 8192

  k  |   v
-----+--------
 999 | xxxxxx
(1 row)

It reads the same index pages, but the leaf points to another table page, at offset 114688, that holds the new value.

To simulate disk corruption, I copy the previous block to this new location, and query again:

\! dd of=/var/lib/postgresql/18/docker/base/5/16388 if=block1.tmp bs=1 seek=114688 conv=notrunc

set enable_seqscan to false;
select distinct pg_buffercache_evict(bufferid) from pg_buffercache;
select * from demo where k=999;

[pid   161] pread64(59</var/lib/postgresql/18/docker/base/5/16394>, "\0\0\0\0@j\225\1R\232\0\0\210\08\36\360\37\4 \0\0\0\0\350\237\20\0\330\237 \0"..., 8192, 24576) = 8192
[pid   161] pread64(59</var/lib/postgresql/18/docker/base/5/16394>, "\0\0\0\0\320\243\200\1mj\0\0\324\5\0\t\360\37\4 \0\0\0\0\340\237 \0\320\237 \0"..., 8192, 32768) = 8192
[pid   161] pread64(58</var/lib/postgresql/18/docker/base/5/16388>, "\0\0\0\0\10\232\232\1\302\260\4\0000\1`\1\0 \4 \0\0\0\0\220\237\322\0 \237\322\0"..., 8192, 114688) = 8192
  k  |                                      v
-----+------------------------------------------------------------------------------
 999 | MQIEZSsmBjk7MRtgIZLL/MqABsjhuMR6I4LtayWfR2764PdB+AcQt2saRtFXkgBUGCLKzM8SBmKX
(1 row)

There's no error because this new block has a correct checksum, as it is a valid block, just not at its right place. And it holds the right structure, as it comes from a block of the same table. However, it shows a row that should not be there. This is an error that can happen with a failure in the storage that does not write a block at the right place.

Checksum is still useful if the corruption is not aligned with well formatted blocks. For example, I'm replacing the first half of the page with the second part of the page:

checkpoint;
\! dd of=/var/lib/postgresql/18/docker/base/5/16388 if=block1.tmp bs=1 seek=118784 count=4096 conv=notrunc

set enable_seqscan to false;
select distinct pg_buffercache_evict(bufferid) from pg_buffercache;
select * from demo where k=999;

[pid   161] pread64(59</var/lib/postgresql/18/docker/base/5/16394>, "\0\0\0\0@j\225\1R\232\0\0\210\08\36\360\37\4 \0\0\0\0\350\237\20\0\330\237 \0"..., 8192, 24576) = 8192
[pid   161] pread64(59</var/lib/postgresql/18/docker/base/5/16394>, "\0\0\0\0\320\243\200\1mj\0\0\324\5\0\t\360\37\4 \0\0\0\0\340\237 \0\320\237 \0"..., 8192, 32768) = 8192
[pid   161] pread64(58</var/lib/postgresql/18/docker/base/5/16388>, "\0\0\0\0\10\232\232\1\302\260\4\0000\1`\1\0 \4 \0\0\0\0\220\237\322\0 \237\322\0"..., 8192, 114688) = 8192
2025-09-08 17:58:41.876 UTC [161] LOG:  page verification failed, calculated checksum 20176 but expected 45250
2025-09-08 17:58:41.876 UTC [161] STATEMENT:  select * from demo where k=999;
2025-09-08 17:58:41.876 UTC [161] LOG:  invalid page in block 14 of relation "base/5/16388"
2025-09-08 17:58:41.876 UTC [161] STATEMENT:  select * from demo where k=999;
2025-09-08 17:58:41.876 UTC [161] ERROR:  invalid page in block 14 of relation "base/5/16388"
2025-09-08 17:58:41.876 UTC [161] STATEMENT:  select * from demo where k=999;
ERROR:  invalid page in block 14 of relation "base/5/16388"

Here the checksum calculated is not correct and an error has been raised. PostgreSQL checksums can detect some block corruption, but it is still possible that a bug or a malicious user that has access to the filesystem can change the data without being detected.

Oracle Database

To detect lost writes like the one I simulated above, Oracle Database compares the block checksum with the standby databases, as there is a low chance that the corruption happened in both environments. I've demonstrated this in the past with a similar demo: 18c new Lost Write Protection

WiredTiger (MongoDB storage engine)

MongoDB employs the WiredTiger storage engine, which is designed to prevent lost writes and detect disk failures that might return the wrong page. To achieve this, WiredTiger stores a checksum alongside each page address within the pointers between the BTree pages, in an address cookie:

Address cookie: an opaque set of bytes returned by the block manager to reference a block in a Btree file, it includes an offset, size, checksum, and object id.

In my lab, I first start a MongoDB container and compile wt, a command-line utility that allows direct interaction with WiredTiger files. This tool enables me to examine the storage engine without relying on the MongoDB query layer, and I'll use it in this series of blog posts:

docker run --rm -it --cap-add=SYS_PTRACE mongo bash
# install required packages
apt-get update && apt-get install -y git xxd strace curl jq python3 python3-dev python3-pip python3-venv build-essential cmake gcc g++ libstdc++-12-dev libtool autoconf automake swig liblz4-dev zlib1g-dev libmemkind-dev libsnappy-dev libsodium-dev libzstd-dev
# get latest WiredTiger
curl -L $(curl -s https://api.github.com/repos/wiredtiger/wiredtiger/releases/latest | jq -r '.tarball_url') -o wiredtiger.tar.gz
# Compile
mkdir /wiredtiger && tar -xzf wiredtiger.tar.gz --strip-components=1 -C /wiredtiger ; cd /wiredtiger
mkdir build && cmake -S /wiredtiger -B /wiredtiger/build \
        -DHAVE_BUILTIN_EXTENSION_SNAPPY=1 \
        -DCMAKE_BUILD_TYPE=Release \
        -DENABLE_WERROR=0 \
        -DENABLE_QPL=0 \
        -DCMAKE_C_FLAGS="-O0 -Wno-error -Wno-format-overflow -Wno-error=array-bounds -Wno-error=format-overflow -Wno-error=nonnull" \
        -DPYTHON_EXECUTABLE=$(which python3)
cmake --build /wiredtiger/build
ln -s /wiredtiger/build/wt /usr/local/bin/wt

It takes some time, but the wt command line utility will make the investigation easier.

I create a demo table and insert ten thousand records:

root@7d6d105a1663:/tmp# wt create table:demo

root@7d6d105a1663:/tmp# ls -alrt

total 68
drwxr-xr-x. 1 root root  4096 Sep  8 19:21 ..
-rw-r--r--. 1 root root    21 Sep  8 19:35 WiredTiger.lock
-rw-r--r--. 1 root root    50 Sep  8 19:35 WiredTiger
-rw-r--r--. 1 root root   299 Sep  8 19:35 WiredTiger.basecfg
-rw-r--r--. 1 root root  4096 Sep  8 19:35 demo.wt
-rw-r--r--. 1 root root  4096 Sep  8 19:35 WiredTigerHS.wt
-rw-r--r--. 1 root root  1475 Sep  8 19:35 WiredTiger.turtle
-rw-r--r--. 1 root root 32768 Sep  8 19:35 WiredTiger.wt
drwxrwxrwt. 1 root root  4096 Sep  8 19:35 .

root@7d6d105a1663:/tmp# wt list

colgroup:demo
file:demo.wt
table:demo

cat /dev/urandom |
 base64 |
 head -10000 |
 awk '{print "i",NR,$0}' |
wt dump -e table:demo

...
Inserted key '9997' and value 'ILZeUq/u/ErLB/i7LOUb4nwYP6D535trb8Mt3vcJXXRAqLeAiYIHn5bEWs1buflmiZMYd3rMMvhh'.
Inserted key '9998' and value 'y+b0eTV/4Ao12qRqtHhgP2xGUr+C9ZOfvOG3ZwbdDNXvpnbM1/laoJ9Yzyt6cbLJOR6jdQktpgFM'.
Inserted key '9999' and value 'cJY9uWtopqFOuZjggkZDWVZEEdygpMLyL7LscqehnKoVY7BrmTh4ZzyTLrZ1glROwLtZYbvLbu5c'.
Inserted key '10000' and value 'QYwyjaRxa9Q+5dvzwQtvv2QE/uS/vhRPCVsQ6p7re/L2yDrVRxyqkvSyMHeRCzMIsIovrCUJpPXI'.

I read record 9999 with wt and use strace to see the read calls:

strace -yy -e trace=pread64 -xs 36 wt dump -e table:demo <<<"s 9999"

...
pread64(6</tmp/demo.wt>, "\x00...\x1f\x63\x77\x41"..., 28672, 806912) = 28672
9999
cJY9uWtopqFOuZjggkZDWVZEEdygpMLyL7LscqehnKoVY7BrmTh4ZzyTLrZ1glROwLtZYbvLbu5c
...

This record is at offset 806912 in a 28672 bytes block. I 'save' this block with dd:

dd if=demo.wt of=block1.tmp bs=1 skip=806912 count=28672

28672+0 records in
28672+0 records out
28672 bytes (29 kB, 28 KiB) copied, 0.0680832 s, 421 kB/s

I update this record to "xxxxxx" and trace the write calls:

strace -yy -e trace=pwrite64 -xs 36 wt dump -e table:demo <<<"u 9999 xxxxxx"

...
Updated key '9999' to value 'xxxxxx'.
pwrite64(6</tmp/demo.wt>, "\x00...\x02\xb1\x1f\xf5"..., 28672, 847872) = 28672
pwrite64(6</tmp/demo.wt>, "\x00...\x4d\x19\x14\x4e"..., 4096, 876544) = 4096
pwrite64(6</tmp/demo.wt>, "\x00...\x4f\x74\xdb\x1e"..., 4096, 880640) = 4096
pwrite64(6</tmp/demo.wt>, "\x00...\x21\xcc\x25\x06"..., 4096, 884736) = 4096
...

This writes a new block (WiredTiger do not write in-place, which helps to avoid corruption) of 28672 bytes, and updates the BTree branches.

I can read this new value:

strace -yy -e trace=pread64  -xs 36 wt dump -e table:demo <<<"s 9999"

...
pread64(6</tmp/demo.wt>, "\x00...\x02\xb1\x1f\xf5"..., 28672, 847872) = 28672
9999
xxxxxx
...

To simulate disk corruption, I do the same as I did on PostgreSQL: replace the current block with the old one. I save the current block before overwriting it:

dd if=demo.wt of=block2.tmp bs=1 skip=847872 count=28672

28672+0 records in
28672+0 records out
28672 bytes (29 kB, 28 KiB) copied, 0.0688249 s, 417 kB/s

dd of=demo.wt if=block1.tmp bs=1 seek=847872 conv=notrunc

28672+0 records in
28672+0 records out
28672 bytes (29 kB, 28 KiB) copied, 0.0666375 s, 430 kB/s

If I try to read the record in this block, the corruption is detected:

strace -yy -e trace=pread64  -xs 36 wt dump -e table:demo <<<"s 9999"

...
pread64(6</tmp/demo.wt>, "\x00...\x1f\x63\x77\x41"..., 28672, 847872) = 28672
[1757361305:392519][8246:0x7fe9a087e740], wt, file:demo.wt, WT_SESSION.open_cursor: [WT_VERB_DEFAULT][ERROR]: __wti_block_read_off, 279: demo.wt: potential hardware corruption, read checksum error for 28672
B block at offset 847872: block header checksum of 0x4177631f doesn't match expected checksum of 0xf51fb102
[1757361305:392904][8246:0x7fe9a087e740], wt, file:demo.wt, WT_SESSION.open_cursor: [WT_VERB_DEFAULT][ERROR]: __bm_corrupt_dump, 86: {0: 847872, 28672, 0xf51fb102}: (chunk 1 of 28): 00 00 00 00 00 00 00 00
3a 00 00 00 00 00 00 00 40 6e 00 00 a8 02 00 00 07 04 00 01 00 70 00 00 1f 63 77 41 01 00 00 00 11 39 36 39 33 80 8c 43 38 30 58 51 66 64 
...

Even if the block checksum is correct for the block itself, it was detected that the checksum of the block, 0x4177631f, which is visible as 1f 63 77 41 in the hexadecimal dump, or \x1f\x63\x77\x41 in the read trace, is different from the expected 0xf51fb102 from the address cookie.

0xf51fb102 was visible as \x02\xb1\x1f\xf5 in the write call of the update, and is visible as 02 b1 1f f5 in the block that I've saved before overwriting it:

root@7d6d105a1663:/tmp# xxd -l 36 block2.tmp
00000000: 0000 0000 0000 0000 5c00 0000 0000 0000  ........\.......
00000010: f96d 0000 a802 0000 0704 0001 0070 0000  .m...........p..
00000020: 02b1 1ff5

Even with access to the files, it would be extremely difficult to corrupt the data in an undetected way because any change must update the checksum, and the checksum is referenced in all address cookies in other blocks. Block corruption is highly unlikely as the blocks are not updated in place, and failure to write blocks would break the pointers.

Conclusion

PostgreSQL requires you to enable checksums to detect data corruption. This detects when a page’s checksum does not match its content. However, if the system erroneously writes a different, but valid, block from the same table in place of the intended one, or misses a write and the previous version of the block remains, PostgreSQL cannot identify this issue. As a result, some disk failures may escape detection and return wrong results.

Oracle Database stores blocks with checksums and can enable checking them on read. With a Data Guard standby and some network overhead, the database can transmit checksums over the network to verify data integrity when reading.

MongoDB WiredTiger enables checksums by default and can detect the wrong blocks without the need to contact replicas. It embeds the expected checksum inside the BTree address cookie so that every internal BTree pointer to a leaf page includes the checksum for the referenced page. If an obsolete or different page is swapped in, any mismatch will be detected because the pointer's checksum won’t match. WiredTiger uses copy-on-write, not in-place overwrites, further reducing the risk of corruption.

Here is a description of WiredTiger by Keith Bostic:

Resilience of MongoDB's WiredTiger Storage Engine to Disk Failure Compared to PostgreSQL and Oracle

There have been jokes that have contributed to persistent myths about MongoDB's durability. The authors of those myths ignore that MongoDB's storage engine is among the most robust in the industry. MongoDB uses WiredTiger (created by the same author as Berkeley DB), which provides block corruption protection stronger than that of many other databases. In this article, I'll show how to reproduce a simple write loss, in a lab, and see how the database detects it to avoid returning corrupt data.

I like comparing the trade-offs and implementations of various databases, but I stick to discussing those I'm familiar with. Let's deliberately corrupt PostgreSQL, Oracle, and MongoDB at the storage level to see what happens. This experiment can be reproduced in a lab and your comments are welcome.

PostgreSQL

To expose the issue when a database doesn't detect lost writes, I chose PostgreSQL for this demonstration. As of version 18, PostgreSQL enables checksums by default. I'm testing it with the Release Candidate in a Docker lab:

docker run --rm -it --cap-add=SYS_PTRACE postgres:18rc1 bash

# Install some utilities
apt update -y && apt install -y strace

# Start PostgreSQL
POSTGRES_PASSWORD=x \
strace -fy -e trace=pread64,pwrite64 \
docker-entrypoint.sh postgres &

# Connect to PostgreSQL
psql -U postgres

I've started PostgreSQL, tracing the read and write calls with strace.

I check that block checksum is enabled:

postgres=# show data_checksums;
 data_checksums
----------------
 on
(1 row)

I create a demo table and insert random data:

create table demo (k int primary key, v text);
copy demo from program $$
cat /dev/urandom |
 base64 |
 head -10000 |
 awk '{print NR"\t"$0}' 
$$ ;
vacuum demo;
checkpoint;
create extension pg_buffercache;
select distinct pg_buffercache_evict(bufferid) from pg_buffercache;

I triggered a checkpoint to write to disk and flushed the shared buffers, so that I can see the read IO for the next query:

set enable_seqscan to false;
select distinct pg_buffercache_evict(bufferid) from pg_buffercache;
select * from demo where k=999;

[pid   161] pread64(59</var/lib/postgresql/18/docker/base/5/16394>, "\0\0\0\0@j\225\1R\232\0\0\210\08\36\360\37\4 \0\0\0\0\350\237\20\0\330\237 \0"..., 8192, 24576) = 8192
[pid   161] pread64(59</var/lib/postgresql/18/docker/base/5/16394>, "\0\0\0\0\320\243\200\1mj\0\0\324\5\0\t\360\37\4 \0\0\0\0\340\237 \0\320\237 \0"..., 8192, 32768) = 8192
[pid   161] pread64(58</var/lib/postgresql/18/docker/base/5/16388>, "\0\0\0\0\10\232\232\1\302\260\4\0000\1`\1\0 \4 \0\0\0\0\220\237\322\0 \237\322\0"..., 8192, 114688) = 8192

  k  |                                      v
-----+------------------------------------------------------------------------------
 999 | MQIEZSsmBjk7MRtgIZLL/MqABsjhuMR6I4LtayWfR2764PdB+AcQt2saRtFXkgBUGCLKzM8SBmKX
(1 row)

This has read two pages from the index, and then the page that contains the row I'm querying. This page is in base/5/16388 at offset 114688.

I use dd to save the content of this page, then update the row 999, trigger a checkpoint, and flush the buffer cache:

\! dd if=/var/lib/postgresql/18/docker/base/5/16388 of=block1.tmp bs=1 skip=114688 count=8192

update demo set v='xxxxxx' where k=999;
checkpoint;
select distinct pg_buffercache_evict(bufferid) from pg_buffercache;

I query the row again, and it shows the updated value:

set enable_seqscan to false;
select distinct pg_buffercache_evict(bufferid) from pg_buffercache;
select * from demo where k=999;

[pid   161] pread64(59</var/lib/postgresql/18/docker/base/5/16394>, "\0\0\0\0@j\225\1R\232\0\0\210\08\36\360\37\4 \0\0\0\0\350\237\20\0\330\237 \0"..., 8192, 24576) = 8192
[pid   161] pread64(59</var/lib/postgresql/18/docker/base/5/16394>, "\0\0\0\0\320\243\200\1mj\0\0\324\5\0\t\360\37\4 \0\0\0\0\340\237 \0\320\237 \0"..., 8192, 32768) = 8192
[pid   161] pread64(58</var/lib/postgresql/18/docker/base/5/16388>, "\0\0\0\0\10m\261\1K~\0\08\1\200\1\0 \4 \2\3\0\0\220\237\322\0 \237\322\0"..., 8192, 114688) = 8192

  k  |   v
-----+--------
 999 | xxxxxx
(1 row)

It reads the same index pages, but the leaf points to another table page, at offset 114688, that holds the new value.

To simulate disk corruption, I copy the previous block to this new location, and query again:

\! dd of=/var/lib/postgresql/18/docker/base/5/16388 if=block1.tmp bs=1 seek=114688 conv=notrunc

set enable_seqscan to false;
select distinct pg_buffercache_evict(bufferid) from pg_buffercache;
select * from demo where k=999;

[pid   161] pread64(59</var/lib/postgresql/18/docker/base/5/16394>, "\0\0\0\0@j\225\1R\232\0\0\210\08\36\360\37\4 \0\0\0\0\350\237\20\0\330\237 \0"..., 8192, 24576) = 8192
[pid   161] pread64(59</var/lib/postgresql/18/docker/base/5/16394>, "\0\0\0\0\320\243\200\1mj\0\0\324\5\0\t\360\37\4 \0\0\0\0\340\237 \0\320\237 \0"..., 8192, 32768) = 8192
[pid   161] pread64(58</var/lib/postgresql/18/docker/base/5/16388>, "\0\0\0\0\10\232\232\1\302\260\4\0000\1`\1\0 \4 \0\0\0\0\220\237\322\0 \237\322\0"..., 8192, 114688) = 8192
  k  |                                      v
-----+------------------------------------------------------------------------------
 999 | MQIEZSsmBjk7MRtgIZLL/MqABsjhuMR6I4LtayWfR2764PdB+AcQt2saRtFXkgBUGCLKzM8SBmKX
(1 row)

There's no error because this new block has a correct checksum, as it is a valid block, just not at its right place. And it holds the right structure, as it comes from a block of the same table. However, it shows a row that should not be there. This is an error that can happen with a failure in the storage that does not write a block at the right place.

Checksum is still useful if the corruption is not aligned with well formatted blocks. For example, I'm replacing the first half of the page with the second part of the page:

checkpoint;
\! dd of=/var/lib/postgresql/18/docker/base/5/16388 if=block1.tmp bs=1 seek=118784 count=4096 conv=notrunc

set enable_seqscan to false;
select distinct pg_buffercache_evict(bufferid) from pg_buffercache;
select * from demo where k=999;

[pid   161] pread64(59</var/lib/postgresql/18/docker/base/5/16394>, "\0\0\0\0@j\225\1R\232\0\0\210\08\36\360\37\4 \0\0\0\0\350\237\20\0\330\237 \0"..., 8192, 24576) = 8192
[pid   161] pread64(59</var/lib/postgresql/18/docker/base/5/16394>, "\0\0\0\0\320\243\200\1mj\0\0\324\5\0\t\360\37\4 \0\0\0\0\340\237 \0\320\237 \0"..., 8192, 32768) = 8192
[pid   161] pread64(58</var/lib/postgresql/18/docker/base/5/16388>, "\0\0\0\0\10\232\232\1\302\260\4\0000\1`\1\0 \4 \0\0\0\0\220\237\322\0 \237\322\0"..., 8192, 114688) = 8192
2025-09-08 17:58:41.876 UTC [161] LOG:  page verification failed, calculated checksum 20176 but expected 45250
2025-09-08 17:58:41.876 UTC [161] STATEMENT:  select * from demo where k=999;
2025-09-08 17:58:41.876 UTC [161] LOG:  invalid page in block 14 of relation "base/5/16388"
2025-09-08 17:58:41.876 UTC [161] STATEMENT:  select * from demo where k=999;
2025-09-08 17:58:41.876 UTC [161] ERROR:  invalid page in block 14 of relation "base/5/16388"
2025-09-08 17:58:41.876 UTC [161] STATEMENT:  select * from demo where k=999;
ERROR:  invalid page in block 14 of relation "base/5/16388"

Here, the checksum calculated is not correct and an error has been raised. PostgreSQL checksums can detect some block corruption, but it is still possible that a bug or a malicious user that has access to the filesystem can change the data without being detected.

Oracle Database

To detect lost writes like the one I simulated above, Oracle Database compares the block checksum with the standby databases, as there is a low chance that the corruption happened in both environments. I've demonstrated this in the past with a similar demo: 18c new Lost Write Protection

WiredTiger (MongoDB storage engine)

MongoDB employs the WiredTiger storage engine, which is designed to prevent lost writes and detect disk failures that might return the wrong page. To achieve this, WiredTiger stores a checksum alongside each page address within the pointers between the B-tree pages, in an address cookie:

An address cookie is an opaque set of bytes returned by the block manager to reference a block in a B-tree file. It includes an offset, size, checksum, and object id.

In my lab, I first start a MongoDB container and compile wt, a command-line utility that allows direct interaction with WiredTiger files. This tool enables me to examine the storage engine without relying on the MongoDB query layer, and I'll use it in this series of blog posts:

docker run --rm -it --cap-add=SYS_PTRACE mongo bash
# install required packages
apt-get update && apt-get install -y git xxd strace curl jq python3 python3-dev python3-pip python3-venv python3-bson build-essential cmake gcc g++ libstdc++-12-dev libtool autoconf automake swig liblz4-dev zlib1g-dev libmemkind-dev libsnappy-dev libsodium-dev libzstd-dev
# get latest WiredTiger
curl -L $(curl -s https://api.github.com/repos/wiredtiger/wiredtiger/releases/latest | jq -r '.tarball_url') -o wiredtiger.tar.gz
# Compile
mkdir /wiredtiger && tar -xzf wiredtiger.tar.gz --strip-components=1 -C /wiredtiger ; cd /wiredtiger
mkdir build && cmake -S /wiredtiger -B /wiredtiger/build \
        -DHAVE_BUILTIN_EXTENSION_SNAPPY=1 \
        -DCMAKE_BUILD_TYPE=Release \
        -DENABLE_WERROR=0 \
        -DENABLE_QPL=0 \
        -DCMAKE_C_FLAGS="-O0 -Wno-error -Wno-format-overflow -Wno-error=array-bounds -Wno-error=format-overflow -Wno-error=nonnull" \
        -DPYTHON_EXECUTABLE=$(which python3)
cmake --build /wiredtiger/build
export PATH=$PATH:/wiredtiger/build:/wiredtiger/tools

It takes some time, but the wt command line utility will make the investigation easier. That's an advantage of MongoDB pluggable storage—you can examine it in layers.

I create a demo table and insert 10,000 records:

root@7d6d105a1663:/tmp# wt create table:demo

root@7d6d105a1663:/tmp# ls -alrt

total 68
drwxr-xr-x. 1 root root  4096 Sep  8 19:21 ..
-rw-r--r--. 1 root root    21 Sep  8 19:35 WiredTiger.lock
-rw-r--r--. 1 root root    50 Sep  8 19:35 WiredTiger
-rw-r--r--. 1 root root   299 Sep  8 19:35 WiredTiger.basecfg
-rw-r--r--. 1 root root  4096 Sep  8 19:35 demo.wt
-rw-r--r--. 1 root root  4096 Sep  8 19:35 WiredTigerHS.wt
-rw-r--r--. 1 root root  1475 Sep  8 19:35 WiredTiger.turtle
-rw-r--r--. 1 root root 32768 Sep  8 19:35 WiredTiger.wt
drwxrwxrwt. 1 root root  4096 Sep  8 19:35 .

root@7d6d105a1663:/tmp# wt list

colgroup:demo
file:demo.wt
table:demo

cat /dev/urandom |
 base64 |
 head -10000 |
 awk '{print "i",NR,$0}' |
wt dump -e table:demo

...
Inserted key '9997' and value 'ILZeUq/u/ErLB/i7LOUb4nwYP6D535trb8Mt3vcJXXRAqLeAiYIHn5bEWs1buflmiZMYd3rMMvhh'.
Inserted key '9998' and value 'y+b0eTV/4Ao12qRqtHhgP2xGUr+C9ZOfvOG3ZwbdDNXvpnbM1/laoJ9Yzyt6cbLJOR6jdQktpgFM'.
Inserted key '9999' and value 'cJY9uWtopqFOuZjggkZDWVZEEdygpMLyL7LscqehnKoVY7BrmTh4ZzyTLrZ1glROwLtZYbvLbu5c'.
Inserted key '10000' and value 'QYwyjaRxa9Q+5dvzwQtvv2QE/uS/vhRPCVsQ6p7re/L2yDrVRxyqkvSyMHeRCzMIsIovrCUJpPXI'.

I read record 9999 with wt and use strace to see the read calls:

strace -yy -e trace=pread64 -xs 36 wt dump -e table:demo <<<"s 9999"

...
pread64(6</tmp/demo.wt>, "\x00...\x1f\x63\x77\x41"..., 28672, 806912) = 28672
9999
cJY9uWtopqFOuZjggkZDWVZEEdygpMLyL7LscqehnKoVY7BrmTh4ZzyTLrZ1glROwLtZYbvLbu5c
...

This record is at offset 806912 in a 28672 bytes block. I “save” this block with dd:

dd if=demo.wt of=block1.tmp bs=1 skip=806912 count=28672

28672+0 records in
28672+0 records out
28672 bytes (29 kB, 28 KiB) copied, 0.0680832 s, 421 kB/s

I update this record to "xxxxxx" and trace the write calls:

strace -yy -e trace=pwrite64 -xs 36 wt dump -e table:demo <<<"u 9999 xxxxxx"

...
Updated key '9999' to value 'xxxxxx'.
pwrite64(6</tmp/demo.wt>, "\x00...\x02\xb1\x1f\xf5"..., 28672, 847872) = 28672
pwrite64(6</tmp/demo.wt>, "\x00...\x4d\x19\x14\x4e"..., 4096, 876544) = 4096
pwrite64(6</tmp/demo.wt>, "\x00...\x4f\x74\xdb\x1e"..., 4096, 880640) = 4096
pwrite64(6</tmp/demo.wt>, "\x00...\x21\xcc\x25\x06"..., 4096, 884736) = 4096
...

This writes a new block (WiredTiger does not write in-place, which helps to avoid corruption) of 28672 bytes, and updates the B-tree branches.

I can read this new value:

strace -yy -e trace=pread64  -xs 36 wt dump -e table:demo <<<"s 9999"

...
pread64(6</tmp/demo.wt>, "\x00...\x02\xb1\x1f\xf5"..., 28672, 847872) = 28672
9999
xxxxxx
...

To simulate disk corruption, I do the same as I did on PostgreSQL: replace the current block with the old one. I save the current block before overwriting it:

dd if=demo.wt of=block2.tmp bs=1 skip=847872 count=28672

28672+0 records in
28672+0 records out
28672 bytes (29 kB, 28 KiB) copied, 0.0688249 s, 417 kB/s

dd of=demo.wt if=block1.tmp bs=1 seek=847872 conv=notrunc

28672+0 records in
28672+0 records out
28672 bytes (29 kB, 28 KiB) copied, 0.0666375 s, 430 kB/s

If I try to read the record in this block, the corruption is detected:

strace -yy -e trace=pread64  -xs 36 wt dump -e table:demo <<<"s 9999"

...
pread64(6</tmp/demo.wt>, "\x00...\x1f\x63\x77\x41"..., 28672, 847872) = 28672
[1757361305:392519][8246:0x7fe9a087e740], wt, file:demo.wt, WT_SESSION.open_cursor: [WT_VERB_DEFAULT][ERROR]: __wti_block_read_off, 279: demo.wt: potential hardware corruption,
read checksum error for 28672B block at offset 847872: block header
checksum of 0x4177631f doesn't match expected checksum of 0xf51fb102
[1757361305:392904][8246:0x7fe9a087e740], wt, file:demo.wt,
WT_SESSION.open_cursor: [WT_VERB_DEFAULT][ERROR]: __bm_corrupt_dump, 86: {0: 847872, 28672, 0xf51fb102}: (chunk 1 of 28): 00 00 00 00 00
00 00 00 3a 00 00 00 00 00 00 00 40 6e 00 00 a8 02 00 00 07 04 00 01 00 70 00 00 1f 63 77 41 01 00 00 00 11 39 36 39 33 80 8c 43 38 30 58 51 66 64 
...

Even if the block checksum is correct for the block itself, it was detected that the checksum of the block, 0x4177631f, which is visible as 1f 63 77 41 in the hexadecimal dump, or \x1f\x63\x77\x41 in the read trace, is different from the expected 0xf51fb102 from the address cookie.

0xf51fb102 was visible as \x02\xb1\x1f\xf5 in the write call of the update, and is visible as 02 b1 1f f5 in the block that I've saved before overwriting it:

root@7d6d105a1663:/tmp# xxd -l 36 block2.tmp
00000000: 0000 0000 0000 0000 5c00 0000 0000 0000  ........\.......
00000010: f96d 0000 a802 0000 0704 0001 0070 0000  .m...........p..
00000020: 02b1 1ff5

Even with access to the files, it would be extremely difficult to corrupt the data in an undetected way because any change must update the checksum, and the checksum is referenced in all address cookies in other blocks. Block corruption is highly unlikely as the blocks are not updated in place, and failure to write blocks would break the pointers.

WiredTiger is open-source and you can check WT_BLOCK_HEADER definition. In this structure, the block size (disk_size) field appears before the checksum field: for example, 00 70 00 00 = 0x00007000 = 28,672 bytes, followed by the checksum 02 b1 1f f5 = 0xf51fb102. One advantage of WiredTiger is that B-tree leaf blocks can have flexible sizes, which MongoDB uses to keep documents as one chunk on disk and improve data locality.

Checksum verification is implemented in block_read.c and performs two validations:

  • It checks that the checksum stored in the block header matches the expected checksum from the address cookie (the B-tree pointer created when the block was written).
  • It zeroes out the checksum field in the header and recomputes the checksum over the block content, verifying it also matches the expected checksum. This ensures both the block’s integrity and its identity.

Conclusion

PostgreSQL requires you to enable checksums to detect data corruption. This detects when a page’s checksum does not match its content. However, if the system erroneously writes a different, but valid, block from the same table in place of the intended one, or misses a write and the previous version of the block remains, PostgreSQL cannot identify this issue. As a result, some disk failures may escape detection and return wrong results.

Oracle Database stores blocks with checksums and can enable checking them on read. With a Data Guard standby and some network overhead, the database can transmit checksums over the network to verify data integrity when reading.

MongoDB WiredTiger enables checksums by default and can detect the wrong blocks without the need to contact replicas. It embeds the expected checksum inside the B-tree address cookie so that every internal B-tree pointer to a leaf page includes the checksum for the referenced page. If an obsolete or different page is swapped in, any mismatch will be detected because the pointer's checksum won’t match. WiredTiger uses copy-on-write, not in-place overwrites, further reducing the risk of corruption.

Here is a description of WiredTiger by Keith Bostic:

And an article by Bob Liles about how MongoDB Atlas deals with that: Managing Data Corruption in the Cloud

Disaggregation: A New Architecture for Cloud Databases

This short VLDB'25 paper surveys disaggregation for cloud databases. It has several insightful points, and I found it worth summarizing. 

The key advantage of the cloud over on-prem is elastic scalability: users can scale resources up and down and pay only for what they use. Traditional database architectures, like shared-nothing, do not fully exploit this. Thus, cloud-native databases increasingly adopt disaggregated designs.

Disaggregation is primarily motivated by the asymmetry between compute and storage:

  • Compute is far more expensive than storage in the cloud.
  • Compute demand fluctuates quickly; storage grows slowly.
  • Compute can be stateless and easier to scale, while storage is inherently stateful.

Decoupling them lets compute scale elastically while storage remains relatively stable and cheap.


Review of Disaggregation in the Clouds

Early cloud-native systems like Snowflake and Amazon Aurora separate compute and storage into independent clusters. Modern systems push disaggregation further. Socrates splits storage into three services: Logging service (small footprint, strict latency), Page cache,  and Durable page store. This way each service can be tuned for its performance/cost tradeoffs. The logging service, for example, can use faster storage hardware.

Other disaggregation examples include computation pushdown (Redshift Spectrum, S3 Select), intermediate caching (Snowflake), a metadata service (Lakehouse), and memory disaggregation (PolarDB). Many other functions (indexing, concurrency control, query optimization) remain underexplored. There is room for a unified middleware layer between compute and storage that can consolidate these.

The paper doesn't mention it but, this discussion mirrors the microservices trend in systems design. Breaking monoliths into smaller, independently scalable services improves modularity and resource efficiency, and also enables better sharing and pooling of resources across workloads. We may see disaggregated databases evolve the way microservices did: first a simple split, then a bunch of fine-grained services, and eventually a need for orchestration layers, observability, and service meshes. Today it is compute and storage; tomorrow it could be dozens of database microservices (maybe even including concurrency control), stitched together by a middleware layer that looks suspiciously like Kubernetes.


Tradeoffs in disaggregated design

The main tradeoff is performance. Since disaggregated components are physically separate, the communication overhead can be high. A 2019 study shows a 10x throughput hit compared to a tuned shared-nothing system. Optimizations can help narrow the gap, but disaggregation should be applied only when its benefits outweigh the network cost. This tradeoff also motivates research into techniques that reduce communication overhead. Hello, distributed systems research!


Rethinking Core Protocols

Many distributed database protocols assume shared-nothing architecture, so with disaggregation, some of these assumptions no longer hold. This creates new opportunities, and not just more problems.

For example, 2PC normally faces a blocking problem because a failed node's log is inaccessible. With disaggregated storage, logs live in a shared, reliable service. Cornus 2PC protocol (2022) leverages this: active nodes can vote NO on behalf of failed nodes by writing directly to their logs. A compare-and-swap API ensures only one decision is recorded.


Disaggregating the Query Engine

Pushdown reduces data movement by executing operators closer to storage. This idea has been studied in database machines, Smart SSDs, and PIM, but fits especially well in the cloud. Leveraging this, PushdownDB uses S3 Select to push down both basic and advanced operators. It cuts query time 6.7x and cost 30%. FlexPushdownDB combines pushdown with caching so that operators like filters or hash probes can run locally on cached data and remotely via pushdown, with results merged. This hybrid mode outperforms either technique alone by 2.2x.


Enabling New Capabilities and Embracing New Hardware

Modern applications want queries to reflect the latest transactions, not data from hours ago. HTAP systems support this but require migration to new engines. Disaggregated architectures offer an opportunity here, and Hermes (VLDB'25) exploits this by placing itself between the compute and storage. Hermes intercepts transactional logs and analytical reads, merging recent updates into queries on the fly. Updates are later batched into stable storage.

Disaggregation also motivates and eases adoption of new hardware. Different components can use specialized GPUs, RDMA, CXL in order to achieve the best cost-performance tradeoff. The paper cites a GPU-based DuckDB engine (VLDB'25) that achieves large speedups by exploiting parallelism.


Discussion

So this paper mentions several new research directions. What should systems researchers in academia work on?

Here is what I think would be an impactful work. Take a monolithic database (like Postgres, RocksDB, or MySQL) and transform it to a disaggregated database. But not just for shits and giggles. Study the efficiency tradeoffs in alternative designs. Also study the software engineering tradeoffs, cost-to-production tradeoffs, resilience tradeoffs and metastability risks. Compare and contrast different transformation paths. This would provide a good roadmap (and paths to avoid minefields) for many databases that consider a similar redesign/implementation. Last year I had reviewed this paper which made a small crack at this, but I believe there is a lot of research and work here to be tackled. 

For distributed protocol designers, the rethinking core protocols section provides a good blueprint. Pick other protocols, such as  consensus, leader election, replication, caching, and revisit them in the disaggregated setting, and consider the new opportunities that open up alongside the challenges introduced.



Swimming with Sharks: Analyzing Encrypted Database Traffic Using Wireshark

Percona has a great set of tools known as the Percona Toolkit, one of which is pt-upgrade. The idea behind this tool is to replay a captured sequence of queries that were executed on a different database server. This is very useful to validate if a new version of the database server works as expected […]

Our Myrtle Beach vacation

This year was a bastard. Not from work. God, no, I find work relaxing. Reading papers, inventing algorithms, ripping apart distributed systems with TLA+ models -- that's how I have fun. I can do that all day with a grin on my face. But the minute I need to do adulting (like simply calling the cable company and ask why keep increasing our bill when I'm not looking), I will stress and procrastinate for weeks. And this year, I had a lot of adulting to do to put our house on market, and plan a move to California, all the while juggling to help three kids with school and EC activities. I was pretty stressed most of the time, and I've been grinding my teeth at night like a mule chewing rocks.

Anywho, we botched our great escape to California. House didn't sell quickly, as we hoped it would, and we are stuck in Buffalo for another year. Summer disappeared in a cloud of errands and disappointment, and suddenly it was late August with our kids twitching with pre-school nerves. There was still some time left to salvage the wreck. We needed a beach, any beach. Myrtle Beach! We heard good things about it. One of my friends called it the Redneck Riviera. Good enough for me, and far enough from Buffalo, so we decided to give it a try. 

Planning is not my strong suit. My wife took the reins. She scoured hotel reviews like a CIA interrogator, picking through tales of bedbugs, mold, broken elevators. She has a radar for doom. Without her, I'd have booked us straight into some cockroach casino on the boardwalk. But she nailed it. Ten days before departure, she locked down an Airbnb room inside a proper resort hotel, facing the ocean. We chose Airbnb for better customer service and because the photos showed the exact floor and view we would get. There was no guessing which floor or room we would get if we went with the resort directly.

The best thing about the vacation is anticipation and the wait. We counted down the days,  giddy with excitement.

And then, the drive. Always the drive. That is how the Demirbas family rolls: No planes if at all possible. Planes are a scam. For five people, it's bankruptcy on wings. You waste a whole day shuffling through TSA lines, just to pray nervously that our planes don't cancel on you, and if you are lucky to sit in a recycled air canister for hours. We once drove from Buffalo to Seattle, and back. And another time to Florida and back. For us seven hours on asphalt is a warm-up lap from Buffalo to Boston. Myrtle Beach was thirteen. Still doable. Just load the audiobooks, clamp the Bose headphones on my head, and hit the highway. Driving is my meditation: the road pours itself under my car, like some childhood arcade game where the scenery and other cars on the road scrolls through you for hours as I nudge the steering wheel left and right to accommodate.

We left Buffalo at 8:30 in our tightly packed Highlander. By noon the youngest announced that she hadn't hit the bathroom that morning and we stopped at a McDonald's little south of Pittsburgh. We mostly pass as a normal American family, but at this stop we stood out like a sore thumb. We received a lot of cold blond stares. I later understood why, when we drove another 30 minutes, the barns started shouting TRUMP in dripping paint, and we entered West Virginia. God's Country, they call it. Heaven on earth. But it was just some green hills, and the scenery didn't impress me much. 

Our next stop was at a rest area in North Carolina, which turned out to become the cleanest, most immaculate rest area I'd ever seen. Somebody in Raleigh must be laundering money through landscaping contracts, but damn if it didn't impress us. Even the butterflies were impressive!

Then Myrtle Beach: 85 degrees weather, ocean air, great view from our flat, and a nice waterpark at the resort. Southern hospitality is real, everyone was smiling. Compared to the winter-scarred faces in Buffalo, it felt like stepping onto another planet. The Carolinas had already shoved their kids back into classrooms, so we owned the pools and ruled the lazy river. The kids tore through the slides. I soaked in the jacuzzi like a tired warrior. At night we binge-watched Quantum Leap. We would have also watched during day, but my wife dragged us to beach walks, waterpark raids. Sometimes we need the push.

By the third day, the spell had taken hold. I started to relax.  Sleep came easy, deeper than home. The wave sounds and the sea view worked its magic. Staying right on the beach was great. No hauling gear, no logistics. Step out the door, fall into the ocean, and crawl back to the flat when you're cooked. The flat was clean, spacious, and blessed with a kitchen so we could gorge on comfort food without shame.

We were wondering if we made a mistake by getting the resort 4-5 miles north of the Boardwalk. When we visited the boardwalk on the third day, we realized that it was overrated anyways. It was full of tourist-trap shops, neon lights, and featured a SkyWheel, which we didn't bother to try. We didn't need the Boardwalk. Myrtle Beach itself is the show: the waves, the horizon, and the beach.

Of course, I had to ruin myself. The kids used sunscreen like sensible citizens, and I, an idiot heir to Turkish tanning lore, slathered on olive oil (which I swiped from our kitchen). If it fries an egg, it'll bronze a body, right? Well... I roasted into a lobster, alright... But I ended up slowly shedding skin like a reptile for days afterwards.

The drive back was clean. Salem's Grill in Pittsburgh was our mandatory detour. You go to great food, if great food doesn't get to you. We hit it before 7pm and dined like kings until closing at 8pm. We were back home before midnight. Eventless driving, the way I like it.

But vacations are lies, sweet lies. Within days the teeth grinding returned. The adult machinery reloaded with forms to sign, kids to shuttle, bills to pay. Adulting feels like having to deal with a constant deluge of junk mail and random chores from the universe.

And, then the saddest part... we will be shipping Ahmet to college. He leaves for Caltech soon (must be his mother's genes). I am proud, of course, but I will miss him a lot. I bought the plane ticket yesterday after weeks of pretending I didn't have to. Kids grow fast. Too fast... It isn't fair. 

Building a DOOM-like multiplayer shooter in pure SQL

DOOMQL: A DOOM-like multiplayer shooter in pure SQL

I recently stumbled across Patrick’s excellent DOOM clone running in a browser powered by DuckDB-WASM. Ever since I’ve read that, I wanted to push his awesome idea to the logical extreme: Build a multiplayer DOOM-like shooter entirely in SQL with CedarDB doing all the heavy lifting. During a month of parental leave (i.e., a lot of sleepless nights), I tried exactly that.

Here’s a sneak peek at DOOMQL:

September 06, 2025

September 05, 2025

Automating vector embedding generation in Amazon Aurora PostgreSQL with Amazon Bedrock

In this post, we explore several approaches for automating the generation of vector embedding in Amazon Aurora PostgreSQL-Compatible Edition when data is inserted or modified in the database. Each approach offers different trade-offs in terms of complexity, latency, reliability, and scalability, allowing you to choose the best fit for your specific application needs.

Group database tables under AWS Database Migration Service tasks for PostgreSQL source engine

AWS DMS accommodates a broad range of source and target data repositories, such as relational databases, data warehouses, and NoSQL databases. Proper preparation and design are vital for a successful migration process, especially when it comes to optimizing performance and addressing potential delay issues. In this blog post, we offer guidance about recognizing potential root causes of complete load and CDC delays early in the process and provide suggestions for optimally clustering tables to achieve the best performance for an AWS DMS task.

September 03, 2025

Recent Reads (September 25)

Small Gods (1992)

I absolutely loved Small Gods. Pratchett takes on religion, faith, and power. The story follows Om, a god trapped in the body of a tortoise, who has only one true believer left: a novice named Brutha. The central premise is that gods and mythical beings exist because people believe in them, and their power fades as belief fades.

    There’s no point in believing in things that exist.

The book is funny, clever, and surprisingly philosophical. Pratchett skewers organized religion, but he also asks bigger questions: What is faith? What is belief? How do institutions shape people, and how do people shape institutions? It's satire, but not heavy-handed. Like Vonnegut, he writes with a wink, yet there's real depth under the jokes.

    Gods don't like people not doing much work. People who aren't busy all the time might start to think.

    Why not? If enough people believe, you can be god of anything….

    The figures looked more or less human. And they were engaged in religion. You could tell by the knives         (it's not murder if you do it for a god).

    The trouble with being a god is that you've got no one to pray to.

I came across Small Gods after reading Gaiman's American Gods (2001), which credits it as inspiration. Both explore gods and belief, but Pratchett's is lighter, sharper, and full of characters you actually care about. The audiobook is narrated by Andy Serkis, and he's brilliant. My precious!

Some more quotes from the book:

    But is all this true?" said Brutha. / Didactylos shrugged. "Could be. Could be. We are here and it is now. The way I see it is, after that, everything tends towards guesswork." / "You mean you don't KNOW it's true?" said Brutha. / "I THINK it might be," said Didactylos. "I could be wrong. Not being certain is what being a philosopher is all about.

    What have I always believed? That on the whole, and by and large, if a man lived properly, not according to what any priests said, but according to what seemed decent and honest inside, then it would, at the end, more or less, turn out all right.

    Sometimes the crime follows the punishment, which only serves to prove the foresight of the Great God." / "That's what my grandmother used to say," said Brutha automatically. / "Indeed? I would like to know more about this formidable lady." / "She used to give me a thrashing every morning because I would certainly do something to deserve it during the day," said Brutha. / "A most complete understanding of the nature of mankind,.

    Probably the last man who knew how it worked had been tortured to death years before. Or as soon as it was installed. Killing the creator was a traditional method of patent protection.

    Last night there seemed to be a chance. Anything was possible last night. That was the trouble with last nights. They were always followed by this mornings.

    The Turtle Moves!


I Shall Wear Midnight (2010)

I had read The Shepherd's Crown earlier, and it made me a witch lover. I loved this book too. Pratchett's witches are sharp, strong, and unforgettable. Tiffany Aching, the protagonist, is brave, clever, and endlessly practical: “I make it my business. I'm a witch. It's what we do. When it's nobody else's business, it's my business.”

Pratchett's prose is excellent and witty. He makes you laugh and think at the same time. Lines like “You've taken the first step.” / “There's a second step?” / “No; there's another first step. Every step is a first step if it's a step in the right direction” stayed with me. There’s so much care in how he builds characters, their choices, and their world. The book is about people, not the fantasy world, and that is Pratchett's genius. The witches, the townsfolk, even the Nac Mac Feegles ... they all feel alive.

I also loved how the story quietly mirrors what researchers like us do: “We look to the edges. There’s a lot of edges, more than people know. Between life and death, this world and the next, night and day, right and wrong … an’ they need watchin’. We watch ’em, we guard the sum of things.” That is the witches' charter, but also a motto for formal methods researchers, or anyone keeping an eye on the boundaries of a complex system.

Another great line: “Well, as a lawyer I can tell you that something that looks very simple indeed can be incredibly complicated, especially if I'm being paid by the hour. The sun is simple. A sword is simple. A storm is simple. Behind everything simple is a huge tail of complicated.”


Quantum Leap (TV series 2022)

As a kid growing up in Turkey, I watched the original Quantum Leap, and it felt magical. We were filled with wildly incorrect optimism about science and technology, but it felt inspiring. Some lies are useful. The reboot was fun to watch with my kids too. Raymond Lee as Ben Song is a strong lead: he's the physicist who ends up stuck leaping into the bodies of other people. Caitlin Bassett guides him as his fiance and observer.

The show is well-paced and fun for family viewing. Many episodes lean into socially conscious themes, which I appreciated, but at times it becomes unintentionally offensive: portraying “dumb Americans” with exaggerated Southern accents or mocking Salem settlers in ways that feel worse than silly. The writers clearly aimed for humor or commentary, but the execution backfired. I won't spoil the full story, but the second season tries a pivot/twist, only to throw it out again. What were they thinking? Moments like this make me want to be a TV writer.