September 15, 2025
In response to a developer asking about systems
Sometimes I get asked questions that would be more fun to answer in public. All letters are treated as anonymous unless permission is otherwise granted.
Hey [Redacted]! It's great to hear from you. I'm very glad you joined the coffee club and met some good folks. :)
You asked how to learn about systems. A great question! I think I need to start first with what I mean when I say systems.
My definition of systems is all of the underlying software we developers use but are taught not to think about because they are so solid: our compilers and interpreters, our databases, our operating system, our browser, and so on. We think of them as basically not having bugs, we just count on them to be correct and fast enough so we can build the applications that really matter to users.
But 1) some developers do actually have to work on these fundamental blocks (compilers, databases, operating systems, browsers, etc.) and 2) it's not thaaaat hard to get into this development professionally and 3) even if you don't get into it professionally, having a better understanding of these fundamental blocks will make you a better application developer. At least I think so.
To get into systems I think it starts by you just questioning how each layer you build on works. Try building that layer yourself. For example you've probably used a web framework like Rails or Next.js. But you can just go and write that layer yourself too (for education).
And you've probably used Postgres or SQLite or DynamoDB. But you can also just go and write that layer yourself (for education). It's this habit of thinking and digging into the next lower layer that will get you into systems. Basically, not being satisfied with the black box.
I do not think there are many good books on programming in general, and very very few must-read ones, but one that I recommend to everybody is Designing Data Intensive Applications. I think it's best if you read it with a group of people. (My book club will read it in December when the 2nd edition comes out, you should join.) But this book is specific to data obviously and not interested in the fundamentals of other systems things like compilers or operating systems or browsers or so on.
Also, I see getting into this as a long-term thing. Throughout my whole career (almost 11 years now) I definitely always tried to dig into compilers and interpreters, I wrote and blogged about toy implementations a lot. And then 5 years ago I started digging into databases and saw that there was more career potential there. But it still took 4 years until I got my first job as a developer working on a database (the job I currently have).
Things take time to learn and that's ok! You have a long career to look forward to. And if you end up not wanting to dig into this stuff that's totally fine too. I think very few developers actually do. And they still have fine careers.
Anyway, I hope this is at least mildly useful. I hope you join nycsystems.xyz as well and look forward to seeing you at future coffee clubs!
Cheers,
Phil
September 14, 2025
MongoDB Internals: How Collections and Indexes Are Stored in WiredTiger
WiredTiger is MongoDB’s default storage engine, but what really occurs behind the scenes when collections and indexes are saved to disk? In this short deep dive, we’ll explore the internals of WiredTiger data files, covering everything from _mdb_catalog metadata and B-Tree page layouts to BSON storage, primary and secondary indexes, and multi-key array handling. The goal is to introduce useful low-level tools like wt and other utilities.
I ran this experiment in a Docker container, set up as described in a previous blog post:
docker run --rm -it --cap-add=SYS_PTRACE mongo bash
# install required packages
apt-get update && apt-get install -y git xxd strace curl jq python3 python3-dev python3-pip python3-venv python3-pymongo python3-bson build-essential cmake gcc g++ libstdc++-12-dev libtool autoconf automake swig liblz4-dev zlib1g-dev libmemkind-dev libsnappy-dev libsodium-dev libzstd-dev
# get WiredTiger main branch
curl -L $(curl -s https://api.github.com/repos/wiredtiger/wiredtiger/releases/latest | jq -r '.tarball_url') -o wiredtiger.tar.gz
git clone https://github.com/wiredtiger/wiredtiger.git
cd wiredtiger
# Compile
mkdir build && cmake -S /wiredtiger -B /wiredtiger/build \
-DCMAKE_C_FLAGS="-O0 -Wno-error -Wno-format-overflow -Wno-error=array-bounds -Wno-error=format-overflow -Wno-error=nonnull" \
-DHAVE_BUILTIN_EXTENSION_SNAPPY=1 \
-DCMAKE_BUILD_TYPE=Release
cmake --build /wiredtiger/build
# add `wt` binaries and other tools in the PATH
export PATH=$PATH:/wiredtiger/build:/wiredtiger/tools
# Start mongodb
mongod &
I use the mongo image, add the WiredTiger sources from the main branch, compile it to get wt, and start mongod.
I create a small collection with three documents, and an index, and stop mongod:
mongosh <<'JS'
db.franck.insertMany([
{_id:"aaa",val1:"xxx",val2:"yyy",val3:"zzz",msg:"hello world"},
{_id:"bbb",val1:"xxx",val2:"yyy",val3:"zzz",msg:["hello","world"]},
{_id:"ccc",val1:"xxx",val2:"yyy",val3:"zzz",msg:["hello","world","hello","again"]}
]);
db.franck.createIndex({_id:1,val1:1,val2:1,val3:1,msg:1});
db.franck.find().showRecordId();
use admin;
db.shutdownServer();
JS
I stop MongoDB so that I can access the WiredTiger files with wt without them being opened and locked by another program. Before stopping, I displayed the documents:
[
{
_id: 'aaa',
val1: 'xxx',
val2: 'yyy',
val3: 'zzz',
msg: 'hello world',
'$recordId': Long('1')
},
{
_id: 'bbb',
val1: 'xxx',
val2: 'yyy',
val3: 'zzz',
msg: [ 'hello', 'world' ],
'$recordId': Long('2')
},
{
_id: 'ccc',
val1: 'xxx',
val2: 'yyy',
val3: 'zzz',
msg: [ 'hello', 'world', 'hello', 'again' ],
'$recordId': Long('3')
}
]
The files are stored in the default WiredTiger directory /data/db
MongoDB catalog, which maps the MongoDB collections to their storage attributes, is stored in a WiredTiger table _mdb_catalog. The default WiredTiger directory is /data/db:
root@72cf410c04cb:/wiredtiger# ls -altU /data/db
drwxr-xr-x. 4 root root 32 Sep 1 23:10 ..
-rw-------. 1 root root 0 Sep 13 20:33 mongod.lock
drwx------. 2 root root 74 Sep 13 20:29 journal
-rw-------. 1 root root 21 Sep 12 22:47 WiredTiger.lock
-rw-------. 1 root root 50 Sep 12 22:47 WiredTiger
-rw-------. 1 root root 73728 Sep 13 20:33 WiredTiger.wt
-rw-r--r--. 1 root root 1504 Sep 13 20:33 WiredTiger.turtle
-rw-------. 1 root root 4096 Sep 13 20:33 WiredTigerHS.wt
-rw-------. 1 root root 36864 Sep 13 20:33 sizeStorer.wt
-rw-------. 1 root root 36864 Sep 13 20:33 _mdb_catalog.wt
-rw-------. 1 root root 114 Sep 12 22:47 storage.bson
-rw-------. 1 root root 20480 Sep 13 20:33 collection-0-3767590060964183367.wt
-rw-------. 1 root root 20480 Sep 13 20:33 index-1-3767590060964183367.wt
-rw-------. 1 root root 36864 Sep 13 20:33 collection-2-3767590060964183367.wt
-rw-------. 1 root root 36864 Sep 13 20:33 index-3-3767590060964183367.wt
-rw-------. 1 root root 20480 Sep 13 20:20 collection-4-3767590060964183367.wt
-rw-------. 1 root root 20480 Sep 13 20:20 index-5-3767590060964183367.wt
-rw-------. 1 root root 20480 Sep 13 20:33 index-6-3767590060964183367.wt
drwx------. 2 root root 4096 Sep 13 20:33 diagnostic.data
drwx------. 3 root root 21 Sep 13 20:17 .mongodb
-rw-------. 1 root root 20480 Sep 13 20:33 collection-0-6917019827977430149.wt
-rw-------. 1 root root 20480 Sep 13 20:23 index-1-6917019827977430149.wt
-rw-------. 1 root root 20480 Sep 13 20:25 index-2-6917019827977430149.wt
Catalog
_mdb_catalog maps MongoDB names to WiredTiger table names. wt lists the key (recordId) and value (BSON):
root@72cf410c04cb:~# wt -h /data/db dump table:_mdb_catalog
WiredTiger Dump (WiredTiger Version 12.0.0)
Format=print
Header
table:_mdb_catalog
access_pattern_hint=none,allocation_size=4KB,app_metadata=(formatVersion=1),assert=(commit_timestamp=none,durable_timestamp=none,read_timestamp=none,write_timestamp=off),block_allocation=best,block_compressor=snappy,block_manager=default,cache_resident=false,checksum=on,colgroups=,collator=,columns=,dictionary=0,disaggregated=(page_log=),encryption=(keyid=,name=),exclusive=false,extractor=,format=btree,huffman_key=,huffman_value=,ignore_in_memory_cache_size=false,immutable=false,import=(compare_timestamp=oldest_timestamp,enabled=false,file_metadata=,metadata_file=,panic_corrupt=true,repair=false),in_memory=false,ingest=,internal_item_max=0,internal_key_max=0,internal_key_truncate=true,internal_page_max=4KB,key_format=q,key_gap=10,leaf_item_max=0,leaf_key_max=0,leaf_page_max=32KB,leaf_value_max=64MB,log=(enabled=true),lsm=(auto_throttle=,bloom=,bloom_bit_count=,bloom_config=,bloom_hash_count=,bloom_oldest=,chunk_count_limit=,chunk_max=,chunk_size=,merge_max=,merge_min=),memory_page_image_max=0,memory_page_max=10m,os_cache_dirty_max=0,os_cache_max=0,prefix_compression=false,prefix_compression_min=4,source="file:_mdb_catalog.wt",split_deepen_min_child=0,split_deepen_per_child=0,split_pct=90,stable=,tiered_storage=(auth_token=,bucket=,bucket_prefix=,cache_directory=,local_retention=300,name=,object_target_size=0),type=file,value_format=u,verbose=[],write_timestamp_usage=none
Data
\81
r\01\00\00\03md\00\eb\00\00\00\02ns\00\15\00\00\00admin.system.version\00\03options\00 \00\00\00\05uuid\00\10\00\00\00\04\ba\fc\c2\a9;EC\94\9d\a1\df(\c9\87\eaW\00\04indexes\00\97\00\00\00\030\00\8f\00\00\00\03spec\00.\00\00\00\10v\00\02\00\00\00\03key\00\0e\00\00\00\10_id\00\01\00\00\00\00\02name\00\05\00\00\00_id_\00\00\08ready\00\01\08multikey\00\00\03multikeyPaths\00\10\00\00\00\05_id\00\01\00\00\00\00\00\00\12head\00\00\00\00\00\00\00\00\00\08backgroundSecondary\00\00\00\00\00\03idxIdent\00+\00\00\00\02_id_\00\1c\00\00\00index-1-3767590060964183367\00\00\02ns\00\15\00\00\00admin.system.version\00\02ident\00!\00\00\00collection-0-3767590060964183367\00\00
\82
\7f\01\00\00\03md\00\fb\00\00\00\02ns\00\12\00\00\00local.startup_log\00\03options\003\00\00\00\05uuid\00\10\00\00\00\042}_\a9\16,L\13\aa*\09\b5<\ea\aa\d6\08capped\00\01\10size\00\00\00\a0\00\00\04indexes\00\97\00\00\00\030\00\8f\00\00\00\03spec\00.\00\00\00\10v\00\02\00\00\00\03key\00\0e\00\00\00\10_id\00\01\00\00\00\00\02name\00\05\00\00\00_id_\00\00\08ready\00\01\08multikey\00\00\03multikeyPaths\00\10\00\00\00\05_id\00\01\00\00\00\00\00\00\12head\00\00\00\00\00\00\00\00\00\08backgroundSecondary\00\00\00\00\00\03idxIdent\00+\00\00\00\02_id_\00\1c\00\00\00index-3-3767590060964183367\00\00\02ns\00\12\00\00\00local.startup_log\00\02ident\00!\00\00\00collection-2-3767590060964183367\00\00
\83
^\02\00\00\03md\00\a7\01\00\00\02ns\00\17\00\00\00config.system.sessions\00\03options\00 \00\00\00\05uuid\00\10\00\00\00\04D\09],\c6\15FG\b6\e2m!\ba\c4j<\00\04indexes\00Q\01\00\00\030\00\8f\00\00\00\03spec\00.\00\00\00\10v\00\02\00\00\00\03key\00\0e\00\00\00\10_id\00\01\00\00\00\00\02name\00\05\00\00\00_id_\00\00\08ready\00\01\08multikey\00\00\03multikeyPaths\00\10\00\00\00\05_id\00\01\00\00\00\00\00\00\12head\00\00\00\00\00\00\00\00\00\08backgroundSecondary\00\00\00\031\00\b7\00\00\00\03spec\00R\00\00\00\10v\00\02\00\00\00\03key\00\12\00\00\00\10lastUse\00\01\00\00\00\00\02name\00\0d\00\00\00lsidTTLIndex\00\10expireAfterSeconds\00\08\07\00\00\00\08ready\00\01\08multikey\00\00\03multikeyPaths\00\14\00\00\00\05lastUse\00\01\00\00\00\00\00\00\12head\00\00\00\00\00\00\00\00\00\08backgroundSecondary\00\00\00\00\00\03idxIdent\00Y\00\00\00\02_id_\00\1c\00\00\00index-5-3767590060964183367\00\02lsidTTLIndex\00\1c\00\00\00index-6-3767590060964183367\00\00\02ns\00\17\00\00\00config.system.sessions\00\02ident\00!\00\00\00collection-4-3767590060964183367\00\00
\84
\a6\02\00\00\03md\00\e6\01\00\00\02ns\00\0c\00\00\00test.franck\00\03options\00 \00\00\00\05uuid\00\10\00\00\00\04>\04\ec\e2SUK\ca\98\e8\bf\fe\0eu\81L\00\04indexes\00\9b\01\00\00\030\00\8f\00\00\00\03spec\00.\00\00\00\10v\00\02\00\00\00\03key\00\0e\00\00\00\10_id\00\01\00\00\00\00\02name\00\05\00\00\00_id_\00\00\08ready\00\01\08multikey\00\00\03multikeyPaths\00\10\00\00\00\05_id\00\01\00\00\00\00\00\00\12head\00\00\00\00\00\00\00\00\00\08backgroundSecondary\00\00\00\031\00\01\01\00\00\03spec\00q\00\00\00\10v\00\02\00\00\00\03key\005\00\00\00\10_id\00\01\00\00\00\10val1\00\01\00\00\00\10val2\00\01\00\00\00\10val3\00\01\00\00\00\10msg\00\01\00\00\00\00\02name\00!\00\00\00_id_1_val1_1_val2_1_val3_1_msg_1\00\00\08ready\00\01\08multikey\00\01\03multikeyPaths\00?\00\00\00\05_id\00\01\00\00\00\00\00\05val1\00\01\00\00\00\00\00\05val2\00\01\00\00\00\00\00\05val3\00\01\00\00\00\00\00\05msg\00\01\00\00\00\00\01\00\12head\00\00\00\00\00\00\00\00\00\08backgroundSecondary\00\00\00\00\00\03idxIdent\00m\00\00\00\02_id_\00\1c\00\00\00index-1-6917019827977430149\00\02_id_1_val1_1_val2_1_val3_1_msg_1\00\1c\00\00\00index-2-6917019827977430149\00\00\02ns\00\0c\00\00\00test.franck\00\02ident\00!\00\00\00collection-0-6917019827977430149\00\00
I can decode the BSON value with wt_to_mdb_bson.py to display it as JSON, and use jq to filter the file information about the collection I've created:
wt -h /data/db dump -x table:_mdb_catalog |
wt_to_mdb_bson.py -m dump -j |
jq 'select(.value.ns == "test.franck") |
{ns: .value.ns, ident: .value.ident, idxIdent: .value.idxIdent}
'
{
"ns": "test.franck",
"ident": "collection-0-6917019827977430149",
"idxIdent": {
"_id_": "index-1-6917019827977430149",
"_id_1_val1_1_val2_1_val3_1_msg_1": "index-2-6917019827977430149"
}
}
ident is the WiredTiger table name (collection-...) for the collection documents. All collections have a primary key index on "_id" and additional secondary indexes, stored in WiredTiger tables (index-...). These indexes are stored as .wt files in the data directory.
Collection
Using the WiredTiger table name for the collection, I dump the content, keys, and values, and decode it as JSON:
wt -h /data/db dump -x table:collection-0-6917019827977430149 |
wt_to_mdb_bson.py -m dump -j
{"key": "81", "value": {"_id": "aaa", "val1": "xxx", "val2": "yyy", "val3": "zzz", "msg": "hello world"}}
{"key": "82", "value": {"_id": "bbb", "val1": "xxx", "val2": "yyy", "val3": "zzz", "msg": ["hello", "world"]}}
{"key": "83", "value": {"_id": "ccc", "val1": "xxx", "val2": "yyy", "val3": "zzz", "msg": ["hello", "world", "hello", "again"]}}
The "key" here is the recordId — an internal, unsigned 64-bit integer MongoDB uses (when not using clustered collections) to order documents in the collection table. The 0x80 offset is because the storage key is stored as a signed 8‑bit integer, but encoded in an order-preserving way.
I can also use wt_binary_decode.py to look at the file blocks. Here is the leaf page (page type: 7 (WT_PAGE_ROW_LEAF)) that contains my three documents as six key and value cells (cells (oflow len): 6) :
wt_binary_decode.py --offset 4096 --page 1 --verbose --split --bson /data/db/collection-0-6917019827977430149.wt
/data/db/collection-0-6917019827977430149.wt, position 0x1000/0x5000, pagelimit 1
Decode at 4096 (0x1000)
0: 00 00 00 00 00 00 00 00 1f 0f 00 00 00 00 00 00 5f 01 00 00
06 00 00 00 07 04 00 01 00 10 00 00 64 0a ec 4b 01 00 00 00
Page Header:
recno: 0
writegen: 3871
memsize: 351
ncells (oflow len): 6
page type: 7 (WT_PAGE_ROW_LEAF)
page flags: 0x4
version: 1
Block Header:
disk_size: 4096
checksum: 0x4bec0a64
block flags: 0x1
0: 28: 05 81
desc: 0x5 short key 1 bytes:
<packed 1 (0x1)>
1: 2a: 80 91 51 00 00 00 02 5f 69 64 00 04 00 00 00 61 61 61 00 02
76 61 6c 31 00 04 00 00 00 78 78 78 00 02 76 61 6c 32 00 04
00 00 00 79 79 79 00 02 76 61 6c 33 00 04 00 00 00 7a 7a 7a
00 02 6d 73 67 00 0c 00 00 00 68 65 6c 6c 6f 20 77 6f 72 6c
64 00 00
cell is valid BSON
{ '_id': 'aaa',
'msg': 'hello world',
'val1': 'xxx',
'val2': 'yyy',
'val3': 'zzz'}
2: 7d: 05 82
desc: 0x5 short key 1 bytes:
<packed 2 (0x2)>
3: 7f: 80 a0 60 00 00 00 02 5f 69 64 00 04 00 00 00 62 62 62 00 02
76 61 6c 31 00 04 00 00 00 78 78 78 00 02 76 61 6c 32 00 04
00 00 00 79 79 79 00 02 76 61 6c 33 00 04 00 00 00 7a 7a 7a
00 04 6d 73 67 00 1f 00 00 00 02 30 00 06 00 00 00 68 65 6c
6c 6f 00 02 31 00 06 00 00 00 77 6f 72 6c 64 00 00 00
cell is valid BSON
{ '_id': 'bbb',
'msg': ['hello', 'world'],
'val1': 'xxx',
'val2': 'yyy',
'val3': 'zzz'}
4: e1: 05 83
desc: 0x5 short key 1 bytes:
<packed 3 (0x3)>
5: e3: 80 ba 7a 00 00 00 02 5f 69 64 00 04 00 00 00 63 63 63 00 02
76 61 6c 31 00 04 00 00 00 78 78 78 00 02 76 61 6c 32 00 04
00 00 00 79 79 79 00 02 76 61 6c 33 00 04 00 00 00 7a 7a 7a
00 04 6d 73 67 00 39 00 00 00 02 30 00 06 00 00 00 68 65 6c
6c 6f 00 02 31 00 06 00 00 00 77 6f 72 6c 64 00 02 32 00 06
00 00 00 68 65 6c 6c 6f 00 02 33 00 06 00 00 00 61 67 61 69
6e 00 00 00
cell is valid BSON
{ '_id': 'ccc',
'msg': ['hello', 'world', 'hello', 'again'],
'val1': 'xxx',
'val2': 'yyy',
'val3': 'zzz'}
The script shows the raw hexadecimal bytes for the key, a description of the cell type, and the decoded logical value using WiredTiger’s order‑preserving integer encoding (packed int encoding). In this example, the raw byte 0x81 decodes to record ID 1:
0: 28: 05 81
desc: 0x5 short key 1 bytes:
<packed 1 (0x1)>
Here is the branch page (page type: 6 (WT_PAGE_ROW_INT)) that references it:
wt_binary_decode.py --offset 8192 --page 1 --verbose --split --bson /data/db/collection-0-6917019827977430149.wt
/data/db/collection-0-6917019827977430149.wt, position 0x2000/0x5000, pagelimit 1
Decode at 8192 (0x2000)
0: 00 00 00 00 00 00 00 00 20 0f 00 00 00 00 00 00 34 00 00 00
02 00 00 00 06 00 00 01 00 10 00 00 21 df 20 d6 01 00 00 00
Page Header:
recno: 0
writegen: 3872
memsize: 52
ncells (oflow len): 2
page type: 6 (WT_PAGE_ROW_INT)
page flags: 0x0
version: 1
Block Header:
disk_size: 4096
checksum: 0xd620df21
block flags: 0x1
0: 28: 05 00
desc: 0x5 short key 1 bytes:
""
1: 2a: 38 00 87 80 81 e4 4b eb ea 24
desc: 0x38 addr (leaf no-overflow) 7 bytes:
<packed 0 (0x0)> <packed 1 (0x1)> <packed 1273760356 (0x4bec0a64)>
As we have seen in the previous blog post, the pointer includes the checksum of the page it references (0x4bec0a64) to detect disc corruption.
Another utility, bsondump, can be used to display the output of wt dump -x as JSON, like wt_to_mdb_bson.py, but requires some filtering to get the BSON content:
wt -h /data/db dump -x table:collection-0-6917019827977430149 | # dump in hexa
egrep '025f696400' | # all documents have an "_id " field
xxd -r -p | # gets the plain binary data
bsondump --type=json # display BSON it as JSON
{"_id":"aaa","val1":"xxx","val2":"yyy","val3":"zzz","msg":"hello world"}
{"_id":"bbb","val1":"xxx","val2":"yyy","val3":"zzz","msg":["hello","world"]}
{"_id":"ccc","val1":"xxx","val2":"yyy","val3":"zzz","msg":["hello","world","hello","again"]}
2025-09-14T08:57:36.182+0000 3 objects found
It also provides a debug type output that gives more insights into how it is stored internally, especially for documents with arrays:
wt -h /data/db dump -x table:collection-0-6917019827977430149 | # dump in hexa
egrep '025f696400' | # all documents have an "_id " field
xxd -r -p | # gets the plain binary data
bsondump --type=debug # display BSON as it is stored
--- new object ---
size : 81
_id
type: 2 size: 13
val1
type: 2 size: 14
val2
type: 2 size: 14
val3
type: 2 size: 14
msg
type: 2 size: 21
--- new object ---
size : 96
_id
type: 2 size: 13
val1
type: 2 size: 14
val2
type: 2 size: 14
val3
type: 2 size: 14
msg
type: 4 size: 36
--- new object ---
size : 31
0
type: 2 size: 13
1
type: 2 size: 13
--- new object ---
size : 122
_id
type: 2 size: 13
val1
type: 2 size: 14
val2
type: 2 size: 14
val3
type: 2 size: 14
msg
type: 4 size: 62
--- new object ---
size : 57
0
type: 2 size: 13
1
type: 2 size: 13
2
type: 2 size: 13
3
type: 2 size: 13
2025-09-14T08:59:15.268+0000 3 objects found
Arrays in BSON are just sub-objects with the array position as a field name.
Primary index
RecordId is an internal, logical key used in the BTree to store the collection. It allows documents to be physically moved without fragmentation when they're updated. All indexes reference documents by recordId, not their physical location. Access by "_id" requires a unique index created automatically with the collection and stored as another WiredTiger table. Here is the content:
wt -h /data/db dump -p table:index-1-6917019827977430149
WiredTiger Dump (WiredTiger Version 12.0.0)
Format=print
Header
table:index-1-6917019827977430149
access_pattern_hint=none,allocation_size=4KB,app_metadata=(formatVersion=8),assert=(commit_timestamp=none,durable_timestamp=none,read_timestamp=none,write_timestamp=off),block_allocation=best,block_compressor=,block_manager=default,cache_resident=false,checksum=on,colgroups=,collator=,columns=,dictionary=0,disaggregated=(page_log=),encryption=(keyid=,name=),exclusive=false,extractor=,format=btree,huffman_key=,huffman_value=,ignore_in_memory_cache_size=false,immutable=false,import=(compare_timestamp=oldest_timestamp,enabled=false,file_metadata=,metadata_file=,panic_corrupt=true,repair=false),in_memory=false,ingest=,internal_item_max=0,internal_key_max=0,internal_key_truncate=true,internal_page_max=16k,key_format=u,key_gap=10,leaf_item_max=0,leaf_key_max=0,leaf_page_max=16k,leaf_value_max=0,log=(enabled=true),lsm=(auto_throttle=,bloom=,bloom_bit_count=,bloom_config=,bloom_hash_count=,bloom_oldest=,chunk_count_limit=,chunk_max=,chunk_size=,merge_max=,merge_min=),memory_page_image_max=0,memory_page_max=5MB,os_cache_dirty_max=0,os_cache_max=0,prefix_compression=true,prefix_compression_min=4,source="file:index-1-6917019827977430149.wt",split_deepen_min_child=0,split_deepen_per_child=0,split_pct=90,stable=,tiered_storage=(auth_token=,bucket=,bucket_prefix=,cache_directory=,local_retention=300,name=,object_target_size=0),type=file,value_format=u,verbose=[],write_timestamp_usage=none
Data
<aaa\00\04
\00\08
<bbb\00\04
\00\10
<ccc\00\04
\00\18
There are three entries, one for each document, with the "_id" value (aaa,bbb,ccc) as the key, and the recordId as the value. The values are packed (see documentation), for example < prefixes a little-endian value.
In MongoDB’s KeyString format, the recordId is stored in a special packed encoding where three bits are added to the right of the big-endian value, to be able to store the length at the end of the key. The same is used when it is in the value part of the index entry, in a unique index. To decode it, you need to shift the last byte right by three bits. Here, 0x08 >> 3 = 1, 0x10 >> 3 = 2, and 0x18 >> 3 = 3, which are the recordId of my documents.
I decode the page that contains those index entries:
wt_binary_decode.py --offset 4096 --page 1 --verbose --split /data/db/index-1-6917019827977430149.wt
/data/db/index-1-6917019827977430149.wt, position 0x1000/0x5000, pagelimit 1
Decode at 4096 (0x1000)
0: 00 00 00 00 00 00 00 00 1f 0f 00 00 00 00 00 00 46 00 00 00
06 00 00 00 07 04 00 01 00 10 00 00 7c d3 87 60 01 00 00 00
Page Header:
recno: 0
writegen: 3871
memsize: 70
ncells (oflow len): 6
page type: 7 (WT_PAGE_ROW_LEAF)
page flags: 0x4
version: 1
Block Header:
disk_size: 4096
checksum: 0x6087d37c
block flags: 0x1
0: 28: 19 3c 61 61 61 00 04
desc: 0x19 short key 6 bytes:
"<aaa"
1: 2f: 0b 00 08
desc: 0xb short val 2 bytes:
"
2: 32: 19 3c 62 62 62 00 04
desc: 0x19 short key 6 bytes:
"<bbb"
3: 39: 0b 00 10
desc: 0xb short val 2 bytes:
""
4: 3c: 19 3c 63 63 63 00 04
desc: 0x19 short key 6 bytes:
"<ccc"
5: 43: 0b 00 18
desc: 0xb short val 2 bytes:
""
This utility doesn't decode the recordId, we need to shift it. There's no BSON to decode in the indexes.
Secondary index
Secondary indexes are similar, except that they can be composed of multiple fields, and any indexed field can contain an array, which may result in multiple index entries for a single document, like an inverted index.
MongoDB tracks which indexed fields contain arrays to improve query planning. A multikey index creates an entry for each array element, and if multiple fields are multikey, it stores entries for all combinations of their values. By knowing exactly which fields are multikey, the query planner can apply tighter index bounds when only one field is involved. This information is stored in the catalog as a "multikey" flag along with the specific "multikeyPaths":
wt -h /data/db dump -x table:_mdb_catalog |
wt_to_mdb_bson.py -m dump -j |
jq 'select(.value.ns == "test.franck") |
.value.md.indexes[] |
{name: .spec.name, key: .spec.key, multikey: .multikey, multikeyPaths: .multikeyPaths | keys}
'
{
"name": "_id_",
"key": {
"_id": { "$numberInt": "1" },
},
"multikey": false,
"multikeyPaths": [
"_id"
]
}
{
"name": "_id_1_val1_1_val2_1_val3_1_msg_1",
"key": {
"_id": { "$numberInt": "1" },
"val1": { "$numberInt": "1" },
"val2": { "$numberInt": "1" },
"val3": { "$numberInt": "1" },
"msg": { "$numberInt": "1" },
},
"multikey": true,
"multikeyPaths": [
"_id",
"msg",
"val1",
"val2",
"val3"
]
}
Here is the dump of my index on {_id:1,val1:1,val2:1,val3:1,msg:1}:
wt -h /data/db dump -p table:index-2-6917019827977430149
WiredTiger Dump (WiredTiger Version 12.0.0)
Format=print
Header
table:index-2-6917019827977430149
access_pattern_hint=none,allocation_size=4KB,app_metadata=(formatVersion=8),assert=(commit_timestamp=none,durable_timestamp=none,read_timestamp=none,write_timestamp=off),block_allocation=best,block_compressor=,block_manager=default,cache_resident=false,checksum=on,colgroups=,collator=,columns=,dictionary=0,disaggregated=(page_log=),encryption=(keyid=,name=),exclusive=false,extractor=,format=btree,huffman_key=,huffman_value=,ignore_in_memory_cache_size=false,immutable=false,import=(compare_timestamp=oldest_timestamp,enabled=false,file_metadata=,metadata_file=,panic_corrupt=true,repair=false),in_memory=false,ingest=,internal_item_max=0,internal_key_max=0,internal_key_truncate=true,internal_page_max=16k,key_format=u,key_gap=10,leaf_item_max=0,leaf_key_max=0,leaf_page_max=16k,leaf_value_max=0,log=(enabled=true),lsm=(auto_throttle=,bloom=,bloom_bit_count=,bloom_config=,bloom_hash_count=,bloom_oldest=,chunk_count_limit=,chunk_max=,chunk_size=,merge_max=,merge_min=),memory_page_image_max=0,memory_page_max=5MB,os_cache_dirty_max=0,os_cache_max=0,prefix_compression=true,prefix_compression_min=4,source="file:index-2-6917019827977430149.wt",split_deepen_min_child=0,split_deepen_per_child=0,split_pct=90,stable=,tiered_storage=(auth_token=,bucket=,bucket_prefix=,cache_directory=,local_retention=300,name=,object_target_size=0),type=file,value_format=u,verbose=[],write_timestamp_usage=none
Data
<aaa\00<xxx\00<yyy\00<zzz\00<hello world\00\04\00\08
(null)
<bbb\00<xxx\00<yyy\00<zzz\00<hello\00\04\00\10
(null)
<bbb\00<xxx\00<yyy\00<zzz
MongoDB Internals: How Collections and Indexes Are Stored in WiredTiger
WiredTiger is MongoDB’s default storage engine, but what really occurs behind the scenes when collections and indexes are saved to disk? In this short deep dive, we’ll explore the internals of WiredTiger data files, covering everything from _mdb_catalog metadata and B-tree page layouts to BSON storage, primary and secondary indexes, and multi-key array handling. The goal is to introduce useful low-level tools like wt and other utilities.
I ran this experiment in a Docker container, set up as described in a previous blog post:
docker run --rm -it --cap-add=SYS_PTRACE mongo bash
# install required packages
apt-get update && apt-get install -y git xxd strace curl jq python3 python3-dev python3-pip python3-venv python3-pymongo python3-bson build-essential cmake gcc g++ libstdc++-12-dev libtool autoconf automake swig liblz4-dev zlib1g-dev libmemkind-dev libsnappy-dev libsodium-dev libzstd-dev
# get WiredTiger main branch
curl -L $(curl -s https://api.github.com/repos/wiredtiger/wiredtiger/releases/latest | jq -r '.tarball_url') -o wiredtiger.tar.gz
git clone https://github.com/wiredtiger/wiredtiger.git
cd wiredtiger
# Compile
mkdir build && cmake -S /wiredtiger -B /wiredtiger/build \
-DCMAKE_C_FLAGS="-O0 -Wno-error -Wno-format-overflow -Wno-error=array-bounds -Wno-error=format-overflow -Wno-error=nonnull" \
-DHAVE_BUILTIN_EXTENSION_SNAPPY=1 \
-DCMAKE_BUILD_TYPE=Release
cmake --build /wiredtiger/build
# add `wt` binaries and other tools in the PATH
export PATH=$PATH:/wiredtiger/build:/wiredtiger/tools
# Start mongodb
mongod &
I use the mongo image, add the WiredTiger sources from the main branch, compile it to get wt, and start mongod.
I create a small collection with three documents and an index, and stop mongod:
mongosh <<'JS'
db.franck.insertMany([
{_id:"aaa",val1:"xxx",val2:"yyy",val3:"zzz",msg:"hello world"},
{_id:"bbb",val1:"xxx",val2:"yyy",val3:"zzz",msg:["hello","world"]},
{_id:"ccc",val1:"xxx",val2:"yyy",val3:"zzz",msg:["hello","world","hello","again"]}
]);
db.franck.createIndex({_id:1,val1:1,val2:1,val3:1,msg:1});
db.franck.find().showRecordId();
use admin;
db.shutdownServer();
JS
I stop MongoDB so that I can access the WiredTiger files with wt without them being opened and locked by another program. Before stopping, I displayed the documents:
[
{
_id: 'aaa',
val1: 'xxx',
val2: 'yyy',
val3: 'zzz',
msg: 'hello world',
'$recordId': Long('1')
},
{
_id: 'bbb',
val1: 'xxx',
val2: 'yyy',
val3: 'zzz',
msg: [ 'hello', 'world' ],
'$recordId': Long('2')
},
{
_id: 'ccc',
val1: 'xxx',
val2: 'yyy',
val3: 'zzz',
msg: [ 'hello', 'world', 'hello', 'again' ],
'$recordId': Long('3')
}
]
The files are stored in the default WiredTiger directory /data/db. MongoDB catalog, which maps the MongoDB collections to their storage attributes, is stored in a WiredTiger table _mdb_catalog. The default WiredTiger directory is /data/db:
root@72cf410c04cb:/wiredtiger# ls -altU /data/db
drwxr-xr-x. 4 root root 32 Sep 1 23:10 ..
-rw-------. 1 root root 0 Sep 13 20:33 mongod.lock
drwx------. 2 root root 74 Sep 13 20:29 journal
-rw-------. 1 root root 21 Sep 12 22:47 WiredTiger.lock
-rw-------. 1 root root 50 Sep 12 22:47 WiredTiger
-rw-------. 1 root root 73728 Sep 13 20:33 WiredTiger.wt
-rw-r--r--. 1 root root 1504 Sep 13 20:33 WiredTiger.turtle
-rw-------. 1 root root 4096 Sep 13 20:33 WiredTigerHS.wt
-rw-------. 1 root root 36864 Sep 13 20:33 sizeStorer.wt
-rw-------. 1 root root 36864 Sep 13 20:33 _mdb_catalog.wt
-rw-------. 1 root root 114 Sep 12 22:47 storage.bson
-rw-------. 1 root root 20480 Sep 13 20:33 collection-0-3767590060964183367.wt
-rw-------. 1 root root 20480 Sep 13 20:33 index-1-3767590060964183367.wt
-rw-------. 1 root root 36864 Sep 13 20:33 collection-2-3767590060964183367.wt
-rw-------. 1 root root 36864 Sep 13 20:33 index-3-3767590060964183367.wt
-rw-------. 1 root root 20480 Sep 13 20:20 collection-4-3767590060964183367.wt
-rw-------. 1 root root 20480 Sep 13 20:20 index-5-3767590060964183367.wt
-rw-------. 1 root root 20480 Sep 13 20:33 index-6-3767590060964183367.wt
drwx------. 2 root root 4096 Sep 13 20:33 diagnostic.data
drwx------. 3 root root 21 Sep 13 20:17 .mongodb
-rw-------. 1 root root 20480 Sep 13 20:33 collection-0-6917019827977430149.wt
-rw-------. 1 root root 20480 Sep 13 20:23 index-1-6917019827977430149.wt
-rw-------. 1 root root 20480 Sep 13 20:25 index-2-6917019827977430149.wt
Catalog
_mdb_catalog maps MongoDB names to WiredTiger table names. wt lists the key (recordId) and value (BSON):
root@72cf410c04cb:~# wt -h /data/db dump table:_mdb_catalog
WiredTiger Dump (WiredTiger Version 12.0.0)
Format=print
Header
table:_mdb_catalog
access_pattern_hint=none,allocation_size=4KB,app_metadata=(formatVersion=1),assert=(commit_timestamp=none,durable_timestamp=none,read_timestamp=none,write_timestamp=off),block_allocation=best,block_compressor=snappy,block_manager=default,cache_resident=false,checksum=on,colgroups=,collator=,columns=,dictionary=0,disaggregated=(page_log=),encryption=(keyid=,name=),exclusive=false,extractor=,format=btree,huffman_key=,huffman_value=,ignore_in_memory_cache_size=false,immutable=false,import=(compare_timestamp=oldest_timestamp,enabled=false,file_metadata=,metadata_file=,panic_corrupt=true,repair=false),in_memory=false,ingest=,internal_item_max=0,internal_key_max=0,internal_key_truncate=true,internal_page_max=4KB,key_format=q,key_gap=10,leaf_item_max=0,leaf_key_max=0,leaf_page_max=32KB,leaf_value_max=64MB,log=(enabled=true),lsm=(auto_throttle=,bloom=,bloom_bit_count=,bloom_config=,bloom_hash_count=,bloom_oldest=,chunk_count_limit=,chunk_max=,chunk_size=,merge_max=,merge_min=),memory_page_image_max=0,memory_page_max=10m,os_cache_dirty_max=0,os_cache_max=0,prefix_compression=false,prefix_compression_min=4,source="file:_mdb_catalog.wt",split_deepen_min_child=0,split_deepen_per_child=0,split_pct=90,stable=,tiered_storage=(auth_token=,bucket=,bucket_prefix=,cache_directory=,local_retention=300,name=,object_target_size=0),type=file,value_format=u,verbose=[],write_timestamp_usage=none
Data
\81
r\01\00\00\03md\00\eb\00\00\00\02ns\00\15\00\00\00admin.system.version\00\03options\00 \00\00\00\05uuid\00\10\00\00\00\04\ba\fc\c2\a9;EC\94\9d\a1\df(\c9\87\eaW\00\04indexes\00\97\00\00\00\030\00\8f\00\00\00\03spec\00.\00\00\00\10v\00\02\00\00\00\03key\00\0e\00\00\00\10_id\00\01\00\00\00\00\02name\00\05\00\00\00_id_\00\00\08ready\00\01\08multikey\00\00\03multikeyPaths\00\10\00\00\00\05_id\00\01\00\00\00\00\00\00\12head\00\00\00\00\00\00\00\00\00\08backgroundSecondary\00\00\00\00\00\03idxIdent\00+\00\00\00\02_id_\00\1c\00\00\00index-1-3767590060964183367\00\00\02ns\00\15\00\00\00admin.system.version\00\02ident\00!\00\00\00collection-0-3767590060964183367\00\00
\82
\7f\01\00\00\03md\00\fb\00\00\00\02ns\00\12\00\00\00local.startup_log\00\03options\003\00\00\00\05uuid\00\10\00\00\00\042}_\a9\16,L\13\aa*\09\b5<\ea\aa\d6\08capped\00\01\10size\00\00\00\a0\00\00\04indexes\00\97\00\00\00\030\00\8f\00\00\00\03spec\00.\00\00\00\10v\00\02\00\00\00\03key\00\0e\00\00\00\10_id\00\01\00\00\00\00\02name\00\05\00\00\00_id_\00\00\08ready\00\01\08multikey\00\00\03multikeyPaths\00\10\00\00\00\05_id\00\01\00\00\00\00\00\00\12head\00\00\00\00\00\00\00\00\00\08backgroundSecondary\00\00\00\00\00\03idxIdent\00+\00\00\00\02_id_\00\1c\00\00\00index-3-3767590060964183367\00\00\02ns\00\12\00\00\00local.startup_log\00\02ident\00!\00\00\00collection-2-3767590060964183367\00\00
\83
^\02\00\00\03md\00\a7\01\00\00\02ns\00\17\00\00\00config.system.sessions\00\03options\00 \00\00\00\05uuid\00\10\00\00\00\04D\09],\c6\15FG\b6\e2m!\ba\c4j<\00\04indexes\00Q\01\00\00\030\00\8f\00\00\00\03spec\00.\00\00\00\10v\00\02\00\00\00\03key\00\0e\00\00\00\10_id\00\01\00\00\00\00\02name\00\05\00\00\00_id_\00\00\08ready\00\01\08multikey\00\00\03multikeyPaths\00\10\00\00\00\05_id\00\01\00\00\00\00\00\00\12head\00\00\00\00\00\00\00\00\00\08backgroundSecondary\00\00\00\031\00\b7\00\00\00\03spec\00R\00\00\00\10v\00\02\00\00\00\03key\00\12\00\00\00\10lastUse\00\01\00\00\00\00\02name\00\0d\00\00\00lsidTTLIndex\00\10expireAfterSeconds\00\08\07\00\00\00\08ready\00\01\08multikey\00\00\03multikeyPaths\00\14\00\00\00\05lastUse\00\01\00\00\00\00\00\00\12head\00\00\00\00\00\00\00\00\00\08backgroundSecondary\00\00\00\00\00\03idxIdent\00Y\00\00\00\02_id_\00\1c\00\00\00index-5-3767590060964183367\00\02lsidTTLIndex\00\1c\00\00\00index-6-3767590060964183367\00\00\02ns\00\17\00\00\00config.system.sessions\00\02ident\00!\00\00\00collection-4-3767590060964183367\00\00
\84
\a6\02\00\00\03md\00\e6\01\00\00\02ns\00\0c\00\00\00test.franck\00\03options\00 \00\00\00\05uuid\00\10\00\00\00\04>\04\ec\e2SUK\ca\98\e8\bf\fe\0eu\81L\00\04indexes\00\9b\01\00\00\030\00\8f\00\00\00\03spec\00.\00\00\00\10v\00\02\00\00\00\03key\00\0e\00\00\00\10_id\00\01\00\00\00\00\02name\00\05\00\00\00_id_\00\00\08ready\00\01\08multikey\00\00\03multikeyPaths\00\10\00\00\00\05_id\00\01\00\00\00\00\00\00\12head\00\00\00\00\00\00\00\00\00\08backgroundSecondary\00\00\00\031\00\01\01\00\00\03spec\00q\00\00\00\10v\00\02\00\00\00\03key\005\00\00\00\10_id\00\01\00\00\00\10val1\00\01\00\00\00\10val2\00\01\00\00\00\10val3\00\01\00\00\00\10msg\00\01\00\00\00\00\02name\00!\00\00\00_id_1_val1_1_val2_1_val3_1_msg_1\00\00\08ready\00\01\08multikey\00\01\03multikeyPaths\00?\00\00\00\05_id\00\01\00\00\00\00\00\05val1\00\01\00\00\00\00\00\05val2\00\01\00\00\00\00\00\05val3\00\01\00\00\00\00\00\05msg\00\01\00\00\00\00\01\00\12head\00\00\00\00\00\00\00\00\00\08backgroundSecondary\00\00\00\00\00\03idxIdent\00m\00\00\00\02_id_\00\1c\00\00\00index-1-6917019827977430149\00\02_id_1_val1_1_val2_1_val3_1_msg_1\00\1c\00\00\00index-2-6917019827977430149\00\00\02ns\00\0c\00\00\00test.franck\00\02ident\00!\00\00\00collection-0-6917019827977430149\00\00
I can decode the BSON value with wt_to_mdb_bson.py to display it as JSON, and use jq to filter the file information about the collection I've created:
wt -h /data/db dump -x table:_mdb_catalog |
wt_to_mdb_bson.py -m dump -j |
jq 'select(.value.ns == "test.franck") |
{ns: .value.ns, ident: .value.ident, idxIdent: .value.idxIdent}
'
{
"ns": "test.franck",
"ident": "collection-0-6917019827977430149",
"idxIdent": {
"_id_": "index-1-6917019827977430149",
"_id_1_val1_1_val2_1_val3_1_msg_1": "index-2-6917019827977430149"
}
}
ident is the WiredTiger table name (collection-...) for the collection documents. All collections have a primary key index on "_id" and additional secondary indexes, stored in WiredTiger tables (index-...). These indexes are stored as .wt files in the data directory.
Collection
Using the WiredTiger table name for the collection, I dump the content, keys, and values, and decode it as JSON:
wt -h /data/db dump -x table:collection-0-6917019827977430149 |
wt_to_mdb_bson.py -m dump -j
{"key": "81", "value": {"_id": "aaa", "val1": "xxx", "val2": "yyy", "val3": "zzz", "msg": "hello world"}}
{"key": "82", "value": {"_id": "bbb", "val1": "xxx", "val2": "yyy", "val3": "zzz", "msg": ["hello", "world"]}}
{"key": "83", "value": {"_id": "ccc", "val1": "xxx", "val2": "yyy", "val3": "zzz", "msg": ["hello", "world", "hello", "again"]}}
The "key" here is the recordId—an internal, unsigned 64-bit integer MongoDB uses (when not using clustered collections) to order documents in the collection table. The 0x80 offset is because the storage key is stored as a signed 8‑bit integer, but encoded in an order-preserving way.
I can also use wt_binary_decode.py to look at the file pages. Here is the leaf page (page type: 7 (WT_PAGE_ROW_LEAF)) that contains my three documents as six key and value cells (cells (oflow len): 6) :
wt_binary_decode.py --offset 4096 --page 1 --verbose --split --bson /data/db/collection-0-6917019827977430149.wt
/data/db/collection-0-6917019827977430149.wt, position 0x1000/0x5000, pagelimit 1
Decode at 4096 (0x1000)
0: 00 00 00 00 00 00 00 00 1f 0f 00 00 00 00 00 00 5f 01 00 00
06 00 00 00 07 04 00 01 00 10 00 00 64 0a ec 4b 01 00 00 00
Page Header:
recno: 0
writegen: 3871
memsize: 351
ncells (oflow len): 6
page type: 7 (WT_PAGE_ROW_LEAF)
page flags: 0x4
version: 1
Block Header:
disk_size: 4096
checksum: 0x4bec0a64
block flags: 0x1
0: 28: 05 81
desc: 0x5 short key 1 bytes:
<packed 1 (0x1)>
1: 2a: 80 91 51 00 00 00 02 5f 69 64 00 04 00 00 00 61 61 61 00 02
76 61 6c 31 00 04 00 00 00 78 78 78 00 02 76 61 6c 32 00 04
00 00 00 79 79 79 00 02 76 61 6c 33 00 04 00 00 00 7a 7a 7a
00 02 6d 73 67 00 0c 00 00 00 68 65 6c 6c 6f 20 77 6f 72 6c
64 00 00
cell is valid BSON
{ '_id': 'aaa',
'msg': 'hello world',
'val1': 'xxx',
'val2': 'yyy',
'val3': 'zzz'}
2: 7d: 05 82
desc: 0x5 short key 1 bytes:
<packed 2 (0x2)>
3: 7f: 80 a0 60 00 00 00 02 5f 69 64 00 04 00 00 00 62 62 62 00 02
76 61 6c 31 00 04 00 00 00 78 78 78 00 02 76 61 6c 32 00 04
00 00 00 79 79 79 00 02 76 61 6c 33 00 04 00 00 00 7a 7a 7a
00 04 6d 73 67 00 1f 00 00 00 02 30 00 06 00 00 00 68 65 6c
6c 6f 00 02 31 00 06 00 00 00 77 6f 72 6c 64 00 00 00
cell is valid BSON
{ '_id': 'bbb',
'msg': ['hello', 'world'],
'val1': 'xxx',
'val2': 'yyy',
'val3': 'zzz'}
4: e1: 05 83
desc: 0x5 short key 1 bytes:
<packed 3 (0x3)>
5: e3: 80 ba 7a 00 00 00 02 5f 69 64 00 04 00 00 00 63 63 63 00 02
76 61 6c 31 00 04 00 00 00 78 78 78 00 02 76 61 6c 32 00 04
00 00 00 79 79 79 00 02 76 61 6c 33 00 04 00 00 00 7a 7a 7a
00 04 6d 73 67 00 39 00 00 00 02 30 00 06 00 00 00 68 65 6c
6c 6f 00 02 31 00 06 00 00 00 77 6f 72 6c 64 00 02 32 00 06
00 00 00 68 65 6c 6c 6f 00 02 33 00 06 00 00 00 61 67 61 69
6e 00 00 00
cell is valid BSON
{ '_id': 'ccc',
'msg': ['hello', 'world', 'hello', 'again'],
'val1': 'xxx',
'val2': 'yyy',
'val3': 'zzz'}
The script shows the raw hexadecimal bytes for the key, a description of the cell type, and the decoded logical value using WiredTiger’s order‑preserving integer encoding (packed int encoding). In this example, the raw byte 0x81 decodes to record ID 1:
0: 28: 05 81
desc: 0x5 short key 1 bytes:
<packed 1 (0x1)>
Here is the branch page (page type: 6 (WT_PAGE_ROW_INT)) that references it:
wt_binary_decode.py --offset 8192 --page 1 --verbose --split --bson /data/db/collection-0-6917019827977430149.wt
/data/db/collection-0-6917019827977430149.wt, position 0x2000/0x5000, pagelimit 1
Decode at 8192 (0x2000)
0: 00 00 00 00 00 00 00 00 20 0f 00 00 00 00 00 00 34 00 00 00
02 00 00 00 06 00 00 01 00 10 00 00 21 df 20 d6 01 00 00 00
Page Header:
recno: 0
writegen: 3872
memsize: 52
ncells (oflow len): 2
page type: 6 (WT_PAGE_ROW_INT)
page flags: 0x0
version: 1
Block Header:
disk_size: 4096
checksum: 0xd620df21
block flags: 0x1
0: 28: 05 00
desc: 0x5 short key 1 bytes:
""
1: 2a: 38 00 87 80 81 e4 4b eb ea 24
desc: 0x38 addr (leaf no-overflow) 7 bytes:
<packed 0 (0x0)> <packed 1 (0x1)> <packed 1273760356 (0x4bec0a64)>
As we have seen in the previous blog post, the pointer includes the checksum of the page it references (0x4bec0a64) to detect disc corruption.
Another utility, bsondump, can be used to display the output of wt dump -x as JSON, like wt_to_mdb_bson.py, but requires some filtering to get the BSON content:
wt -h /data/db dump -x table:collection-0-6917019827977430149 | # dump in hexa
egrep '025f696400' | # all documents have an "_id " field
xxd -r -p | # gets the plain binary data
bsondump --type=json # display BSON it as JSON
{"_id":"aaa","val1":"xxx","val2":"yyy","val3":"zzz","msg":"hello world"}
{"_id":"bbb","val1":"xxx","val2":"yyy","val3":"zzz","msg":["hello","world"]}
{"_id":"ccc","val1":"xxx","val2":"yyy","val3":"zzz","msg":["hello","world","hello","again"]}
2025-09-14T08:57:36.182+0000 3 objects found
It also provides a debug type output that gives more insights into how it is stored internally, especially for documents with arrays:
wt -h /data/db dump -x table:collection-0-6917019827977430149 | # dump in hexa
egrep '025f696400' | # all documents have an "_id " field
xxd -r -p | # gets the plain binary data
bsondump --type=debug # display BSON as it is stored
--- new object ---
size : 81
_id
type: 2 size: 13
val1
type: 2 size: 14
val2
type: 2 size: 14
val3
type: 2 size: 14
msg
type: 2 size: 21
--- new object ---
size : 96
_id
type: 2 size: 13
val1
type: 2 size: 14
val2
type: 2 size: 14
val3
type: 2 size: 14
msg
type: 4 size: 36
--- new object ---
size : 31
0
type: 2 size: 13
1
type: 2 size: 13
--- new object ---
size : 122
_id
type: 2 size: 13
val1
type: 2 size: 14
val2
type: 2 size: 14
val3
type: 2 size: 14
msg
type: 4 size: 62
--- new object ---
size : 57
0
type: 2 size: 13
1
type: 2 size: 13
2
type: 2 size: 13
3
type: 2 size: 13
2025-09-14T08:59:15.268+0000 3 objects found
Arrays in BSON are just sub-objects with the array position as a field name.
Primary index
RecordId is an internal, logical key used in the B-tree to store the collection. It allows documents to be physically moved without fragmentation when they're updated. All indexes reference documents by recordId, not their physical location. Access by "_id" requires a unique index created automatically with the collection and stored as another WiredTiger table. Here is the content:
wt -h /data/db dump -p table:index-1-6917019827977430149
WiredTiger Dump (WiredTiger Version 12.0.0)
Format=print
Header
table:index-1-6917019827977430149
access_pattern_hint=none,allocation_size=4KB,app_metadata=(formatVersion=8),assert=(commit_timestamp=none,durable_timestamp=none,read_timestamp=none,write_timestamp=off),block_allocation=best,block_compressor=,block_manager=default,cache_resident=false,checksum=on,colgroups=,collator=,columns=,dictionary=0,disaggregated=(page_log=),encryption=(keyid=,name=),exclusive=false,extractor=,format=btree,huffman_key=,huffman_value=,ignore_in_memory_cache_size=false,immutable=false,import=(compare_timestamp=oldest_timestamp,enabled=false,file_metadata=,metadata_file=,panic_corrupt=true,repair=false),in_memory=false,ingest=,internal_item_max=0,internal_key_max=0,internal_key_truncate=true,internal_page_max=16k,key_format=u,key_gap=10,leaf_item_max=0,leaf_key_max=0,leaf_page_max=16k,leaf_value_max=0,log=(enabled=true),lsm=(auto_throttle=,bloom=,bloom_bit_count=,bloom_config=,bloom_hash_count=,bloom_oldest=,chunk_count_limit=,chunk_max=,chunk_size=,merge_max=,merge_min=),memory_page_image_max=0,memory_page_max=5MB,os_cache_dirty_max=0,os_cache_max=0,prefix_compression=true,prefix_compression_min=4,source="file:index-1-6917019827977430149.wt",split_deepen_min_child=0,split_deepen_per_child=0,split_pct=90,stable=,tiered_storage=(auth_token=,bucket=,bucket_prefix=,cache_directory=,local_retention=300,name=,object_target_size=0),type=file,value_format=u,verbose=[],write_timestamp_usage=none
Data
<aaa\00\04
\00\08
<bbb\00\04
\00\10
<ccc\00\04
\00\18
There are three entries, one for each document, with the "_id" value (aaa,bbb,ccc) as the key, and the recordId as the value. The values are packed (see documentation)—for example, < prefixes a little-endian value.
In MongoDB’s KeyString format, the recordId is stored in a special packed encoding where three bits are added to the right of the big-endian value, to be able to store the length at the end of the key. The same is used when it is in the value part of the index entry, in a unique index. To decode it, you need to shift the last byte right by three bits. Here, 0x08 >> 3 = 1, 0x10 >> 3 = 2, and 0x18 >> 3 = 3, which are the recordId of my documents.
I decode the page that contains those index entries:
wt_binary_decode.py --offset 4096 --page 1 --verbose --split /data/db/index-1-6917019827977430149.wt
/data/db/index-1-6917019827977430149.wt, position 0x1000/0x5000, pagelimit 1
Decode at 4096 (0x1000)
0: 00 00 00 00 00 00 00 00 1f 0f 00 00 00 00 00 00 46 00 00 00
06 00 00 00 07 04 00 01 00 10 00 00 7c d3 87 60 01 00 00 00
Page Header:
recno: 0
writegen: 3871
memsize: 70
ncells (oflow len): 6
page type: 7 (WT_PAGE_ROW_LEAF)
page flags: 0x4
version: 1
Block Header:
disk_size: 4096
checksum: 0x6087d37c
block flags: 0x1
0: 28: 19 3c 61 61 61 00 04
desc: 0x19 short key 6 bytes:
"<aaa"
1: 2f: 0b 00 08
desc: 0xb short val 2 bytes:
"
2: 32: 19 3c 62 62 62 00 04
desc: 0x19 short key 6 bytes:
"<bbb"
3: 39: 0b 00 10
desc: 0xb short val 2 bytes:
""
4: 3c: 19 3c 63 63 63 00 04
desc: 0x19 short key 6 bytes:
"<ccc"
5: 43: 0b 00 18
desc: 0xb short val 2 bytes:
""
This utility doesn't decode the recordId—we need to shift it. There's no BSON to decode in the indexes.
Secondary index
Secondary indexes are similar, except that they can be composed of multiple fields, and any indexed field can contain an array, which may result in multiple index entries for a single document, like an inverted index.
MongoDB tracks which indexed fields contain arrays to improve query planning. A multikey index creates an entry for each array element, and if multiple fields are multikey, it stores entries for all combinations of their values. By knowing exactly which fields are multikey, the query planner can apply tighter index bounds when only one field is involved. This information is stored in the catalog as a "multikey" flag along with the specific "multikeyPaths":
wt -h /data/db dump -x table:_mdb_catalog |
wt_to_mdb_bson.py -m dump -j |
jq 'select(.value.ns == "test.franck") |
.value.md.indexes[] |
{name: .spec.name, key: .spec.key, multikey: .multikey, multikeyPaths: .multikeyPaths | keys}
'
{
"name": "_id_",
"key": {
"_id": { "$numberInt": "1" },
},
"multikey": false,
"multikeyPaths": [
"_id"
]
}
{
"name": "_id_1_val1_1_val2_1_val3_1_msg_1",
"key": {
"_id": { "$numberInt": "1" },
"val1": { "$numberInt": "1" },
"val2": { "$numberInt": "1" },
"val3": { "$numberInt": "1" },
"msg": { "$numberInt": "1" },
},
"multikey": true,
"multikeyPaths": [
"_id",
"msg",
"val1",
"val2",
"val3"
]
}
Here is the dump of my index on {_id:1,val1:1,val2:1,val3:1,msg:1}:
wt -h /data/db dump -p table:index-2-6917019827977430149
WiredTiger Dump (WiredTiger Version 12.0.0)
Format=print
Header
table:index-2-6917019827977430149
access_pattern_hint=none,allocation_size=4KB,app_metadata=(formatVersion=8),assert=(commit_timestamp=none,durable_timestamp=none,read_timestamp=none,write_timestamp=off),block_allocation=best,block_compressor=,block_manager=default,cache_resident=false,checksum=on,colgroups=,collator=,columns=,dictionary=0,disaggregated=(page_log=),encryption=(keyid=,name=),exclusive=false,extractor=,format=btree,huffman_key=,huffman_value=,ignore_in_memory_cache_size=false,immutable=false,import=(compare_timestamp=oldest_timestamp,enabled=false,file_metadata=,metadata_file=,panic_corrupt=true,repair=false),in_memory=false,ingest=,internal_item_max=0,internal_key_max=0,internal_key_truncate=true,internal_page_max=16k,key_format=u,key_gap=10,leaf_item_max=0,leaf_key_max=0,leaf_page_max=16k,leaf_value_max=0,log=(enabled=true),lsm=(auto_throttle=,bloom=,bloom_bit_count=,bloom_config=,bloom_hash_count=,bloom_oldest=,chunk_count_limit=,chunk_max=,chunk_size=,merge_max=,merge_min=),memory_page_image_max=0,memory_page_max=5MB,os_cache_dirty_max=0,os_cache_max=0,prefix_compression=true,prefix_compression_min=4,source="file:index-2-6917019827977430149.wt",split_deepen_min_child=0,split_deepen_per_child=0,split_pct=90,stable=,tiered_storage=(auth_token=,bucket=,bucket_prefix=,cache_directory=,local_retention=300,name=,object_target_size=0),type=file,value_format=u,verbose=[],write_timestamp_usage=none
Data
<aaa\00<xxx\00<yyy\00<zzz\00<hello world\00\04\00\08
(null)
<bbb\00<xxx\00<yyy\00<zzz\00<hello\00\04\00\10
(null)
<bbb\00<xxx\00<yyy\00<zzz
September 13, 2025
Setsum - order agnostic, additive, subtractive checksum
A brief introduction to Setsum - order agnostic, additive, subtractive checksum
September 12, 2025
Postgres High Availability with CDC
Why a lagging client can stall or break failover, and how MySQL’s GTID model avoids it.
September 11, 2025
Postgres 18rc1 vs sysbench
This post has results for Postgres 18rc1 vs sysbench on small and large servers. Results for Postgres 18beta3 are here for a small and large server.
tl;dr
- Postgres 18 looks great
- I continue to see small CPU regressions in Postgres 18 for range queries that don't do aggregation on low-concurrency workloads. I have yet to explain that.
- The throughput for the scan microbenchmark has more variance with Postgres 18. I assume this is related to more or less work getting done by vacuum but I have yet to debug the root cause.
Builds, configuration and hardware
I compiled Postgres from source for versions 17.6, 18 beta3 and 18 rc1.
The servers are:- small
- an ASUS ExpertCenter PN53 with AMD Ryzen 7735HS CPU, 32G of RAM, 8 cores with AMD SMT disabled, Ubuntu 24.04 and an NVMe device with ext4 and discard enabled.
- large32
- Dell Precision 7865 Tower Workstation with 1 socket, 128G RAM, AMD Ryzen Threadripper PRO 5975WX with 32 Cores and AMD SMT disabled, Ubuntu 24.04 and and NVMe device with ext4 and discard.
- large48
- an ax162s from Hetzner with an AMD EPYC 9454P 48-Core Processor with SMT disabled
- 2 Intel D7-P5520 NVMe storage devices with RAID 1 (3.8T each) using ext4
- 128G RAM
- Ubuntu 22.04 running the non-HWE kernel (5.5.0-118-generic)
All configurations use synchronous IO which is the the only option prior to Postgres 18 and for Postgres 18 the config file sets io_method=sync.
Configuration files:- small server
- for versions 17.6 (x10a_c8r32) and 18 (x10b_c8r32)
- large servers
- for versions 17.6 (x10a_c32r128) and 18 (x10b_c32r128)
Benchmark
I used sysbench and my usage is explained here. To save time I only run 32 of the 42 microbenchmarks and most test only 1 type of SQL statement. Benchmarks are run with the database cached by Postgres.
For all servers the read-heavy microbenchmarks run for 600 seconds and the write-heavy for 900 seconds.
The number of tables and rows per table was:- small server - 1 table, 50M rows
- large servers - 8 tables, 10M rows per table
The number of clients (amount of concurrency) was:- small server - 1
- large32 server - 24
- large48 servcer- 40
Results
The microbenchmarks are split into 4 groups -- 1 for point queries, 2 for range queries, 1 for writes. For the range query microbenchmarks, part 1 has queries that don't do aggregation while part 2 has queries that do aggregation.
I provide charts below with relative QPS. The relative QPS is the following:(QPS for some version) / (QPS for Postgres 17.6)
When the relative QPS is > 1 then some version is faster than PG 17.6. When it is < 1 then there might be a regression. Values from iostat and vmstat divided by QPS are also provided here. These can help to explain why something is faster or slower because it shows how much HW is used per request.
The numbers highlighted in yellow below might be from a small regression for range queries that don't do aggregation. But note that this does reproduce for the full table scan microbenchmark (scan). I am not certain it is a regression as this might be from non-deterministic CPU overheads for read-heavy workloads that are run after vacuum. I hope to look at CPU flamegraphs soon.
Results: small server
I continue to see small (~3%) regressions in throughput for range queries without aggregation across Postgres 18 beta1, beta2, beta3 and rc1. But I have yet to debug this and am not certain it is a regression. I am also skeptical about the great results for scan. I suspect that I have more work to do to make the benchmark less subject to variance from MVCC GC (vacuum here). I also struggle with that on RocksDB (compaction), but not on InnoDB (purge).
Relative to: Postgres 17.6col-1 : 18beta3col-2 : 18rc1
col-1 col-2 point queries1.01 0.98 hot-points_range=1001.01 1.00 point-query_range=1001.02 1.02 points-covered-pk_range=1000.99 1.01 points-covered-si_range=1001.00 0.99 points-notcovered-pk_range=1001.00 0.99 points-notcovered-si_range=1001.01 1.00 random-points_range=10001.01 0.99 random-points_range=1001.01 1.00 random-points_range=10
col-1 col-2 range queries without aggregation0.97 0.96 range-covered-pk_range=1000.97 0.97 range-covered-si_range=1000.99 0.99 range-notcovered-pk_range=1000.99 0.99 range-notcovered-si_range=1001.35 1.36 scan_range=100
col-1 col-2 range queries with aggregation1.02 1.03 read-only-count_range=10001.00 1.00 read-only-distinct_range=10000.99 0.99 read-only-order_range=10001.00 1.00 read-only_range=100001.00 0.99 read-only_range=1000.99 0.98 read-only_range=101.01 1.01 read-only-simple_range=10001.02 1.00 read-only-sum_range=1000
col-1 col-2 writes0.99 0.99 delete_range=1000.99 1.01 insert_range=1000.99 0.99 read-write_range=1000.99 0.99 read-write_range=100.98 0.98 update-index_range=1001.00 0.99 update-inlist_range=1000.98 0.98 update-nonindex_range=1000.98 0.97 update-one_range=1000.98 0.97 update-zipf_range=1000.99 0.98 write-only_range=10000
Results: large32 server
I don't see small regressions in throughput for range queries without aggregation across Postgres 18 beta1, beta2, beta3 and rc1. I have only seen that on the low concurrency (small server) results.
The improvements on the scan microbenchmark come from using less CPU. But I am skeptical about the improvements. I might have more work to do to make the benchmark less subject to variance from MVCC GC (vacuum here). I also struggle with that on RocksDB (compaction), but not on InnoDB (purge).
Relative to: Postgres 17.6col-1 : Postgres 18rc1
col-1 point queries1.01 hot-points_range=1001.01 point-query_range=1001.01 points-covered-pk_range=1001.01 points-covered-si_range=1001.00 points-notcovered-pk_range=1001.00 points-notcovered-si_range=1001.01 random-points_range=10001.00 random-points_range=1001.01 random-points_range=10
col-1 range queries without aggregation0.99 range-covered-pk_range=1000.99 range-covered-si_range=1000.99 range-notcovered-pk_range=1000.99 range-notcovered-si_range=1001.12 scan_range=100
col-1 range queries with aggregation1.00 read-only-count_range=10001.02 read-only-distinct_range=10001.01 read-only-order_range=10001.03 read-only_range=100001.00 read-only_range=1001.00 read-only_range=101.00 read-only-simple_range=10001.00 read-only-sum_range=1000
col-1 writes1.01 delete_range=1001.00 insert_range=1001.00 read-write_range=1001.00 read-write_range=101.00 update-index_range=1001.00 update-inlist_range=1001.00 update-nonindex_range=1000.99 update-one_range=1001.00 update-zipf_range=1001.00 write-only_range=10000
Results: large48 server
I don't see small regressions in throughput for range queries without aggregation across Postgres 18 beta1, beta2, beta3 and rc1. I have only seen that on the low concurrency (small server) results.
The improvements on the scan microbenchmark come from using less CPU. But I am skeptical about the improvements. I might have more work to do to make the benchmark less subject to variance from MVCC GC (vacuum here). I also struggle with that on RocksDB (compaction), but not on InnoDB (purge).
I am skeptical about the regression I see here for scan. That comes from using ~10% more CPU per query. I might have more work to do to make the benchmark less subject to variance from MVCC GC (vacuum here). I also struggle with that on RocksDB (compaction), but not on InnoDB (purge).
I have not see the large improvements for the insert and delete microbenchmarks on previous tests on that large server. I assume this is another case where I need to figure out how to reduce variance when I run the benchmark.
Relative to: Postgres 17.6col-1 : Postgres 18beta3col-2 : Postgres 18rc1
col-1 col-2 point queries0.99 0.99 hot-points_range=1000.99 0.99 point-query_range=1001.00 0.99 points-covered-pk_range=1000.99 1.02 points-covered-si_range=1001.00 0.99 points-notcovered-pk_range=1000.99 1.01 points-notcovered-si_range=1001.00 0.99 random-points_range=10001.00 0.99 random-points_range=1001.00 1.00 random-points_range=10
col-1 col-2 range queries without aggregation0.99 0.99 range-covered-pk_range=1000.98 0.99 range-covered-si_range=1000.99 0.99 range-notcovered-pk_range=1001.01 1.01 range-notcovered-si_range=1000.91 0.91 scan_range=100
col-1 col-2 range queries with aggregation1.04 1.03 read-only-count_range=10001.02 1.01 read-only-distinct_range=10001.01 1.00 read-only-order_range=10001.06 1.06 read-only_range=100000.98 0.97 read-only_range=1000.99 0.99 read-only_range=101.02 1.02 read-only-simple_range=10001.03 1.03 read-only-sum_range=1000
col-1 col-2 writes1.46 1.49 delete_range=1001.32 1.32 insert_range=1000.99 1.00 read-write_range=1000.98 1.00 read-write_range=100.99 1.00 update-index_range=1000.95 1.03 update-inlist_range=1001.00 1.02 update-nonindex_range=1000.96 1.04 update-one_range=1001.00 1.01 update-zipf_range=1001.00 1.00 write-only_range=10000
Why we maintain a ClickHouse fork at Tinybird (and how it's different)
We've long contributed to the open source ClickHouse community, and we didn't make this decision lightly. Here's why we maintain our own private ClickHouse fork.
Why we maintain a ClickHouse® fork at Tinybird (and how it's different)
We've long contributed to the open source ClickHouse community, and we didn't make this decision lightly. Here's why we maintain our own private ClickHouse fork.
September 09, 2025
Beyond EOL: The Real Benefits of Upgrading to MySQL 8.4
Right now, you’re probably hoping someone else will deal with this MySQL 8.0 end-of-life situation. Maybe your team can squeeze another few months out of it. Maybe Oracle will extend support. Maybe it won’t be as bad as everyone says. We get it. You’ve got enough things going on without adding “major database upgrade” to […]
September 08, 2025
Resilience of MongoDB's WiredTiger Storage Engine to Disk Failure Compared to PostgreSQL and Oracle
There have been jokes that have contributed to persistent myths about MongoDB's durability. The authors of those myths ignore that MongoDB's storage engine is among the most robust in the industry, and it's easy to demonstrate. MongoDB uses WiredTiger (created by the same author as Berkeley DB), which provides block corruption protection stronger than that of many other databases. In this article I'll show how to reproduce a simple write loss, in a lab, and see how the database detects it to avoid returning corrupt data.
PostgreSQL
To expose the issue when a database doesn't detect lost writes, I chose PostgreSQL for this demonstration. As of version 18, PostgreSQL enables checksums by default. I'm testing it with the Release Candidate in a docker lab:
docker run --rm -it --cap-add=SYS_PTRACE postgres:18rc1 bash
# Install some utilities
apt update -y && apt install -y strace
# Start PostgreSQL
POSTGRES_PASSWORD=x \
strace -fy -e trace=pread64,pwrite64 \
docker-entrypoint.sh postgres &
# Connect to PostgreSQL
psql -U postgres
I've started PostgreSQL, tracing the read and write calls with strace.
I check that block checksum is enabled:
postgres=# show data_checksums;
data_checksums
----------------
on
(1 row)
I create a demo table and insert random data:
create table demo (k int primary key, v text);
copy demo from program $$
cat /dev/urandom |
base64 |
head -10000 |
awk '{print NR"\t"$0}'
$$ ;
vacuum demo;
checkpoint;
create extension pg_buffercache;
select distinct pg_buffercache_evict(bufferid) from pg_buffercache;
I triggered a checkpoint to write to disk and flushed the shared buffers, so that I can see the read IO for the next query:
set enable_seqscan to false;
select distinct pg_buffercache_evict(bufferid) from pg_buffercache;
select * from demo where k=999;
[pid 161] pread64(59</var/lib/postgresql/18/docker/base/5/16394>, "\0\0\0\0@j\225\1R\232\0\0\210\08\36\360\37\4 \0\0\0\0\350\237\20\0\330\237 \0"..., 8192, 24576) = 8192
[pid 161] pread64(59</var/lib/postgresql/18/docker/base/5/16394>, "\0\0\0\0\320\243\200\1mj\0\0\324\5\0\t\360\37\4 \0\0\0\0\340\237 \0\320\237 \0"..., 8192, 32768) = 8192
[pid 161] pread64(58</var/lib/postgresql/18/docker/base/5/16388>, "\0\0\0\0\10\232\232\1\302\260\4\0000\1`\1\0 \4 \0\0\0\0\220\237\322\0 \237\322\0"..., 8192, 114688) = 8192
k | v
-----+------------------------------------------------------------------------------
999 | MQIEZSsmBjk7MRtgIZLL/MqABsjhuMR6I4LtayWfR2764PdB+AcQt2saRtFXkgBUGCLKzM8SBmKX
(1 row)
This has read two pages from the index, and then the page that contains the row I'm querying. This page is in base/5/16388 at offset 114688.
I use dd to save the content of this page, then update the row 999, trigger a checkpoint and flush the buffer cache:
\! dd if=/var/lib/postgresql/18/docker/base/5/16388 of=block1.tmp bs=1 skip=114688 count=8192
update demo set v='xxxxxx' where k=999;
checkpoint;
select distinct pg_buffercache_evict(bufferid) from pg_buffercache;
I query the row again, and it shows the updated value:
set enable_seqscan to false;
select distinct pg_buffercache_evict(bufferid) from pg_buffercache;
select * from demo where k=999;
[pid 161] pread64(59</var/lib/postgresql/18/docker/base/5/16394>, "\0\0\0\0@j\225\1R\232\0\0\210\08\36\360\37\4 \0\0\0\0\350\237\20\0\330\237 \0"..., 8192, 24576) = 8192
[pid 161] pread64(59</var/lib/postgresql/18/docker/base/5/16394>, "\0\0\0\0\320\243\200\1mj\0\0\324\5\0\t\360\37\4 \0\0\0\0\340\237 \0\320\237 \0"..., 8192, 32768) = 8192
[pid 161] pread64(58</var/lib/postgresql/18/docker/base/5/16388>, "\0\0\0\0\10m\261\1K~\0\08\1\200\1\0 \4 \2\3\0\0\220\237\322\0 \237\322\0"..., 8192, 114688) = 8192
k | v
-----+--------
999 | xxxxxx
(1 row)
It reads the same index pages, but the leaf points to another table page, at offset 114688, that holds the new value.
To simulate disk corruption, I copy the previous block to this new location, and query again:
\! dd of=/var/lib/postgresql/18/docker/base/5/16388 if=block1.tmp bs=1 seek=114688 conv=notrunc
set enable_seqscan to false;
select distinct pg_buffercache_evict(bufferid) from pg_buffercache;
select * from demo where k=999;
[pid 161] pread64(59</var/lib/postgresql/18/docker/base/5/16394>, "\0\0\0\0@j\225\1R\232\0\0\210\08\36\360\37\4 \0\0\0\0\350\237\20\0\330\237 \0"..., 8192, 24576) = 8192
[pid 161] pread64(59</var/lib/postgresql/18/docker/base/5/16394>, "\0\0\0\0\320\243\200\1mj\0\0\324\5\0\t\360\37\4 \0\0\0\0\340\237 \0\320\237 \0"..., 8192, 32768) = 8192
[pid 161] pread64(58</var/lib/postgresql/18/docker/base/5/16388>, "\0\0\0\0\10\232\232\1\302\260\4\0000\1`\1\0 \4 \0\0\0\0\220\237\322\0 \237\322\0"..., 8192, 114688) = 8192
k | v
-----+------------------------------------------------------------------------------
999 | MQIEZSsmBjk7MRtgIZLL/MqABsjhuMR6I4LtayWfR2764PdB+AcQt2saRtFXkgBUGCLKzM8SBmKX
(1 row)
There's no error because this new block has a correct checksum, as it is a valid block, just not at its right place. And it holds the right structure, as it comes from a block of the same table. However, it shows a row that should not be there. This is an error that can happen with a failure in the storage that does not write a block at the right place.
Checksum is still useful if the corruption is not aligned with well formatted blocks. For example, I'm replacing the first half of the page with the second part of the page:
checkpoint;
\! dd of=/var/lib/postgresql/18/docker/base/5/16388 if=block1.tmp bs=1 seek=118784 count=4096 conv=notrunc
set enable_seqscan to false;
select distinct pg_buffercache_evict(bufferid) from pg_buffercache;
select * from demo where k=999;
[pid 161] pread64(59</var/lib/postgresql/18/docker/base/5/16394>, "\0\0\0\0@j\225\1R\232\0\0\210\08\36\360\37\4 \0\0\0\0\350\237\20\0\330\237 \0"..., 8192, 24576) = 8192
[pid 161] pread64(59</var/lib/postgresql/18/docker/base/5/16394>, "\0\0\0\0\320\243\200\1mj\0\0\324\5\0\t\360\37\4 \0\0\0\0\340\237 \0\320\237 \0"..., 8192, 32768) = 8192
[pid 161] pread64(58</var/lib/postgresql/18/docker/base/5/16388>, "\0\0\0\0\10\232\232\1\302\260\4\0000\1`\1\0 \4 \0\0\0\0\220\237\322\0 \237\322\0"..., 8192, 114688) = 8192
2025-09-08 17:58:41.876 UTC [161] LOG: page verification failed, calculated checksum 20176 but expected 45250
2025-09-08 17:58:41.876 UTC [161] STATEMENT: select * from demo where k=999;
2025-09-08 17:58:41.876 UTC [161] LOG: invalid page in block 14 of relation "base/5/16388"
2025-09-08 17:58:41.876 UTC [161] STATEMENT: select * from demo where k=999;
2025-09-08 17:58:41.876 UTC [161] ERROR: invalid page in block 14 of relation "base/5/16388"
2025-09-08 17:58:41.876 UTC [161] STATEMENT: select * from demo where k=999;
ERROR: invalid page in block 14 of relation "base/5/16388"
Here the checksum calculated is not correct and an error has been raised. PostgreSQL checksums can detect some block corruption, but it is still possible that a bug or a malicious user that has access to the filesystem can change the data without being detected.
Oracle Database
To detect lost writes like the one I simulated above, Oracle Database compares the block checksum with the standby databases, as there is a low chance that the corruption happened in both environments. I've demonstrated this in the past with a similar demo: 18c new Lost Write Protection
WiredTiger (MongoDB storage engine)
MongoDB employs the WiredTiger storage engine, which is designed to prevent lost writes and detect disk failures that might return the wrong page. To achieve this, WiredTiger stores a checksum alongside each page address within the pointers between the BTree pages, in an address cookie:
Address cookie: an opaque set of bytes returned by the block manager to reference a block in a Btree file, it includes an offset, size, checksum, and object id.
In my lab, I first start a MongoDB container and compile wt, a command-line utility that allows direct interaction with WiredTiger files. This tool enables me to examine the storage engine without relying on the MongoDB query layer, and I'll use it in this series of blog posts:
docker run --rm -it --cap-add=SYS_PTRACE mongo bash
# install required packages
apt-get update && apt-get install -y git xxd strace curl jq python3 python3-dev python3-pip python3-venv build-essential cmake gcc g++ libstdc++-12-dev libtool autoconf automake swig liblz4-dev zlib1g-dev libmemkind-dev libsnappy-dev libsodium-dev libzstd-dev
# get latest WiredTiger
curl -L $(curl -s https://api.github.com/repos/wiredtiger/wiredtiger/releases/latest | jq -r '.tarball_url') -o wiredtiger.tar.gz
# Compile
mkdir /wiredtiger && tar -xzf wiredtiger.tar.gz --strip-components=1 -C /wiredtiger ; cd /wiredtiger
mkdir build && cmake -S /wiredtiger -B /wiredtiger/build \
-DHAVE_BUILTIN_EXTENSION_SNAPPY=1 \
-DCMAKE_BUILD_TYPE=Release \
-DENABLE_WERROR=0 \
-DENABLE_QPL=0 \
-DCMAKE_C_FLAGS="-O0 -Wno-error -Wno-format-overflow -Wno-error=array-bounds -Wno-error=format-overflow -Wno-error=nonnull" \
-DPYTHON_EXECUTABLE=$(which python3)
cmake --build /wiredtiger/build
ln -s /wiredtiger/build/wt /usr/local/bin/wt
It takes some time, but the wt command line utility will make the investigation easier.
I create a demo table and insert ten thousand records:
root@7d6d105a1663:/tmp# wt create table:demo
root@7d6d105a1663:/tmp# ls -alrt
total 68
drwxr-xr-x. 1 root root 4096 Sep 8 19:21 ..
-rw-r--r--. 1 root root 21 Sep 8 19:35 WiredTiger.lock
-rw-r--r--. 1 root root 50 Sep 8 19:35 WiredTiger
-rw-r--r--. 1 root root 299 Sep 8 19:35 WiredTiger.basecfg
-rw-r--r--. 1 root root 4096 Sep 8 19:35 demo.wt
-rw-r--r--. 1 root root 4096 Sep 8 19:35 WiredTigerHS.wt
-rw-r--r--. 1 root root 1475 Sep 8 19:35 WiredTiger.turtle
-rw-r--r--. 1 root root 32768 Sep 8 19:35 WiredTiger.wt
drwxrwxrwt. 1 root root 4096 Sep 8 19:35 .
root@7d6d105a1663:/tmp# wt list
colgroup:demo
file:demo.wt
table:demo
cat /dev/urandom |
base64 |
head -10000 |
awk '{print "i",NR,$0}' |
wt dump -e table:demo
...
Inserted key '9997' and value 'ILZeUq/u/ErLB/i7LOUb4nwYP6D535trb8Mt3vcJXXRAqLeAiYIHn5bEWs1buflmiZMYd3rMMvhh'.
Inserted key '9998' and value 'y+b0eTV/4Ao12qRqtHhgP2xGUr+C9ZOfvOG3ZwbdDNXvpnbM1/laoJ9Yzyt6cbLJOR6jdQktpgFM'.
Inserted key '9999' and value 'cJY9uWtopqFOuZjggkZDWVZEEdygpMLyL7LscqehnKoVY7BrmTh4ZzyTLrZ1glROwLtZYbvLbu5c'.
Inserted key '10000' and value 'QYwyjaRxa9Q+5dvzwQtvv2QE/uS/vhRPCVsQ6p7re/L2yDrVRxyqkvSyMHeRCzMIsIovrCUJpPXI'.
I read record 9999 with wt and use strace to see the read calls:
strace -yy -e trace=pread64 -xs 36 wt dump -e table:demo <<<"s 9999"
...
pread64(6</tmp/demo.wt>, "\x00...\x1f\x63\x77\x41"..., 28672, 806912) = 28672
9999
cJY9uWtopqFOuZjggkZDWVZEEdygpMLyL7LscqehnKoVY7BrmTh4ZzyTLrZ1glROwLtZYbvLbu5c
...
This record is at offset 806912 in a 28672 bytes block. I 'save' this block with dd:
dd if=demo.wt of=block1.tmp bs=1 skip=806912 count=28672
28672+0 records in
28672+0 records out
28672 bytes (29 kB, 28 KiB) copied, 0.0680832 s, 421 kB/s
I update this record to "xxxxxx" and trace the write calls:
strace -yy -e trace=pwrite64 -xs 36 wt dump -e table:demo <<<"u 9999 xxxxxx"
...
Updated key '9999' to value 'xxxxxx'.
pwrite64(6</tmp/demo.wt>, "\x00...\x02\xb1\x1f\xf5"..., 28672, 847872) = 28672
pwrite64(6</tmp/demo.wt>, "\x00...\x4d\x19\x14\x4e"..., 4096, 876544) = 4096
pwrite64(6</tmp/demo.wt>, "\x00...\x4f\x74\xdb\x1e"..., 4096, 880640) = 4096
pwrite64(6</tmp/demo.wt>, "\x00...\x21\xcc\x25\x06"..., 4096, 884736) = 4096
...
This writes a new block (WiredTiger do not write in-place, which helps to avoid corruption) of 28672 bytes, and updates the BTree branches.
I can read this new value:
strace -yy -e trace=pread64 -xs 36 wt dump -e table:demo <<<"s 9999"
...
pread64(6</tmp/demo.wt>, "\x00...\x02\xb1\x1f\xf5"..., 28672, 847872) = 28672
9999
xxxxxx
...
To simulate disk corruption, I do the same as I did on PostgreSQL: replace the current block with the old one. I save the current block before overwriting it:
dd if=demo.wt of=block2.tmp bs=1 skip=847872 count=28672
28672+0 records in
28672+0 records out
28672 bytes (29 kB, 28 KiB) copied, 0.0688249 s, 417 kB/s
dd of=demo.wt if=block1.tmp bs=1 seek=847872 conv=notrunc
28672+0 records in
28672+0 records out
28672 bytes (29 kB, 28 KiB) copied, 0.0666375 s, 430 kB/s
If I try to read the record in this block, the corruption is detected:
strace -yy -e trace=pread64 -xs 36 wt dump -e table:demo <<<"s 9999"
...
pread64(6</tmp/demo.wt>, "\x00...\x1f\x63\x77\x41"..., 28672, 847872) = 28672
[1757361305:392519][8246:0x7fe9a087e740], wt, file:demo.wt, WT_SESSION.open_cursor: [WT_VERB_DEFAULT][ERROR]: __wti_block_read_off, 279: demo.wt: potential hardware corruption, read checksum error for 28672
B block at offset 847872: block header checksum of 0x4177631f doesn't match expected checksum of 0xf51fb102
[1757361305:392904][8246:0x7fe9a087e740], wt, file:demo.wt, WT_SESSION.open_cursor: [WT_VERB_DEFAULT][ERROR]: __bm_corrupt_dump, 86: {0: 847872, 28672, 0xf51fb102}: (chunk 1 of 28): 00 00 00 00 00 00 00 00
3a 00 00 00 00 00 00 00 40 6e 00 00 a8 02 00 00 07 04 00 01 00 70 00 00 1f 63 77 41 01 00 00 00 11 39 36 39 33 80 8c 43 38 30 58 51 66 64
...
Even if the block checksum is correct for the block itself, it was detected that the checksum of the block, 0x4177631f, which is visible as 1f 63 77 41 in the hexadecimal dump, or \x1f\x63\x77\x41 in the read trace, is different from the expected 0xf51fb102 from the address cookie.
0xf51fb102 was visible as \x02\xb1\x1f\xf5 in the write call of the update, and is visible as 02 b1 1f f5 in the block that I've saved before overwriting it:
root@7d6d105a1663:/tmp# xxd -l 36 block2.tmp
00000000: 0000 0000 0000 0000 5c00 0000 0000 0000 ........\.......
00000010: f96d 0000 a802 0000 0704 0001 0070 0000 .m...........p..
00000020: 02b1 1ff5
Even with access to the files, it would be extremely difficult to corrupt the data in an undetected way because any change must update the checksum, and the checksum is referenced in all address cookies in other blocks. Block corruption is highly unlikely as the blocks are not updated in place, and failure to write blocks would break the pointers.
Conclusion
PostgreSQL requires you to enable checksums to detect data corruption. This detects when a page’s checksum does not match its content. However, if the system erroneously writes a different, but valid, block from the same table in place of the intended one, or misses a write and the previous version of the block remains, PostgreSQL cannot identify this issue. As a result, some disk failures may escape detection and return wrong results.
Oracle Database stores blocks with checksums and can enable checking them on read. With a Data Guard standby and some network overhead, the database can transmit checksums over the network to verify data integrity when reading.
MongoDB WiredTiger enables checksums by default and can detect the wrong blocks without the need to contact replicas. It embeds the expected checksum inside the BTree address cookie so that every internal BTree pointer to a leaf page includes the checksum for the referenced page. If an obsolete or different page is swapped in, any mismatch will be detected because the pointer's checksum won’t match. WiredTiger uses copy-on-write, not in-place overwrites, further reducing the risk of corruption.
Here is a description of WiredTiger by Keith Bostic:
Resilience of MongoDB's WiredTiger Storage Engine to Disk Failure Compared to PostgreSQL and Oracle
There have been jokes that have contributed to persistent myths about MongoDB's durability. The authors of those myths ignore that MongoDB's storage engine is among the most robust in the industry. MongoDB uses WiredTiger (created by the same author as Berkeley DB), which provides block corruption protection stronger than that of many other databases. In this article, I'll show how to reproduce a simple write loss, in a lab, and see how the database detects it to avoid returning corrupt data.
I like comparing the trade-offs and implementations of various databases, but I stick to discussing those I'm familiar with. Let's deliberately corrupt PostgreSQL, Oracle, and MongoDB at the storage level to see what happens. This experiment can be reproduced in a lab and your comments are welcome.
PostgreSQL
To expose the issue when a database doesn't detect lost writes, I chose PostgreSQL for this demonstration. As of version 18, PostgreSQL enables checksums by default. I'm testing it with the Release Candidate in a Docker lab:
docker run --rm -it --cap-add=SYS_PTRACE postgres:18rc1 bash
# Install some utilities
apt update -y && apt install -y strace
# Start PostgreSQL
POSTGRES_PASSWORD=x \
strace -fy -e trace=pread64,pwrite64 \
docker-entrypoint.sh postgres &
# Connect to PostgreSQL
psql -U postgres
I've started PostgreSQL, tracing the read and write calls with strace.
I check that block checksum is enabled:
postgres=# show data_checksums;
data_checksums
----------------
on
(1 row)
I create a demo table and insert random data:
create table demo (k int primary key, v text);
copy demo from program $$
cat /dev/urandom |
base64 |
head -10000 |
awk '{print NR"\t"$0}'
$$ ;
vacuum demo;
checkpoint;
create extension pg_buffercache;
select distinct pg_buffercache_evict(bufferid) from pg_buffercache;
I triggered a checkpoint to write to disk and flushed the shared buffers, so that I can see the read IO for the next query:
set enable_seqscan to false;
select distinct pg_buffercache_evict(bufferid) from pg_buffercache;
select * from demo where k=999;
[pid 161] pread64(59</var/lib/postgresql/18/docker/base/5/16394>, "\0\0\0\0@j\225\1R\232\0\0\210\08\36\360\37\4 \0\0\0\0\350\237\20\0\330\237 \0"..., 8192, 24576) = 8192
[pid 161] pread64(59</var/lib/postgresql/18/docker/base/5/16394>, "\0\0\0\0\320\243\200\1mj\0\0\324\5\0\t\360\37\4 \0\0\0\0\340\237 \0\320\237 \0"..., 8192, 32768) = 8192
[pid 161] pread64(58</var/lib/postgresql/18/docker/base/5/16388>, "\0\0\0\0\10\232\232\1\302\260\4\0000\1`\1\0 \4 \0\0\0\0\220\237\322\0 \237\322\0"..., 8192, 114688) = 8192
k | v
-----+------------------------------------------------------------------------------
999 | MQIEZSsmBjk7MRtgIZLL/MqABsjhuMR6I4LtayWfR2764PdB+AcQt2saRtFXkgBUGCLKzM8SBmKX
(1 row)
This has read two pages from the index, and then the page that contains the row I'm querying. This page is in base/5/16388 at offset 114688.
I use dd to save the content of this page, then update the row 999, trigger a checkpoint, and flush the buffer cache:
\! dd if=/var/lib/postgresql/18/docker/base/5/16388 of=block1.tmp bs=1 skip=114688 count=8192
update demo set v='xxxxxx' where k=999;
checkpoint;
select distinct pg_buffercache_evict(bufferid) from pg_buffercache;
I query the row again, and it shows the updated value:
set enable_seqscan to false;
select distinct pg_buffercache_evict(bufferid) from pg_buffercache;
select * from demo where k=999;
[pid 161] pread64(59</var/lib/postgresql/18/docker/base/5/16394>, "\0\0\0\0@j\225\1R\232\0\0\210\08\36\360\37\4 \0\0\0\0\350\237\20\0\330\237 \0"..., 8192, 24576) = 8192
[pid 161] pread64(59</var/lib/postgresql/18/docker/base/5/16394>, "\0\0\0\0\320\243\200\1mj\0\0\324\5\0\t\360\37\4 \0\0\0\0\340\237 \0\320\237 \0"..., 8192, 32768) = 8192
[pid 161] pread64(58</var/lib/postgresql/18/docker/base/5/16388>, "\0\0\0\0\10m\261\1K~\0\08\1\200\1\0 \4 \2\3\0\0\220\237\322\0 \237\322\0"..., 8192, 114688) = 8192
k | v
-----+--------
999 | xxxxxx
(1 row)
It reads the same index pages, but the leaf points to another table page, at offset 114688, that holds the new value.
To simulate disk corruption, I copy the previous block to this new location, and query again:
\! dd of=/var/lib/postgresql/18/docker/base/5/16388 if=block1.tmp bs=1 seek=114688 conv=notrunc
set enable_seqscan to false;
select distinct pg_buffercache_evict(bufferid) from pg_buffercache;
select * from demo where k=999;
[pid 161] pread64(59</var/lib/postgresql/18/docker/base/5/16394>, "\0\0\0\0@j\225\1R\232\0\0\210\08\36\360\37\4 \0\0\0\0\350\237\20\0\330\237 \0"..., 8192, 24576) = 8192
[pid 161] pread64(59</var/lib/postgresql/18/docker/base/5/16394>, "\0\0\0\0\320\243\200\1mj\0\0\324\5\0\t\360\37\4 \0\0\0\0\340\237 \0\320\237 \0"..., 8192, 32768) = 8192
[pid 161] pread64(58</var/lib/postgresql/18/docker/base/5/16388>, "\0\0\0\0\10\232\232\1\302\260\4\0000\1`\1\0 \4 \0\0\0\0\220\237\322\0 \237\322\0"..., 8192, 114688) = 8192
k | v
-----+------------------------------------------------------------------------------
999 | MQIEZSsmBjk7MRtgIZLL/MqABsjhuMR6I4LtayWfR2764PdB+AcQt2saRtFXkgBUGCLKzM8SBmKX
(1 row)
There's no error because this new block has a correct checksum, as it is a valid block, just not at its right place. And it holds the right structure, as it comes from a block of the same table. However, it shows a row that should not be there. This is an error that can happen with a failure in the storage that does not write a block at the right place.
Checksum is still useful if the corruption is not aligned with well formatted blocks. For example, I'm replacing the first half of the page with the second part of the page:
checkpoint;
\! dd of=/var/lib/postgresql/18/docker/base/5/16388 if=block1.tmp bs=1 seek=118784 count=4096 conv=notrunc
set enable_seqscan to false;
select distinct pg_buffercache_evict(bufferid) from pg_buffercache;
select * from demo where k=999;
[pid 161] pread64(59</var/lib/postgresql/18/docker/base/5/16394>, "\0\0\0\0@j\225\1R\232\0\0\210\08\36\360\37\4 \0\0\0\0\350\237\20\0\330\237 \0"..., 8192, 24576) = 8192
[pid 161] pread64(59</var/lib/postgresql/18/docker/base/5/16394>, "\0\0\0\0\320\243\200\1mj\0\0\324\5\0\t\360\37\4 \0\0\0\0\340\237 \0\320\237 \0"..., 8192, 32768) = 8192
[pid 161] pread64(58</var/lib/postgresql/18/docker/base/5/16388>, "\0\0\0\0\10\232\232\1\302\260\4\0000\1`\1\0 \4 \0\0\0\0\220\237\322\0 \237\322\0"..., 8192, 114688) = 8192
2025-09-08 17:58:41.876 UTC [161] LOG: page verification failed, calculated checksum 20176 but expected 45250
2025-09-08 17:58:41.876 UTC [161] STATEMENT: select * from demo where k=999;
2025-09-08 17:58:41.876 UTC [161] LOG: invalid page in block 14 of relation "base/5/16388"
2025-09-08 17:58:41.876 UTC [161] STATEMENT: select * from demo where k=999;
2025-09-08 17:58:41.876 UTC [161] ERROR: invalid page in block 14 of relation "base/5/16388"
2025-09-08 17:58:41.876 UTC [161] STATEMENT: select * from demo where k=999;
ERROR: invalid page in block 14 of relation "base/5/16388"
Here, the checksum calculated is not correct and an error has been raised. PostgreSQL checksums can detect some block corruption, but it is still possible that a bug or a malicious user that has access to the filesystem can change the data without being detected.
Oracle Database
To detect lost writes like the one I simulated above, Oracle Database compares the block checksum with the standby databases, as there is a low chance that the corruption happened in both environments. I've demonstrated this in the past with a similar demo: 18c new Lost Write Protection
WiredTiger (MongoDB storage engine)
MongoDB employs the WiredTiger storage engine, which is designed to prevent lost writes and detect disk failures that might return the wrong page. To achieve this, WiredTiger stores a checksum alongside each page address within the pointers between the B-tree pages, in an address cookie:
An address cookie is an opaque set of bytes returned by the block manager to reference a block in a B-tree file. It includes an offset, size, checksum, and object id.
In my lab, I first start a MongoDB container and compile wt, a command-line utility that allows direct interaction with WiredTiger files. This tool enables me to examine the storage engine without relying on the MongoDB query layer, and I'll use it in this series of blog posts:
docker run --rm -it --cap-add=SYS_PTRACE mongo bash
# install required packages
apt-get update && apt-get install -y git xxd strace curl jq python3 python3-dev python3-pip python3-venv python3-bson build-essential cmake gcc g++ libstdc++-12-dev libtool autoconf automake swig liblz4-dev zlib1g-dev libmemkind-dev libsnappy-dev libsodium-dev libzstd-dev
# get latest WiredTiger
curl -L $(curl -s https://api.github.com/repos/wiredtiger/wiredtiger/releases/latest | jq -r '.tarball_url') -o wiredtiger.tar.gz
# Compile
mkdir /wiredtiger && tar -xzf wiredtiger.tar.gz --strip-components=1 -C /wiredtiger ; cd /wiredtiger
mkdir build && cmake -S /wiredtiger -B /wiredtiger/build \
-DHAVE_BUILTIN_EXTENSION_SNAPPY=1 \
-DCMAKE_BUILD_TYPE=Release \
-DENABLE_WERROR=0 \
-DENABLE_QPL=0 \
-DCMAKE_C_FLAGS="-O0 -Wno-error -Wno-format-overflow -Wno-error=array-bounds -Wno-error=format-overflow -Wno-error=nonnull" \
-DPYTHON_EXECUTABLE=$(which python3)
cmake --build /wiredtiger/build
export PATH=$PATH:/wiredtiger/build:/wiredtiger/tools
It takes some time, but the wt command line utility will make the investigation easier. That's an advantage of MongoDB pluggable storage—you can examine it in layers.
I create a demo table and insert 10,000 records:
root@7d6d105a1663:/tmp# wt create table:demo
root@7d6d105a1663:/tmp# ls -alrt
total 68
drwxr-xr-x. 1 root root 4096 Sep 8 19:21 ..
-rw-r--r--. 1 root root 21 Sep 8 19:35 WiredTiger.lock
-rw-r--r--. 1 root root 50 Sep 8 19:35 WiredTiger
-rw-r--r--. 1 root root 299 Sep 8 19:35 WiredTiger.basecfg
-rw-r--r--. 1 root root 4096 Sep 8 19:35 demo.wt
-rw-r--r--. 1 root root 4096 Sep 8 19:35 WiredTigerHS.wt
-rw-r--r--. 1 root root 1475 Sep 8 19:35 WiredTiger.turtle
-rw-r--r--. 1 root root 32768 Sep 8 19:35 WiredTiger.wt
drwxrwxrwt. 1 root root 4096 Sep 8 19:35 .
root@7d6d105a1663:/tmp# wt list
colgroup:demo
file:demo.wt
table:demo
cat /dev/urandom |
base64 |
head -10000 |
awk '{print "i",NR,$0}' |
wt dump -e table:demo
...
Inserted key '9997' and value 'ILZeUq/u/ErLB/i7LOUb4nwYP6D535trb8Mt3vcJXXRAqLeAiYIHn5bEWs1buflmiZMYd3rMMvhh'.
Inserted key '9998' and value 'y+b0eTV/4Ao12qRqtHhgP2xGUr+C9ZOfvOG3ZwbdDNXvpnbM1/laoJ9Yzyt6cbLJOR6jdQktpgFM'.
Inserted key '9999' and value 'cJY9uWtopqFOuZjggkZDWVZEEdygpMLyL7LscqehnKoVY7BrmTh4ZzyTLrZ1glROwLtZYbvLbu5c'.
Inserted key '10000' and value 'QYwyjaRxa9Q+5dvzwQtvv2QE/uS/vhRPCVsQ6p7re/L2yDrVRxyqkvSyMHeRCzMIsIovrCUJpPXI'.
I read record 9999 with wt and use strace to see the read calls:
strace -yy -e trace=pread64 -xs 36 wt dump -e table:demo <<<"s 9999"
...
pread64(6</tmp/demo.wt>, "\x00...\x1f\x63\x77\x41"..., 28672, 806912) = 28672
9999
cJY9uWtopqFOuZjggkZDWVZEEdygpMLyL7LscqehnKoVY7BrmTh4ZzyTLrZ1glROwLtZYbvLbu5c
...
This record is at offset 806912 in a 28672 bytes block. I “save” this block with dd:
dd if=demo.wt of=block1.tmp bs=1 skip=806912 count=28672
28672+0 records in
28672+0 records out
28672 bytes (29 kB, 28 KiB) copied, 0.0680832 s, 421 kB/s
I update this record to "xxxxxx" and trace the write calls:
strace -yy -e trace=pwrite64 -xs 36 wt dump -e table:demo <<<"u 9999 xxxxxx"
...
Updated key '9999' to value 'xxxxxx'.
pwrite64(6</tmp/demo.wt>, "\x00...\x02\xb1\x1f\xf5"..., 28672, 847872) = 28672
pwrite64(6</tmp/demo.wt>, "\x00...\x4d\x19\x14\x4e"..., 4096, 876544) = 4096
pwrite64(6</tmp/demo.wt>, "\x00...\x4f\x74\xdb\x1e"..., 4096, 880640) = 4096
pwrite64(6</tmp/demo.wt>, "\x00...\x21\xcc\x25\x06"..., 4096, 884736) = 4096
...
This writes a new block (WiredTiger does not write in-place, which helps to avoid corruption) of 28672 bytes, and updates the B-tree branches.
I can read this new value:
strace -yy -e trace=pread64 -xs 36 wt dump -e table:demo <<<"s 9999"
...
pread64(6</tmp/demo.wt>, "\x00...\x02\xb1\x1f\xf5"..., 28672, 847872) = 28672
9999
xxxxxx
...
To simulate disk corruption, I do the same as I did on PostgreSQL: replace the current block with the old one. I save the current block before overwriting it:
dd if=demo.wt of=block2.tmp bs=1 skip=847872 count=28672
28672+0 records in
28672+0 records out
28672 bytes (29 kB, 28 KiB) copied, 0.0688249 s, 417 kB/s
dd of=demo.wt if=block1.tmp bs=1 seek=847872 conv=notrunc
28672+0 records in
28672+0 records out
28672 bytes (29 kB, 28 KiB) copied, 0.0666375 s, 430 kB/s
If I try to read the record in this block, the corruption is detected:
strace -yy -e trace=pread64 -xs 36 wt dump -e table:demo <<<"s 9999"
...
pread64(6</tmp/demo.wt>, "\x00...\x1f\x63\x77\x41"..., 28672, 847872) = 28672
[1757361305:392519][8246:0x7fe9a087e740], wt, file:demo.wt, WT_SESSION.open_cursor: [WT_VERB_DEFAULT][ERROR]: __wti_block_read_off, 279: demo.wt: potential hardware corruption,
read checksum error for 28672B block at offset 847872: block header
checksum of 0x4177631f doesn't match expected checksum of 0xf51fb102
[1757361305:392904][8246:0x7fe9a087e740], wt, file:demo.wt,
WT_SESSION.open_cursor: [WT_VERB_DEFAULT][ERROR]: __bm_corrupt_dump, 86: {0: 847872, 28672, 0xf51fb102}: (chunk 1 of 28): 00 00 00 00 00
00 00 00 3a 00 00 00 00 00 00 00 40 6e 00 00 a8 02 00 00 07 04 00 01 00 70 00 00 1f 63 77 41 01 00 00 00 11 39 36 39 33 80 8c 43 38 30 58 51 66 64
...
Even if the block checksum is correct for the block itself, it was detected that the checksum of the block, 0x4177631f, which is visible as 1f 63 77 41 in the hexadecimal dump, or \x1f\x63\x77\x41 in the read trace, is different from the expected 0xf51fb102 from the address cookie.
0xf51fb102 was visible as \x02\xb1\x1f\xf5 in the write call of the update, and is visible as 02 b1 1f f5 in the block that I've saved before overwriting it:
root@7d6d105a1663:/tmp# xxd -l 36 block2.tmp
00000000: 0000 0000 0000 0000 5c00 0000 0000 0000 ........\.......
00000010: f96d 0000 a802 0000 0704 0001 0070 0000 .m...........p..
00000020: 02b1 1ff5
Even with access to the files, it would be extremely difficult to corrupt the data in an undetected way because any change must update the checksum, and the checksum is referenced in all address cookies in other blocks. Block corruption is highly unlikely as the blocks are not updated in place, and failure to write blocks would break the pointers.
WiredTiger is open-source and you can check WT_BLOCK_HEADER definition. In this structure, the block size (disk_size) field appears before the checksum field: for example, 00 70 00 00 = 0x00007000 = 28,672 bytes, followed by the checksum 02 b1 1f f5 = 0xf51fb102. One advantage of WiredTiger is that B-tree leaf blocks can have flexible sizes, which MongoDB uses to keep documents as one chunk on disk and improve data locality.
Checksum verification is implemented in block_read.c and performs two validations:
- It checks that the checksum stored in the block header matches the expected checksum from the address cookie (the B-tree pointer created when the block was written).
- It zeroes out the checksum field in the header and recomputes the checksum over the block content, verifying it also matches the expected checksum. This ensures both the block’s integrity and its identity.
Conclusion
PostgreSQL requires you to enable checksums to detect data corruption. This detects when a page’s checksum does not match its content. However, if the system erroneously writes a different, but valid, block from the same table in place of the intended one, or misses a write and the previous version of the block remains, PostgreSQL cannot identify this issue. As a result, some disk failures may escape detection and return wrong results.
Oracle Database stores blocks with checksums and can enable checking them on read. With a Data Guard standby and some network overhead, the database can transmit checksums over the network to verify data integrity when reading.
MongoDB WiredTiger enables checksums by default and can detect the wrong blocks without the need to contact replicas. It embeds the expected checksum inside the B-tree address cookie so that every internal B-tree pointer to a leaf page includes the checksum for the referenced page. If an obsolete or different page is swapped in, any mismatch will be detected because the pointer's checksum won’t match. WiredTiger uses copy-on-write, not in-place overwrites, further reducing the risk of corruption.
Here is a description of WiredTiger by Keith Bostic:
And an article by Bob Liles about how MongoDB Atlas deals with that: Managing Data Corruption in the Cloud
Disaggregation: A New Architecture for Cloud Databases
This short VLDB'25 paper surveys disaggregation for cloud databases. It has several insightful points, and I found it worth summarizing.
The key advantage of the cloud over on-prem is elastic scalability: users can scale resources up and down and pay only for what they use. Traditional database architectures, like shared-nothing, do not fully exploit this. Thus, cloud-native databases increasingly adopt disaggregated designs.
Disaggregation is primarily motivated by the asymmetry between compute and storage:
- Compute is far more expensive than storage in the cloud.
- Compute demand fluctuates quickly; storage grows slowly.
- Compute can be stateless and easier to scale, while storage is inherently stateful.
Decoupling them lets compute scale elastically while storage remains relatively stable and cheap.
Review of Disaggregation in the Clouds
Early cloud-native systems like Snowflake and Amazon Aurora separate compute and storage into independent clusters. Modern systems push disaggregation further. Socrates splits storage into three services: Logging service (small footprint, strict latency), Page cache, and Durable page store. This way each service can be tuned for its performance/cost tradeoffs. The logging service, for example, can use faster storage hardware.
Other disaggregation examples include computation pushdown (Redshift Spectrum, S3 Select), intermediate caching (Snowflake), a metadata service (Lakehouse), and memory disaggregation (PolarDB). Many other functions (indexing, concurrency control, query optimization) remain underexplored. There is room for a unified middleware layer between compute and storage that can consolidate these.
The paper doesn't mention it but, this discussion mirrors the microservices trend in systems design. Breaking monoliths into smaller, independently scalable services improves modularity and resource efficiency, and also enables better sharing and pooling of resources across workloads. We may see disaggregated databases evolve the way microservices did: first a simple split, then a bunch of fine-grained services, and eventually a need for orchestration layers, observability, and service meshes. Today it is compute and storage; tomorrow it could be dozens of database microservices (maybe even including concurrency control), stitched together by a middleware layer that looks suspiciously like Kubernetes.
Tradeoffs in disaggregated design
The main tradeoff is performance. Since disaggregated components are physically separate, the communication overhead can be high. A 2019 study shows a 10x throughput hit compared to a tuned shared-nothing system. Optimizations can help narrow the gap, but disaggregation should be applied only when its benefits outweigh the network cost. This tradeoff also motivates research into techniques that reduce communication overhead. Hello, distributed systems research!
Rethinking Core Protocols
Many distributed database protocols assume shared-nothing architecture, so with disaggregation, some of these assumptions no longer hold. This creates new opportunities, and not just more problems.
For example, 2PC normally faces a blocking problem because a failed node's log is inaccessible. With disaggregated storage, logs live in a shared, reliable service. Cornus 2PC protocol (2022) leverages this: active nodes can vote NO on behalf of failed nodes by writing directly to their logs. A compare-and-swap API ensures only one decision is recorded.
Disaggregating the Query Engine
Pushdown reduces data movement by executing operators closer to storage. This idea has been studied in database machines, Smart SSDs, and PIM, but fits especially well in the cloud. Leveraging this, PushdownDB uses S3 Select to push down both basic and advanced operators. It cuts query time 6.7x and cost 30%. FlexPushdownDB combines pushdown with caching so that operators like filters or hash probes can run locally on cached data and remotely via pushdown, with results merged. This hybrid mode outperforms either technique alone by 2.2x.
Enabling New Capabilities and Embracing New Hardware
Modern applications want queries to reflect the latest transactions, not data from hours ago. HTAP systems support this but require migration to new engines. Disaggregated architectures offer an opportunity here, and Hermes (VLDB'25) exploits this by placing itself between the compute and storage. Hermes intercepts transactional logs and analytical reads, merging recent updates into queries on the fly. Updates are later batched into stable storage.
Disaggregation also motivates and eases adoption of new hardware. Different components can use specialized GPUs, RDMA, CXL in order to achieve the best cost-performance tradeoff. The paper cites a GPU-based DuckDB engine (VLDB'25) that achieves large speedups by exploiting parallelism.
Discussion
So this paper mentions several new research directions. What should systems researchers in academia work on?
Here is what I think would be an impactful work. Take a monolithic database (like Postgres, RocksDB, or MySQL) and transform it to a disaggregated database. But not just for shits and giggles. Study the efficiency tradeoffs in alternative designs. Also study the software engineering tradeoffs, cost-to-production tradeoffs, resilience tradeoffs and metastability risks. Compare and contrast different transformation paths. This would provide a good roadmap (and paths to avoid minefields) for many databases that consider a similar redesign/implementation. Last year I had reviewed this paper which made a small crack at this, but I believe there is a lot of research and work here to be tackled.
For distributed protocol designers, the rethinking core protocols section provides a good blueprint. Pick other protocols, such as consensus, leader election, replication, caching, and revisit them in the disaggregated setting, and consider the new opportunities that open up alongside the challenges introduced.
If you are still looking for more to read, I had written about disaggregated architecture several times on this blog.
Swimming with Sharks: Analyzing Encrypted Database Traffic Using Wireshark
Percona has a great set of tools known as the Percona Toolkit, one of which is pt-upgrade. The idea behind this tool is to replay a captured sequence of queries that were executed on a different database server. This is very useful to validate if a new version of the database server works as expected […]
Directly query the underlying ClickHouse database in Tinybird via the native HTTP interface
Tinybird now supports the native ClickHouse® HTTP interface, making it easier to connect BI tools and SQL clients directly to your Tinybird workspaces using their built-in ClickHouse drivers.
Directly query the underlying ClickHouse database in Tinybird via the native HTTP interface
Tinybird now supports the native ClickHouse® HTTP interface, making it easier to connect BI tools and SQL clients directly to your Tinybird workspaces using their built-in ClickHouse drivers.
Our Myrtle Beach vacation
This year was a bastard. Not from work. God, no, I find work relaxing. Reading papers, inventing algorithms, ripping apart distributed systems with TLA+ models -- that's how I have fun. I can do that all day with a grin on my face. But the minute I need to do adulting (like simply calling the cable company and ask why keep increasing our bill when I'm not looking), I will stress and procrastinate for weeks. And this year, I had a lot of adulting to do to put our house on market, and plan a move to California, all the while juggling to help three kids with school and EC activities. I was pretty stressed most of the time, and I've been grinding my teeth at night like a mule chewing rocks.
Anywho, we botched our great escape to California. House didn't sell quickly, as we hoped it would, and we are stuck in Buffalo for another year. Summer disappeared in a cloud of errands and disappointment, and suddenly it was late August with our kids twitching with pre-school nerves. There was still some time left to salvage the wreck. We needed a beach, any beach. Myrtle Beach! We heard good things about it. One of my friends called it the Redneck Riviera. Good enough for me, and far enough from Buffalo, so we decided to give it a try.
Planning is not my strong suit. My wife took the reins. She scoured hotel reviews like a CIA interrogator, picking through tales of bedbugs, mold, broken elevators. She has a radar for doom. Without her, I'd have booked us straight into some cockroach casino on the boardwalk. But she nailed it. Ten days before departure, she locked down an Airbnb room inside a proper resort hotel, facing the ocean. We chose Airbnb for better customer service and because the photos showed the exact floor and view we would get. There was no guessing which floor or room we would get if we went with the resort directly.
The best thing about the vacation is anticipation and the wait. We counted down the days, giddy with excitement.
And then, the drive. Always the drive. That is how the Demirbas family rolls: No planes if at all possible. Planes are a scam. For five people, it's bankruptcy on wings. You waste a whole day shuffling through TSA lines, just to pray nervously that our planes don't cancel on you, and if you are lucky to sit in a recycled air canister for hours. We once drove from Buffalo to Seattle, and back. And another time to Florida and back. For us seven hours on asphalt is a warm-up lap from Buffalo to Boston. Myrtle Beach was thirteen. Still doable. Just load the audiobooks, clamp the Bose headphones on my head, and hit the highway. Driving is my meditation: the road pours itself under my car, like some childhood arcade game where the scenery and other cars on the road scrolls through you for hours as I nudge the steering wheel left and right to accommodate.
We left Buffalo at 8:30 in our tightly packed Highlander. By noon the youngest announced that she hadn't hit the bathroom that morning and we stopped at a McDonald's little south of Pittsburgh. We mostly pass as a normal American family, but at this stop we stood out like a sore thumb. We received a lot of cold blond stares. I later understood why, when we drove another 30 minutes, the barns started shouting TRUMP in dripping paint, and we entered West Virginia. God's Country, they call it. Heaven on earth. But it was just some green hills, and the scenery didn't impress me much.
Our next stop was at a rest area in North Carolina, which turned out to become the cleanest, most immaculate rest area I'd ever seen. Somebody in Raleigh must be laundering money through landscaping contracts, but damn if it didn't impress us. Even the butterflies were impressive!
Then Myrtle Beach: 85 degrees weather, ocean air, great view from our flat, and a nice waterpark at the resort. Southern hospitality is real, everyone was smiling. Compared to the winter-scarred faces in Buffalo, it felt like stepping onto another planet. The Carolinas had already shoved their kids back into classrooms, so we owned the pools and ruled the lazy river. The kids tore through the slides. I soaked in the jacuzzi like a tired warrior. At night we binge-watched Quantum Leap. We would have also watched during day, but my wife dragged us to beach walks, waterpark raids. Sometimes we need the push.
By the third day, the spell had taken hold. I started to relax. Sleep came easy, deeper than home. The wave sounds and the sea view worked its magic. Staying right on the beach was great. No hauling gear, no logistics. Step out the door, fall into the ocean, and crawl back to the flat when you're cooked. The flat was clean, spacious, and blessed with a kitchen so we could gorge on comfort food without shame.
We were wondering if we made a mistake by getting the resort 4-5 miles north of the Boardwalk. When we visited the boardwalk on the third day, we realized that it was overrated anyways. It was full of tourist-trap shops, neon lights, and featured a SkyWheel, which we didn't bother to try. We didn't need the Boardwalk. Myrtle Beach itself is the show: the waves, the horizon, and the beach.
Of course, I had to ruin myself. The kids used sunscreen like sensible citizens, and I, an idiot heir to Turkish tanning lore, slathered on olive oil (which I swiped from our kitchen). If it fries an egg, it'll bronze a body, right? Well... I roasted into a lobster, alright... But I ended up slowly shedding skin like a reptile for days afterwards.
The drive back was clean. Salem's Grill in Pittsburgh was our mandatory detour. You go to great food, if great food doesn't get to you. We hit it before 7pm and dined like kings until closing at 8pm. We were back home before midnight. Eventless driving, the way I like it.
But vacations are lies, sweet lies. Within days the teeth grinding returned. The adult machinery reloaded with forms to sign, kids to shuttle, bills to pay. Adulting feels like having to deal with a constant deluge of junk mail and random chores from the universe.
And, then the saddest part... we will be shipping Ahmet to college. He leaves for Caltech soon (must be his mother's genes). I am proud, of course, but I will miss him a lot. I bought the plane ticket yesterday after weeks of pretending I didn't have to. Kids grow fast. Too fast... It isn't fair.
Building a DOOM-like multiplayer shooter in pure SQL
DOOMQL: A DOOM-like multiplayer shooter in pure SQL
I recently stumbled across Patrick’s excellent DOOM clone running in a browser powered by DuckDB-WASM.
Ever since I’ve read that, I wanted to push his awesome idea to the logical extreme: Build a multiplayer DOOM-like shooter entirely in SQL with CedarDB doing all the heavy lifting.
During a month of parental leave (i.e., a lot of sleepless nights), I tried exactly that.
Here’s a sneak peek at DOOMQL: