WiredTiger is MongoDB’s default storage engine, but what really occurs behind the scenes when collections and indexes are saved to disk? In this short deep dive, we’ll explore the internals of WiredTiger data files, covering everything from _mdb_catalog metadata and B-Tree page layouts to BSON storage, primary and secondary indexes, and multi-key array handling. The goal is to introduce useful low-level tools like wt and other utilities.
I ran this experiment in a Docker container, set up as described in a previous blog post:
docker run --rm -it --cap-add=SYS_PTRACE mongo bash
# install required packages
apt-get update && apt-get install -y git xxd strace curl jq python3 python3-dev python3-pip python3-venv python3-pymongo python3-bson build-essential cmake gcc g++ libstdc++-12-dev libtool autoconf automake swig liblz4-dev zlib1g-dev libmemkind-dev libsnappy-dev libsodium-dev libzstd-dev
# get WiredTiger main branch
curl -L $(curl -s https://api.github.com/repos/wiredtiger/wiredtiger/releases/latest | jq -r '.tarball_url') -o wiredtiger.tar.gz
git clone https://github.com/wiredtiger/wiredtiger.git
cd wiredtiger
# Compile
mkdir build && cmake -S /wiredtiger -B /wiredtiger/build \
-DCMAKE_C_FLAGS="-O0 -Wno-error -Wno-format-overflow -Wno-error=array-bounds -Wno-error=format-overflow -Wno-error=nonnull" \
-DHAVE_BUILTIN_EXTENSION_SNAPPY=1 \
-DCMAKE_BUILD_TYPE=Release
cmake --build /wiredtiger/build
# add `wt` binaries and other tools in the PATH
export PATH=$PATH:/wiredtiger/build:/wiredtiger/tools
# Start mongodb
mongod &
I use the mongo image, add the WiredTiger sources from the main branch, compile it to get wt, and start mongod.
I create a small collection with three documents, and an index, and stop mongod:
mongosh <<'JS'
db.franck.insertMany([
{_id:"aaa",val1:"xxx",val2:"yyy",val3:"zzz",msg:"hello world"},
{_id:"bbb",val1:"xxx",val2:"yyy",val3:"zzz",msg:["hello","world"]},
{_id:"ccc",val1:"xxx",val2:"yyy",val3:"zzz",msg:["hello","world","hello","again"]}
]);
db.franck.createIndex({_id:1,val1:1,val2:1,val3:1,msg:1});
db.franck.find().showRecordId();
use admin;
db.shutdownServer();
JS
I stop MongoDB so that I can access the WiredTiger files with wt without them being opened and locked by another program. Before stopping, I displayed the documents:
[
{
_id: 'aaa',
val1: 'xxx',
val2: 'yyy',
val3: 'zzz',
msg: 'hello world',
'$recordId': Long('1')
},
{
_id: 'bbb',
val1: 'xxx',
val2: 'yyy',
val3: 'zzz',
msg: [ 'hello', 'world' ],
'$recordId': Long('2')
},
{
_id: 'ccc',
val1: 'xxx',
val2: 'yyy',
val3: 'zzz',
msg: [ 'hello', 'world', 'hello', 'again' ],
'$recordId': Long('3')
}
]
The files are stored in the default WiredTiger directory /data/db
MongoDB catalog, which maps the MongoDB collections to their storage attributes, is stored in a WiredTiger table _mdb_catalog. The default WiredTiger directory is /data/db:
root@72cf410c04cb:/wiredtiger# ls -altU /data/db
drwxr-xr-x. 4 root root 32 Sep 1 23:10 ..
-rw-------. 1 root root 0 Sep 13 20:33 mongod.lock
drwx------. 2 root root 74 Sep 13 20:29 journal
-rw-------. 1 root root 21 Sep 12 22:47 WiredTiger.lock
-rw-------. 1 root root 50 Sep 12 22:47 WiredTiger
-rw-------. 1 root root 73728 Sep 13 20:33 WiredTiger.wt
-rw-r--r--. 1 root root 1504 Sep 13 20:33 WiredTiger.turtle
-rw-------. 1 root root 4096 Sep 13 20:33 WiredTigerHS.wt
-rw-------. 1 root root 36864 Sep 13 20:33 sizeStorer.wt
-rw-------. 1 root root 36864 Sep 13 20:33 _mdb_catalog.wt
-rw-------. 1 root root 114 Sep 12 22:47 storage.bson
-rw-------. 1 root root 20480 Sep 13 20:33 collection-0-3767590060964183367.wt
-rw-------. 1 root root 20480 Sep 13 20:33 index-1-3767590060964183367.wt
-rw-------. 1 root root 36864 Sep 13 20:33 collection-2-3767590060964183367.wt
-rw-------. 1 root root 36864 Sep 13 20:33 index-3-3767590060964183367.wt
-rw-------. 1 root root 20480 Sep 13 20:20 collection-4-3767590060964183367.wt
-rw-------. 1 root root 20480 Sep 13 20:20 index-5-3767590060964183367.wt
-rw-------. 1 root root 20480 Sep 13 20:33 index-6-3767590060964183367.wt
drwx------. 2 root root 4096 Sep 13 20:33 diagnostic.data
drwx------. 3 root root 21 Sep 13 20:17 .mongodb
-rw-------. 1 root root 20480 Sep 13 20:33 collection-0-6917019827977430149.wt
-rw-------. 1 root root 20480 Sep 13 20:23 index-1-6917019827977430149.wt
-rw-------. 1 root root 20480 Sep 13 20:25 index-2-6917019827977430149.wt
Catalog
_mdb_catalog maps MongoDB names to WiredTiger table names. wt lists the key (recordId) and value (BSON):
root@72cf410c04cb:~# wt -h /data/db dump table:_mdb_catalog
WiredTiger Dump (WiredTiger Version 12.0.0)
Format=print
Header
table:_mdb_catalog
access_pattern_hint=none,allocation_size=4KB,app_metadata=(formatVersion=1),assert=(commit_timestamp=none,durable_timestamp=none,read_timestamp=none,write_timestamp=off),block_allocation=best,block_compressor=snappy,block_manager=default,cache_resident=false,checksum=on,colgroups=,collator=,columns=,dictionary=0,disaggregated=(page_log=),encryption=(keyid=,name=),exclusive=false,extractor=,format=btree,huffman_key=,huffman_value=,ignore_in_memory_cache_size=false,immutable=false,import=(compare_timestamp=oldest_timestamp,enabled=false,file_metadata=,metadata_file=,panic_corrupt=true,repair=false),in_memory=false,ingest=,internal_item_max=0,internal_key_max=0,internal_key_truncate=true,internal_page_max=4KB,key_format=q,key_gap=10,leaf_item_max=0,leaf_key_max=0,leaf_page_max=32KB,leaf_value_max=64MB,log=(enabled=true),lsm=(auto_throttle=,bloom=,bloom_bit_count=,bloom_config=,bloom_hash_count=,bloom_oldest=,chunk_count_limit=,chunk_max=,chunk_size=,merge_max=,merge_min=),memory_page_image_max=0,memory_page_max=10m,os_cache_dirty_max=0,os_cache_max=0,prefix_compression=false,prefix_compression_min=4,source="file:_mdb_catalog.wt",split_deepen_min_child=0,split_deepen_per_child=0,split_pct=90,stable=,tiered_storage=(auth_token=,bucket=,bucket_prefix=,cache_directory=,local_retention=300,name=,object_target_size=0),type=file,value_format=u,verbose=[],write_timestamp_usage=none
Data
\81
r\01\00\00\03md\00\eb\00\00\00\02ns\00\15\00\00\00admin.system.version\00\03options\00 \00\00\00\05uuid\00\10\00\00\00\04\ba\fc\c2\a9;EC\94\9d\a1\df(\c9\87\eaW\00\04indexes\00\97\00\00\00\030\00\8f\00\00\00\03spec\00.\00\00\00\10v\00\02\00\00\00\03key\00\0e\00\00\00\10_id\00\01\00\00\00\00\02name\00\05\00\00\00_id_\00\00\08ready\00\01\08multikey\00\00\03multikeyPaths\00\10\00\00\00\05_id\00\01\00\00\00\00\00\00\12head\00\00\00\00\00\00\00\00\00\08backgroundSecondary\00\00\00\00\00\03idxIdent\00+\00\00\00\02_id_\00\1c\00\00\00index-1-3767590060964183367\00\00\02ns\00\15\00\00\00admin.system.version\00\02ident\00!\00\00\00collection-0-3767590060964183367\00\00
\82
\7f\01\00\00\03md\00\fb\00\00\00\02ns\00\12\00\00\00local.startup_log\00\03options\003\00\00\00\05uuid\00\10\00\00\00\042}_\a9\16,L\13\aa*\09\b5<\ea\aa\d6\08capped\00\01\10size\00\00\00\a0\00\00\04indexes\00\97\00\00\00\030\00\8f\00\00\00\03spec\00.\00\00\00\10v\00\02\00\00\00\03key\00\0e\00\00\00\10_id\00\01\00\00\00\00\02name\00\05\00\00\00_id_\00\00\08ready\00\01\08multikey\00\00\03multikeyPaths\00\10\00\00\00\05_id\00\01\00\00\00\00\00\00\12head\00\00\00\00\00\00\00\00\00\08backgroundSecondary\00\00\00\00\00\03idxIdent\00+\00\00\00\02_id_\00\1c\00\00\00index-3-3767590060964183367\00\00\02ns\00\12\00\00\00local.startup_log\00\02ident\00!\00\00\00collection-2-3767590060964183367\00\00
\83
^\02\00\00\03md\00\a7\01\00\00\02ns\00\17\00\00\00config.system.sessions\00\03options\00 \00\00\00\05uuid\00\10\00\00\00\04D\09],\c6\15FG\b6\e2m!\ba\c4j<\00\04indexes\00Q\01\00\00\030\00\8f\00\00\00\03spec\00.\00\00\00\10v\00\02\00\00\00\03key\00\0e\00\00\00\10_id\00\01\00\00\00\00\02name\00\05\00\00\00_id_\00\00\08ready\00\01\08multikey\00\00\03multikeyPaths\00\10\00\00\00\05_id\00\01\00\00\00\00\00\00\12head\00\00\00\00\00\00\00\00\00\08backgroundSecondary\00\00\00\031\00\b7\00\00\00\03spec\00R\00\00\00\10v\00\02\00\00\00\03key\00\12\00\00\00\10lastUse\00\01\00\00\00\00\02name\00\0d\00\00\00lsidTTLIndex\00\10expireAfterSeconds\00\08\07\00\00\00\08ready\00\01\08multikey\00\00\03multikeyPaths\00\14\00\00\00\05lastUse\00\01\00\00\00\00\00\00\12head\00\00\00\00\00\00\00\00\00\08backgroundSecondary\00\00\00\00\00\03idxIdent\00Y\00\00\00\02_id_\00\1c\00\00\00index-5-3767590060964183367\00\02lsidTTLIndex\00\1c\00\00\00index-6-3767590060964183367\00\00\02ns\00\17\00\00\00config.system.sessions\00\02ident\00!\00\00\00collection-4-3767590060964183367\00\00
\84
\a6\02\00\00\03md\00\e6\01\00\00\02ns\00\0c\00\00\00test.franck\00\03options\00 \00\00\00\05uuid\00\10\00\00\00\04>\04\ec\e2SUK\ca\98\e8\bf\fe\0eu\81L\00\04indexes\00\9b\01\00\00\030\00\8f\00\00\00\03spec\00.\00\00\00\10v\00\02\00\00\00\03key\00\0e\00\00\00\10_id\00\01\00\00\00\00\02name\00\05\00\00\00_id_\00\00\08ready\00\01\08multikey\00\00\03multikeyPaths\00\10\00\00\00\05_id\00\01\00\00\00\00\00\00\12head\00\00\00\00\00\00\00\00\00\08backgroundSecondary\00\00\00\031\00\01\01\00\00\03spec\00q\00\00\00\10v\00\02\00\00\00\03key\005\00\00\00\10_id\00\01\00\00\00\10val1\00\01\00\00\00\10val2\00\01\00\00\00\10val3\00\01\00\00\00\10msg\00\01\00\00\00\00\02name\00!\00\00\00_id_1_val1_1_val2_1_val3_1_msg_1\00\00\08ready\00\01\08multikey\00\01\03multikeyPaths\00?\00\00\00\05_id\00\01\00\00\00\00\00\05val1\00\01\00\00\00\00\00\05val2\00\01\00\00\00\00\00\05val3\00\01\00\00\00\00\00\05msg\00\01\00\00\00\00\01\00\12head\00\00\00\00\00\00\00\00\00\08backgroundSecondary\00\00\00\00\00\03idxIdent\00m\00\00\00\02_id_\00\1c\00\00\00index-1-6917019827977430149\00\02_id_1_val1_1_val2_1_val3_1_msg_1\00\1c\00\00\00index-2-6917019827977430149\00\00\02ns\00\0c\00\00\00test.franck\00\02ident\00!\00\00\00collection-0-6917019827977430149\00\00
I can decode the BSON value with wt_to_mdb_bson.py to display it as JSON, and use jq to filter the file information about the collection I've created:
wt -h /data/db dump -x table:_mdb_catalog |
wt_to_mdb_bson.py -m dump -j |
jq 'select(.value.ns == "test.franck") |
{ns: .value.ns, ident: .value.ident, idxIdent: .value.idxIdent}
'
{
"ns": "test.franck",
"ident": "collection-0-6917019827977430149",
"idxIdent": {
"_id_": "index-1-6917019827977430149",
"_id_1_val1_1_val2_1_val3_1_msg_1": "index-2-6917019827977430149"
}
}
ident is the WiredTiger table name (collection-...) for the collection documents. All collections have a primary key index on "_id" and additional secondary indexes, stored in WiredTiger tables (index-...). These indexes are stored as .wt files in the data directory.
Collection
Using the WiredTiger table name for the collection, I dump the content, keys, and values, and decode it as JSON:
wt -h /data/db dump -x table:collection-0-6917019827977430149 |
wt_to_mdb_bson.py -m dump -j
{"key": "81", "value": {"_id": "aaa", "val1": "xxx", "val2": "yyy", "val3": "zzz", "msg": "hello world"}}
{"key": "82", "value": {"_id": "bbb", "val1": "xxx", "val2": "yyy", "val3": "zzz", "msg": ["hello", "world"]}}
{"key": "83", "value": {"_id": "ccc", "val1": "xxx", "val2": "yyy", "val3": "zzz", "msg": ["hello", "world", "hello", "again"]}}
The "key" here is the recordId — an internal, unsigned 64-bit integer MongoDB uses (when not using clustered collections) to order documents in the collection table. The 0x80 offset is because the storage key is stored as a signed 8‑bit integer, but encoded in an order-preserving way.
I can also use wt_binary_decode.py to look at the file blocks. Here is the leaf page (page type: 7 (WT_PAGE_ROW_LEAF)) that contains my three documents as six key and value cells (cells (oflow len): 6) :
wt_binary_decode.py --offset 4096 --page 1 --verbose --split --bson /data/db/collection-0-6917019827977430149.wt
/data/db/collection-0-6917019827977430149.wt, position 0x1000/0x5000, pagelimit 1
Decode at 4096 (0x1000)
0: 00 00 00 00 00 00 00 00 1f 0f 00 00 00 00 00 00 5f 01 00 00
06 00 00 00 07 04 00 01 00 10 00 00 64 0a ec 4b 01 00 00 00
Page Header:
recno: 0
writegen: 3871
memsize: 351
ncells (oflow len): 6
page type: 7 (WT_PAGE_ROW_LEAF)
page flags: 0x4
version: 1
Block Header:
disk_size: 4096
checksum: 0x4bec0a64
block flags: 0x1
0: 28: 05 81
desc: 0x5 short key 1 bytes:
<packed 1 (0x1)>
1: 2a: 80 91 51 00 00 00 02 5f 69 64 00 04 00 00 00 61 61 61 00 02
76 61 6c 31 00 04 00 00 00 78 78 78 00 02 76 61 6c 32 00 04
00 00 00 79 79 79 00 02 76 61 6c 33 00 04 00 00 00 7a 7a 7a
00 02 6d 73 67 00 0c 00 00 00 68 65 6c 6c 6f 20 77 6f 72 6c
64 00 00
cell is valid BSON
{ '_id': 'aaa',
'msg': 'hello world',
'val1': 'xxx',
'val2': 'yyy',
'val3': 'zzz'}
2: 7d: 05 82
desc: 0x5 short key 1 bytes:
<packed 2 (0x2)>
3: 7f: 80 a0 60 00 00 00 02 5f 69 64 00 04 00 00 00 62 62 62 00 02
76 61 6c 31 00 04 00 00 00 78 78 78 00 02 76 61 6c 32 00 04
00 00 00 79 79 79 00 02 76 61 6c 33 00 04 00 00 00 7a 7a 7a
00 04 6d 73 67 00 1f 00 00 00 02 30 00 06 00 00 00 68 65 6c
6c 6f 00 02 31 00 06 00 00 00 77 6f 72 6c 64 00 00 00
cell is valid BSON
{ '_id': 'bbb',
'msg': ['hello', 'world'],
'val1': 'xxx',
'val2': 'yyy',
'val3': 'zzz'}
4: e1: 05 83
desc: 0x5 short key 1 bytes:
<packed 3 (0x3)>
5: e3: 80 ba 7a 00 00 00 02 5f 69 64 00 04 00 00 00 63 63 63 00 02
76 61 6c 31 00 04 00 00 00 78 78 78 00 02 76 61 6c 32 00 04
00 00 00 79 79 79 00 02 76 61 6c 33 00 04 00 00 00 7a 7a 7a
00 04 6d 73 67 00 39 00 00 00 02 30 00 06 00 00 00 68 65 6c
6c 6f 00 02 31 00 06 00 00 00 77 6f 72 6c 64 00 02 32 00 06
00 00 00 68 65 6c 6c 6f 00 02 33 00 06 00 00 00 61 67 61 69
6e 00 00 00
cell is valid BSON
{ '_id': 'ccc',
'msg': ['hello', 'world', 'hello', 'again'],
'val1': 'xxx',
'val2': 'yyy',
'val3': 'zzz'}
The script shows the raw hexadecimal bytes for the key, a description of the cell type, and the decoded logical value using WiredTiger’s order‑preserving integer encoding (packed int encoding). In this example, the raw byte 0x81 decodes to record ID 1:
0: 28: 05 81
desc: 0x5 short key 1 bytes:
<packed 1 (0x1)>
Here is the branch page (page type: 6 (WT_PAGE_ROW_INT)) that references it:
wt_binary_decode.py --offset 8192 --page 1 --verbose --split --bson /data/db/collection-0-6917019827977430149.wt
/data/db/collection-0-6917019827977430149.wt, position 0x2000/0x5000, pagelimit 1
Decode at 8192 (0x2000)
0: 00 00 00 00 00 00 00 00 20 0f 00 00 00 00 00 00 34 00 00 00
02 00 00 00 06 00 00 01 00 10 00 00 21 df 20 d6 01 00 00 00
Page Header:
recno: 0
writegen: 3872
memsize: 52
ncells (oflow len): 2
page type: 6 (WT_PAGE_ROW_INT)
page flags: 0x0
version: 1
Block Header:
disk_size: 4096
checksum: 0xd620df21
block flags: 0x1
0: 28: 05 00
desc: 0x5 short key 1 bytes:
""
1: 2a: 38 00 87 80 81 e4 4b eb ea 24
desc: 0x38 addr (leaf no-overflow) 7 bytes:
<packed 0 (0x0)> <packed 1 (0x1)> <packed 1273760356 (0x4bec0a64)>
As we have seen in the previous blog post, the pointer includes the checksum of the page it references (0x4bec0a64) to detect disc corruption.
Another utility, bsondump, can be used to display the output of wt dump -x as JSON, like wt_to_mdb_bson.py, but requires some filtering to get the BSON content:
wt -h /data/db dump -x table:collection-0-6917019827977430149 | # dump in hexa
egrep '025f696400' | # all documents have an "_id " field
xxd -r -p | # gets the plain binary data
bsondump --type=json # display BSON it as JSON
{"_id":"aaa","val1":"xxx","val2":"yyy","val3":"zzz","msg":"hello world"}
{"_id":"bbb","val1":"xxx","val2":"yyy","val3":"zzz","msg":["hello","world"]}
{"_id":"ccc","val1":"xxx","val2":"yyy","val3":"zzz","msg":["hello","world","hello","again"]}
2025-09-14T08:57:36.182+0000 3 objects found
It also provides a debug type output that gives more insights into how it is stored internally, especially for documents with arrays:
wt -h /data/db dump -x table:collection-0-6917019827977430149 | # dump in hexa
egrep '025f696400' | # all documents have an "_id " field
xxd -r -p | # gets the plain binary data
bsondump --type=debug # display BSON as it is stored
--- new object ---
size : 81
_id
type: 2 size: 13
val1
type: 2 size: 14
val2
type: 2 size: 14
val3
type: 2 size: 14
msg
type: 2 size: 21
--- new object ---
size : 96
_id
type: 2 size: 13
val1
type: 2 size: 14
val2
type: 2 size: 14
val3
type: 2 size: 14
msg
type: 4 size: 36
--- new object ---
size : 31
0
type: 2 size: 13
1
type: 2 size: 13
--- new object ---
size : 122
_id
type: 2 size: 13
val1
type: 2 size: 14
val2
type: 2 size: 14
val3
type: 2 size: 14
msg
type: 4 size: 62
--- new object ---
size : 57
0
type: 2 size: 13
1
type: 2 size: 13
2
type: 2 size: 13
3
type: 2 size: 13
2025-09-14T08:59:15.268+0000 3 objects found
Arrays in BSON are just sub-objects with the array position as a field name.
Primary index
RecordId is an internal, logical key used in the BTree to store the collection. It allows documents to be physically moved without fragmentation when they're updated. All indexes reference documents by recordId, not their physical location. Access by "_id" requires a unique index created automatically with the collection and stored as another WiredTiger table. Here is the content:
wt -h /data/db dump -p table:index-1-6917019827977430149
WiredTiger Dump (WiredTiger Version 12.0.0)
Format=print
Header
table:index-1-6917019827977430149
access_pattern_hint=none,allocation_size=4KB,app_metadata=(formatVersion=8),assert=(commit_timestamp=none,durable_timestamp=none,read_timestamp=none,write_timestamp=off),block_allocation=best,block_compressor=,block_manager=default,cache_resident=false,checksum=on,colgroups=,collator=,columns=,dictionary=0,disaggregated=(page_log=),encryption=(keyid=,name=),exclusive=false,extractor=,format=btree,huffman_key=,huffman_value=,ignore_in_memory_cache_size=false,immutable=false,import=(compare_timestamp=oldest_timestamp,enabled=false,file_metadata=,metadata_file=,panic_corrupt=true,repair=false),in_memory=false,ingest=,internal_item_max=0,internal_key_max=0,internal_key_truncate=true,internal_page_max=16k,key_format=u,key_gap=10,leaf_item_max=0,leaf_key_max=0,leaf_page_max=16k,leaf_value_max=0,log=(enabled=true),lsm=(auto_throttle=,bloom=,bloom_bit_count=,bloom_config=,bloom_hash_count=,bloom_oldest=,chunk_count_limit=,chunk_max=,chunk_size=,merge_max=,merge_min=),memory_page_image_max=0,memory_page_max=5MB,os_cache_dirty_max=0,os_cache_max=0,prefix_compression=true,prefix_compression_min=4,source="file:index-1-6917019827977430149.wt",split_deepen_min_child=0,split_deepen_per_child=0,split_pct=90,stable=,tiered_storage=(auth_token=,bucket=,bucket_prefix=,cache_directory=,local_retention=300,name=,object_target_size=0),type=file,value_format=u,verbose=[],write_timestamp_usage=none
Data
<aaa\00\04
\00\08
<bbb\00\04
\00\10
<ccc\00\04
\00\18
There are three entries, one for each document, with the "_id" value (aaa,bbb,ccc) as the key, and the recordId as the value. The values are packed (see documentation), for example < prefixes a little-endian value.
In MongoDB’s KeyString format, the recordId is stored in a special packed encoding where three bits are added to the right of the big-endian value, to be able to store the length at the end of the key. The same is used when it is in the value part of the index entry, in a unique index. To decode it, you need to shift the last byte right by three bits. Here, 0x08 >> 3 = 1, 0x10 >> 3 = 2, and 0x18 >> 3 = 3, which are the recordId of my documents.
I decode the page that contains those index entries:
wt_binary_decode.py --offset 4096 --page 1 --verbose --split /data/db/index-1-6917019827977430149.wt
/data/db/index-1-6917019827977430149.wt, position 0x1000/0x5000, pagelimit 1
Decode at 4096 (0x1000)
0: 00 00 00 00 00 00 00 00 1f 0f 00 00 00 00 00 00 46 00 00 00
06 00 00 00 07 04 00 01 00 10 00 00 7c d3 87 60 01 00 00 00
Page Header:
recno: 0
writegen: 3871
memsize: 70
ncells (oflow len): 6
page type: 7 (WT_PAGE_ROW_LEAF)
page flags: 0x4
version: 1
Block Header:
disk_size: 4096
checksum: 0x6087d37c
block flags: 0x1
0: 28: 19 3c 61 61 61 00 04
desc: 0x19 short key 6 bytes:
"<aaa"
1: 2f: 0b 00 08
desc: 0xb short val 2 bytes:
"
2: 32: 19 3c 62 62 62 00 04
desc: 0x19 short key 6 bytes:
"<bbb"
3: 39: 0b 00 10
desc: 0xb short val 2 bytes:
""
4: 3c: 19 3c 63 63 63 00 04
desc: 0x19 short key 6 bytes:
"<ccc"
5: 43: 0b 00 18
desc: 0xb short val 2 bytes:
""
This utility doesn't decode the recordId, we need to shift it. There's no BSON to decode in the indexes.
Secondary index
Secondary indexes are similar, except that they can be composed of multiple fields, and any indexed field can contain an array, which may result in multiple index entries for a single document, like an inverted index.
MongoDB tracks which indexed fields contain arrays to improve query planning. A multikey index creates an entry for each array element, and if multiple fields are multikey, it stores entries for all combinations of their values. By knowing exactly which fields are multikey, the query planner can apply tighter index bounds when only one field is involved. This information is stored in the catalog as a "multikey" flag along with the specific "multikeyPaths":
wt -h /data/db dump -x table:_mdb_catalog |
wt_to_mdb_bson.py -m dump -j |
jq 'select(.value.ns == "test.franck") |
.value.md.indexes[] |
{name: .spec.name, key: .spec.key, multikey: .multikey, multikeyPaths: .multikeyPaths | keys}
'
{
"name": "_id_",
"key": {
"_id": { "$numberInt": "1" },
},
"multikey": false,
"multikeyPaths": [
"_id"
]
}
{
"name": "_id_1_val1_1_val2_1_val3_1_msg_1",
"key": {
"_id": { "$numberInt": "1" },
"val1": { "$numberInt": "1" },
"val2": { "$numberInt": "1" },
"val3": { "$numberInt": "1" },
"msg": { "$numberInt": "1" },
},
"multikey": true,
"multikeyPaths": [
"_id",
"msg",
"val1",
"val2",
"val3"
]
}
Here is the dump of my index on {_id:1,val1:1,val2:1,val3:1,msg:1}:
wt -h /data/db dump -p table:index-2-6917019827977430149
WiredTiger Dump (WiredTiger Version 12.0.0)
Format=print
Header
table:index-2-6917019827977430149
access_pattern_hint=none,allocation_size=4KB,app_metadata=(formatVersion=8),assert=(commit_timestamp=none,durable_timestamp=none,read_timestamp=none,write_timestamp=off),block_allocation=best,block_compressor=,block_manager=default,cache_resident=false,checksum=on,colgroups=,collator=,columns=,dictionary=0,disaggregated=(page_log=),encryption=(keyid=,name=),exclusive=false,extractor=,format=btree,huffman_key=,huffman_value=,ignore_in_memory_cache_size=false,immutable=false,import=(compare_timestamp=oldest_timestamp,enabled=false,file_metadata=,metadata_file=,panic_corrupt=true,repair=false),in_memory=false,ingest=,internal_item_max=0,internal_key_max=0,internal_key_truncate=true,internal_page_max=16k,key_format=u,key_gap=10,leaf_item_max=0,leaf_key_max=0,leaf_page_max=16k,leaf_value_max=0,log=(enabled=true),lsm=(auto_throttle=,bloom=,bloom_bit_count=,bloom_config=,bloom_hash_count=,bloom_oldest=,chunk_count_limit=,chunk_max=,chunk_size=,merge_max=,merge_min=),memory_page_image_max=0,memory_page_max=5MB,os_cache_dirty_max=0,os_cache_max=0,prefix_compression=true,prefix_compression_min=4,source="file:index-2-6917019827977430149.wt",split_deepen_min_child=0,split_deepen_per_child=0,split_pct=90,stable=,tiered_storage=(auth_token=,bucket=,bucket_prefix=,cache_directory=,local_retention=300,name=,object_target_size=0),type=file,value_format=u,verbose=[],write_timestamp_usage=none
Data
<aaa\00<xxx\00<yyy\00<zzz\00<hello world\00\04\00\08
(null)
<bbb\00<xxx\00<yyy\00<zzz\00<hello\00\04\00\10
(null)
<bbb\00<xxx\00<yyy\00<zzz
by Franck Pachot
Franck Pachot
WiredTiger is MongoDB’s default storage engine, but what really occurs behind the scenes when collections and indexes are saved to disk? In this short deep dive, we’ll explore the internals of WiredTiger data files, covering everything from _mdb_catalog metadata and B-tree page layouts to BSON storage, primary and secondary indexes, and multi-key array handling. The goal is to introduce useful low-level tools like wt and other utilities.
I ran this experiment in a Docker container, set up as described in a previous blog post:
docker run --rm -it --cap-add=SYS_PTRACE mongo bash
# install required packages
apt-get update && apt-get install -y git xxd strace curl jq python3 python3-dev python3-pip python3-venv python3-pymongo python3-bson build-essential cmake gcc g++ libstdc++-12-dev libtool autoconf automake swig liblz4-dev zlib1g-dev libmemkind-dev libsnappy-dev libsodium-dev libzstd-dev
# get WiredTiger main branch
curl -L $(curl -s https://api.github.com/repos/wiredtiger/wiredtiger/releases/latest | jq -r '.tarball_url') -o wiredtiger.tar.gz
git clone https://github.com/wiredtiger/wiredtiger.git
cd wiredtiger
# Compile
mkdir build && cmake -S /wiredtiger -B /wiredtiger/build \
-DCMAKE_C_FLAGS="-O0 -Wno-error -Wno-format-overflow -Wno-error=array-bounds -Wno-error=format-overflow -Wno-error=nonnull" \
-DHAVE_BUILTIN_EXTENSION_SNAPPY=1 \
-DCMAKE_BUILD_TYPE=Release
cmake --build /wiredtiger/build
# add `wt` binaries and other tools in the PATH
export PATH=$PATH:/wiredtiger/build:/wiredtiger/tools
# Start mongodb
mongod &
I use the mongo image, add the WiredTiger sources from the main branch, compile it to get wt, and start mongod.
I create a small collection with three documents and an index, and stop mongod:
mongosh <<'JS'
db.franck.insertMany([
{_id:"aaa",val1:"xxx",val2:"yyy",val3:"zzz",msg:"hello world"},
{_id:"bbb",val1:"xxx",val2:"yyy",val3:"zzz",msg:["hello","world"]},
{_id:"ccc",val1:"xxx",val2:"yyy",val3:"zzz",msg:["hello","world","hello","again"]}
]);
db.franck.createIndex({_id:1,val1:1,val2:1,val3:1,msg:1});
db.franck.find().showRecordId();
use admin;
db.shutdownServer();
JS
I stop MongoDB so that I can access the WiredTiger files with wt without them being opened and locked by another program. Before stopping, I displayed the documents:
[
{
_id: 'aaa',
val1: 'xxx',
val2: 'yyy',
val3: 'zzz',
msg: 'hello world',
'$recordId': Long('1')
},
{
_id: 'bbb',
val1: 'xxx',
val2: 'yyy',
val3: 'zzz',
msg: [ 'hello', 'world' ],
'$recordId': Long('2')
},
{
_id: 'ccc',
val1: 'xxx',
val2: 'yyy',
val3: 'zzz',
msg: [ 'hello', 'world', 'hello', 'again' ],
'$recordId': Long('3')
}
]
The files are stored in the default WiredTiger directory /data/db. MongoDB catalog, which maps the MongoDB collections to their storage attributes, is stored in a WiredTiger table _mdb_catalog. The default WiredTiger directory is /data/db:
root@72cf410c04cb:/wiredtiger# ls -altU /data/db
drwxr-xr-x. 4 root root 32 Sep 1 23:10 ..
-rw-------. 1 root root 0 Sep 13 20:33 mongod.lock
drwx------. 2 root root 74 Sep 13 20:29 journal
-rw-------. 1 root root 21 Sep 12 22:47 WiredTiger.lock
-rw-------. 1 root root 50 Sep 12 22:47 WiredTiger
-rw-------. 1 root root 73728 Sep 13 20:33 WiredTiger.wt
-rw-r--r--. 1 root root 1504 Sep 13 20:33 WiredTiger.turtle
-rw-------. 1 root root 4096 Sep 13 20:33 WiredTigerHS.wt
-rw-------. 1 root root 36864 Sep 13 20:33 sizeStorer.wt
-rw-------. 1 root root 36864 Sep 13 20:33 _mdb_catalog.wt
-rw-------. 1 root root 114 Sep 12 22:47 storage.bson
-rw-------. 1 root root 20480 Sep 13 20:33 collection-0-3767590060964183367.wt
-rw-------. 1 root root 20480 Sep 13 20:33 index-1-3767590060964183367.wt
-rw-------. 1 root root 36864 Sep 13 20:33 collection-2-3767590060964183367.wt
-rw-------. 1 root root 36864 Sep 13 20:33 index-3-3767590060964183367.wt
-rw-------. 1 root root 20480 Sep 13 20:20 collection-4-3767590060964183367.wt
-rw-------. 1 root root 20480 Sep 13 20:20 index-5-3767590060964183367.wt
-rw-------. 1 root root 20480 Sep 13 20:33 index-6-3767590060964183367.wt
drwx------. 2 root root 4096 Sep 13 20:33 diagnostic.data
drwx------. 3 root root 21 Sep 13 20:17 .mongodb
-rw-------. 1 root root 20480 Sep 13 20:33 collection-0-6917019827977430149.wt
-rw-------. 1 root root 20480 Sep 13 20:23 index-1-6917019827977430149.wt
-rw-------. 1 root root 20480 Sep 13 20:25 index-2-6917019827977430149.wt
Catalog
_mdb_catalog maps MongoDB names to WiredTiger table names. wt lists the key (recordId) and value (BSON):
root@72cf410c04cb:~# wt -h /data/db dump table:_mdb_catalog
WiredTiger Dump (WiredTiger Version 12.0.0)
Format=print
Header
table:_mdb_catalog
access_pattern_hint=none,allocation_size=4KB,app_metadata=(formatVersion=1),assert=(commit_timestamp=none,durable_timestamp=none,read_timestamp=none,write_timestamp=off),block_allocation=best,block_compressor=snappy,block_manager=default,cache_resident=false,checksum=on,colgroups=,collator=,columns=,dictionary=0,disaggregated=(page_log=),encryption=(keyid=,name=),exclusive=false,extractor=,format=btree,huffman_key=,huffman_value=,ignore_in_memory_cache_size=false,immutable=false,import=(compare_timestamp=oldest_timestamp,enabled=false,file_metadata=,metadata_file=,panic_corrupt=true,repair=false),in_memory=false,ingest=,internal_item_max=0,internal_key_max=0,internal_key_truncate=true,internal_page_max=4KB,key_format=q,key_gap=10,leaf_item_max=0,leaf_key_max=0,leaf_page_max=32KB,leaf_value_max=64MB,log=(enabled=true),lsm=(auto_throttle=,bloom=,bloom_bit_count=,bloom_config=,bloom_hash_count=,bloom_oldest=,chunk_count_limit=,chunk_max=,chunk_size=,merge_max=,merge_min=),memory_page_image_max=0,memory_page_max=10m,os_cache_dirty_max=0,os_cache_max=0,prefix_compression=false,prefix_compression_min=4,source="file:_mdb_catalog.wt",split_deepen_min_child=0,split_deepen_per_child=0,split_pct=90,stable=,tiered_storage=(auth_token=,bucket=,bucket_prefix=,cache_directory=,local_retention=300,name=,object_target_size=0),type=file,value_format=u,verbose=[],write_timestamp_usage=none
Data
\81
r\01\00\00\03md\00\eb\00\00\00\02ns\00\15\00\00\00admin.system.version\00\03options\00 \00\00\00\05uuid\00\10\00\00\00\04\ba\fc\c2\a9;EC\94\9d\a1\df(\c9\87\eaW\00\04indexes\00\97\00\00\00\030\00\8f\00\00\00\03spec\00.\00\00\00\10v\00\02\00\00\00\03key\00\0e\00\00\00\10_id\00\01\00\00\00\00\02name\00\05\00\00\00_id_\00\00\08ready\00\01\08multikey\00\00\03multikeyPaths\00\10\00\00\00\05_id\00\01\00\00\00\00\00\00\12head\00\00\00\00\00\00\00\00\00\08backgroundSecondary\00\00\00\00\00\03idxIdent\00+\00\00\00\02_id_\00\1c\00\00\00index-1-3767590060964183367\00\00\02ns\00\15\00\00\00admin.system.version\00\02ident\00!\00\00\00collection-0-3767590060964183367\00\00
\82
\7f\01\00\00\03md\00\fb\00\00\00\02ns\00\12\00\00\00local.startup_log\00\03options\003\00\00\00\05uuid\00\10\00\00\00\042}_\a9\16,L\13\aa*\09\b5<\ea\aa\d6\08capped\00\01\10size\00\00\00\a0\00\00\04indexes\00\97\00\00\00\030\00\8f\00\00\00\03spec\00.\00\00\00\10v\00\02\00\00\00\03key\00\0e\00\00\00\10_id\00\01\00\00\00\00\02name\00\05\00\00\00_id_\00\00\08ready\00\01\08multikey\00\00\03multikeyPaths\00\10\00\00\00\05_id\00\01\00\00\00\00\00\00\12head\00\00\00\00\00\00\00\00\00\08backgroundSecondary\00\00\00\00\00\03idxIdent\00+\00\00\00\02_id_\00\1c\00\00\00index-3-3767590060964183367\00\00\02ns\00\12\00\00\00local.startup_log\00\02ident\00!\00\00\00collection-2-3767590060964183367\00\00
\83
^\02\00\00\03md\00\a7\01\00\00\02ns\00\17\00\00\00config.system.sessions\00\03options\00 \00\00\00\05uuid\00\10\00\00\00\04D\09],\c6\15FG\b6\e2m!\ba\c4j<\00\04indexes\00Q\01\00\00\030\00\8f\00\00\00\03spec\00.\00\00\00\10v\00\02\00\00\00\03key\00\0e\00\00\00\10_id\00\01\00\00\00\00\02name\00\05\00\00\00_id_\00\00\08ready\00\01\08multikey\00\00\03multikeyPaths\00\10\00\00\00\05_id\00\01\00\00\00\00\00\00\12head\00\00\00\00\00\00\00\00\00\08backgroundSecondary\00\00\00\031\00\b7\00\00\00\03spec\00R\00\00\00\10v\00\02\00\00\00\03key\00\12\00\00\00\10lastUse\00\01\00\00\00\00\02name\00\0d\00\00\00lsidTTLIndex\00\10expireAfterSeconds\00\08\07\00\00\00\08ready\00\01\08multikey\00\00\03multikeyPaths\00\14\00\00\00\05lastUse\00\01\00\00\00\00\00\00\12head\00\00\00\00\00\00\00\00\00\08backgroundSecondary\00\00\00\00\00\03idxIdent\00Y\00\00\00\02_id_\00\1c\00\00\00index-5-3767590060964183367\00\02lsidTTLIndex\00\1c\00\00\00index-6-3767590060964183367\00\00\02ns\00\17\00\00\00config.system.sessions\00\02ident\00!\00\00\00collection-4-3767590060964183367\00\00
\84
\a6\02\00\00\03md\00\e6\01\00\00\02ns\00\0c\00\00\00test.franck\00\03options\00 \00\00\00\05uuid\00\10\00\00\00\04>\04\ec\e2SUK\ca\98\e8\bf\fe\0eu\81L\00\04indexes\00\9b\01\00\00\030\00\8f\00\00\00\03spec\00.\00\00\00\10v\00\02\00\00\00\03key\00\0e\00\00\00\10_id\00\01\00\00\00\00\02name\00\05\00\00\00_id_\00\00\08ready\00\01\08multikey\00\00\03multikeyPaths\00\10\00\00\00\05_id\00\01\00\00\00\00\00\00\12head\00\00\00\00\00\00\00\00\00\08backgroundSecondary\00\00\00\031\00\01\01\00\00\03spec\00q\00\00\00\10v\00\02\00\00\00\03key\005\00\00\00\10_id\00\01\00\00\00\10val1\00\01\00\00\00\10val2\00\01\00\00\00\10val3\00\01\00\00\00\10msg\00\01\00\00\00\00\02name\00!\00\00\00_id_1_val1_1_val2_1_val3_1_msg_1\00\00\08ready\00\01\08multikey\00\01\03multikeyPaths\00?\00\00\00\05_id\00\01\00\00\00\00\00\05val1\00\01\00\00\00\00\00\05val2\00\01\00\00\00\00\00\05val3\00\01\00\00\00\00\00\05msg\00\01\00\00\00\00\01\00\12head\00\00\00\00\00\00\00\00\00\08backgroundSecondary\00\00\00\00\00\03idxIdent\00m\00\00\00\02_id_\00\1c\00\00\00index-1-6917019827977430149\00\02_id_1_val1_1_val2_1_val3_1_msg_1\00\1c\00\00\00index-2-6917019827977430149\00\00\02ns\00\0c\00\00\00test.franck\00\02ident\00!\00\00\00collection-0-6917019827977430149\00\00
I can decode the BSON value with wt_to_mdb_bson.py to display it as JSON, and use jq to filter the file information about the collection I've created:
wt -h /data/db dump -x table:_mdb_catalog |
wt_to_mdb_bson.py -m dump -j |
jq 'select(.value.ns == "test.franck") |
{ns: .value.ns, ident: .value.ident, idxIdent: .value.idxIdent}
'
{
"ns": "test.franck",
"ident": "collection-0-6917019827977430149",
"idxIdent": {
"_id_": "index-1-6917019827977430149",
"_id_1_val1_1_val2_1_val3_1_msg_1": "index-2-6917019827977430149"
}
}
ident is the WiredTiger table name (collection-...) for the collection documents. All collections have a primary key index on "_id" and additional secondary indexes, stored in WiredTiger tables (index-...). These indexes are stored as .wt files in the data directory.
Collection
Using the WiredTiger table name for the collection, I dump the content, keys, and values, and decode it as JSON:
wt -h /data/db dump -x table:collection-0-6917019827977430149 |
wt_to_mdb_bson.py -m dump -j
{"key": "81", "value": {"_id": "aaa", "val1": "xxx", "val2": "yyy", "val3": "zzz", "msg": "hello world"}}
{"key": "82", "value": {"_id": "bbb", "val1": "xxx", "val2": "yyy", "val3": "zzz", "msg": ["hello", "world"]}}
{"key": "83", "value": {"_id": "ccc", "val1": "xxx", "val2": "yyy", "val3": "zzz", "msg": ["hello", "world", "hello", "again"]}}
The "key" here is the recordId—an internal, unsigned 64-bit integer MongoDB uses (when not using clustered collections) to order documents in the collection table. The 0x80 offset is because the storage key is stored as a signed 8‑bit integer, but encoded in an order-preserving way.
I can also use wt_binary_decode.py to look at the file pages. Here is the leaf page (page type: 7 (WT_PAGE_ROW_LEAF)) that contains my three documents as six key and value cells (cells (oflow len): 6) :
wt_binary_decode.py --offset 4096 --page 1 --verbose --split --bson /data/db/collection-0-6917019827977430149.wt
/data/db/collection-0-6917019827977430149.wt, position 0x1000/0x5000, pagelimit 1
Decode at 4096 (0x1000)
0: 00 00 00 00 00 00 00 00 1f 0f 00 00 00 00 00 00 5f 01 00 00
06 00 00 00 07 04 00 01 00 10 00 00 64 0a ec 4b 01 00 00 00
Page Header:
recno: 0
writegen: 3871
memsize: 351
ncells (oflow len): 6
page type: 7 (WT_PAGE_ROW_LEAF)
page flags: 0x4
version: 1
Block Header:
disk_size: 4096
checksum: 0x4bec0a64
block flags: 0x1
0: 28: 05 81
desc: 0x5 short key 1 bytes:
<packed 1 (0x1)>
1: 2a: 80 91 51 00 00 00 02 5f 69 64 00 04 00 00 00 61 61 61 00 02
76 61 6c 31 00 04 00 00 00 78 78 78 00 02 76 61 6c 32 00 04
00 00 00 79 79 79 00 02 76 61 6c 33 00 04 00 00 00 7a 7a 7a
00 02 6d 73 67 00 0c 00 00 00 68 65 6c 6c 6f 20 77 6f 72 6c
64 00 00
cell is valid BSON
{ '_id': 'aaa',
'msg': 'hello world',
'val1': 'xxx',
'val2': 'yyy',
'val3': 'zzz'}
2: 7d: 05 82
desc: 0x5 short key 1 bytes:
<packed 2 (0x2)>
3: 7f: 80 a0 60 00 00 00 02 5f 69 64 00 04 00 00 00 62 62 62 00 02
76 61 6c 31 00 04 00 00 00 78 78 78 00 02 76 61 6c 32 00 04
00 00 00 79 79 79 00 02 76 61 6c 33 00 04 00 00 00 7a 7a 7a
00 04 6d 73 67 00 1f 00 00 00 02 30 00 06 00 00 00 68 65 6c
6c 6f 00 02 31 00 06 00 00 00 77 6f 72 6c 64 00 00 00
cell is valid BSON
{ '_id': 'bbb',
'msg': ['hello', 'world'],
'val1': 'xxx',
'val2': 'yyy',
'val3': 'zzz'}
4: e1: 05 83
desc: 0x5 short key 1 bytes:
<packed 3 (0x3)>
5: e3: 80 ba 7a 00 00 00 02 5f 69 64 00 04 00 00 00 63 63 63 00 02
76 61 6c 31 00 04 00 00 00 78 78 78 00 02 76 61 6c 32 00 04
00 00 00 79 79 79 00 02 76 61 6c 33 00 04 00 00 00 7a 7a 7a
00 04 6d 73 67 00 39 00 00 00 02 30 00 06 00 00 00 68 65 6c
6c 6f 00 02 31 00 06 00 00 00 77 6f 72 6c 64 00 02 32 00 06
00 00 00 68 65 6c 6c 6f 00 02 33 00 06 00 00 00 61 67 61 69
6e 00 00 00
cell is valid BSON
{ '_id': 'ccc',
'msg': ['hello', 'world', 'hello', 'again'],
'val1': 'xxx',
'val2': 'yyy',
'val3': 'zzz'}
The script shows the raw hexadecimal bytes for the key, a description of the cell type, and the decoded logical value using WiredTiger’s order‑preserving integer encoding (packed int encoding). In this example, the raw byte 0x81 decodes to record ID 1:
0: 28: 05 81
desc: 0x5 short key 1 bytes:
<packed 1 (0x1)>
Here is the branch page (page type: 6 (WT_PAGE_ROW_INT)) that references it:
wt_binary_decode.py --offset 8192 --page 1 --verbose --split --bson /data/db/collection-0-6917019827977430149.wt
/data/db/collection-0-6917019827977430149.wt, position 0x2000/0x5000, pagelimit 1
Decode at 8192 (0x2000)
0: 00 00 00 00 00 00 00 00 20 0f 00 00 00 00 00 00 34 00 00 00
02 00 00 00 06 00 00 01 00 10 00 00 21 df 20 d6 01 00 00 00
Page Header:
recno: 0
writegen: 3872
memsize: 52
ncells (oflow len): 2
page type: 6 (WT_PAGE_ROW_INT)
page flags: 0x0
version: 1
Block Header:
disk_size: 4096
checksum: 0xd620df21
block flags: 0x1
0: 28: 05 00
desc: 0x5 short key 1 bytes:
""
1: 2a: 38 00 87 80 81 e4 4b eb ea 24
desc: 0x38 addr (leaf no-overflow) 7 bytes:
<packed 0 (0x0)> <packed 1 (0x1)> <packed 1273760356 (0x4bec0a64)>
As we have seen in the previous blog post, the pointer includes the checksum of the page it references (0x4bec0a64) to detect disc corruption.
Another utility, bsondump, can be used to display the output of wt dump -x as JSON, like wt_to_mdb_bson.py, but requires some filtering to get the BSON content:
wt -h /data/db dump -x table:collection-0-6917019827977430149 | # dump in hexa
egrep '025f696400' | # all documents have an "_id " field
xxd -r -p | # gets the plain binary data
bsondump --type=json # display BSON it as JSON
{"_id":"aaa","val1":"xxx","val2":"yyy","val3":"zzz","msg":"hello world"}
{"_id":"bbb","val1":"xxx","val2":"yyy","val3":"zzz","msg":["hello","world"]}
{"_id":"ccc","val1":"xxx","val2":"yyy","val3":"zzz","msg":["hello","world","hello","again"]}
2025-09-14T08:57:36.182+0000 3 objects found
It also provides a debug type output that gives more insights into how it is stored internally, especially for documents with arrays:
wt -h /data/db dump -x table:collection-0-6917019827977430149 | # dump in hexa
egrep '025f696400' | # all documents have an "_id " field
xxd -r -p | # gets the plain binary data
bsondump --type=debug # display BSON as it is stored
--- new object ---
size : 81
_id
type: 2 size: 13
val1
type: 2 size: 14
val2
type: 2 size: 14
val3
type: 2 size: 14
msg
type: 2 size: 21
--- new object ---
size : 96
_id
type: 2 size: 13
val1
type: 2 size: 14
val2
type: 2 size: 14
val3
type: 2 size: 14
msg
type: 4 size: 36
--- new object ---
size : 31
0
type: 2 size: 13
1
type: 2 size: 13
--- new object ---
size : 122
_id
type: 2 size: 13
val1
type: 2 size: 14
val2
type: 2 size: 14
val3
type: 2 size: 14
msg
type: 4 size: 62
--- new object ---
size : 57
0
type: 2 size: 13
1
type: 2 size: 13
2
type: 2 size: 13
3
type: 2 size: 13
2025-09-14T08:59:15.268+0000 3 objects found
Arrays in BSON are just sub-objects with the array position as a field name.
Primary index
RecordId is an internal, logical key used in the B-tree to store the collection. It allows documents to be physically moved without fragmentation when they're updated. All indexes reference documents by recordId, not their physical location. Access by "_id" requires a unique index created automatically with the collection and stored as another WiredTiger table. Here is the content:
wt -h /data/db dump -p table:index-1-6917019827977430149
WiredTiger Dump (WiredTiger Version 12.0.0)
Format=print
Header
table:index-1-6917019827977430149
access_pattern_hint=none,allocation_size=4KB,app_metadata=(formatVersion=8),assert=(commit_timestamp=none,durable_timestamp=none,read_timestamp=none,write_timestamp=off),block_allocation=best,block_compressor=,block_manager=default,cache_resident=false,checksum=on,colgroups=,collator=,columns=,dictionary=0,disaggregated=(page_log=),encryption=(keyid=,name=),exclusive=false,extractor=,format=btree,huffman_key=,huffman_value=,ignore_in_memory_cache_size=false,immutable=false,import=(compare_timestamp=oldest_timestamp,enabled=false,file_metadata=,metadata_file=,panic_corrupt=true,repair=false),in_memory=false,ingest=,internal_item_max=0,internal_key_max=0,internal_key_truncate=true,internal_page_max=16k,key_format=u,key_gap=10,leaf_item_max=0,leaf_key_max=0,leaf_page_max=16k,leaf_value_max=0,log=(enabled=true),lsm=(auto_throttle=,bloom=,bloom_bit_count=,bloom_config=,bloom_hash_count=,bloom_oldest=,chunk_count_limit=,chunk_max=,chunk_size=,merge_max=,merge_min=),memory_page_image_max=0,memory_page_max=5MB,os_cache_dirty_max=0,os_cache_max=0,prefix_compression=true,prefix_compression_min=4,source="file:index-1-6917019827977430149.wt",split_deepen_min_child=0,split_deepen_per_child=0,split_pct=90,stable=,tiered_storage=(auth_token=,bucket=,bucket_prefix=,cache_directory=,local_retention=300,name=,object_target_size=0),type=file,value_format=u,verbose=[],write_timestamp_usage=none
Data
<aaa\00\04
\00\08
<bbb\00\04
\00\10
<ccc\00\04
\00\18
There are three entries, one for each document, with the "_id" value (aaa,bbb,ccc) as the key, and the recordId as the value. The values are packed (see documentation)—for example, < prefixes a little-endian value.
In MongoDB’s KeyString format, the recordId is stored in a special packed encoding where three bits are added to the right of the big-endian value, to be able to store the length at the end of the key. The same is used when it is in the value part of the index entry, in a unique index. To decode it, you need to shift the last byte right by three bits. Here, 0x08 >> 3 = 1, 0x10 >> 3 = 2, and 0x18 >> 3 = 3, which are the recordId of my documents.
I decode the page that contains those index entries:
wt_binary_decode.py --offset 4096 --page 1 --verbose --split /data/db/index-1-6917019827977430149.wt
/data/db/index-1-6917019827977430149.wt, position 0x1000/0x5000, pagelimit 1
Decode at 4096 (0x1000)
0: 00 00 00 00 00 00 00 00 1f 0f 00 00 00 00 00 00 46 00 00 00
06 00 00 00 07 04 00 01 00 10 00 00 7c d3 87 60 01 00 00 00
Page Header:
recno: 0
writegen: 3871
memsize: 70
ncells (oflow len): 6
page type: 7 (WT_PAGE_ROW_LEAF)
page flags: 0x4
version: 1
Block Header:
disk_size: 4096
checksum: 0x6087d37c
block flags: 0x1
0: 28: 19 3c 61 61 61 00 04
desc: 0x19 short key 6 bytes:
"<aaa"
1: 2f: 0b 00 08
desc: 0xb short val 2 bytes:
"
2: 32: 19 3c 62 62 62 00 04
desc: 0x19 short key 6 bytes:
"<bbb"
3: 39: 0b 00 10
desc: 0xb short val 2 bytes:
""
4: 3c: 19 3c 63 63 63 00 04
desc: 0x19 short key 6 bytes:
"<ccc"
5: 43: 0b 00 18
desc: 0xb short val 2 bytes:
""
This utility doesn't decode the recordId—we need to shift it. There's no BSON to decode in the indexes.
Secondary index
Secondary indexes are similar, except that they can be composed of multiple fields, and any indexed field can contain an array, which may result in multiple index entries for a single document, like an inverted index.
MongoDB tracks which indexed fields contain arrays to improve query planning. A multikey index creates an entry for each array element, and if multiple fields are multikey, it stores entries for all combinations of their values. By knowing exactly which fields are multikey, the query planner can apply tighter index bounds when only one field is involved. This information is stored in the catalog as a "multikey" flag along with the specific "multikeyPaths":
wt -h /data/db dump -x table:_mdb_catalog |
wt_to_mdb_bson.py -m dump -j |
jq 'select(.value.ns == "test.franck") |
.value.md.indexes[] |
{name: .spec.name, key: .spec.key, multikey: .multikey, multikeyPaths: .multikeyPaths | keys}
'
{
"name": "_id_",
"key": {
"_id": { "$numberInt": "1" },
},
"multikey": false,
"multikeyPaths": [
"_id"
]
}
{
"name": "_id_1_val1_1_val2_1_val3_1_msg_1",
"key": {
"_id": { "$numberInt": "1" },
"val1": { "$numberInt": "1" },
"val2": { "$numberInt": "1" },
"val3": { "$numberInt": "1" },
"msg": { "$numberInt": "1" },
},
"multikey": true,
"multikeyPaths": [
"_id",
"msg",
"val1",
"val2",
"val3"
]
}
Here is the dump of my index on {_id:1,val1:1,val2:1,val3:1,msg:1}:
wt -h /data/db dump -p table:index-2-6917019827977430149
WiredTiger Dump (WiredTiger Version 12.0.0)
Format=print
Header
table:index-2-6917019827977430149
access_pattern_hint=none,allocation_size=4KB,app_metadata=(formatVersion=8),assert=(commit_timestamp=none,durable_timestamp=none,read_timestamp=none,write_timestamp=off),block_allocation=best,block_compressor=,block_manager=default,cache_resident=false,checksum=on,colgroups=,collator=,columns=,dictionary=0,disaggregated=(page_log=),encryption=(keyid=,name=),exclusive=false,extractor=,format=btree,huffman_key=,huffman_value=,ignore_in_memory_cache_size=false,immutable=false,import=(compare_timestamp=oldest_timestamp,enabled=false,file_metadata=,metadata_file=,panic_corrupt=true,repair=false),in_memory=false,ingest=,internal_item_max=0,internal_key_max=0,internal_key_truncate=true,internal_page_max=16k,key_format=u,key_gap=10,leaf_item_max=0,leaf_key_max=0,leaf_page_max=16k,leaf_value_max=0,log=(enabled=true),lsm=(auto_throttle=,bloom=,bloom_bit_count=,bloom_config=,bloom_hash_count=,bloom_oldest=,chunk_count_limit=,chunk_max=,chunk_size=,merge_max=,merge_min=),memory_page_image_max=0,memory_page_max=5MB,os_cache_dirty_max=0,os_cache_max=0,prefix_compression=true,prefix_compression_min=4,source="file:index-2-6917019827977430149.wt",split_deepen_min_child=0,split_deepen_per_child=0,split_pct=90,stable=,tiered_storage=(auth_token=,bucket=,bucket_prefix=,cache_directory=,local_retention=300,name=,object_target_size=0),type=file,value_format=u,verbose=[],write_timestamp_usage=none
Data
<aaa\00<xxx\00<yyy\00<zzz\00<hello world\00\04\00\08
(null)
<bbb\00<xxx\00<yyy\00<zzz\00<hello\00\04\00\10
(null)
<bbb\00<xxx\00<yyy\00<zzz
by Franck Pachot
September 13, 2025
Avinash Sajjanshetty
A brief introduction to Setsum - order agnostic, additive, subtractive checksum
September 12, 2025
PlanetScale Blog
Why a lagging client can stall or break failover, and how MySQL’s GTID model avoids it.
September 11, 2025
Small Datum - Mark Callaghan
This post has results for Postgres 18rc1 vs sysbench on small and large servers. Results for Postgres 18beta3 are here for a small and large server.
tl;dr
- Postgres 18 looks great
- I continue to see small CPU regressions in Postgres 18 for range queries that don't do aggregation on low-concurrency workloads. I have yet to explain that.
- The throughput for the scan microbenchmark has more variance with Postgres 18. I assume this is related to more or less work getting done by vacuum but I have yet to debug the root cause.
Builds, configuration and hardware
I compiled Postgres from source for versions 17.6, 18 beta3 and 18 rc1.
The servers are:
- small
- an ASUS ExpertCenter PN53 with AMD Ryzen 7735HS CPU, 32G of RAM, 8 cores with AMD SMT disabled, Ubuntu 24.04 and an NVMe device with ext4 and discard enabled.
- large32
- Dell Precision 7865 Tower Workstation with 1 socket, 128G RAM, AMD Ryzen Threadripper PRO 5975WX with 32 Cores and AMD SMT disabled, Ubuntu 24.04 and and NVMe device with ext4 and discard.
- large48
- an ax162s from Hetzner with an AMD EPYC 9454P 48-Core Processor with SMT disabled
- 2 Intel D7-P5520 NVMe storage devices with RAID 1 (3.8T each) using ext4
- 128G RAM
- Ubuntu 22.04 running the non-HWE kernel (5.5.0-118-generic)
All configurations use synchronous IO which is the the only option prior to Postgres 18 and for Postgres 18 the config file sets io_method=sync.
Configuration files:
- small server
- large servers
Benchmark
I used sysbench and my usage is
explained here. To save time I only run 32 of the 42 microbenchmarks
and most test only 1 type of SQL statement. Benchmarks are run with the database cached by Postgres.
For all servers the read-heavy microbenchmarks run for 600 seconds and the write-heavy for 900 seconds.
The number of tables and rows per table was:
- small server - 1 table, 50M rows
- large servers - 8 tables, 10M rows per table
The number of clients (amount of concurrency) was:
- small server - 1
- large32 server - 24
- large48 servcer- 40
Results
The microbenchmarks are split into 4 groups -- 1 for point queries, 2 for range queries, 1 for writes. For the range query microbenchmarks, part 1 has queries that don't do aggregation while part 2 has queries that do aggregation.
I provide charts below with relative QPS. The relative QPS is the following:
(QPS for some version) / (QPS for Postgres 17.6)
When the relative QPS is > 1 then
some version is faster than PG 17.6. When it is < 1 then there might be a regression.
Values from iostat and vmstat divided by QPS are also provided here. These can help to explain why something is faster or slower because it shows how much HW is used per request.
The numbers highlighted in yellow below might be from a small regression for range queries that don't do aggregation. But note that this does reproduce for the full table scan microbenchmark (scan). I am not certain it is a regression as this might be from non-deterministic CPU overheads for read-heavy workloads that are run after vacuum. I hope to look at CPU flamegraphs soon.
Results: small server
I continue to see small (~3%) regressions in throughput for range queries without aggregation across Postgres 18 beta1, beta2, beta3 and rc1. But I have yet to debug this and am not certain it is a regression. I am also skeptical about the great results for scan. I suspect that I have more work to do to make the benchmark less subject to variance from MVCC GC (vacuum here). I also struggle with that on RocksDB (compaction), but not on InnoDB (purge).
Relative to: Postgres 17.6
col-1 : 18beta3
col-2 : 18rc1
col-1 col-2 point queries
1.01 0.98 hot-points_range=100
1.01 1.00 point-query_range=100
1.02 1.02 points-covered-pk_range=100
0.99 1.01 points-covered-si_range=100
1.00 0.99 points-notcovered-pk_range=100
1.00 0.99 points-notcovered-si_range=100
1.01 1.00 random-points_range=1000
1.01 0.99 random-points_range=100
1.01 1.00 random-points_range=10
col-1 col-2 range queries without aggregation
0.97 0.96 range-covered-pk_range=100
0.97 0.97 range-covered-si_range=100
0.99 0.99 range-notcovered-pk_range=100
0.99 0.99 range-notcovered-si_range=100
1.35 1.36 scan_range=100
col-1 col-2 range queries with aggregation
1.02 1.03 read-only-count_range=1000
1.00 1.00 read-only-distinct_range=1000
0.99 0.99 read-only-order_range=1000
1.00 1.00 read-only_range=10000
1.00 0.99 read-only_range=100
0.99 0.98 read-only_range=10
1.01 1.01 read-only-simple_range=1000
1.02 1.00 read-only-sum_range=1000
col-1 col-2 writes
0.99 0.99 delete_range=100
0.99 1.01 insert_range=100
0.99 0.99 read-write_range=100
0.99 0.99 read-write_range=10
0.98 0.98 update-index_range=100
1.00 0.99 update-inlist_range=100
0.98 0.98 update-nonindex_range=100
0.98 0.97 update-one_range=100
0.98 0.97 update-zipf_range=100
0.99 0.98 write-only_range=10000
Results: large32 server
I don't see small regressions in throughput for range queries without aggregation across Postgres 18 beta1, beta2, beta3 and rc1. I have only seen that on the low concurrency (small server) results.
The improvements on the scan microbenchmark come from using less CPU. But I am skeptical about the improvements. I might have more work to do to make the benchmark less subject to variance from MVCC GC (vacuum here). I also struggle with that on RocksDB (compaction), but not on InnoDB (purge).
Relative to: Postgres 17.6
col-1 : Postgres 18rc1
col-1 point queries
1.01 hot-points_range=100
1.01 point-query_range=100
1.01 points-covered-pk_range=100
1.01 points-covered-si_range=100
1.00 points-notcovered-pk_range=100
1.00 points-notcovered-si_range=100
1.01 random-points_range=1000
1.00 random-points_range=100
1.01 random-points_range=10
col-1 range queries without aggregation
0.99 range-covered-pk_range=100
0.99 range-covered-si_range=100
0.99 range-notcovered-pk_range=100
0.99 range-notcovered-si_range=100
1.12 scan_range=100
col-1 range queries with aggregation
1.00 read-only-count_range=1000
1.02 read-only-distinct_range=1000
1.01 read-only-order_range=1000
1.03 read-only_range=10000
1.00 read-only_range=100
1.00 read-only_range=10
1.00 read-only-simple_range=1000
1.00 read-only-sum_range=1000
col-1 writes
1.01 delete_range=100
1.00 insert_range=100
1.00 read-write_range=100
1.00 read-write_range=10
1.00 update-index_range=100
1.00 update-inlist_range=100
1.00 update-nonindex_range=100
0.99 update-one_range=100
1.00 update-zipf_range=100
1.00 write-only_range=10000
Results: large48 server
I don't see small regressions in throughput for range queries without aggregation across Postgres 18 beta1, beta2, beta3 and rc1. I have only seen that on the low concurrency (small server) results.
The improvements on the scan microbenchmark come from using less CPU. But I am skeptical about the improvements. I might have more work to do to make the benchmark less subject to variance from MVCC GC (vacuum here). I also struggle with that on RocksDB (compaction), but not on InnoDB (purge).
I am skeptical about the regression I see here for scan. That comes from using ~10% more CPU per query. I might have more work to do to make the benchmark less subject to variance from MVCC GC (vacuum here). I also struggle with that on RocksDB (compaction), but not on InnoDB (purge).
I have not see the large improvements for the insert and delete microbenchmarks on
previous tests on that large server. I assume this is another case where I need to figure out how to reduce variance when I run the benchmark.
Relative to: Postgres 17.6
col-1 : Postgres 18beta3
col-2 : Postgres 18rc1
col-1 col-2 point queries
0.99 0.99 hot-points_range=100
0.99 0.99 point-query_range=100
1.00 0.99 points-covered-pk_range=100
0.99 1.02 points-covered-si_range=100
1.00 0.99 points-notcovered-pk_range=100
0.99 1.01 points-notcovered-si_range=100
1.00 0.99 random-points_range=1000
1.00 0.99 random-points_range=100
1.00 1.00 random-points_range=10
col-1 col-2 range queries without aggregation
0.99 0.99 range-covered-pk_range=100
0.98 0.99 range-covered-si_range=100
0.99 0.99 range-notcovered-pk_range=100
1.01 1.01 range-notcovered-si_range=100
0.91 0.91 scan_range=100
col-1 col-2 range queries with aggregation
1.04 1.03 read-only-count_range=1000
1.02 1.01 read-only-distinct_range=1000
1.01 1.00 read-only-order_range=1000
1.06 1.06 read-only_range=10000
0.98 0.97 read-only_range=100
0.99 0.99 read-only_range=10
1.02 1.02 read-only-simple_range=1000
1.03 1.03 read-only-sum_range=1000
col-1 col-2 writes
1.46 1.49 delete_range=100
1.32 1.32 insert_range=100
0.99 1.00 read-write_range=100
0.98 1.00 read-write_range=10
0.99 1.00 update-index_range=100
0.95 1.03 update-inlist_range=100
1.00 1.02 update-nonindex_range=100
0.96 1.04 update-one_range=100
1.00 1.01 update-zipf_range=100
1.00 1.00 write-only_range=10000
by Mark Callaghan (noreply@blogger.com)
Tinybird Engineering Blog
We've long contributed to the open source ClickHouse community, and we didn't make this decision lightly. Here's why we maintain our own private ClickHouse fork.
by Javier Santana, Maksim Kita
Tinybird Engineering Blog
We've long contributed to the open source ClickHouse community, and we didn't make this decision lightly. Here's why we maintain our own private ClickHouse fork.
by Javi Santana, Maksim Kita