a curated list of database news from authoritative sources

December 01, 2025

Impact of Starting PostgreSQL Service Manually in an Active Patroni Cluster

A Nightmare no DBA wants One day, I had a nightmare, the kind every DBA fears. My manager called me and said, “Hey… Mr. Clever. We don’t need you on the team anymore.” I panicked. “What did I do wrong?!” He replied, “You tell me. Think about your ‘great’ work yesterday.” Trying to defend myself, […]

Introducing Vector Buckets

Introducing vector storage in Supabase: a durable storage layer with similarity search built-in.

November 29, 2025

Using sysbench to measure how Postgres performance changes over time, November 2025 edition

This has results for the sysbench benchmark on a small and big server for Postgres versions 12 through 18. Once again, Postgres is boring because I search for perf regressions and can't find any here. Results from MySQL are here and MySQL is not boring.

While I don't show the results here, I don't see regressions when comparing the latest point releases with their predecessors -- 13.22 vs 13.23, 14.19 vs 14.20, 15.14 vs 15.15, 16.10 vs 16.11, 17.6 vs 17.7 and 18.0 vs 18.1.

tl;dr for low-concurrency

  • a few small regressions
  • many more small improvements
  • for write-heavy tests at high-concurrency there are many large improvements starting in PG 17

Builds, configuration and hardware

I compiled Postgres from source for versions 12.22, 13.22, 13.23, 14.19, 14.20, 15.14, 15.15, 16.10, 16.11, 17.6, 17.7, 18.0 and 18.1.

I used two servers:
  • small
    • an ASUS ExpertCenter PN53 with AMD Ryzen 7735HS CPU, 32G of RAM, 8 cores with AMD SMT disabled, Ubuntu 24.04 and an NVMe device with ext4 and discard enabled.
  • big
    • an ax162s from Hetzner with an AMD EPYC 9454P 48-Core Processor with SMT disabled
    • 2 Intel D7-P5520 NVMe storage devices with RAID 1 (3.8T each) using ext4
    • 128G RAM
    • Ubuntu 22.04 running the non-HWE kernel (5.5.0-118-generic)
Configuration files for the small server
  • Configuration files are here for Postgres versions 1213141516 and 17.
  • For Postgres 18 I used io_method=sync and the configuration file is here.
Configuration files for the big server
  • Configuration files are here for Postgres versions 1213141516 and 17.
  • For Postgres 18 I used io_method=sync and the configuration file is here.
Benchmark

I used sysbench and my usage is explained here. I now run 32 of the 42 microbenchmarks listed in that blog post. Most test only one type of SQL statement. Benchmarks are run with the database cached by InnoDB.

The read-heavy microbenchmarks are run for 600 seconds and the write-heavy for 900 seconds. On the small server the benchmark is run with 1 client and 1 table with 50M rows. On the big server the benchmark is run with 12 clients and 8 tables with 10M rows per table. 

The purpose is to search for regressions from new CPU overhead and mutex contention. I use the small server with low concurrency to find regressions from new CPU overheads and then larger servers with high concurrency to find regressions from new CPU overheads and mutex contention.

Results

The microbenchmarks are split into 4 groups -- 1 for point queries, 2 for range queries, 1 for writes. For the range query microbenchmarks, part 1 has queries that don't do aggregation while part 2 has queries that do aggregation. 

I provide charts below with relative QPS. The relative QPS is the following:
(QPS for some version) / (QPS for Postgres 12.22)
When the relative QPS is > 1 then some version is faster than Postgres 12.12.  When it is < 1 then there might be a regression. When the relative QPS is 1.2 then some version is about 20% faster than Postgres 12.22.

Values from iostat and vmstat divided by QPS are here for the small server and the big serverThese can help to explain why something is faster or slower because it shows how much HW is used per request, including CPU overhead per operation (cpu/o) and context switches per operation (cs/o) which are often a proxy for mutex contention.

The spreadsheet and charts are here and in some cases are easier to read than the charts below. Converting the Google Sheets charts to PNG files does the wrong thing for some of the test names listed at the bottom of the charts below.

Results: point queries

This is from the small server.
  • a large improvement arrived in Postgres 17 for the hot-points test
  • otherwise results have been stable from 12.22 through 18.1
This is from the big server.
  • a large improvement arrived in Postgres 17 for the hot-points test
  • otherwise results have been stable from 12.22 through 18.1
Results: range queries without aggregation

This is from the small server.
  • there are small improvements for the scan test
  • otherwise results have been stable from 12.22 through 18.1
This is from the big server.
  • there are small improvements for the scan test
  • otherwise results have been stable from 12.22 through 18.1
Results: range queries with aggregation

This is from the small server.
  • there are small improvements for a few tests
  • otherwise results have been stable from 12.22 through 18.1
This is from the big server.
  • there might be small regressions for a few tests
  • otherwise results have been stable from 12.22 through 18.1
Results: writes

This is from the small server.
  • there are small improvements for most tests
  • otherwise results have been stable from 12.22 through 18.1
This is from the big server.
  • there are large improvements for half of the tests
  • otherwise results have been stable from 12.22 through 18.1
From vmstat results for update-index the per-operation CPU overhead and context switch rate are much smaller starting in Postgres 17.7. The CPU overhead is about 70% of what it was in 16.11 and the context switch rate is about 50% of the rate for 16.11. Note that context switch rates are often a proxy for mutex contention.

Script to list MongoDB collection URI (to map to WiredTiger files)

When tracing I/O to WiredTiger files, like with strace -y, the filenames do not give the collection name:

strace -yy -e trace=pwrite64 -s100 -fqT -p $(pgrep -d, mongod)
[pid   237] pwrite64(15</data/db/journal/WiredTigerLog.0000000001>, "\0\2\0\0\6\314kK\0\0\0\0\0\0\0\0\201\300\211\204\300%\202\265table:collection-2fb69242-1b95-4a08-9939-23172e5ea178+\0\0\0\22numRecords\0\3\0\0\0\0\0\0"..., 512, 200576) = 512 <0.000179>
[pid   238] pwrite64(69</data/db/collection-7e76acd8-718e-4fd6-93f2-fd08eaac3ed1.wt>, "\0\0\0\0\0\0\0\0\10\0\0\0\0\0\0\0`\0\0\0\2\0\0\0\7\4\0\1\0\20\0\0\225\21\375t\1\0\0\0\5\201\214(\350i*%\214\377\377\337\303\300\207\200\247'\0\0\0\2_id\0\6\0\0\0test1\0\2marker\0\7\0\0\0Franck\0\0\0\0\0\0"..., 4096, 4096) = 4096 <0.000053>
[pid   238] pwrite64(69</data/db/collection-7e76acd8-718e-4fd6-93f2-fd08eaac3ed1.wt>, "\0\0\0\0\0\0\0\0\t\0\0\0\0\0\0\0@\0\0\0\2\0\0\0\6 \0\1\0\20\0\0\22B\312\326\1\0\0\0\5\08*\350i*%\214\377\377\337\303\300\207\200\207\200\201\344t\374\361U\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096, 8192) = 4096 <0.000034>
[pid   238] pwrite64(69</data/db/collection-7e76acd8-718e-4fd6-93f2-fd08eaac3ed1.wt>, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0002\0\0\0\n\0\0\0\1\0\0\1\0\20\0\0V\335F\25\1\0\0\0\342\365\32\200\317\300\337\300\200\200\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096, 12288) = 4096 <0.000028>
[pid   238] pwrite64(69</data/db/collection-7e76acd8-718e-4fd6-93f2-fd08eaac3ed1.wt>, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\09\0\0\0\21\0\0\0\1\0\0\1\0\20\0\0m/\302\306\1\0\0\0\342\365\32\200\342\37\300\317\300\342/\300\342\37\300\200\201\212\342o\300\0\0\0\0\0\0\304Oencryption=(keyid=,name=),block"..., 4096, 16384) = 4096 <0.000022>
[pid   238] pwrite64(37</data/db/index-58332e85-20c2-4b5d-a14c-4d2a62f2f7f4.wt>, "\0\0\0\0\0\0\0\0\5\0\0\0\0\0\0\0\322\0\0\0\4\0\0\0\7\4\0\1\0\20\0\0)\1=\20\1\0\0\0P\202FZid\0Z\20\4<\353\316\272\256\317E\17\240nK\233X~\233\6Zuid\0Z \0\343\260\304B\230\374\34\24\232\373\364\310\231o\271$'\256A\344d\233\223L\244\225"..., 4096, 20480) = 4096 <0.000069>
[pid   238] pwrite64(37</data/db/index-58332e85-20c2-4b5d-a14c-4d2a62f2f7f4.wt>, "\0\0\0\0\0\0\0\0\6\0\0\0\0\0\0\0E\0\0\0\2\0\0\0\6 \0\1\0\20\0\0p\3142\240\1\0\0\0\5\08*\350i*%F\377\377\337\302\300\205\345E\377\377\337\300\207\204\201\344\20<\340\351\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096, 24576) = 4096 <0.000057>
[pid   238] pwrite64(37</data/db/index-58332e85-20c2-4b5d-a14c-4d2a62f2f7f4.wt>, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0003\0\0\0\v\0\0\0\1\0\0\1\0\20\0\0\364'AB\1\0\0\0\342\365\32\200\342/\300\337\300\200\200\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096, 28672) = 4096 <0.000083>
[pid   238] pwrite64(37</data/db/index-58332e85-20c2-4b5d-a14c-4d2a62f2f7f4.wt>, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0003\0\0\0\v\0\0\0\1\0\0\1\0\20\0\0\210\201\f\231\1\0\0\0\342\365\32\200\317\300\342\37\300\200\201\207\342o\300\0\0\0\0\0\0\304Gencryption=(keyid=,name=),block_metad"..., 4096, 32768) = 4096 <0.000012>
[pid   238] pwrite64(36</data/db/collection-2f934890-bfba-4bf1-8e30-a4983d1e443e.wt>, "\0\0\0\0\0\0\0\0\5\0\0\0\0\0\0\0\320\1\0\0\4\0\0\0\7\4\0\1\0\20\0\0\3y\341\332\1\0\0\0\5\201\214(\350i*%F\377\377\337\302\300s\200\300\202\302\0\0\0\3_id\0H\0\0\0\5id\0\20\0\0\0\4^\v\252=\334\206IR\272\320\264\246v\353\27.\5uid"..., 4096, 20480) = 4096 <0.000051>
[pid   238] pwrite64(36</data/db/collection-2f934890-bfba-4bf1-8e30-a4983d1e443e.wt>, "\0\0\0\0\0\0\0\0\6\0\0\0\0\0\0\0E\0\0\0\2\0\0\0\6 \0\1\0\20\0\0\241yx\34\1\0\0\0\5\08*\350i*%F\377\377\337\302\300\205\345E\377\377\337\300\207\204\201\344\332\341X\303\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096, 24576) = 4096 <0.000085>
[pid   238] pwrite64(36</data/db/collection-2f934890-bfba-4bf1-8e30-a4983d1e443e.wt>, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0003\0\0\0\v\0\0\0\1\0\0\1\0\20\0\0\364'AB\1\0\0\0\342\365\32\200\342/\300\337\300\200\200\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096, 28672) = 4096 <0.000027>
[pid   238] pwrite64(36</data/db/collection-2f934890-bfba-4bf1-8e30-a4983d1e443e.wt>, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0003\0\0\0\v\0\0\0\1\0\0\1\0\20\0\0d\3230Y\1\0\0\0\342\365\32\200\317\300\342\37\300\200\201\207\342o\300\0\0\0\0\0\0\304Oencryption=(keyid=,name=),block_metad"..., 4096, 32768) = 4096 <0.000041>
[pid   238] pwrite64(5</data/db/index-0bdb5c19-b8d1-4329-a85e-dce59f589b66.wt>, "\0\0\0\0\0\0\0\0\10\0\0\0\0\0\0\0'\1\0\0\6\0\0\0\7\4\0\1\0\20\0\0\33\312\247\363\1\0\0\0P\202FZid\0Z\20\4<\353\316\272\256\317E\17\240nK\233X~\233\6Zuid\0Z \0\343\260\304B\230\374\34\24\232\373\364\310\231o\271$'\256A\344d\233\223L\244\225"..., 4096, 4096) = 4096 <0.000026>
[pid   238] pwrite64(5</data/db/index-0bdb5c19-b8d1-4329-a85e-dce59f589b66.wt>, "\0\0\0\0\0\0\0\0\t\0\0\0\0\0\0\0E\0\0\0\2\0\0\0\6 \0\1\0\20\0\0\343\21\367&\1\0\0\0\5\08*\350i*$\374\377\377\337\303\300\205\345\217\377\377\337\277\207\200\201\344\363\247\251\333\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096, 8192) = 4096 <0.000271>
[pid   238] pwrite64(5</data/db/index-0bdb5c19-b8d1-4329-a85e-dce59f589b66.wt>, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0002\0\0\0\n\0\0\0\1\0\0\1\0\20\0\0V\335F\25\1\0\0\0\342\365\32\200\317\300\337\300\200\200\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096, 12288) = 4096 <0.000055>
[pid   238] pwrite64(5</data/db/index-0bdb5c19-b8d1-4329-a85e-dce59f589b66.wt>, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\09\0\0\0\21\0\0\0\1\0\0\1\0\20\0\0H\224$\305\1\0\0\0\342\365\32\200\342\37\300\317\300\342/\300\342\37\300\200\201\212\342o\300\0\0\0\0\0\0\304Gencryption=(keyid=,name=),block"..., 4096, 16384) = 4096 <0.000127>
[pid   238] pwrite64(49</data/db/collection-2fb69242-1b95-4a08-9939-23172e5ea178.wt>, "\0\0\0\0\0\0\0\0\10\0\0\0\0\0\0\0P\2\0\0\6\0\0\0\7\4\0\1\0\20\0\0\243\222\264\346\1\0\0\0\5\201\214(\350i*$\374\377\377\337\303\300c\200\300f\246\0\0\0\3_id\0H\0\0\0\5id\0\20\0\0\0\4\243,\276\205\7\376F\244\205z\r\343\216\216\f\376\5uid"..., 4096, 4096) = 4096 <0.000124>
[pid   238] pwrite64(49</data/db/collection-2fb69242-1b95-4a08-9939-23172e5ea178.wt>, "\0\0\0\0\0\0\0\0\t\0\0\0\0\0\0\0E\0\0\0\2\0\0\0\6 \0\1\0\20\0\0\25472\200\1\0\0\0\5\08*\350i*$\374\377\377\337\303\300\207\345\217\377\377\337\300\207\200\201\344\346\264rc\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096, 8192) = 4096 <0.000058>
[pid   238] pwrite64(49</data/db/collection-2fb69242-1b95-4a08-9939-23172e5ea178.wt>, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0002\0\0\0\n\0\0\0\1\0\0\1\0\20\0\0V\335F\25\1\0\0\0\342\365\32\200\317\300\337\300\200\200\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096, 12288) = 4096 <0.000085>
[pid   238] pwrite64(49</data/db/collection-2fb69242-1b95-4a08-9939-23172e5ea178.wt>, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\09\0\0\0\21\0\0\0\1\0\0\1\0\20\0\0\305\232\373\24\1\0\0\0\342\365\32\200\342\37\300\317\300\342/\300\342\37\300\200\201\212\342o\300\0\0\0\0\0\0\304Oencryption=(keyid=,name=),block"..., 4096, 16384) = 4096 <0.000010>
[pid   238] pwrite64(41</data/db/collection-7dc3bad2-7f29-4853-b83a-a0d75ce4091c.wt>, "\0\0\0\0\0\0\0\0\v\0\0\0\0\0\0\0\345\33\0\0H\0\0\0\7\4\0\1\0 \0\0\317@\335\234\1\0\0\0%\350i*$\335\377\377\337\301\200\235]\0\0\0\2op\0\2\0\0\0n\0\2ns\0\1\0\0\0\0\3o\0\35\0\0\0\2msg\0\17\0\0\0initiatin"..., 8192, 20480) = 8192 <0.000038>
[pid   238] pwrite64(41</data/db/collection-7dc3bad2-7f29-4853-b83a-a0d75ce4091c.wt>, "\0\0\0\0\0\0\0\0\f\0\0\0\0\0\0\0004\0\0\0\2\0\0\0\6 \0\1\0\20\0\0\2772\240\200\1\0\0\0\5\08\0\207\204\202\344\234\335 \217\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096, 28672) = 4096 <0.000027>
[pid   238] pwrite64(41</data/db/collection-7dc3bad2-7f29-4853-b83a-a0d75ce4091c.wt>, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0004\0\0\0\f\0\0\0\1\0\0\1\0\20\0\0^\245\335\343\1\0\0\0\342\365\32\200\342/\300\342\17\300\200\200\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096, 32768) = 4096 <0.000017>
[pid   238] pwrite64(41</data/db/collection-7dc3bad2-7f29-4853-b83a-a0d75ce4091c.wt>, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0=\0\0\0\25\0\0\0\1\0\0\1\0\20\0\0\372\257\271\374\1\0\0\0\342\365\32\200\342o\300\317\300\317\300\342\37\300\342\177\300\317\300\200\201\215\342\217\300\0\0\0\0\0\0\304jencryption=(keyid=,name=),b"..., 4096, 36864) = 4096 <0.000009>
[pid   238] pwrite64(24</data/db/collection-d3c1ce81-a90b-434a-867e-d985cc474e98.wt>, "\0\0\0\0\0\0\0\0\10\0\0\0\0\0\0\0s\0\0\0\2\0\0\0\7\4\0\1\0\20\0\0?\330w8\1\0\0\0\5\201\200\207G\0\0\0\2_id\0\30\0\0\0oplogTruncateAfterPoint\0\21oplogTruncateAfter"..., 4096, 4096) = 4096 <0.000137>
[pid   238] pwrite64(24</data/db/collection-d3c1ce81-a90b-434a-867e-d985cc474e98.wt>, "\0\0\0\0\0\0\0\0\t\0\0\0\0\0\0\0004\0\0\0\2\0\0\0\6 \0\1\0\20\0\0\271p\261a\1\0\0\0\5\08\0\207\200\201\3448w\267\377\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096, 8192) = 4096 <0.000087>
[pid   238] pwrite64(24</data/db/collection-d3c1ce81-a90b-434a-867e-d985cc474e98.wt>, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0002\0\0\0\n\0\0\0\1\0\0\1\0\20\0\0V\335F\25\1\0\0\0\342\365\32\200\317\300\337\300\200\200\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096, 12288) = 4096 <0.000102>
[pid   238] pwrite64(24</data/db/collection-d3c1ce81-a90b-434a-867e-d985cc474e98.wt>, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\09\0\0\0\21\0\0\0\1\0\0\1\0\20\0\0\203\206\33-\1\0\0\0\342\365\32\200\342\37\300\317\300\342/\300\342\37\300\200\201\212\342o\300\0\0\0\0\0\0\304Mencryption=(keyid=,name=),block"..., 4096, 16384) = 4096 <0.000074>
[pid   238] pwrite64(18</data/db/sizeStorer.wt>, "\0\0\0\0\0\0\0\0\v\0\0\0\0\0\0\0a\5\0\0\34\0\0\0\7\4\0\1\0\20\0\0\256dT\370\1\0\0\0Itable:_mdb_catalog\257+\0\0\0\22numRecords\0\22\0\0\0\0\0\0\0\22dataSize\0\363\34\0\0\0\0"..., 4096, 20480) = 4096 <0.000064>
[pid   238] pwrite64(18</data/db/sizeStorer.wt>, "\0\0\0\0\0\0\0\0\f\0\0\0\0\0\0\0004\0\0\0\2\0\0\0\6 \0\1\0\20\0\0W\243\10:\1\0\0\0\5\08\0\207\204\201\344\370TDn\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096, 24576) = 4096 <0.000014>
[pid   238] pwrite64(18</data/db/sizeStorer.wt>, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0003\0\0\0\v\0\0\0\1\0\0\1\0\20\0\0\364'AB\1\0\0\0\342\365\32\200\342/\300\337\300\200\200\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096, 28672) = 4096 <0.000047>
[pid   238] pwrite64(18</data/db/sizeStorer.wt>, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0003\0\0\0\v\0\0\0\1\0\0\1\0\20\0\0\34\265ch\1\0\0\0\342\365\32\200\317\300\342\37\300\200\201\215\342o\300\0\0\0\0\0\0\3045encryption=(keyid=,name=),block_metad"..., 4096, 32768) = 4096 <0.000027>
[pid   238] pwrite64(16</data/db/WiredTigerHS.wt>, "\0\0\0\0\0\0\0\0\10\0\0\0\0\0\0\0h\n\0\0\24\0\0\0\7\4\0\1\0\20\0\0:9\224\230\1\0\0\0005\203\201\212\350i*$\335\377\377\337\303\200\214x\350i*$\335\377\377\337\303\300\21\201\207\200\301=\350i*$\335\377\377\337\304\350i*$\335\377\377\337\303\203j\1\0\0\3md\0\306"..., 4096, 4096) = 4096 <0.000129>
[pid   238] pwrite64(16</data/db/WiredTigerHS.wt>, "\0\0\0\0\0\0\0\0\t\0\0\0\0\0\0\0M\0\0\0\2\0\0\0\6 \0\1\0\20\0\0\31\333Y\221\1\0\0\0\5\08~\350i*$\335\377\377\337\303\300\207\345\256\377\377\337\277\345\256\377\377\337\300\200\200\207\200\201\344\230\224\30\372\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096, 8192) = 4096 <0.000046>
[pid   238] pwrite64(16</data/db/WiredTigerHS.wt>, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0002\0\0\0\n\0\0\0\1\0\0\1\0\20\0\0V\335F\25\1\0\0\0\342\365\32\200\317\300\337\300\200\200\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096, 12288) = 4096 <0.000023>
[pid   238] pwrite64(16</data/db/WiredTigerHS.wt>, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\09\0\0\0\21\0\0\0\1\0\0\1\0\20\0\0$jAb\1\0\0\0\342\365\32\200\342\37\300\317\300\342/\300\342\37\300\200\201\212\342o\300\0\0\0\0\0\0\305\230encryption=(keyid=,name=),block"..., 4096, 16384) = 4096 <0.000062>
[pid   238] pwrite64(15</data/db/journal/WiredTigerLog.0000000001>, "\200\0\0\0)\300\374y\0\0\0\0\0\0\0\0\204\207\202\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 13568, 201088) = 13568 <0.000091>
[pid   238] pwrite64(13</data/db/WiredTiger.wt>, "\0\0\0\0\0\0\0\0\33\0\0\0\0\0\0\0\24n\0\0f\0\0\0\7\4\0\1\0p\0\0Y\232Q~\1\0\0\0Ycolgroup:_mdb_catalog\0\200\300^app_metadata=(formatVersion=1),ass"..., 28672, 4096) = 28672 <0.000034>
[pid   238] pwrite64(13</data/db/WiredTiger.wt>, "\0\0\0\0\0\0\0\0\34\0\0\0\0\0\0\0\5m\0\0(\0\0\0\7\4\0\1\0p\0\0\253|\333*\1\0\0\0\341file:collection-af1de828-88d4-462d-8163-138c3e6a94de.wt\0\200\304\213"..., 28672, 102400) = 28672 <0.000184>
[pid   238] pwrite64(13</data/db/WiredTiger.wt>, "\0\0\0\0\0\0\0\0\35\0\0\0\0\0\0\0*b\0\0l\0\0\0\7\4\0\1\0p\0\0002pLs\1\0\0\0\315file:index-679f8d22-1c9e-4852-8909-903c3809d2b4.wt\0\200\304\305acces"..., 28672, 131072) = 28672 <0.000018>
[pid   238] pwrite64(13</data/db/WiredTiger.wt>, "\0\0\0\0\0\0\0\0\36\0\0\0\0\0\0\0o\0\0\0\6\0\0\0\6 \0\1\0\20\0\0\352=F\361\1\0\0\0\5\08 \300\212\207\200\207\344~Qz\31Ifile:collection-af8 \300\212\207\230\207\344*\333\\k5file:index-678"..., 4096, 32768) = 4096 <0.000046>
[pid   238] pwrite64(13</data/db/WiredTiger.wt>, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0:\0\0\0\22\0\0\0\1\0\0\1\0\20\0\0\2\262\230\230\1\0\0\0\342\365\32\200\317\300\342_\300\343\1o\300\342\277\300\200\200\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096, 36864) = 4096 <0.000027>
[pid   238] pwrite64(13</data/db/WiredTiger.wt>, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0@\0\0\0\30\0\0\0\1\0\0\1\0\20\0\0}c\245(\1\0\0\0\342\365\32\200\342\177\300\317\300\342\217\300\342\277\300\343\2O\300\342\177\300\200\201\237\343\2\357\300\0\0\0\0\0\305Hencryption=(keyid=,name="..., 4096, 40960) = 4096 <0.000124>
[pid   238] pwrite64(64</data/db/WiredTiger.turtle.set>, "WiredTiger version string\nWiredTiger 12.0.0: (November 15, 2024)\nWiredTiger version\nmajor=12,minor=0"..., 1471, 0) = 1471 <0.000048>
[pid   238] pwrite64(15</data/db/journal/WiredTigerLog.0000000001>, "\200\0\0\0D\376\33\222\0\0\0\0\0\0\0\0\200\201\343\2\361@\200\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 128, 214656) = 128 <0.000029>

Here is a quick script to run with mongosh that gets the URI for all collections:

let results = [];  
const dbNames = db.adminCommand({ listDatabases: 1 }).databases.map(db => db.name);  
// For each database
for (const dbName of dbNames) {  
  const currentDB = db.getSiblingDB(dbName);  
  const collNames = currentDB.getCollectionNames();  
  // Get collectons
  for (const collName of collNames) {  
    const stats = currentDB[collName].stats({ indexDetails: true });  
    // Collection URI  
    if (stats.wiredTiger?.uri) {  
      results.push({  
        id: (stats.wiredTiger.uri.replace(/^statistics:table:/, "")),  
        name: `${dbName}.${collName}`  
      });  
    }  
    // Index URIs  
    if (stats.indexDetails) {  
      for (const [indexName, details] of Object.entries(stats.indexDetails)) {  
        if (details.uri) {  
          results.push({  
            id:(details.uri.replace(/^statistics:table:/, "     ")),  
            name: `${dbName}.${collName} "${indexName}"`  
          });  
        }  
      }  
    }  
  }  
}  
// Sort by the identifier  
results.sort((a, b) => a.id.localeCompare(b.id));  
// Print  
results.forEach(r => {  
  print(`${r.id} ${r.name}`);  
});  

Here is an example of output:

     index-0076544d-0e60-4613-ac74-798b08940e83 config.sampledQueriesDiff "_id_"
     index-09da90b9-dcaf-42b6-ad0d-17fe23efa6f9 test.demo "_id_"
     index-0bdb5c19-b8d1-4329-a85e-dce59f589b66 config.transactions "_id_"
     index-11ae14c2-0298-413c-8ed4-148f806cfb2a admin.system.version "_id_"
     index-11e8073c-b849-4c57-8abd-382d4df08cf0 local.replset.oplogTruncateAfterPoint "_id_"
     index-1394e651-d8fa-43c3-8215-cc7d2c860e8f local.replset.initialSyncId "_id_"
     index-4ac5f987-a260-4952-96fa-4dfd53ae2442 local.replset.election "_id_"
     index-578076ff-91f4-4150-abaf-ab0d8ee99ca1 config.sampledQueries "SampledQueriesTTLIndex"
     index-58332e85-20c2-4b5d-a14c-4d2a62f2f7f4 config.image_collection "_id_"
     index-594731ae-f513-4443-afd8-7be83ee371dd config.transactions "parent_lsid"
     index-63c4c350-4cbf-43e5-b963-2470388c13ea local.startup_log "_id_"
     index-679f8d22-1c9e-4852-8909-903c3809d2b4 local.system.replset "_id_"
     index-6e10b714-4320-46ad-8dd8-826e181a80c8 local.replset.minvalid "_id_"
     index-7f7e27ba-c599-473c-ae45-098c95bf0fa5 admin.system.keys "_id_"
     index-acf8df7a-79fd-4ca3-90af-2065c56731fb config.analyzeShardKeySplitPoints "_id_"
     index-c3f19b26-8cfe-4162-b72e-0457e1290c91 config.analyzeShardKeySplitPoints "AnalyzeShardKeySplitPointsTTLIndex"
     index-e3de6faa-ce85-49bd-811e-2718dbfe68c6 config.sampledQueries "_id_"
     index-e435b3cd-214c-4e54-a3a0-2b502ad4e2f3 local.system.rollback.id "_id_"
     index-f2ab8612-b224-4284-8d98-178241c2c40f config.system.indexBuilds "_id_"
     index-f55353fc-6037-4919-9240-0832291acd86 config.sampledQueriesDiff "SampledQueriesDiffTTLIndex"
collection-0a266347-b02a-4b6d-aad9-d5fea0978ecc admin.system.keys
collection-2f934890-bfba-4bf1-8e30-a4983d1e443e config.image_collection
collection-2fb69242-1b95-4a08-9939-23172e5ea178 config.transactions
collection-44998cb5-8f3c-46fd-b793-bb7af13fc6fc local.replset.initialSyncId
collection-7dc3bad2-7f29-4853-b83a-a0d75ce4091c local.oplog.rs
collection-7e76acd8-718e-4fd6-93f2-fd08eaac3ed1 test.demo
collection-7f227aa8-c421-4095-84ed-e188eca693b5 local.system.replset
collection-94cfdac0-3772-45cc-8bf4-e93691a663b1 config.analyzeShardKeySplitPoints
collection-aec65821-ce4d-4dcc-a753-b445dac4b7b8 admin.system.version
collection-af1de828-88d4-462d-8163-138c3e6a94de config.sampledQueries
collection-b7b41814-325d-4331-96b2-57896547a8d8 config.sampledQueriesDiff
collection-cb1d9d74-51cd-404e-8baf-559a1f583864 local.startup_log
collection-d3c1ce81-a90b-434a-867e-d985cc474e98 local.replset.oplogTruncateAfterPoint
collection-e9e489e7-441d-449a-8972-71c6610d217a config.system.preimages
collection-ed1a19e5-322f-4dbb-adaa-bbfc474a0d55 local.system.rollback.id
collection-ee3b161c-625b-48c4-9d23-620aa7b776e1 config.system.indexBuilds
collection-f5621f15-8658-4df7-b2c6-f6e0e51eec73 local.replset.minvalid
collection-fccee054-c41f-4585-8354-163477ef91c2 local.replset.election

This helps me to understand my trace:

  • collection-7e76acd8-718e-4fd6-93f2-fd08eaac3ed1 is test.demo which I updated
  • collection-2f934890-bfba-4bf1-8e30-a4983d1e443e is config.image_collection that stores pre/post-image documents for retryable findAndModify operations, enabling idempotent retries across failovers.
  • collection-2fb69242-1b95-4a08-9939-23172e5ea178 is config.transactions that tracks session transaction state including startOpTime for active transactions.
  • collection-7dc3bad2-7f29-4853-b83a-a0d75ce4091c is local.oplog.rs   which is the capped collection containing all replication operations in timestamp order
  • collection-d3c1ce81-a90b-434a-867e-d985cc474e98 is local.replset.oplogTruncateAfterPoint which marks the safe truncation point to remove oplog holes after unclean shutdown (crash recovery)
  • collection-af1de828-88d4-462d-8163-138c3e6a94de is config.sampledQueries that stores sampled query shapes for query analyzer and sharding key analysis

Using sysbench to measure how MySQL performance changes over time, November 2025 edition

This has results for the sysbench benchmark on a small and big server for MySQL versions 5.6 through 9.5. The good news is that the arrival rate of performance regressions has mostly stopped as of 8.0.43. The bad news is that there were large regressions from 5.6 through 8.0.

tl;dr for low-concurrency tests

  • for point queries
    • MySQL 5.7.44 gets about 10% less QPS than 5.6.51
    • MySQL 8.0 through 9.5 get about 30% less QPS than 5.6.51
  • for range queries without aggregation
    • MySQL 5.7.44 gets about 15% less QPS than 5.6.51
    • MySQL 8.0 through 9.5 get about 30% less QPS than 5.6.51
  • for range queries with aggregation
    • MySQL 5.7.44 is faster than 5.6.51 for two tests, as fast for one and gets about 15% less QPS for the other five
    • MySQL 8.0 to 9.5 are faster than 5.6.51 for one test, as fast for one and get about 30% less QPS for the other six
  • for writes
    • MySQL 5.7.44 gets between 10% and 20% less QPS than 5.6.51 for most tests
    • MySQL 8.0 to 9.5 get between 40% to 50% less QPS than 5.6.51 for most tests
tl;dr for high-concurrency tests
  • for point queries
    • for most tests MySQL 5.7 to 9.5 get at least 1.5X more QPS than 5.6.51
    • for tests that use secondary indexes MySQL 5.7 to 9.5 get about 25% less QPS than 5.6.51
  • for range queries without aggregation
    • MySQL 5.7.44 gets about 10% less QPS than 5.6.51
    • MySQL 8.0 through 9.5 get about 30% less QPS than 5.6.51
  • for range queries with aggregation
    • MySQL 5.7.44 is faster than 5.6.51 for six tests, as fast for one test and gets about 20% less QPS for one test
    • MySQL 8.0 to 9.5 are a lot faster than 5.6.51 for two tests, about as fast for three tests and gets between 10% and 30% less QPS for the other three tests
  • for writes
    • MySQL 5.7.44 gets more QPS than 5.6.51 for all tests
    • MySQL 8.0 to 9.5 get more QPS than 5.6.51 for all tests

Builds, configuration and hardware

I build MySQL from source for versions 5.6.51, 5.7.44, 8.0.43, 8.0.44, 8.4.6, 8.4.7, 9.4.0 and 9.5.0.

I used two servers:
  • small
    • an ASUS ExpertCenter PN53 with AMD Ryzen 7735HS CPU, 32G of RAM, 8 cores with AMD SMT disabled, Ubuntu 24.04 and an NVMe device with ext4 and discard enabled.
  • big
    • an ax162s from Hetzner with an AMD EPYC 9454P 48-Core Processor with SMT disabled
    • 2 Intel D7-P5520 NVMe storage devices with RAID 1 (3.8T each) using ext4
    • 128G RAM
    • Ubuntu 22.04 running the non-HWE kernel (5.5.0-118-generic)
The config files are here:
Benchmark

I used sysbench and my usage is explained here. I now run 32 of the 42 microbenchmarks listed in that blog post. Most test only one type of SQL statement. Benchmarks are run with the database cached by InnoDB.

The read-heavy microbenchmarks are run for 600 seconds and the write-heavy for 900 seconds. On the small server the benchmark is run with 1 client and 1 table with 50M rows. On the big server the benchmark is run with 12 clients and 8 tables with 10M rows per table. 

The purpose is to search for regressions from new CPU overhead and mutex contention. I use the small server with low concurrency to find regressions from new CPU overheads and then larger servers with high concurrency to find regressions from new CPU overheads and mutex contention.

Results

The microbenchmarks are split into 4 groups -- 1 for point queries, 2 for range queries, 1 for writes. For the range query microbenchmarks, part 1 has queries that don't do aggregation while part 2 has queries that do aggregation. 

I provide charts below with relative QPS. The relative QPS is the following:
(QPS for some version) / (QPS for MySQL 5.6.51)
When the relative QPS is > 1 then some version is faster than MySQL 5.6.51.  When it is < 1 then there might be a regression. When the relative QPS is 1.2 then some version is about 20% faster than MySQL 5.6.51.

Values from iostat and vmstat divided by QPS are here for the small server and the big serverThese can help to explain why something is faster or slower because it shows how much HW is used per request, including CPU overhead per operation (cpu/o) and context switches per operation (cs/o) which are often a proxy for mutex contention.

The spreadsheet and charts are here and in some cases are easier to read than the charts below. Converting the Google Sheets charts to PNG files does the wrong thing for some of the test names listed at the bottom of the charts below.

Results: point queries

This is from the small server.
  • MySQL 5.7.44 gets about 10% less QPS than 5.6.51
  • MySQL 8.0 through 9.5 get about 30% less QPS than 5.6.51
  • There are few regressions after MySQL 8.0
  • New CPU overheads explain the regressions. See the vmstat results for the hot-points test.
This is from the large server.
  • For most point query tests MySQL 5.7 to 9.5 get at least 1.5X more QPS than 5.6.51
    • MySQL 5.7 to 9.5 use less CPU, see vmstat results for the hot-points test.
  • For tests that use secondary indexes (*-si) MySQL 5.7 to 9.5 get about 25% less QPS than 5.6.51.
    • This result is similar to what happens on the small server above.
    • The regressions are from extra CPU overhead, see vmstat results
  • MySQL 5.7 does better than 8.0 to 9.5. There are few regressions after MySQL 8.0.
Results: range queries without aggregation

This is from the small server.
  • MySQL 5.7.44 gets about 15% less QPS than 5.6.51
  • MySQL 8.0 through 9.5 get about 30% less QPS than 5.6.51
  • There are few regressions after MySQL 8.0
  • New CPU overheads explain the regressions. See the vmstat results for the scan test.
This is from the large server.
  • MySQL 5.7.44 gets about 10% less QPS than 5.6.51
  • MySQL 8.0 through 9.5 get about 30% less QPS than 5.6.51
  • There are few regressions after MySQL 8.0
  • New CPU overheads explain the regressions. See the vmstat results for the scan test.
Results: range queries with aggregation

This is from the small server.
  • for the read-only-distinct test, MySQL 5.7 to 9.5 are faster than 5.6.51
  • for the read-only_range=X tests
    • with the longest range scan (*_range=10000), MySQL 5.7.44 is faster than 5.6.51 and 8.0 to 9.5 have the same QPS as 5.6.51
    • with shorter range scans (*_range=100 & *_range=10) MySQL 5.6.51 is faster than 5.7 to 9.5. This implies that the regressions are from code above the storage engine layer.
    • From vmstat results the perf differences are explained by CPU overheads
  • for the other tests
    • MySQL 5.7.44 gets about 15% less QPS than 5.6.51
    • MySQL 8.0 to 9.5 get about 30% less QPS than 5.6.51
    • From vmstat results for read-only-count the reason is new CPU overhead
This is from the large server.
  • for the read-only-distinct test, MySQL 5.7 to 9.5 are faster than 5.6.51
  • for the read-only_range=X tests
    • MySQL 5.7.44 is as fast as 5.6.51 for the longest range scan and faster than 5.6.51 for the shorter range scans
    • MySQL 8.0 to 9.5 are much faster than 5.6.51 for the longest range scan and somewhat faster for the shorter range scans
    • From vmstat results the perf differences are explained by CPU overheads and possible from changes in mutex contention
  • for the other tests
    • MySQL 5.7.44 gets about 20% less QPS than 5.6.51 for read-only-count and about 10% more QPS than 5.6.51 for read-only-simple and read-only-sum
    • MySQL 8.0 to 9.5 get about 30% less QPS than 5.6.51 for read-only-count and up to 20% less QPS than 5.6.51 for read-only-simple and read-only-sum
    • From vmstat results for read-only-count the reason is new CPU overhead
Results: writes

This is from the small server.
  • For most tests
    • MySQL 5.7.44 gets between 10% and 20% less QPS than 5.6.51
    • MySQL 8.0 to 9.5 get between 40% to 50% less QPS than 5.6.51
    • From vmstat results for the insert test, MySQL 5.7 to 9.5 use a lot more CPU
  • For the update-index test
    • MySQL 5.7.44 is faster than 5.6.51
    • MySQL 8.0 to 9.5 get about 10% less QPS than 5.6.51
    • From vmstat metrics MySQL 5.6.51 has more mutex contention
  • For the update-inlist test
    • MySQL 5.7.44 is as fast as 5.6.51
    • MySQL 8.0 to 9.5 get about 30% less QPS than 5.6.51
    • From vmstat metrics MySQL 5.6.51 has more mutex contention
This is from the large server and the y-axis truncates the result for the update-index test to improve readability for the other results.
  • For all tests MySQL 5.7 to 9.5 get more QPS than 5.6.51
    • From vmstat results for the write-only test MySQL 5.6.51 uses more CPU and has more mutex contention.
  • For some tests (read-write_range=X) MySQL 8.0 to 9.5 get less QPS than 5.7.44
    • These are the classic sysbench transaction with different range scan lengths and the performance is dominated by the range query response time, thus 5.7 is fastest.
  • For most tests MySQL 5.7 to 9.5 have similar perf with two exceptions
    • For the delete test, MySQL 8.0 to 9.5 are faster than 5.7. From vmstat metrics 5.7 uses more CPU and has more mutex contention than 8.0 to 9.5.
    • For the update-inlist test, MySQL 8.0 to 9.5 are faster than 5.7. From vmstat metrics 5.7 uses more CPU than 8.0 to 9.5.
This is also from the large server and does not truncate the update-index test result.

November 27, 2025

MongoDB Index Intersection (and PostgreSQL Bitmap-and)

You won’t always have a perfect index for every query, but may have have single-field indexes for each filter. In such cases, PostgreSQL can use Bitmap Scan to combine these indexes. MongoDB is also capable of merging multiple index bounds in a single scan or using index intersection to combine separate scans. Yet, the MongoDB query planner rarely selects index intersection. Let’s look at the reasons behind this.

TL;DR: if you think you need index intersection, you probably need better compound indexes.

Test Setup

I create a collection with one hundred thousand documents and two fields, with an index on each:


let bulk = [];    
for (let i = 0; i < 100000; i++) {    
  bulk.push({      
    a: Math.floor(Math.random()*100),     
    b: Math.floor(Math.random()*100)      
  });    
}    
db.demo.insertMany(bulk);

// separate indexes
db.demo.createIndex({ a: 1 });    
db.demo.createIndex({ b: 1 });  

In PostgreSQL, we’ll mirror the dataset as follow:

CREATE TABLE demo AS
SELECT id,
  (random()*100)::int AS a,
  (random()*100)::int AS b
FROM generate_series(1,100000) id;

CREATE INDEX demo_a_idx ON demo(a);
CREATE INDEX demo_b_idx ON demo(b);

In my MongoDB collection of 100,000 documents, only nine documents have both the "a" and "b" fields set to 42:

mongo> db.demo.countDocuments()

100000

mongo> db.demo.find({ a: 42, b: 42 }).showRecordID()

[
  { _id: ObjectId('6928697ae5fd2cdba9d53f54'), a: 42, b: 42, '$recordId': Long('36499') },
  { _id: ObjectId('6928697ae5fd2cdba9d54081'), a: 42, b: 42, '$recordId': Long('36800') },
  { _id: ObjectId('6928697ae5fd2cdba9d54a7c'), a: 42, b: 42, '$recordId': Long('39355') },
  { _id: ObjectId('6928697ae5fd2cdba9d55a3e'), a: 42, b: 42, '$recordId': Long('43389') },
  { _id: ObjectId('6928697ae5fd2cdba9d5a214'), a: 42, b: 42, '$recordId': Long('61779') },
  { _id: ObjectId('6928697ae5fd2cdba9d5e52a'), a: 42, b: 42, '$recordId': Long('78953') },
  { _id: ObjectId('6928697ae5fd2cdba9d5eeea'), a: 42, b: 42, '$recordId': Long('81449') },
  { _id: ObjectId('6928697ae5fd2cdba9d61f48'), a: 42, b: 42, '$recordId': Long('93831') },
  { _id: ObjectId('6928697ae5fd2cdba9d61f97'), a: 42, b: 42, '$recordId': Long('93910') }
]

In my PostgreSQL database, there are 100,000 rows and, among them, nine rows have the value 42 in both the "a" and "b" columns:

postgres=# select count(*) from demo;
 count
--------
 100000
(1 row)

postgres=# SELECT *, ctid FROM demo WHERE a = 42 AND b = 42;

 a  | b  |  id   |   ctid
----+----+-------+-----------
 42 | 42 |  4734 | (25,109)
 42 | 42 | 15678 | (84,138)
 42 | 42 | 29464 | (159,49)
 42 | 42 | 29748 | (160,148)
 42 | 42 | 31139 | (168,59)
 42 | 42 | 37785 | (204,45)
 42 | 42 | 55112 | (297,167)
 42 | 42 | 85823 | (463,168)
 42 | 42 | 88707 | (479,92)

I displayed the CTID for PostgreSQL and the RecordID for MongoDB, to see the distribution over the heap table (for PostgreSQL) or the WiredTiger B-Tree (for MongoDB).

MongoDB possible execution plans

I have executed db.demo.find({ a: 42, b: 42 }), and multiple plans have been evaluated:

mongo> db.demo.getPlanCache().list();
[
  {
    version: '1',
    queryHash: 'BBC007A6',
    planCacheShapeHash: 'BBC007A6',
    planCacheKey: '51C56FDD',
    isActive: true,
    works: Long('968'),
    worksType: 'works',
    timeOfCreation: ISODate('2025-11-27T15:09:11.069Z'),
    createdFromQuery: { query: { a: 42, b: 42 }, sort: {}, projection: {} },
    cachedPlan: {
      stage: 'FETCH',
      filter: { b: { '$eq': 42 } },
      inputStage: {
        stage: 'IXSCAN',
        keyPattern: { a: 1 },
        indexName: 'a_1',
        isMultiKey: false,
        multiKeyPaths: { a: [] },
        isUnique: false,
        isSparse: false,
        isPartial: false,
        indexVersion: 2,
        direction: 'forward',
        indexBounds: { a: [ '[42, 42]' ] }
      }
    },
    creationExecStats: [
      {
        nReturned: 12,
        executionTimeMillisEstimate: 0,
        totalKeysExamined: 967,
        totalDocsExamined: 967,
        executionStages: {
          stage: 'FETCH',
          filter: { b: { '$eq': 42 } },
          nReturned: 12,
          executionTimeMillisEstimate: 0,
          works: 968,
          advanced: 12,
          needTime: 955,
          needYield: 0,
          saveState: 0,
          restoreState: 0,
          isEOF: 1,
          docsExamined: 967,
          alreadyHasObj: 0,
          inputStage: {
            stage: 'IXSCAN',
            nReturned: 967,
            executionTimeMillisEstimate: 0,
            works: 968,
            advanced: 967,
            needTime: 0,
            needYield: 0,
            saveState: 0,
            restoreState: 0,
            isEOF: 1,
            keyPattern: { a: 1 },
            indexName: 'a_1',
            isMultiKey: false,
            multiKeyPaths: { a: [] },
            isUnique: false,
            isSparse: false,
            isPartial: false,
            indexVersion: 2,
            direction: 'forward',
            indexBounds: { a: [Array] },
            keysExamined: 967,
            seeks: 1,
            dupsTested: 0,
            dupsDropped: 0
          }
        }
      },
      {
        nReturned: 10,
        executionTimeMillisEstimate: 0,
        totalKeysExamined: 968,
        totalDocsExamined: 968,
        executionStages: {
          stage: 'FETCH',
          filter: { a: { '$eq': 42 } },
          nReturned: 10,
          executionTimeMillisEstimate: 0,
          works: 968,
          advanced: 10,
          needTime: 958,
          needYield: 0,
          saveState: 0,
          restoreState: 0,
          isEOF: 0,
          docsExamined: 968,
          alreadyHasObj: 0,
          inputStage: {
            stage: 'IXSCAN',
            nReturned: 968,
            executionTimeMillisEstimate: 0,
            works: 968,
            advanced: 968,
            needTime: 0,
            needYield: 0,
            saveState: 0,
            restoreState: 0,
            isEOF: 0,
            keyPattern: { b: 1 },
            indexName: 'b_1',
            isMultiKey: false,
            multiKeyPaths: { b: [] },
            isUnique: false,
            isSparse: false,
            isPartial: false,
            indexVersion: 2,
            direction: 'forward',
            indexBounds: { b: [Array] },
            keysExamined: 968,
            seeks: 1,
            dupsTested: 0,
            dupsDropped: 0
          }
        }
      },
      {
        nReturned: 7,
        executionTimeMillisEstimate: 0,
        totalKeysExamined: 968,
        totalDocsExamined: 7,
        executionStages: {
          stage: 'FETCH',
          filter: { '$and': [ [Object], [Object] ] },
          nReturned: 7,
          executionTimeMillisEstimate: 0,
          works: 968,
          advanced: 7,
          needTime: 961,
          needYield: 0,
          saveState: 0,
          restoreState: 0,
          isEOF: 0,
          docsExamined: 7,
          alreadyHasObj: 0,
          inputStage: {
            stage: 'AND_SORTED',
            nReturned: 7,
            executionTimeMillisEstimate: 0,
            works: 968,
            advanced: 7,
            needTime: 961,
            needYield: 0,
            saveState: 0,
            restoreState: 0,
            isEOF: 0,
            failedAnd_0: 232,
            failedAnd_1: 230,
            inputStages: [ [Object], [Object] ]
          }
        }
      }
    ],
    candidatePlanScores: [ 2.012596694214876, 1.0105305785123966, 1.0073314049586777 ],
    indexFilterSet: false,
    estimatedSizeBytes: Long('4826'),
    solutionHash: Long('6151768200665613849'),
    host: '40ae92e83a12:27017'
  }
]

The cached plan uses only one index, on "a", but there are two additional possible plans in the cache: one using the index on "b" and another using a combination of both indexes with AND_SORTED. The scores (candidatePlanScores) are:

  • 2.012 for the index on "a"
  • 1.010 for the index on "b"
  • 1.007 for the AND_SORTED intersection of the indexes on "a" and "b"

This may be surprising, and given how the data was generated, we should expect similar costs for the two indexes. We can see that during the trial period on the query plans, the index on "a" finished the scan (isEOF: 1), and even though the other two had similar performance and were going to end, the trial period ended before they reached the end (isEOF: 0). MongoDB adds an EOF bonus of 1 when one when the trial plan finishes before the others, and that explains why the score is higher. So it's not really that the index on "a" is better than the other plans, but just that all plans are good, and the first one started and finished first, and got the bonus.

In addition to that, there's another small penalty on index intersection. Finally the scores are:

  • Index on "a": 1 (base) + 0.012 (productivity) + 1.0 (EOF bonus) = 2.012
  • Index on "b": 1 (base) + 0.010 (productivity) + 0 (no EOF) = 1.010
  • AND_SORTED: 1 (base) + 0.007 (productivity) + 0 (no EOF) = 1.007

Without the penalties, AND_SORTED would still not have been chosen. The problem is that the score measure productivity in units of work (advanced/work) but do not account for lighter work: one index scan must fetch the document and apply the additional filter in one work unit, where AND_SORTED doesn't and waits for the intersection without additional fetch and filter.

To show the AND_SORTED plan, I'll force it on my lab database for the following examples in this article:

// get the current parametrs (default):
mongo> Object.keys(db.adminCommand({ getParameter: "*" })).filter(k => k.toLowerCase().includes("intersection"))  .forEach(k => print(k + " : " + allParams[k]));

internalQueryForceIntersectionPlans : false
internalQueryPlannerEnableHashIntersection : false
internalQueryPlannerEnableIndexIntersection : true

// set all at true:
db.adminCommand({
  setParameter: 1,
  internalQueryPlannerEnableIndexIntersection: true,
  internalQueryPlannerEnableHashIntersection: true,
  internalQueryForceIntersectionPlans: true
});

I have set internalQueryForceIntersectionPlans to force index intersection (it still uses the query planner, but with a 3-point boost to the score). Index intersection is possible for AND_SORTED by default, but I also set AND_HASH for another test later that cannot use AND_SORTED.

Index Intersection in MongoDB

Now that I forced index intersection, I can observe it with execution statistics:

db.demo.find({ a: 42, b: 42 }).explain("executionStats").executionStats

Execution plan shows that both indexes were scanned with one range (seeks: 1), and combined with an AND_SORTED before fetching the documents for the result:

{
  executionSuccess: true,
  nReturned: 9,
  executionTimeMillis: 4,
  totalKeysExamined: 2009,
  totalDocsExamined: 9,
  executionStages: {
    isCached: false,
    stage: 'FETCH',
    filter: {
      '$and': [ { a: { '$eq': 42 } }, { b: { '$eq': 42 } } ]
    },
    nReturned: 9,
    executionTimeMillisEstimate: 0,
    works: 2010,
    advanced: 9,
    needTime: 2000,
    needYield: 0,
    saveState: 0,
    restoreState: 0,
    isEOF: 1,
    docsExamined: 9,
    alreadyHasObj: 0,
    inputStage: {
      stage: 'AND_SORTED',
      nReturned: 9,
      executionTimeMillisEstimate: 0,
      works: 2010,
      advanced: 9,
      needTime: 2000,
      needYield: 0,
      saveState: 0,
      restoreState: 0,
      isEOF: 1,
      failedAnd_0: 501,
      failedAnd_1: 495,
      inputStages: [
        {
          stage: 'IXSCAN',
          nReturned: 964,
          executionTimeMillisEstimate: 0,
          works: 965,
          advanced: 964,
          needTime: 0,
          needYield: 0,
          saveState: 0,
          restoreState: 0,
          isEOF: 1,
          keyPattern: { a: 1 },
          indexName: 'a_1',
          isMultiKey: false,
          multiKeyPaths: { a: [] },
          isUnique: false,
          isSparse: false,
          isPartial: false,
          indexVersion: 2,
          direction: 'forward',
          indexBounds: { a: [ '[42, 42]' ] },
          keysExamined: 964,
          seeks: 1,
          dupsTested: 0,
          dupsDropped: 0
        },
        {
          stage: 'IXSCAN',
          nReturned: 1045,
          executionTimeMillisEstimate: 0,
          works: 1045,
          advanced: 1045,
          needTime: 0,
          needYield: 0,
          saveState: 0,
          restoreState: 0,
          isEOF: 0,
          keyPattern: { b: 1 },
          indexName: 'b_1',
          isMultiKey: false,
          multiKeyPaths: { b: [] },
          isUnique: false,
          isSparse: false,
          isPartial: false,
          indexVersion: 2,
          direction: 'forward',
          indexBounds: { b: [ '[42, 42]' ] },
          keysExamined: 1045,
          seeks: 1,
          dupsTested: 0,
          
                                    
                                    
                                    
                                    
                                

Netflix consolidates relational database infrastructure on Amazon Aurora, achieving up to 75% improved performance

Netflix operates a global streaming service that serves hundreds of millions of users through a distributed microservices architecture. In this post, we examine the technical and operational challenges encountered by their Online Data Stores (ODS) team with their current self-managed distributed PostgreSQL-compatible database, the evaluation criteria used to select a database solution, and why they chose to migrate to Amazon Aurora PostgreSQL to meet their current and future performance needs. The migration to Aurora PostgreSQL improved their database infrastructure, achieving up to 75% increase in performance and 28% cost savings across critical applications.

November 26, 2025

How Letta builds production-ready AI agents with Amazon Aurora PostgreSQL

With the Letta Developer Platform, you can create stateful agents with built-in context management (compaction, context rewriting, and context offloading) and persistence. Using the Letta API, you can create agents that are long-lived or achieve complex tasks without worrying about context overflow or model lock-in. In this post, we guide you through setting up Amazon Aurora Serverless as a database repository for storing Letta long-term memory. We show how to create an Aurora cluster in the cloud, configure Letta to connect to it, and deploy agents that persist their memory to Aurora. We also explore how to query the database directly to view agent state.

Let’s Rebuild the MySQL Community Together

Where We Are We can all agree that the MySQL ecosystem isn’t in great shape right now. Take a look at Julia’s blog post [Analyzing the Heartbeat of the MySQL Server: A Look at Repository Statistics], which confirms what many of us have felt: Oracle isn’t as committed to MySQL and its ecosystem as it […]

ParadeDB 0.20.0: Simpler and Faster

Introducing search aggregation, V2 API as default, and performance improvements that eliminate the complexity between search and analytics in a single Postgres-native system.

November 25, 2025

Everything you don’t need to know about Amazon Aurora DSQL: Part 4 – DSQL components

Amazon Aurora DSQL employs an active-active distributed database design, wherein all database resources are peers and serve both write and read traffic within a Region and across Regions. This design facilitates synchronous data replication and automated zero data loss failover for single and multi-Region Aurora DSQL clusters. In this post, I discuss the individual components and the responsibilities of a multi-Region distributed database to provide an ACID-compliant, strongly consistent relational database.

Everything you don’t need to know about Amazon Aurora DSQL: Part 3 – Transaction processing

In this third post of the series, I examine the end-to-end processing of the two transaction types in Aurora DSQL: read-only and read-write. Amazon Aurora DSQL doesn’t have write-only transactions, since it’s imperative to verify the table schema or ensure the uniqueness of primary keys on each change – which results them being read-write transactions as well.

Everything you don’t need to know about Amazon Aurora DSQL: Part 2 – Shallow view

In this second post, I examine Aurora DSQL's architecture and explain how its design decisions impact functionality—such as optimistic locking and PostgreSQL feature support—so you can assess compatibility with your applications. I provide a comprehensive overview of the underlying architecture, which is fully abstracted from the user.

Building the Future of MySQL: Announcing Plans for MySQL Vector Support and a MySQL Binlog Server

At Percona, our mission has always been to help you succeed with open source databases. We do that by listening to the community, understanding the challenges you face, and building the solutions you need. Now, after a comprehensive review of market trends and direct feedback from our customers and the MySQL community, we are excited […]

Mitigating Application Resource Overload with Targeted Task Cancellation

The Atropos paper (SOSP'25) argues that overload-control systems are built on a flawed assumption. They monitor global signals (like queue length or tail latency) to adjust admission control (throttling new arrivals or dropping random requests). This works when the bottleneck is CPU or network, but it fails when the real problem is inside the application. This considers only the symptoms but not the source. As a result, it drops the victims rather than the culprits.

Real systems often run into overload because one or two unlucky timed requests monopolize an internal logical resource (like buffer pools, locks, and thread-pool queues). These few rogue whales have nonlinear effects. A single ill-timed dump query can thrash the buffer pool and cut throughput in half. A single backup thread combined with a heavy table scan can stall writes in MySQL as seen in Figure 3. The CPU metrics will not show this.

Atropos proposes a simple fix to this problem. Rather than throttling or dropping victims at the front of the system, it continuously monitors how tasks use internal logical resources and cancels the ones most responsible for the collapse. The name comes from the three Greek Fates. Clotho spins the thread of life, Lachesis measures its length, and Atropos cuts it when its time is up. The system plays the same role: it cuts the harmful task to protect the others.

The first interesting point in the paper is that their survey of 151 real applications (databases, search engines, and distributed systems) shows that almost all of them already contain safe cancellation hooks. Developers have already built the ability to stop a task cleanly. So the concerns about cancellation being too dangerous or too invasive turn out to be outdated. The support is already there; what's missing is a systematic way to decide which tasks to cancel in the first place.

To identify which tasks are the rogue whales,  Atropos introduces a lightweight way to track logical resource usage. It instruments three operations: acquiring a resource, releasing it, and waiting on it. These are cheap counters and timestamps. For memory this wraps code that gets or frees buffer pages. For locks it wraps lock acquisition and lock waits. For queues it wraps enqueue and dequeue. The runtime can now trace who is touching which resource and who is blocking whom.

The map above (from Nano Banana Pro) lays out the Atropos design. Let's walk through its main pieces.

The runtime borrows the "overload detection" from prior work (Breakwater). When latency rises above the SLO but throughput stays flat, overload is present. At that moment Atropos inspects the resource traces and identifies hot spots. It computes two measures per resource per task.

  • The first measure is the contention level. This is a measure of how much time tasks spend waiting due to that resource. High eviction ratios, long lock wait ratios, and long queue wait times all point to a rogue whale task.
  • The second measure is the resource gain. This estimates how much future load would be relieved if a task were canceled. This is the clever idea. The naive approach counts past usage, but a task that has consumed many pages but is almost finished does not pose much future harm. A task that has barely begun a huge scan is the real whale to watch, because its future thrashing potential is far larger than what its small past footprint suggests. The system uses progress estimates to scale current usage by remaining work. Databases usually track rows read vs rows expected. Other systems provide analogous application-specific counters for an estimate of future demand. This allows the controller to avoid killing nearly finished tasks.

The policy engine considers all resources together. Some tasks stress memory, others stress locks, and others queues. A single-resource policy would make narrow decisions. Instead Atropos identifies the non-dominated set of tasks across all resources and computes a weighted score, where the weights are the contention levels of the resources. The resultant score is the expected gain from canceling that task. The task with the highest score is canceled using the application's own cancellation mechanism.



The evaluation of the paper is strong. This is a prototypical SOSP paper. The authors reproduce sixteen real overload bugs in systems like MySQL, Postgres, Elasticsearch, Solr, Apache, and etcd. These include buffer pool thrashing, lock convoys, queue stalls, and indexing contention. Across these cases Atropos restores throughput close to the baseline. The median result is around ninety six percent of normal throughput under overload. Tail latency stays near normal. The cancellation rate is tiny: less than one in ten thousand requests! Competing approaches must drop far more requests and still fail to restore throughput. The key here is the nonlinear effect of freeing internal logical resources. Canceling the right task unblocks a crowd of others.

As usual, Aleksey and I did our live blind read of it, which you can watch below. My annotated copy of the paper is also available here.

What I like about the paper is its clarity. The motivating examples are strong and concrete. The design is small, understandable, and modular. The progress-based future estimation is a good idea. The policy avoids naive heuristics. This is not a general overload controller for all situations. It does not manage CPU saturation or network overload. Instead it solves overload caused by contention on internal logical resources by killing the rogue whale, and quickly restoring normal behavior.

CedarDB Takes to the Slopes!

CedarDB Tames the Slopes (of the graph)

It would be natural to ask about the connection between skiing, or taking to the slopes, and database systems. That connection arose out of the way I related to the graph, below, that I created for a recent POC, showing the relationship between the time to run 750 queries against a table, for various row counts, in PostgreSQL, TimescaleDB, and CedarDB.

In skiing terminology, the blue line might be a black diamond run, the green one a beginner run, and the grey line along the x-axis is almost in an entirely different category and more fit for cross-country skiing. What blew my mind was that the slope of the CedarDB line was essentially zero.

The motivation for these database comparisons is a SQL query workload to support an app built by a biotech startup. The app interacts with their hardware device, which collects telemetry based on a user’s activities and, among other things, provides access to a library of content that the user might enjoy. The team at this startup provided us with some background on their database challenges, and that included lists such as top 20 slowest queries and top 20 heaviest queries. After analyzing these, we identified a query that appeared high on both lists, one that informs their recommendation pipeline, so we started there:

SELECT count(*) AS count_1
FROM (
 SELECT track_plays.time AS track_plays_time
 , track_plays.end_time AS track_plays_end_time
 , track_plays.user_id AS track_plays_user_id
 , track_plays.session_id AS track_plays_session_id
 , track_plays.track_id AS track_plays_track_id
 , track_plays.duration_sec AS track_plays_duration_sec
 , track_plays.source AS track_plays_source
 FROM track_plays
 WHERE
 track_plays.time >= $1
 AND track_plays.end_time <= $2
 AND track_plays.user_id = $3
 AND track_plays.duration_sec >= $4
) AS anon_1;

Looking at it, it’s not complex: just a table scan with a filter on a couple of TIMESTAMP and INTEGER columns.

How Timescale Makes the Slope Beginner Friendly

Credit where credit is due, let’s first take a look at how TimescaleDB managed to flatten the slope considerably for our POC customer. TimescaleDB is a PostgreSQL extension designed to optimize the processing of time-series data, which is exactly what the track_plays table tracks. To really see the effect above with TimescaleDB, you must explicitly set it up using Hypertables, which partition the data by the time column using hourly partitions (chunk_interval). This, however, requires advance knowledge of your data and query patterns.

...
WITH
(
 timescaledb.hypertable,
 timescaledb.partition_column='time',
 timescaledb.chunk_interval='1 hour'
);

TimescaleDB can then use the partitions at query time, selecting only those relevant to the query, which drastically speeds up many time-series workloads. While it’s reasonable to assume that you often know this kind of information beforehand, unfortunately, TimescaleDB’s Hypertables are not without tradeoffs:

  • Foreign key constraints are not supported between hypertables.
  • UPDATE statements that would move a row between chunks are not allowed.
  • All columns used for partitioning dimensions must be included in a unique index.
  • The time column used for partitioning cannot contain NULL values.

While these tradeoffs are acceptable for many workloads, the question is: Do you need to make them?

So, Where’s the Secret Sauce with CedarDB?

Having seen how a use-case specific database system can make workloads beginner friendly, let’s take a look at how CedarDB manages to play a different sport entirely. How do I account for the vastly different performance results I observed for this workload? After all, TimescaleDB is specifically built to tackle just this type of time series data challenge.

While the performance increase is a mix of a lot of innovations, I want to highlight the one aspect I think best maps to what Hypertables do in TimescaleDB, the concept of “early pruning”. When scanning a table, CedarDB manages to check many predicates on metadata only, avoiding to scan blocks that don’t qualify entirely. In this instance, since a timestamp is just a 64 bit integer, timestamp comparisons are just integer comparisons internally, and CedarDB is able to just skip most data blocks when filtering for a time range, like the heavy-hitter query does.
How is it able to achieve this? Here are some key aspects contributing to that:

  • CedarDB stores large chunks of data together into compressed blocks.
  • CedarDB stores statistics for each block (e.g., max value, min value, count).
  • If there is no value in a given data block that qualifies, per the query predicate, CedarDB avoids scanning that block.
  • Time-series data is often loosely inserted in timestamp order. Even if the order is not perfect, overall timestamps close together are stored in the same block.
  • Filters are especially useful for pruning in a time series data workload, where it’s common to filter by a timestamp column (e.g., all values of the last month), as illustrated here.

Sounds familiar? That is exactly the effect Hypertables have in TimescaleDB, just on all columns instead of one and without needing to specify anything beforehand.

And the beauty of it is that you get all this without sacrificing flexibility in other areas.

Some Takeaways

What else did I learn from this exercise, besides that I need to get ready for skiing season? Here’s what I’ve jotted down:

  • The observations represented by the graphs up above inspired me to continue, with more data, so I moved the test jig off my M4 MacBook and onto an r7i.8xlarge instance in EC2. There, I gradually increased the row count in this table (via a data generator), from 25 M rows, where I recorded a cumulative runtime (750 queries) of 79 ms, to 1.13 billion rows, where the runtime was 96 ms. That 1.13B rows is 180x the size of the data point represented by the right-most point in the graph, yet the query runtime in CedarDB remained essentially constant.

  • … So, while CedarDB is amazing on the 8 vCPU & 32 GB RAM of my MacBook Docker setup, when it has access to the 32 vCPU & 128 GB of RAM of the larger EC2 instance, it takes full advantage of all that CPU and memory.

  • The efficiency here takes on two forms: (1) you pay less for IaaS or data center and (2) simplification through simplified DDL, fewer indexes, and the ability to consolidate multiple workloads onto a single, general purpose database.

  • Finally, this radical performance boost may embolden you to consider some new use cases you may have been hesitant to explore using your existing database engine.

Note: If the folks we’re collaborating with on this give us the “OK”, we’ll share the code used here and update this with links.

Call to action: If you’re curious and want to give CedarDB a try in AWS, the CloudFormation template that I used to automate setup of my CedarDB in EC2 can help you get started pretty easily.

Thank you!

If you’d like to discuss your data challenges with us, we’d love to hear from you.

Appendix: The Nuts and Bolts

Show Benchmark Setup

To investigate how this workload would perform in CedarDB, we needed a few things:

  • Data generator: Pair programming alongside ChatGPT knocked this out quickly, in Python. It was designed based on the DDL for two tables which were key to a different query, along with the queries themselves, to ensure our joins resolved properly. Ultimately, we relied only on the data for the single table, track_plays (DDL below).
  • Load simulator: Again, ChatGPT came in handy and rendered a mostly-working version of a multi-threaded Go/pgx client.
  • A couple of shell scripts for running tests and loading data.
  • Our load parameters: We opted to simulate 25 runs across 30 threads, yielding 750 total queries.
  • Docker containers for CedarDB, TimescaleDB, and PostgreSQL

Here’s the DDL for track_plays, in TimescaleDB. A couple of things to note:

  • The table uses TimescaleDB’s hypertable, which is described here, to partition the data by the time column, using hourly partitions (our generated data covered a single day), allowing queries to target only the relevant chunks, drastically improving performance compared to a standard, unpartitioned table, especially for large time-series datasets.
  • The DDL provided to us included these four indexes, so we included these for Postgres and TimescaleDB.
CREATE TABLE public.track_plays
(
 "time" timestamp with time zone NOT NULL,
 user_id text NOT NULL,
 session_id text NOT NULL,
 track_id text NOT NULL,
 duration_sec integer NOT NULL,
 source TEXT DEFAULT 'default_source'::TEXT NOT NULL,
 end_time timestamp with time zone
)
WITH
(
 timescaledb.hypertable,
 timescaledb.partition_column='time',
 timescaledb.chunk_interval='1 hour'
);

CREATE INDEX ON public.track_plays USING btree (end_time);
CREATE INDEX ON public.track_plays USING btree (session_id, "time" DESC);
CREATE INDEX ON public.track_plays USING btree ("time", end_time, duration_sec);
CREATE INDEX ON public.track_plays USING btree ("time", user_id, end_time, duration_sec);

In PostgreSQL, our DDL was similar, but didn’t contain the WITH ... portion defining the hypertable.

Finally, in CedarDB, the DDL from Postgres reduces to just this (note the absence of indexes):

CREATE TABLE public.track_plays
(
 "time" timestamp with time zone NOT NULL,
 user_id text NOT NULL,
 session_id text NOT NULL,
 track_id text NOT NULL,
 duration_sec integer NOT NULL,
 source TEXT DEFAULT 'default_source'::TEXT NOT NULL,
 end_time timestamp with time zone
);

This is a sample row of generated data:

postgres=# select * from track_plays order by random() limit 1;
 time | user_id | session_id | track_id | duration_sec | source | end_time
------------------------+--------------------+------------------+-------------+--------------+--------+------------------------
 2025-11-01 11:40:35+00 | fransisco.jolliffe | 6d473b55db1c71cb | track_85767 | 395 | music | 2025-11-01 11:47:10+00
(1 row)

The Bash script that drove the test runs:

#!/bin/bash

# This will print -runs X -threads lines of output; e.g. for 25 runs and 30 threads, there will be 750 lines of output

if [ -z ${DB_URL+x} ]
then
 echo " DB_URL must be set"
 echo ""
 echo " Example: export DB_URL=\"postgresql://postgres:postgres@localhost:5432/postgres?sslmode=require\""
 echo ""
 exit 1
fi

# Build Go app if the EXE isn't present here
[ -x ./go-pgx-tester ] || go mod tidy && go build .

# Obtain initial and final time bounds from generated data
t0=$( psql "$DB_URL" -c "\\timing off" -tAc "select min("time") from track_plays;" | tail -1 | perl -ne 's/ /T/; s/\+00$/Z/; print;' )
t1=$( psql "$DB_URL" -c "\\timing off" -tAc "select max("time") from track_plays;" | tail -1 | perl -ne 's/ /T/; s/\+00$/Z/; print;' )

# List of users
users=$( psql "$DB_URL" -c "\\timing off" -tAc "with a as (select distinct(user_id) from track_plays) select string_agg(user_id, ',') from a;" | tail -1 )

./go-pgx-tester \
 -db "$DB_URL" \
 -start $t0 \
 -end $t1 \
 -max-dur 450 \
 -min-dur 30 \
 -runs 25 \
 -sleep 100us \
 -threads 30 \
 -users $users

# Print which DB ran the test
psql "$DB_URL" -c "\\timing off" -tAc "select version();" | tail -1

# Print number of rows in the table
n_rows=$( psql "$DB_URL" -c "\\timing off" -tAc "select count(*) from track_plays;" | tail -1 )
echo "Row count: $n_rows"

Here’s the progression of data generation used to create the graphs:

$ ./data_gen_track_plays_sleep_states_duration.py --users 100 --sessions-per-user 15 --sleep-cadence-sec 5 --minutes-per-session 720 --start '2025-11-01T00:00:00Z' --sleep-csv sleep_states.csv --tracks-csv track_plays.csv --tracks-only

$ ./data_gen_track_plays_sleep_states_duration.py --users 100 --sessions-per-user 30 --sleep-cadence-sec 5 --minutes-per-session 720 --start '2025-11-01T00:00:00Z' --sleep-csv sleep_states.csv --tracks-csv track_plays.csv --tracks-only

$ ./data_gen_track_plays_sleep_states_duration.py --users 100 --sessions-per-user 60 --sleep-cadence-sec 5 --minutes-per-session 720 --start '2025-11-01T00:00:00Z' --sleep-csv sleep_states.csv --tracks-csv track_plays.csv --tracks-only

$ ./data_gen_track_plays_sleep_states_duration.py --users 140 --sessions-per-user 75 --sleep-cadence-sec 5 --minutes-per-session 720 --start '2025-11-01T00:00:00Z' --sleep-csv sleep_states.csv --tracks-csv track_plays.csv --tracks-only

$ ./data_gen_track_plays_sleep_states_duration.py --users 281 --sessions-per-user 75 --sleep-cadence-sec 5 --minutes-per-session 720 --start '2025-11-01T00:00:00Z' --sleep-csv sleep_states.csv --tracks-csv track_plays.csv --tracks-only

$ ./data_gen_track_plays_sleep_states_duration.py --users 570 --sessions-per-user 75 --sleep-cadence-sec 5 --minutes-per-session 720 --start '2025-11-01T00:00:00Z' --sleep-csv sleep_states.csv --tracks-csv track_plays.csv --tracks-only

Data loading, TimescaleDB:

$ psql "postgresql://postgres:postgres@localhost:15432/postgres?sslmode=disable" -c "COPY track_plays FROM stdin DELIMITER ',' NULL '\\N' ESCAPE '\' HEADER CSV" < track_plays.csv
Timing is on.
COPY 6220985
Time: 54162.184 ms (00:54.162)

PostgreSQL:

$ psql "postgresql://postgres:postgres@localhost:15432/postgres?sslmode=disable" -c "COPY track_plays FROM stdin DELIMITER ',' NULL '\\N' ESCAPE '\' HEADER CSV" < track_plays.csv
Timing is on.
COPY 6220985
Time: 60835.740 ms (01:00.836)

And CedarDB:

$ psql "postgresql://postgres:postgres@localhost:5432/postgres?sslmode=require" -c "COPY track_plays FROM stdin DELIMITER ',' NULL '\\N' ESCAPE '\' HEADER CSV" < track_plays.csv
Timing is on.
COPY 6220985
Time: 10038.192 ms (00:10.038)

What I noticed: CedarDB’s load time is less than 10% of the time taken by either TimescaleDB or Postgres, which isn’t surprising given CedarDB doesn’t require any indexes for this workload, so it avoids the costs associated with them.

For each of the three DB’s, and for each of the data sizes we measured, four runs were measured; the average of these values appears in the data table below. Here is an example test run:

What the Test Runs Looked Like:

$ ./run_test.sh
Starting workload: threads=30 runs/thread=25 time=[2025-11-01T00:00:00Z .. 2025-11-02T11:59:55Z] dur=[30..450] users=460 timeout=15s
2025/11/14 10:58:07 [w15] ok run=0 uid=milana.katsma time=[2025-11-01T06:51:43Z..2025-11-01T07:17:03Z] dur=[301] count=131 latency=4.547542ms
2025/11/14 10:58:07 [w22] ok run=0 uid=robert.gottreich time=[2025-11-02T07:25:17Z..2025-11-02T08:10:53Z] dur=[38] count=609 latency=4.853083ms
...
2025/11/14 10:58:14 [w06] ok run=24 uid=con.gustafson time=[2025-11-01T08:24:13Z..2025-11-01T08:50:51Z] dur=[274] count=358 latency=614.541µs
2025/11/14 10:58:14 [w15] ok run=24 uid=matute.gluck time=[2025-11-02T02:22:44Z..2025-11-02T02:44:23Z] dur=[59] count=507 latency=580.792µs
Done in 130.279042ms
PostgreSQL 16.3 compatible CedarDB v2025-11-06
Row count: 6220985

And the resulting data, from which the graphs were created:

N_Rows Postgres (ms) TimescaleDB (ms) CedarDB (ms)
218,380 248 128 127
436,700 523 183 130
873,437 853 301 133
1,528,447 1267 488 136
3,067,366 6932 1312 127
6,220,985 15180 2847 132

November 24, 2025

Protect sensitive data with dynamic data masking for Amazon Aurora PostgreSQL

Today, we are launching dynamic data masking feature for Amazon Aurora PostgreSQL-Compatible Edition. In this post we show how dynamic data masking can help you meet data privacy requirements. We discuss how this feature is implemented and demonstrate how it works with PostgreSQL role hierarchy.

What writeConcern: {w: 1} really means? Isolation and Durability

In MongoDB, a write concern of w:1 indicates that a write operation is considered successful once the primary node acknowledges it, without waiting for the data to be replicated to secondary nodes. While this reduces latency, it also introduces the risk that if the primary fails before replication occurs, the written data could be lost. In replica sets with multiple voters, such writes can be rolled back if a failure happens before a majority acknowledges the change.

This is not the default setting. Most clusters (Primary-Secondary-Secondary) use an implicit w:majority write concern, which ensures durability in the event of a zone failure. The implicit default write concern is w:1 only when an arbiter is present (Primary-Secondary-Arbiter) or when the topology lowers the number of data-bearing voters.

For performance reasons, you may sometimes write with w:1. However, it's important to understand the consequences this setting might have in certain failure scenarios. To clarify, here is an example.

I started a three-node replica set using Docker:

docker network create lab

docker run -d --network lab --name m1 --hostname m1 mongo --bind_ip_all --replSet rs

docker run -d --network lab --name m2 --hostname m2 mongo --bind_ip_all --replSet rs

docker run -d --network lab --name m3 --hostname m3 mongo --bind_ip_all --replSet rs

docker exec -it m1 mongosh --host m1 --eval '
 rs.initiate( {_id: "rs", members: [
  {_id: 0, priority: 3, host: "m1:27017"},
  {_id: 1, priority: 2, host: "m2:27017"},
  {_id: 2, priority: 1, host: "m3:27017"}]
 });
'

I create a collection with one "old" document:

docker exec -it m1 mongosh --host m1 --eval '
  db.myCollection.drop();
  db.myCollection.insertOne(
   {name: "old"},
   {writeConcern: {w: "majority", wtimeout: 15000}}
);
'

I checked that the document is there:

docker exec -it m1 mongosh --host m1 --eval 'db.myCollection.find()'

[ { _id: ObjectId('691df945727482ee30fa3350'), name: 'old' } ]

I disconnected two nodes, so I no longer had a majority. However, I quickly inserted a new document before the primary stepped down and became a secondary that would not accept new writes:

docker network disconnect lab m2

docker network disconnect lab m3

docker exec -it m1 mongosh --host m1 --eval '
  db.myCollection.insertOne(
   {name: "new"},
   {writeConcern: {w: "1", wtimeout: 15000}}
);
'

Note the use of writeConcern: {w: "1"} to explicitly reduce consistency. Without this setting, the default is "majority". In that case, the write operation would have waited until a timeout, allowing the application to recognize that durability could not be guaranteed and that the write was unsuccessful.

With writeConcern: {w: 1}, the operation was acknowledged and the data became visible:

docker exec -it m1 mongosh --host m1 --eval 'db.myCollection.find()'

[
  { _id: ObjectId('691df945727482ee30fa3350'), name: 'old' },
  { _id: ObjectId('691dfa0ff09d463d36fa3350'), name: 'new' }
]

Keep in mind that this is visible when using the default 'local' read concern, but not when using 'majority':

docker exec -it m1 mongosh --host m1 --eval '
db.myCollection.find().readConcern("majority")
'

[
  { _id: ObjectId('691df945727482ee30fa3350'), name: 'old' }
]

I checked the Oplog to confirm that the idempotent version of my change was present:

docker exec -it m1 mongosh --host m1 local --eval '
  db.oplog.rs
    .find({ns:"test.myCollection"},{op:1, o:1, t:1})
    .sort({ ts: -1 });
'

[
  {
    op: 'i',
    o: { _id: ObjectId('691dfa0ff09d463d36fa3350'), name: 'new' },
    t: Long('1')
  },
  {
    op: 'i',
    o: { _id: ObjectId('691df945727482ee30fa3350'), name: 'old' },
    t: Long('1')
  }
]

The primary node accepted w:1 writes only briefly, during the interval between losing quorum and stepping down. Afterwards, it automatically switches to SECONDARY, and since no quorum is present, there is no PRIMARY. This state can persist for some time:

docker exec -it m1 mongosh --host m1 --eval '
 rs.status().members
'

[
  {
    _id: 0,
    name: 'm1:27017',
    health: 1,
    state: 2,
    stateStr: 'SECONDARY',
    uptime: 1172,
    optime: { ts: Timestamp({ t: 1763572239, i: 1 }), t: Long('1') },
    optimeDate: ISODate('2025-11-19T17:10:39.000Z'),
    optimeWritten: { ts: Timestamp({ t: 1763572239, i: 1 }), t: Long('1') },
    optimeWrittenDate: ISODate('2025-11-19T17:10:39.000Z'),
    lastAppliedWallTime: ISODate('2025-11-19T17:10:39.685Z'),
    lastDurableWallTime: ISODate('2025-11-19T17:10:39.685Z'),
    lastWrittenWallTime: ISODate('2025-11-19T17:10:39.685Z'),
    syncSourceHost: '',
    syncSourceId: -1,
    infoMessage: '',
    configVersion: 1,
    configTerm: 1,
    self: true,
    lastHeartbeatMessage: ''
  },
  {
    _id: 1,
    name: 'm2:27017',
    health: 0,
    state: 8,
    stateStr: '(not reachable/healthy)',
    uptime: 0,
    optime: { ts: Timestamp({ t: 0, i: 0 }), t: Long('-1') },
    optimeDurable: { ts: Timestamp({ t: 0, i: 0 }), t: Long('-1') },
    optimeWritten: { ts: Timestamp({ t: 0, i: 0 }), t: Long('-1') },
    optimeDate: ISODate('1970-01-01T00:00:00.000Z'),
    optimeDurableDate: ISODate('1970-01-01T00:00:00.000Z'),
    optimeWrittenDate: ISODate('1970-01-01T00:00:00.000Z'),
    lastAppliedWallTime: ISODate('2025-11-19T17:10:34.194Z'),
    lastDurableWallTime: ISODate('2025-11-19T17:10:34.194Z'),
    lastWrittenWallTime: ISODate('2025-11-19T17:10:34.194Z'),
    lastHeartbeat: ISODate('2025-11-19T17:26:03.626Z'),
    lastHeartbeatRecv: ISODate('2025-11-19T17:10:37.153Z'),
    pingMs: Long('0'),
    lastHeartbeatMessage: 'Error connecting to m2:27017 :: caused by :: Could not find address for m2:27017: SocketException: onInvoke :: caused by :: Host not found (authoritative)',
    syncSourceHost: '',
    syncSourceId: -1,
    infoMessage: '',
    configVersion: 1,
    configTerm: 1
  },
  {
    _id: 2,
    name: 'm3:27017',
    health: 0,
    state: 8,
    stateStr: '(not reachable/healthy)',
    uptime: 0,
    optime: { ts: Timestamp({ t: 0, i: 0 }), t: Long('-1') },
    optimeDurable: { ts: Timestamp({ t: 0, i: 0 }), t: Long('-1') },
    optimeWritten: { ts: Timestamp({ t: 0, i: 0 }), t: Long('-1') },
    optimeDate: ISODate('1970-01-01T00:00:00.000Z'),
    optimeDurableDate: ISODate('1970-01-01T00:00:00.000Z'),
    optimeWrittenDate: ISODate('1970-01-01T00:00:00.000Z'),
    lastAppliedWallTime: ISODate('2025-11-19T17:10:34.194Z'),
    lastDurableWallTime: ISODate('2025-11-19T17:10:34.194Z'),
    lastWrittenWallTime: ISODate('2025-11-19T17:10:34.194Z'),
    lastHeartbeat: ISODate('2025-11-19T17:26:03.202Z'),
    lastHeartbeatRecv: ISODate('2025-11-19T17:10:37.153Z'),
    pingMs: Long('0'),
    lastHeartbeatMessage: 'Error connecting to m3:27017 :: caused by :: Could not find address for m3:27017: SocketException: onInvoke :: caused by :: Host not found (authoritative)',
    syncSourceHost: '',
    syncSourceId: -1,
    infoMessage: '',
    configVersion: 1,
    configTerm: 1
  }
]

When there is no primary, no further writes are accepted—even if you set writeConcern: {w: "1"}:

docker exec -it m1 mongosh --host m1 --eval '
  db.myCollection.insertOne(
   {name: "new"},
   {writeConcern: {w: "1", wtimeout: 15000}}
);
'

MongoServerError: not primary

The system may remain in this state for some time. When at least one sync replica comes back online, it will pull the Oplog and synchronize the write to the quorum, making the acknowledged write durable.

Using writeConcern: {w: "1"} boosts performance, as the primary doesn't wait for acknowledgments from other nodes. This write concern tolerates a single node failure since the quorum remains, and can even withstand another brief failure. However, if a failure persists, additional writes aren't accepted, reducing the risk of unacknowledged writes. Usually, when a node recovers, it synchronizes via the Oplog, and the primary resumes accepting writes.

In the common scenario where brief, transient failures may occur, using writeConcern: {w: "1"} means the database remains available if the failure is just a momentary glitch. However, the point here is to illustrate the worst-case scenario. If one node accepts a write that is not acknowledged by any other node, and this node fails before any others recover, that write may be lost.

To illustrate this possible scenario, I first disconnected this node and then proceeded to connect the remaining ones:

docker network disconnect lab m1
docker network    connect lab m2
docker network    connect lab m3

In this worst-case scenario, a new quorum is formed with a state that predates when the write could be synchronized to the replicas. However, progress continues because a new primary is established:

docker exec -it m2 mongosh --host m2 --eval '
 rs.status().members
'

> '
[
  {
    _id: 0,
    name: 'm1:27017',
    health: 0,
    state: 8,
    stateStr: '(not reachable/healthy)',
    uptime: 0,
    optime: { ts: Timestamp({ t: 0, i: 0 }), t: Long('-1') },
    optimeDurable: { ts: Timestamp({ t: 0, i: 0 }), t: Long('-1') },
    optimeWritten: { ts: Timestamp({ t: 0, i: 0 }), t: Long('-1') },
    optimeDate: ISODate('1970-01-01T00:00:00.000Z'),
    optimeDurableDate: ISODate('1970-01-01T00:00:00.000Z'),
    optimeWrittenDate: ISODate('1970-01-01T00:00:00.000Z'),
    lastAppliedWallTime: ISODate('2025-11-19T17:10:34.194Z'),
    lastDurableWallTime: ISODate('2025-11-19T17:10:34.194Z'),
    lastWrittenWallTime: ISODate('2025-11-19T17:10:34.194Z'),
    lastHeartbeat: ISODate('2025-11-19T17:39:02.913Z'),
    lastHeartbeatRecv: ISODate('2025-11-19T17:10:38.153Z'),
    pingMs: Long('0'),
    lastHeartbeatMessage: 'Error connecting to m1:27017 :: caused by :: Could not find address for m1:27017: SocketException: onInvoke :: caused by :: Host not found (authoritative)',
    syncSourceHost: '',
    syncSourceId: -1,
    infoMessage: '',
    configVersion: 1,
    configTerm: 1
  },
  {
    _id: 1,
    name: 'm2:27017',
    health: 1,
    state: 1,
    stateStr: 'PRIMARY',
    uptime: 1952,
    optime: { ts: Timestamp({ t: 1763573936, i: 1 }), t: Long('2') },
    optimeDate: ISODate('2025-11-19T17:38:56.000Z'),
    optimeWritten: { ts: Timestamp({ t: 1763573936, i: 1 }), t: Long('2') },
    optimeWrittenDate: ISODate('2025-11-19T17:38:56.000Z'),
    lastAppliedWallTime: ISODate('2025-11-19T17:38:56.678Z'),
    lastDurableWallTime: ISODate('2025-11-19T17:38:56.678Z'),
    lastWrittenWallTime: ISODate('2025-11-19T17:38:56.678Z'),
    syncSourceHost: '',
    syncSourceId: -1,
    infoMessage: 'Could not find member to sync from',
    electionTime: Timestamp({ t: 1763573886, i: 1 }),
    electionDate: ISODate('2025-11-19T17:38:06.000Z'),
    configVersion: 1,
    configTerm: 2,
    self: true,
    lastHeartbeatMessage: ''
  },
  {
    _id: 2,
    name: 'm3:27017',
    health: 1,
    state: 2,
    stateStr: 'SECONDARY',
    uptime: 58,
    optime: { ts: Timestamp({ t: 1763573936, i: 1 }), t: Long('2') },
    optimeDurable: { ts: Timestamp({ t: 1763573936, i: 1 }), t: Long('2') },
    optimeWritten: { ts: Timestamp({ t: 1763573936, i: 1 }), t: Long('2') },
    optimeDate: ISODate('2025-11-19T17:38:56.000Z'),
    optimeDurableDate: ISODate('2025-11-19T17:38:56.000Z'),
    optimeWrittenDate: ISODate('2025-11-19T17:38:56.000Z'),
    lastAppliedWallTime: ISODate('2025-11-19T17:38:56.678Z'),
    lastDurableWallTime: ISODate('2025-11-19T17:38:56.678Z'),
    lastWrittenWallTime: ISODate('2025-11-19T17:38:56.678Z'),
    lastHeartbeat: ISODate('2025-11-19T17:39:02.679Z'),
    lastHeartbeatRecv: ISODate('2025-11-19T17:39:01.178Z'),
    pingMs: Long('0'),
    lastHeartbeatMessage: '',
    syncSourceHost: 'm2:27017',
    syncSourceId: 1,
    infoMessage: '',
    configVersion: 1,
    configTerm: 2
  }
]

This replica set has a primary and is accepting new writes with a new Raft term (configTerm: 2). However, during recovery, it ignored a pending write from the previous term (configTerm: 1) that originated from an unreachable node.

A write made with w:1 after the quorum was lost but before the primary stepped down was lost:

docker exec -it m2 mongosh --host m2 --eval '
  db.myCollection.find()
'

[ { _id: ObjectId('691df945727482ee30fa3350'), name: 'old' } ]

After reconnecting the first node, it enters recovery mode and synchronizes with the other nodes, all of which are on term 2:

docker network    connect lab m1

docker exec -it m1 mongosh --host m1 --eval '
 db.myCollection.find()
'

MongoServerError: Oplog collection reads are not allowed while in the rollback or startup state.

The rollback process employs the 'Recover To A Timestamp' algorithm to restore the node to the highest majority-committed point. While rolling back, the node transitions to the ROLLBACK state, suspends user operations, finds the common point with the sync source, and recovers to the stable timestamp.

After recovery, changes made in term 1 that did not receive quorum acknowledgment are truncated from the Oplog. This behavior is an extension to the standard Raft algorithm:

docker exec -it m1 mongosh --host m1 --eval '
 rs.status().members
'

[
  {
    op: 'i',
    o: { _id: ObjectId('691df945727482ee30fa3350'), name: 'old' },
    t: Long('1')
  }
]

A w:1 write that was visible at one point, and acknlowledged to the client, but never actually committed to the quorum, has now disappeared:

docker exec -it m1 mongosh --host m1 --eval '
  db.myCollection.find()
'

[ { _id: ObjectId('691df945727482ee30fa3350'), name: 'old' } ]

With writeConcern: {w: 1}, the developer must be aware that such issue can arise if a write occurs immediately after quorum is lost and the primary fails before other nodes recover.

While SQL databases typically abstract physical concerns such as persistence and replication, MongoDB shifts more responsibility to developers. By default, acknowledged writes are considered durable only once a majority of nodes confirm they are synced to disk.

In some cases, strict write guarantees are unnecessary and can be relaxed for improved performance. Developers can adjust the write concern to suit their application's needs. When using writeConcern: {w: 1}, this affects two aspects of ACID:

  • Durability: If there is a failure impacting both the primary and replicas, and only replicas recover, writes not acknowledged by replicas may be rolled back—similar to PostgreSQL's synchronous_commit = local.
  • Isolation: Reads with 'local' concern might not see writes confirmed to the client until these are acknowledged by a majority. There is no PostgreSQL equivalent to MongoDB's 'majority' read concern (MVCC visibility tracking what was applied on the replicas).

Although writeConcern: {w: 1} is sometimes described as permitting 'dirty reads', this term is misleading as it is also used as a synonym of 'read uncommitted' in relational databases. In SQL databases with a single read-write instance, 'uncommitted read' refers to single-server isolation (the I in ACID). However, with writeConcern: {w: 1} and a 'majority' read concern, uncommitted reads do not occur and only committed changes are visible to other sessions. The real challenge involves durability (the D in ACID) in the context of a replica set. With traditional SQL databases replication, writes might be visible before all peers (replica, WAL, application) have fully acknowledged them, since there's no single atomic operation covering all. MongoDB's w:1 is similar, and calling it a 'dirty read' is useful to highlight the implications for developers.

Analyzing the Heartbeat of the MySQL Server: A Look at Repository Statistics

The MySQL database server is a foundational component of the open-source world. While its impact is undeniable, looking at the raw statistics of its core source code repository reveals a dynamic and sometimes surprising development history. By analyzing the total lines of code inserted, the number of commits over the years, and unique contributors, we […]

November 23, 2025

Data Locality vs. Independence: Which Should Your Database Prioritize?

Understand how the principle of "store together what is accessed together" is a game-changer for database performance and scalability.

When your application needs several pieces of data at once, the fastest approach is to read them from a single location in a single call. In a document database, developers can decide what is stored together, both logically and physically.

Fragmentation has never been beneficial for performance. In databases, the proximity of data — on disk, in memory or across the network — is crucial for scalability. Keeping related data together allows a single operation to fetch everything needed, reducing disk I/O, memory cache misses and network round-trips, thereby making performance more predictable.

The principle “store together what is accessed together” is central to modeling in document databases. Yet its purpose is to allow developers to control the physical storage layout, even with flexible data structures.

In contrast, SQL databases were designed for data independence — allowing users to interact with a logical model separate from the physical implementation managed by a database administrator.

Today, the trend is not to separate development and operations, allowing faster development cycles without the complexity of coordinating multiple teams or shared schemas. Avoiding the separation into logical and physical models further simplifies the process.

Understanding the core principle of data locality is essential today, especially as many databases emulate document databases or offer similar syntax on top of SQL. To qualify as a document database, it’s not enough to accept JSON documents with a developer-friendly syntax.

The database must also preserve those documents intact in storage so that accessing them has predictable performance. Whether they expose a relational or document API, it is essential to know if your objective is data independence or data locality.

Why Locality Still Matters in Modern Infrastructure

Modern hardware still suffers from penalties for scattered access. Hard disk drives (HDDs) highlighted the importance of locality because seek and rotational latency are more impactful than transfer speed, especially for online transactional processing (OTLP) workloads.

While solid state drives (SSDs) remove mechanical delays, random writes remain expensive, and cloud storage adds latency due to network access to storage. Even in-memory access isn’t immune: on multisocket servers, non-uniform memory access (NUMA) causes varying access times depending on where the data was loaded into memory by the first access, relative to the CPU core that processes it later.

Scale-out architecture further increases complexity. Vertical scaling — keeping all reads and writes on a single instance with shared disks and memory — has capacity limits. Large instances are expensive, and scaling them down or up often requires downtime, which is risky for always-on applications.

For example, you might need your maximum instance size for Black Friday but would have to scale up progressively in the lead-up, incurring downtime as usage increases. Without horizontal scalability, you end up provisioning well above your average load “just in case,” as in on-premises infrastructures sized years in advance for occasional peaks — something that can be prohibitively costly in the cloud.

Horizontal scaling allows adding or removing nodes without downtime. However, more nodes increase the likelihood of distributed queries, in which operations that once hit local memory must now traverse the network, introducing unpredictable latency. Data locality becomes critical with scale-out databases.

To create scalable database applications, developers should understand storage organization and prioritize single-document operations for performance-critical transactions. CRUD functions (insert, find, update, delete) targeting a single document in MongoDB are always handled by a single node, even in a sharded deployment. If that document isn’t in memory, it can be read from disk in a single I/O operation. Modifications are applied to the in-memory copy and written back as a single document during asynchronous checkpoints, avoiding on-disk fragmentation.

In MongoDB, the WiredTiger storage engine stores each document’s fields together in contiguous storage blocks, allowing developers to follow the principle “store together what is accessed together.” By avoiding cross-document joins, such as the $lookup operation in queries, this design helps prevent scatter-gather operations internally, which promotes consistent performance. This supports predictable performance regardless of document size, update frequency or cluster scale.

The Relational Promise: Physical Data Independence

For developers working with NoSQL databases, what I exposed above seems obvious: There is one single data model — the domain model — defined in the application, and the database stores exactly that model.

The MongoDB data modeling workshop defines a database schema as the physical model that describes how the data is organized in the database. In relational databases, the logical model is typically independent of the physical storage model, regardless of the data type used, because they serve different purposes.

SQL developers work with a relational model that is mapped to their object model via object relational mapping (ORM) tooling or hand-coded SQL joins. The models and schemas are normalized for generality, not necessarily optimized for specific application access patterns.

The goal of the relational model was to serve online interactive use by non-programmers and casual users by providing an abstraction that hides physical concerns. This includes avoiding data anomalies through normalization and enabling declarative query access without procedural code. Physical optimizations, like indexes, are considered implementation details. You will not find CREATE INDEX in the SQL standard.

In practice, a SQL query planner chooses access paths based on statistics. When writing JOIN clauses, the order of tables in the FROM clause should not matter. The SQL query planner reorders based on cost estimates. The database guarantees logical consistency, at least in theory, even with concurrent users and internal replication. The SQL approach is database-centric: rules, constraints and transactional guarantees are defined in the relational database, independent of specific use cases or table sizes.

Today, most relational databases sit behind applications. End users rarely interact with them directly, except in analytical or data science contexts. Applications can enforce data integrity and handle code anomalies, and developers understand data structures and algorithms. Nonetheless, relational database experts still advise keeping constraints, stored procedures, transactions, and joins within the database.

The physical storage remains abstracted — indexes, clustering, and partitions are administrator-level, not application-level, concepts, as if the application developers were like the non-programmer casual users described in the early papers about relational databases.

How Codd’s Rules Apply to SQL/JSON Documents

Because data locality matters, some relational databases have mechanisms to enforce it internally. For example, Oracle has long supported “clustered tables” for co-locating related rows from multiple columns, and more recently offers a choice for JSON storage as either binary JSON (OSON, Oracle’s native binary JSON) or decomposed relational rows (JSON-relational duality views). However, those physical attributes are declared and deployed in the database using a specific data definition language (DDL) and are not exposed to the application developers. This reflects Codd’s “independence” rules:

  • Rule 8: Physical data independence
  • Rule 9: Logical data independence
  • Rule 10: Integrity independence
  • Rule 11: Distribution independence

Rules 8 and 11 relate directly to data locality: The user is not supposed to care whether data is physically together or distributed. The database is opened to users who ignore the physical data model, access paths and algorithms. Developers do not know what is replicated, sharded or distributed across multiple data centers.

Where the SQL Abstraction Begins to Weaken

In practice, no relational database perfectly achieves these rules. Performance tuning often requires looking at execution plans and physical data layouts. Serializable isolation is rarely used due to scalability limitations of two-phase locking, leading developers to fall back to weaker isolation levels or to explicit locking (SELECT ... FOR UPDATE). Physical co-location mechanisms — hash clusters, attribute clustering — exist, but are difficult to size and maintain optimally without precise knowledge of access patterns. They often require regular data reorganization as updates can fragment it again.

The normalized model is inherently application-agnostic, so optimizing for locality often means breaking data independence ( denormalizing, maintaining materialized views, accepting stale reads from replicas, disabling referential integrity). With sharding, constraints like foreign keys and unique indexes generally cannot be enforced across shards. Transactions must be carefully ordered to avoid long waits and deadlocks. Even with an abstraction layer, applications must be aware of the physical distribution for some operations.

The NoSQL Approach: Modeling for Access Patterns

As data volumes and latency expectations grow, a different paradigm has emerged: give developers complete control rather than an abstraction with some exceptions.

NoSQL databases adopt an application-first approach: The physical model matches the access patterns, and the responsibility for maintaining integrity and transactional scope is pushed to the application. Initially, many NoSQL stores delegated all responsibility, including consistency, to developers, acting as “dumb” key-value or document stores. Most lacked ACID (atomicity, consistency, isolation and durability) transactions or query planners. If secondary indexes were present, they needed to be queried explicitly.

This NoSQL approach was the opposite of the relational database world: Instead of one shared, normalized database, there were many purpose-built data stores per application. It reduces the performance and scalability surprises, but at the price of more complexity.

MongoDB’s Middle Road for Flexible Schemas

MongoDB evolved by adding essential relational database capabilities — indexes, query planning, multidocument ACID transactions — while keeping the application-first document model. When you insert a document, it is stored as a single unit.

In WiredTiger, the MongoDB storage engine, BSON documents (binary JSON with additional datatypes and indexing capabilities) are stored in B-trees with variable-sized leaf pages, allowing large documents to remain contiguous, which differs from the fixed-size page structures used by many relational databases. This avoids splitting a business object across multiple blocks and ensures consistent latency for operations that appear as a single operation to developers.

Updates in MongoDB are applied in memory. Committing them as in-place changes on disk would fragment pages. Instead, WiredTiger uses reconciliation to write a complete new version at checkpoints — similar to copy-on-write filesystems, but with a flexible block size. This may cause write amplification, but preserves document locality. With appropriately sized instances, these writes occur in the background and do not affect in-memory write latency.

Locality defined at the application’s document schema flows all the way down to the storage layer, something that relational database engines typically cannot match with their goal of physical data independence.

How Data Locality Improves Application Performance

Designing for locality simplifies development and operations in several ways:

  • Transactions: A business change affecting a single aggregate (in the domain-driven design sense) becomes a single atomic read–modify–write on one document — no multiple roundtrips like BEGIN, SELECT ... FOR UPDATE, multiple updates and COMMIT.
  • Queries and indexing: Related data in one document avoids SQL joins and ORM lazy/eager mapping. A single compound index can cover filters and projections across fields that would otherwise be in separate tables, ensuring predictable plans without join-order uncertainty.
  • Development: The same domain model in the application is used directly as the database schema. Developers can reason about access patterns without mapping to a separate model, making latency and plan stability predictable.
  • Scalability: Most operations targeting a single aggregate, with shard keys chosen accordingly, can be routed to one node, avoiding scatter–gather fan-out for critical use cases.

MongoDB’s optimistic concurrency control avoids locks, though it requires retry logic on write conflict errors. For single-document calls, retries are handled transparently by the databases, which have a complete view of the transaction intent, making it simpler and faster.

Embedding vs. Referencing in Document Data Modeling

Locality doesn’t mean “embed everything.” It means: Embed what you consistently access together. Bounded one-to-many relationships (such as an order and its line items) are candidates for embedding. Rarely updated references and dimensions can also be duplicated and embedded. High-cardinality or unbounded-growth relationships, or independently updated entities, are better represented as separate documents and can be co-located via shard keys.

MongoDB’s compound and multikey indexes support embedded fields, maintaining predictable, selective access without joins. Embedding within the same document is the only way to guarantee co-location at the block level. Multiple documents in a single collection are not stored close together, except for small documents inserted at the same time, as they might share the same block. In sharding, the shard key ensures co-location on the same node but not within the same block.

In MongoDB, locality is an explicit design choice in domain-driven design:

  • Identify aggregates that change and are read together.
  • Store them in one document when appropriate.
  • Use indexes aligned with access paths.
  • Choose shard keys so related operations route to one node.

What MongoDB Emulations Miss About Locality

Given the popularity of the document model, some cloud services offer MongoDB-like APIs on top of SQL databases. These systems may expose a MongoDB-like API while retaining a relational storage model, which typically does not maintain the same level of physical locality.

Relational databases store rows in fixed-size blocks (often 8 KB). Large documents must be split across multiple blocks. Here are some examples in popular SQL databases:

  • PostgreSQL JSONB: Stores JSON in heap tables and large documents in many chunks, using TOAST, the oversized attribute storage technique. The document is compressed and split into chunks stored in another table, accessed via an index. Reading a large document is like a nested loop join between the row and its TOAST table.
  • Oracle JSON-Relational Duality Views: Map JSON documents to relational tables, preserving data independence rather than physical locality. Elements accessed together may be scattered across blocks, requiring internal joins, multiple I/Os and possibly network calls in distributed setups.

In both scenarios, the documents are divided into either binary chunks or normalized tables. Although the API resembles MongoDB, it remains a SQL database that lacks data locality. Instead, it provides an abstraction that keeps the developer unaware of internal processes until they inspect the execution plan and understand the database internals.

Conclusion

“Store together what is accessed together” reflects realities across sharding, I/O patterns, transactions, and memory cache efficiency. Relational database engines abstract away physical layout, which works well for centralized, normalized databases serving multiple applications in a single monolithic server. At a larger scale, especially in elastic cloud environments, horizontal sharding is essential — and often incompatible with pure data independence. Developers must account for locality.

In SQL databases, this means denormalizing, duplicating reference data, and avoiding cross-shard constraints. The document model, when the database truly enforces locality down to storage offers an alternative to this abstraction and exceptions.

In MongoDB, locality can be explicitly defined at the application level while still providing indexing, query planning and transactional features. When assessing “MongoDB-compatible”systems on relational engines, it is helpful to determine whether the engine stores aggregates contiguously on disk and routes them to a single node by design. If not, the performance characteristics may differ from those of a document database that maintains physical locality.

Both approaches are valid. In database-first deployment, developers depend on in-database declarations to ensure performance, working alongside the database administrator and using tools like execution plans for troubleshooting. In contrast, application-first deployment shifts more responsibility to developers, who must validate both the application’s functionality and its performance.