June 11, 2026
The only scalable delete in Postgres is DROP TABLE
June 10, 2026
Automate Amazon Aurora PostgreSQL major or minor version upgrade using AWS Systems Manager and Amazon EC2
How to migrate from Oracle to Amazon Aurora PostgreSQL using AWS CloudFormation (Part 1)
Postgres 19 beta1 vs sysbench on a small server
This has results from sysbench on a small server with Postgres 19 beta1, 18.4 and 17.10. Sysbench is run with low concurrency (1 thread) and a cached database. The purpose is to search for changes in performance, often from new CPU overheads.
tl;dr
- 19beta1, 18.4 and 17.10 have mostly similar performance
- There might be small regressions (about 2%) from 17.10 to 19beta1 but my tests are not good at spotting that.
- 19beta1 is much faster on one test (read-only-count) thanks to a new query plan
The read-heavy microbenchmarks run for 600 seconds and the write-heavy for 900 seconds.
The benchmark is run with 1 client, 1 table and 50M rows. The purpose is to search for CPU regressions.
I provide charts below with relative QPS (rQPS). The relative QPS is the following:
(QPS for some version) / (QPS for base version)
I describe performance changes (changes to relative QPS) in terms of basis points. Performance changes by one basis point when the difference in rQPS is 0.01. When rQPS decreases from 0.95 to 0.85 then it changed by 10 basis points.
Results: point queries
- 19beta1 is better than 17.10 by ~3 basis points for most tests
- 19beta1 is slightly better than 18.4
- 19beta1 is worse than 17.10 by ~3 basis points in 4 of 5 tests
- 19beta1 is better than 17.10 by 5 basis points in the scan test
- 19beta1 and 18.4 are similar except for the scan test where 19beta1 did better
- 19beta1 is worse than than 17.10 on two tests
- 19beta1 is better than 17.10 on five tests
- 19beta1 and 17.10 are the same on one test
- 19beta1 is ~2.5X better than 17.10 on the read-only-count test
- 19beta1 and 18.4 have similar results except for the read-only-count test
Query plans for the read-only-count test ...
For 17.10
For 19beta1
- 19beta1 is worse than 17.10 by 2 to 5 basis points
- 18.4 is worse than 17.10 by 2 to 3 basis points
June 09, 2026
Building type-safe applications with Drizzle ORM in Aurora DSQL
Pagination patterns in Amazon Aurora DSQL
Your MySQL Schema is Already a REST API
Percona Operator for MySQL (PXC) 1.20.0: Automatic Storage Resizing, TLS Certificate Rotation, and ARM64 Support
Percona Operator for MySQL PXC 1.20.0 is out today, and it addresses three long-requested operational headaches: storage that grows on its own before it fills up, TLS certificates that rotate without cluster downtime, and images that run natively on ARM64. Disk-full incidents on PXC clusters often arrive at 2 AM when monitoring alerts fire, and … Continued
The post Percona Operator for MySQL (PXC) 1.20.0: Automatic Storage Resizing, TLS Certificate Rotation, and ARM64 Support appeared first on Percona.
RUM—Storing More in the Index
This series of posts traces the evolution from GIN to RUM to Extended RUM, showing how a single architectural idea—store more in the index to do less work at query time—unlocks major performance improvements at each step.
RUM was created by Alexander Korotkov, Oleg Bartunov, and Teodor Sigaev at Postgres Professional. It started as a direct response to the GIN limitations that users kept hitting in production — particularly in full-text search scenarios where ranking and ordering dominate query time.
The name is a nod: GIN (the drink) → RUM (a stronger drink). The index is stronger, too.
RUM's innovation is exactly one architectural change: each entry in a posting list can carry an additional datum alongside the TID:
GIN posting list entry: [TID]
RUM posting list entry: [TID, addInfo]
A 2016 thread on pgsql-general captures the exact pain point. Andreas Krogh was building a web-based email system — millions of messages, users expecting millisecond response — and wanted a single index that could:
- Match full-text search terms (
fts_all @@ to_tsquery(...)) - Filter by folder (
folder_id = ANY(ARRAY[2,3])) - Sort by timestamp (
ORDER BY sent DESC) - Stop early (
LIMIT 101)
With GIN, steps 1 and 2 worked (via btree_gin), but step 3 always required a separate Sort node — the index couldn't deliver results in timestamp order. The planner had to scan all matching TIDs, fetch every heap tuple, and sort them. For a mailbox with 100K matches, that's unacceptable.
Oleg Bartunov's response was direct: "We are [working] hard on our internal version of rum." RUM was designed to solve exactly this class of problem.
RUM extends the GIN framework with an order_by_attach option:
CREATE INDEX idx ON documents USING rum (tsv rum_tsvector_addon_ops, created_at)
WITH (attach = 'created_at', to = 'tsv', order_by_attach = true);
This tells RUM: "for every TID in the tsv posting list, also store the corresponding created_at value." The posting list entries become:
"postgresql" → [(0,1, 2024-01-15), (3,7, 2024-02-20), (5,2, 2024-03-01)]
Now the posting list is sorted by addInfo (the timestamp), not by the physical order of TIDs.
What RUM Enables Over GIN:
- Ordered results without sorting. The index walks the posting list in
addInfoorder and stops after N entries. No Sort node, no full scan. - Distance-based ordering with
<=>. RUM introduces distance operators (<=>,<=|,|=>). The<=>computesABS(a - b)— nearest-first retrieval. The<=|and|=>variants restrict to one direction. For full-text ranking, RUM's<=>operator has a built-in ranking function that combinests_rankandts_rank_cdsemantics and handles OR queries better than either function alone. - Depth-first traversal: first results immediately. Unlike GIN's bitmap approach (collect all TIDs, then access the heap), RUM performs a depth-first traversal. It can return first results immediately — critical for
LIMITqueries. - Phrase search without recheck. RUM's
rum_tsvector_opsstores word positions asaddInfo. The index verifies phrase adjacency during the scan — no heap rechecks needed.
Here are the operator classes:
-
rum_tsvector_opsstores lexemes with positional information, supports ordering (<=>) by relevance -
rum_tsvector_hash_opsstores hashed lexemes with positions, supports ordering (<=>) with no prefix search -
rum_tsvector_addon_opsstores lexemes + any attached column, supports ordering (<=>) on attached column -
rum_anyarray_opsstores array elements + array length, supports ordering (<=>) by similarity -
rum_anyarray_addon_opsstores array elements + attached column, supports ordering (<=>) on attached column -
rum_timestamp_opsstores scalar values, supports ordering (<=>)
Using the same articles table from the previous post, I add a RUM index that adds the published timestamp to tsv:
postgres=# CREATE EXTENSION IF NOT EXISTS rum;
postgres=# CREATE INDEX idx_rum_addon ON articles
USING rum (tsv rum_tsvector_addon_ops, published)
WITH (attach = 'published', to = 'tsv');
I searched for the five most recent articles by publication date using the same text query as in the previous post.
postgres=# EXPLAIN (COSTS OFF, ANALYZE, BUFFERS, VERBOSE)
SELECT id, title, published,
published <=> '2020-06-01'::timestamp AS distance
FROM articles
WHERE tsv @@ to_tsquery('simple', 'postgresql & article')
ORDER BY published <=> '2020-06-01'::timestamp
LIMIT 5;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------
Limit (actual time=89.101..89.552 rows=5 loops=1)
Output: id, title, published, ((published <=> '2020-06-01 00:00:00'::timestamp without time zone))
Buffers: shared hit=1233 read=1514, temp read=1303 written=1303
-> Index Scan using idx_rum_addon on documentdb_core.articles (actual time=89.097..89.542 rows=5 loops=1)
Output: id, title, published, (published <=> '2020-06-01 00:00:00'::timestamp without time zone)
Index Cond: (articles.tsv @@ '''postgresql'' & ''article'''::tsquery)
Order By: (articles.published <=> '2020-06-01 00:00:00'::timestamp without time zone)
Buffers: shared hit=1233 read=1514, temp read=1303 written=1303
Planning:
Buffers: shared hit=3
Planning Time: 0.401 ms
Execution Time: 89.747 ms
(12 rows)
No Sort node. No Bitmap. A true Index Scan with Order By pushed into the index. Stops after 5 results.
I add a RUM index on tsv that adds the position, for phrase search without recheck:
postgres=# CREATE INDEX idx_rum_pos ON articles USING rum (tsv rum_tsvector_ops);
postgres=# EXPLAIN (COSTS OFF, ANALYZE, BUFFERS, VERBOSE)
SELECT id, title
FROM articles
WHERE tsv @@ to_tsquery('simple', 'postgresql <-> article')
LIMIT 5;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------
Limit (actual time=116.196..116.198 rows=0 loops=1)
Output: id, title
Buffers: shared hit=661
-> Index Scan using idx_rum_pos on documentdb_core.articles (actual time=116.192..116.193 rows=0 loops=1)
Output: id, title
Index Cond: (articles.tsv @@ '''postgresql'' <-> ''article'''::tsquery)
Buffers: shared hit=661
Planning:
Buffers: shared hit=3
Planning Time: 0.223 ms
Execution Time: 117.838 ms
(11 rows)
No "Rows Removed by Index Recheck". Positions stored in the index enable direct verification of adjacency. The same idea applies beyond full‑text search, for example with JSONB, where RUM can store element positions to reduce rechecks and improve performance.
The same index is used for relevance ranking:
postgres=# EXPLAIN (COSTS OFF, ANALYZE, BUFFERS, VERBOSE)
SELECT id, title,
tsv <=> to_tsquery('english', 'postgresql | optimization') AS rank
FROM articles
WHERE tsv @@ to_tsquery('simple', 'postgresql | article')
ORDER BY tsv <=> to_tsquery('english', 'postgresql | optimization')
LIMIT 10;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------
Limit (actual time=543.958..544.006 rows=10 loops=1)
Output: id, title, ((tsv <=> '''postgresql'' | ''optim'''::tsquery))
Buffers: shared hit=783, temp read=13289 written=15244
-> Index Scan using idx_rum_pos on documentdb_core.articles (actual time=543.954..543.977 rows=10 loops=1)
Output: id, title, (tsv <=> '''postgresql'' | ''optim'''::tsquery)
Index Cond: (articles.tsv @@ '''postgresql'' | ''article'''::tsquery)
Order By: (articles.tsv <=> '''postgresql'' | ''optim'''::tsquery)
Buffers: shared hit=783, temp read=13289 written=15244
Planning:
Buffers: shared hit=3
Planning Time: 0.198 ms
Execution Time: 548.884 ms
(12 rows)
Ranking is computed directly during the index scan. No Sort node, no heap access for computing the score, enabling early termination with LIMIT. This contrasts with GIN, where ranking is computed after fetching all matching rows from the heap and sorting them.
Early versions of RUM allowed multi-column indexes, but only the main inverted column participated in the ordered scan (2017 thread). Additional columns were applied as post-filters, which limited the benefit of LIMIT queries. Currently filter columns can participate in the index scan itself, allowing pruning during traversal of the ordered posting list — but ordering is still driven by a single attached column. Here is an example:
postgres=# CREATE INDEX idx_rum_multi ON articles
USING rum (tsv rum_tsvector_addon_ops, category, published)
WITH (attach = 'published', to = 'tsv');
postgres=# EXPLAIN (COSTS OFF, ANALYZE, BUFFERS, VERBOSE)
SELECT id, title, published
FROM articles
WHERE tsv @@ to_tsquery('english', 'postgresql')
AND category = 'tech'
ORDER BY published <=> '2020-06-01'::timestamp
LIMIT 5;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------
Limit (actual time=50.212..51.269 rows=5 loops=1)
Output: id, title, published, ((published <=> '2020-06-01 00:00:00'::timestamp without time zone))
Buffers: shared hit=117 read=112, temp read=550 written=550
-> Index Scan using idx_rum_multi on documentdb_core.articles (actual time=50.209..51.258 rows=5 loops=1)
Output: id, title, published, (published <=> '2020-06-01 00:00:00'::timestamp without time zone)
Index Cond: ((articles.tsv @@ '''postgresql'''::tsquery) AND (articles.category = 'tech'::text))
Order By: (articles.published <=> '2020-06-01 00:00:00'::timestamp without time zone)
Buffers: shared hit=117 read=112, temp read=550 written=550
Planning:
Buffers: shared hit=37 read=9 dirtied=3
Planning Time: 7.805 ms
Execution Time: 51.667 ms
(12 rows)
Here, category = 'tech' is in Index Cond, not a Filter, so pre-filtering on multiple columns is possible. RUM’s <=> operator uses a ranking model (ts_score) that combines ts_rank and ts_rank_cd, addressing their respective limitations (logical operators and OR queries).
The RUM Data Structure includes addInfo and may be sorted on it instead of TID:
RUM Posting List (leaf page):
┌─────────────────────────────────────────────┐
│ [TID₁, addInfo₁] [TID₂, addInfo₂] ... │
│ sorted by addInfo (when order_by_attach=on) │
│ or by TID (when order_by_attach=off) │
└─────────────────────────────────────────────┘
One drawback of the RUM index is that it has slower build and insert times than GIN. This is because RUM stores additional information alongside keys and utilizes generic Write-Ahead Logging (WAL) records, which can significantly increase WAL volume compared to GIN. It fits workloads where data is written once, like append-only or event-driven use cases, and queried many times later. Unlike GIN, RUM does not implement a pending list for fast updates, which contributes to its higher write and build costs. RUM performs best when dealing with highly repetitive keys, such as those found in natural language text, or denormalized documents. However, for high-cardinality unique keys, like UUIDs, the inverted index structure offers little advantage over a B-tree.
The next post will cover Extended RUM, which enables additional indexing and ordering capabilities in DocumentDB.
GIN: Understanding PostgreSQL's Inverted Index and Its Limitations
This series of posts traces the evolution from GIN to RUM to Extended RUM, showing how a single architectural idea—to store more in the index to do less work at query time—unlocks major performance improvements at each step.
Generalized Inverted Index (GIN) is PostgreSQL's answer to the question "how do I index composite values?" — including arrays, JSONB documents, tsvector, and any data type that decomposes into multiple keys per row.
The structure is conceptually simple:
Entry Tree (B-tree of keys)
├── "alice" → Posting List: [(0,1), (3,7), (5,2)]
├── "bob" → Posting List: [(0,2), (4,1)]
├── "carol" → Posting Tree (when list becomes too large for in-line storage)"
│ └── B-tree of ItemPointers
└── ...
For each indexed value, GIN calls extractValue to decompose it into keys. Each key gets an entry in the B-tree. Each entry points to a posting list — a sorted array of heap TIDs (tuple identifiers) that contain that key.
At query time, GIN calls extractQuery to decompose the search condition into keys, then intersects or unions the posting lists using a consistent function.
What GIN Gets Right:
- Space efficiency: posting lists are compressed (variable-byte encoding), and duplicate keys are stored once.
- Fast containment queries:
@>,<@,@@— any "does this value contain these keys?" pattern is natural. - Extensible operator classes: anyone can plug in new data types via
extractValue/extractQuery/consistent.
Limitations of GIN:
- Posting lists lack ordering info. GIN's lists only contain TIDs sorted by physical location, with no space for extra data.
- Cannot perform ordered scans. GIN doesn’t support
ORDER BYpushdown, and queries with sorting require an explicit sort after the index scan. - No extra payload per TID. Posting lists hold only 6-byte TIDs, without timestamps, scores, or additional info.
- Only bitmap scans are supported, not true index scans. GIN generates and processes bitmaps of matching TIDs.
Implications:
- Ranking needs heap access. For ranked searches like
ts_rank(), GIN finds TIDs but lacks term frequency or position info, so it must fetch and sort matching heap tuples. - Phrase searches need reprocessing. GIN can locate tuples containing specific words, but it cannot confirm if these words are adjacent based only on the index. It identifies possible matches and then rechecks each heap tuple to ensure the words are in sequence. As clarified by Artur Zakirov (Postgres Professional), "GIN supports it, but it requires additional bitmap heap scan and so it is slower."
- No
LIMITpushdown — must scan the entire posting list even for a small number of results. - Cursor-based iteration isn't supported.
I created the following tables that I'll use for the examples in this blog post and the next ones:
CREATE EXTENSION IF NOT EXISTS btree_gin;
DROP TABLE IF EXISTS articles;
CREATE TABLE articles (
id serial PRIMARY KEY,
title text NOT NULL,
body text NOT NULL,
category text NOT NULL,
published timestamp NOT NULL,
score int NOT NULL,
tsv tsvector GENERATED ALWAYS AS (
to_tsvector('simple', title || ' ' || body)
) STORED
);
INSERT INTO articles (title, body, category, published, score)
SELECT
'Article about ' || words[1 + (i % 10)] || ' and ' || words[1 + ((i*7) % 10)],
'This is the body discussing ' || words[1 + ((i*3) % 10)] ||
' in the context of ' || words[1 + ((i*11) % 10)] ||
'. We also mention ' || words[1 + ((i*13) % 10)] || ' here.',
categories[1 + (i % 4)],
'2020-01-01'::timestamp + (i || ' hours')::interval,
(i * 17) % 100
FROM generate_series(1, 1e6) AS s(i),
LATERAL (SELECT ARRAY['postgresql','indexing','performance','database',
'query','optimization','storage','replication',
'vacuum','analytics'] AS words) w,
LATERAL (SELECT ARRAY['tech','science','blog','news'] AS categories) c;
ANALYZE articles;
CREATE INDEX idx_gin_tsv ON articles USING gin (tsv);
Here is a Full-Text Search — Bitmap Only:
postgres=# EXPLAIN (COSTS OFF, ANALYZE, BUFFERS)
SELECT id, title, score
FROM articles
WHERE tsv @@ to_tsquery('simple', 'postgresql & article')
LIMIT 10
;
QUERY PLAN
-----------------------------------------------------------------------------------------------
Limit (actual time=24.049..24.104 rows=10 loops=1)
Buffers: shared hit=299 read=6
-> Bitmap Heap Scan on articles (actual time=24.045..24.087 rows=10 loops=1)
Recheck Cond: (tsv @@ '''postgresql'' & ''article'''::tsquery)
Heap Blocks: exact=6
Buffers: shared hit=299 read=6
-> Bitmap Index Scan on idx_gin_tsv (actual time=16.165..16.165 rows=100000 loops=1)
Index Cond: (tsv @@ '''postgresql'' & ''article'''::tsquery)
Buffers: shared hit=299
Planning:
Buffers: shared hit=33 read=2 dirtied=7
Planning Time: 0.314 ms
Execution Time: 24.138 ms
(13 rows)
GIN uses Bitmap Index Scan → Bitmap Heap Scan. It never produces a true Index Scan. Even with LIMIT 10, it builds the full bitmap first on one million rows before discarding most of them to return only ten.
An ORDER BY requires a Sort operation:
postgres=# EXPLAIN (COSTS OFF, ANALYZE)
SELECT id, title, published
FROM articles
WHERE tsv @@ to_tsquery('simple', 'postgresql & article')
ORDER BY published DESC
LIMIT 5;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------
Limit (actual time=1838.940..1853.064 rows=5 loops=1)
-> Gather Merge (actual time=1838.937..1853.054 rows=5 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Sort (actual time=1812.430..1812.435 rows=4 loops=3)
Sort Key: published DESC
Sort Method: top-N heapsort Memory: 26kB
Worker 0: Sort Method: top-N heapsort Memory: 26kB
Worker 1: Sort Method: top-N heapsort Memory: 26kB
-> Parallel Bitmap Heap Scan on articles (actual time=113.247..1784.517 rows=33333 loops=3)
Recheck Cond: (tsv @@ '''postgresql'' & ''article'''::tsquery)
Heap Blocks: exact=19968
-> Bitmap Index Scan on idx_gin_tsv (actual time=127.531..127.532 rows=100000 loops=1)
Index Cond: (tsv @@ '''postgresql'' & ''article'''::tsquery)
Planning Time: 1.290 ms
Execution Time: 1853.445 ms
(16 rows)
The Sort node is unavoidable. GIN collected all matching TIDs, accessed the heap, sorted them, and retained the top 5.
Phrase search with distance — Recheck Required:
postgres=# EXPLAIN (COSTS OFF, ANALYZE)
SELECT id, title
FROM articles
WHERE tsv @@ to_tsquery('simple', 'postgresql <-> article')
LIMIT 5;
QUERY PLAN
-----------------------------------------------------------------------------------------------
Limit (actual time=357.799..357.803 rows=0 loops=1)
-> Bitmap Heap Scan on articles (actual time=357.795..357.797 rows=0 loops=1)
Recheck Cond: (tsv @@ '''postgresql'' <-> ''article'''::tsquery)
Rows Removed by Index Recheck: 100000
Heap Blocks: exact=55556
-> Bitmap Index Scan on idx_gin_tsv (actual time=48.745..48.746 rows=100000 loops=1)
Index Cond: (tsv @@ '''postgresql'' <-> ''article'''::tsquery)
Planning Time: 0.561 ms
Execution Time: 357.873 ms
(9 rows)
With Rows Removed by Index Recheck, GIN found candidates containing both words but had to access the heap to verify adjacency. No positional information in the index.
What's Next
These three limitations — no ordering, no positions, bitmap-only scans — are exactly what RUM was designed to solve. In the next post of this series, we'll see how a single architectural change (adding addInfo to each posting list entry) eliminates all three.
Writing Code vs. Shipping Code: Productivity Effects Across Generations of AI Coding Tools
The transformative power of LLMs in coding has been irrefutable, and it feels like we are living through a magical computing renaissance. On the socials, we hear impressive numbers of lines of code generated, features delivered, and bugs fixed. But, the macroeconomic indicators seem to be still lagging. Heck, if you talk with an engineering manager, you find that their product shipping dates haven't miraculously compressed by a factor of five, either.
This paper just landed 10 days ago. It is from MIT and Wharton by Mert Demirer, Leon Musolff, and Liyuan Yang. Their study attempts to provide a structured economic model for evaluating actual productivity obtained from AI coding tools. By pairing confidential Microsoft telemetry with the public footprints of over 100,000 GitHub developers (tracking everything from open-source utilities to web app repositories), the authors show significant systemic friction downstream of AI code generation.
Of course, I do my usual skeptical critic of the paper. In this case, this is especially heightened because these are economists peeking into the messy non-linear world of software engineering and trying to impose a "production hierarchy" abstraction onto it. But if we reconsider their analysis from a different perspective, it becomes possible to translate their complex production functions into Amdahl's Law terms, and then we can start doing our own evaluations and draw our own conclusions as I discuss below.
The Monotonic Decay
The core of the paper rests on this monotonic decay argument. The sheer task-level velocity gains we see from AI coding tools start to bleed out as they move up the production hierarchy. The authors break down AI tool adoption into three distinct generational tiers:
- Autocomplete (intelligent text prediction),
- Synchronous (Sync) Agents (interactive, real-time code modifiers like Claude Code or Cursor),
- Asynchronous (Async) Agents (autonomous async agents).
When we look at the task-level velocity of these tools, we see impressive numbers. The paper's abstract claims that autocomplete, interactive sync agents, and autonomous async agents increase overall commit activity by cumulative totals of 40%, 140%, and 180% respectively. But as that work climbs toward an official production milestone, the improvements decay significantly.
For Autocomplete, the +228.2% explosion in raw lines of code bleeds out layer-by-layer until becomes a meager +10.2% increase in actual shipped software releases. For Sync agents, a gigantic +741.3% surge in code syntax reduces down to a modest +20.3% final weekly releases.
My immediate reaction to this vertical hierarchy (Lines -> Files -> Commits -> PRs -> Repositories -> Releases) is skepticism. Treating code production like a neat production line feels superficial. Software engineering is not a linear conveyor belt, as coding is highly nonlinear, and a single commit routinely alters fifty files. However, giving the authors the benefit of doubt and reading onwards, I find that there is still value to this naive abstraction, as it points to human gatekeeping and coordination overhead at higher levels of the CI/CD pipeline. AI can write lines of code instantly, not being bogged down by the code syntax at the lower layers. But as that work climbs toward an official production milestone, the structural constraints of the system and human bottlenecks take over, and the massive improvements at the task level decay down to nearly nothing.
Let's dive deeper on the mathematical modeling behind this. By taking only the performance of Autocomplete into account (because it operates exclusively at the code-writing level), the authors chose parameters that minimized the differences between their model's predictions and actual developer behavior. Through this exercise, they extracted a local Upstream Output Elasticity ($\theta$) of approximately 0.75. In this layered production model, $\theta = 0.75$ acts as a vertical pass-through metric. It means that at any given stage (say, turning raw commits into clean pull requests), 75% of that layer's success leans entirely on the upstream technical assets flowing into it from the layer below, while a remaining fraction represents the local human effort added at that layer. Because they model software development as a vertically sequential aggregation process, that human intervention operates like a compounding efficiency tax. A massive initial code productivity surge at the bottom layer gets relentlessly multiplied by 0.75 over and over again as it attempts to climb the hierarchy, mathematically forcing the steep vertical attenuation we see in the empirical data.
Now, the sync and async agents aren't just typing lines inside an editor. They operate at a level where they directly manage files, commits, and pull requests. This expanded layer coverage allows agentic workflows to drop their productivity contributions closer to the finish line. By short-circuiting the early stages of the vertical decay chain, agents handle the work more efficiently, doubling the final impact on shipped software compared to autocomplete as seen in Figure 1.
Translating the Economics formulas to Amdahl’s Law
To model what happens inside each of these individual layers, the paper transitions to a nested Constant Elasticity of Substitution (CES) production function. Here, they extract an Elasticity of Substitution ($\sigma$) between AI-generated code and human review effort of a rigid 0.25. In economic lingo, an elasticity of substitution well below $1.0$ means the inputs are "strong complements". That means they are tied/dependent together like a car chassis and tires. It doesn't matter if an automated factory line can manufacture tires 10,000% faster; if you don't speed up the production of the chassis, you don't get more cars.
This of course looks a whole lot like Amdahl’s Law, which dictates that the overall speedup of a system is strictly limited by its sequential un-parallelizable bottlenecks:
$S_{\text{total}} = \frac{1}{(1-P) + \frac{P}{S_{\text{task}}}}$
where $S_{\text{task}}$ is the speedup achieved at the automated task level, and $P$ is the "Global Parallelizable Fraction" of the entire system workload.
When the elasticity of substitution ($\sigma$) between machine output and human validation drops to 0.25, the economic model behaves almost exactly like a Leontief production function. (A Leontief production function describes a strict, zero-flexibility production process where inputs must be combined in exact, unalterable proportions, meaning an excess of one ingredient cannot substitute for a shortage of another.) This dictates that human code review is a non-negotiable strictly sequential bottleneck ($1 - P$). If $\sigma$ were infinite, you could completely substitute human verification with raw AI text volume effectively parallelizing the entire layer. But because $\sigma = 0.25$, throwing an infinite mountain of automated code ($S_{\text{task}} \to \infty$) at the problem does nothing to diminish the fixed, sequential time investment required for a human to review it.
If we run the paper's real-world empirical findings through this equation (isolating the global parallel fraction $P$ for commits against final releases), we find the following:
- Autocomplete: $S_{\text{task}} = 1.359\times$, $S_{\text{total}} = 1.102\times \implies \mathbf{P \approx 35.0\%}$
- Sync Agents: $S_{\text{task}} = 2.091\times$, $S_{\text{total}} = 1.202\times \implies \mathbf{P \approx 32.2\%}$
- Async Agents: $S_{\text{task}} = 2.800\times$, $S_{\text{total}} = 1.300\times \implies \mathbf{P \approx 35.9\%}$
Huh! Even going from a simple inline autocomplete tool to fully autonomous agents that clone repositories and run test suites out-of-band, the global parallelizable fraction ($P$) refuses to budge and hovers around 35% across all three generations of AI!
Then how come Sync and Async agents manage to squeeze out 2-3x more final software releases than Autocomplete? While Autocomplete notches a modest 10.2% release expansion, Sync agents push it to 20.3%, and the cumulative stack of Async tools lifts the final output baseline up by 30%. If the global parallelizable envelope ($P$) is locked at 35% from the formulas above, how is the system actually accelerating? This is because according to Amdahl's Law, a system has two entirely separate levers for optimization. You can either increase $P$, or you can aggressively push harder and increase $S_{\text{task}}$. Autocomplete achieved a commit-level speedup ($S_{\text{task}}$) of just $1.359\times$ relative to releases. But Sync agents drive that localized task speedup to $2.091\times$, and Async agents push $S_{\text{task}}$ to $2.800\times$. But this hits a wall of diminishing returns quickly. When the parallelizable footprint ($P$) is pinned to 35%, scaling the localized task speedup ($S_{\text{task}}$) toward infinity yields a hard asymptotic ceiling. Mathematically, the absolute maximum total speedup this configuration can ever achieve is $1 / (1 - 0.35)$, which works out to a hard cap of a 53% overall increase ($1.53\times$) in shipped software. So, no matter how fast an autonomous bot can process a commit, the remaining 65% human sequential bottleneck ($1 - P$) acts as a hard stop.
Is P=0.35 sensible? I think that 35% parallel fraction passes the smell test. If you ask any developer what percentage of their week is spent actually writing code, they'll give you a number right in this ballpark, around 20% to 40%. The remaining 65% of the developers' time is consumed by finding/defining the task, planning, and paying the inevitable human communication tax of meetings and team alignment.
This is why that flatlining 35% profile across all three tool generations makes sense to me. Generative AI can supercharge the coding sandbox by churning out code at high speed (maxing out $S_{\text{task}}$), but it can't parallelize the systemic reasoning, organizational consensus, and deep problem/model comprehension that dominates the rest of the job. Until we find a way to automate those parts as well, shipping software will remain an inherently human-throttled process.
June 08, 2026
Automate Oracle PL/SQL to PostgreSQL migration with Amazon Bedrock and Strands Agents
Building Python applications with SQLAlchemy and Aurora DSQL
A Case for Simulation-Driven Resilience in Agentic Data Systems
As I mentioned in my previous post, I traveled to San Jose at the end of May for the ACM CAIS conference. On Day 0, I gave a very short talk at the Supporting our AI Overlords (SAO) workshop. This post is the promised summary of our paper, "A Case for Simulation-Driven Resilience in Agentic Data Systems", joint work with Aleksey Charapko (University of New Hampshire) and Akshat Vig (MongoDB).
Metastability is critical for building the next generation of distributed systems
Our story starts with metastability. Metastability is the failure mode where the mechanisms built to protect the system (retries, queues, timeouts, load shedding) turn into amplifiers. Even after the trigger that caused the overload goes away, the system stays behind, churning through busy work, perpetually trying to catch up with the remnants of failed and behind-schedule tasks. It's a bit like missing some foundational math in high school. You spend so long backfilling the old gaps that you never keep up with the new material piling on top, so you stay permanently behind. The catching up work is what keeps you behind, which requires catching up work later, which keeps you behind. (This is presumably why I'm not a rich machine learning scientist today at Deep Mind.)
Avoiding and tolerating metastable failures is critical to building the next generation of reliable distributed systems. Aleksey's Metastable Failures in the Wild study (OSDI'22) cataloged these failures from real production incidents. They matter so much because they are the hard faults that remain. We have largely learned to deal with the straightforward ones (crashes, corruptions, dropped packets), which leaves emergent performance failures as the final boss. Metastable failures are responsible for a disproportionate share of critical cloud unavailability incidents, with no single broken part to point at and no obvious way to fix/reset them when they emerge. The cloud economics make the problem worse. Providers and operators have every incentive to run with the absolute minimum of excess capacity and to trim slack and "waste". But this thin margin is exactly where metastability thrives.
Agents supercharge the feedback loops and put metastability on steroids
As AI agents are replacing humans as the primary clients of modern data systems, they are bringing a qualitative shift in workload characteristics. Agents retry aggressively while mutating the query on each attempt. They fan out into bursty parallel sub-tasks. They hold transactions open while they wait on an external LLM to provide the next step. The "Supporting our AI overlords" paper showed that agents create ~20x more branches and perform ~50x more rollbacks than humans do. These behaviors violate assumptions baked into every layer of a modern data system. Execution control assumes stationary arrivals. Caching assumes temporal locality. Concurrency control assumes bounded hold times. Agents break all three at once!
We propose simulation-driven resilience to address this problem
Simulation can enable us to systematically explore the agent-database boundary, and discover/prevent metastability failures before a production incident forces a reactive (and nonworking) patch. We propose a simulation based approach because only simulation is cheap enough to sweep an enormous trigger space, and deterministic enough to replay and dissect every failure it finds.
- Benchmarks measure steady state. But, metastability is transient and emergent: it lives in the sequence of rare events, not in the average.
- Queueing theory assumes mostly stationary independent arrivals. Agents break these assumptions.
- Testing with the production system is hopeless for design sweeps and gives you almost no observability. When your database falls over under load, you still don't know what went wrong, or how it went wrong, and you can't explore the design space because you have no feedback to explore it with.
MESSI finds failures in the seams
Metastability scurries in the seams/interaction of the composition of subsystems, so we need a tool that lets us look there. Aleksey developed MESSI (MEtaStability SImulator), a discrete-event simulation framework for exploring metastability dynamics in distributed systems. MESSI enables modeling any (sub)system as a directed graph and composition of (sub)systems. Logic Nodes implement routing and state policy. Processors model the physical resource constraints (queues, I/O delays, network latency). Individual work items, QItems, carry state as they traverse the graph. There is a clear separation of roles: policy lives in the nodes, resource contention lives in the edges, and you can vary each independently.
Because this is a simulator, it is deterministic and replayable, and it exposes the full internal state of every component at every tick. A metastable trigger you discover once can then be re-run against alternate designs to see which ones survive.
Our findings from the Execution Control System (ECS) simulations
Using MESSI, we performed an analysis of the Execution Control System (ECS), because it is the critical first domino: when the ECS fails, every subsystem downstream of it (caches, buffer pools, lock managers) fails after it. The ECS sits between admitted requests and the execution engine and decides who runs, in what order, with how much. It is the component that mediates resource contention at the backend. The usual design hands out a bounded pool of execution tickets (one ticket buys one worker thread), sorts admitted requests into a few priority queues with different ticket budgets, and dispatches them to worker threads and I/O slots. But, unlike an OS scheduler, which aims for fairness and wants every thread to eventually run, the ECS has to make decisive choices. It needs to prioritize the latency-sensitive short queries and shed the excess, as it aims the cost of waiting off the server and back onto the client. This is, of course, exactly what closes the feedback loop with a retrying agent.
We found two interesting results in our analysis.
Two reasonable policies may compose into a metastable loop
A natural ECS design uses two queues, a high-priority one for short tasks and a low-priority one for long tasks, with a probing policy on each that nudges its ticket count up or down to chase a performance metric. Each policy is sensible in isolation. But when you compose them, they interfere with each other on occasion in a metastable manner. The long queue probes for more tickets to improve its own throughput. More long-task tickets means more threads contending for the same cores, which steals CPU from the short queue, which then escalates its ticket count to keep up. Now both queues are inflating until they slam into their hard limits. The trap is that ticket acquisition rate, the metric each policy is optimizing, stops predicting actual progress under overload. A queue full of waiting tasks can churn tickets at a furious rate (grab a ticket, do 1 ms of work, yield, repeat) while getting almost nothing done. The metric looks healthy while the system performance collapses.
Admission control and the ECS destructively interfere
Putting an admission controller in front of the ECS sounds like defense in depth, but this can also backfire when done naively. Admission control drops requests indiscriminately. It sits at the network edge, and when it sees elevated latency, it starts rejecting a fraction of everything, short and long task alike. In simulations, we found that a workload spike can trigger admission control immediately, even though the ECS, left alone, would have rebalanced its tickets and absorbed the load after a brief adjustment. But the drops from the admission control prevent the ECS from rebalancing. With work being shed out from under it, the ECS never gets the signal it needs to adjust allocations and priorities, and the system stays parked in reduced-goodput mode until the workload subsides on its own. Under agents this gets worse, because admission control can't tell an agent's first attempt from its fifth retry, and a rejection would usually cause the agent to escalate things.
The full ECS design study is in our companion paper, "Towards Designing an Execution Control System with Metastability Resilience" (to appear at IEEE ICCCN 2026).
June 07, 2026
Getting Paid by Flat Rate Movers
Back in 2023, I hired Flat Rate Movers (A.K.A. Flat Rate Moving) for an interstate move. They subcontracted to a third party who showed up under-staffed, under-equipped, and very confused; the whole mess wound up causing a good deal of damage to my belongings and home. I filed a claim with Flat Rate and also requested they issue a partial refund. They refused to talk to me or their insurance company, and after six months of calling, emailing, and writing, I got overwhelmed by other life struggles and gave up. I did manage to file a claim with their backing insurance company, but got nothing for the damage to the house, or failure to deliver goods and services.
In October 2025 Flat Rate started sending me e-mails asking me to use Flat Rate for a move again. I called, explained the situation, and asked how they were planning to resolve it. They invited me to write to customer service, and the whole process resumed, with essentially the same result. I’d call every few days to ask how things were going, and various Flat Rate representatives would promise they’d get back to me tomorrow, or a check was in the mail, or that they’d talk to claims for me. Unsurprisingly, none of this worked.
Long story short, after a bunch of claim forms, emails, letters, complaints to various regulators, requests for arbitration, and one hundred and sixty-one phone calls with Flat Rate directly, I have succeeded. I started calling their main line at (212) 988-9292, picking a three-digit extension, and asking that person for help. I did this ten times each morning, and within two weeks had multiple people offering to help. I can’t say for sure which person got things over the line, because I called a lot of them, but by April 2026, I had a check in the mail.
If you wind up in the same boat, I recommend two things. First, keep a log of everything you do. Flat Rate may tell you things like “we did not receive supporting documentation within sixty days” or “you did not file the claim in time”; it is extremely helpful to be able to say things like “Please see my email from 2023-05-06 at 15:32 in which I submitted the requested documentation.” Second, set up a spreadsheet to keep track of which extensions you’ve called and when. Here's mine—each row number is the three-digit extension.
Extended RUM in DocumentDB extension for PostgreSQL: Efficient ESR (Equality, Sort, Range) Queries
Last year, I examined RUM indexes within this series on multi-key indexing, demonstrating that they cannot substitute MongoDB's compound indexes for sorted queries. A year later, the DocumentDB extension has evolved to incorporate an Extended RUM index, which allows for an ordered scan rather than a bitmap scan. Let's revisit our pagination query to see how it performs now.
I start a container with the latest DocumentDB (version v0.112-0 from May 26, 2026):
docker run -d --name documentdb-local -p 10260:10260 -p 9712:9712 ghcr.io/documentdb/documentdb/documentdb-local:latest --username franck --password franck --start-pg
I can connect to PostgreSQL on port 9712, where many extensions are installed, including the extended RUM index:
docker exec -it documentdb-local psql -p 9712 postgres
psql (17.10 (Debian 17.10-1.pgdg13+1))
Type "help" for help.
postgres=# \dx
List of installed extensions
Name | Version | Schema | Description
-------------------------+---------+------------+------------------------------------------------------------
documentdb | 0.112-0 | public | API surface for DocumentDB for PostgreSQL
documentdb_core | 0.112-0 | public | Core API surface for DocumentDB on PostgreSQL
documentdb_extended_rum | 0.112-0 | public | DocumentDB Extended RUM index access method
pg_cron | 1.6 | pg_catalog | Job scheduler for PostgreSQL
plpgsql | 1.0 | pg_catalog | PL/pgSQL procedural language
postgis | 3.6.3 | public | PostGIS geometry and geography spatial types and functions
tsm_system_rows | 1.0 | public | TABLESAMPLE method which accepts number of rows as a limit
vector | 0.8.2 | public | vector data type and ivfflat and hnsw access methods
(8 rows)
postgres=#
I can also connect to the MongoDB-compatible API:
docker exec -it documentdb-local mongosh -u franck -p franck 'mongodb://localhost:10260/?tls=true&tlsAllowInvalidCertificates=true'
Current Mongosh Log ID: 6a0b3b537d2a1c3471d1a7ba
Connecting to: mongodb://<credentials>@localhost:10260/?tls=true&tlsAllowInvalidCertificates=true&directConnection=true&serverSelectionTimeoutMS=2000&appName=mongosh+2.8.3
Using MongoDB: 7.0.0
Using Mongosh: 2.8.3
For mongosh info see: https://www.mongodb.com/docs/mongodb-shell/
[direct: mongos] test>
Like in the previous post, I created a simple collection with 10,000 documents:
[direct: mongos] test>
for (let i = 0; i < 10000; i++) {
db.demo.insertOne({
a: 1,
b: Math.random(),
ts: new Date()
});
}
I create a compound index that follows the MongoDB Equality, Sort, Range rule—designed for queries with an equality filter on a and a sort on ts:
[direct: mongos] test>
db.demo.createIndex({ "a": 1, "ts": -1 });
I run the same query as in the previous post, which, with the standard RUM indexes, produced a Bitmap Index Scan followed by a Sort of all documents matching a: 1 before returning the top 10:
[direct: mongos] test>
db.demo.find(
{ a: 1 }
).sort(
{ ts: -1 }
).limit(10).explain("executionStats");
The good surprise is that with the current version of DocumentDB, the execution plan looks like MongoDB's native IXSCAN with no additional sort step:
[direct: mongos] test> db.demo.find(
{ a: 1 }).sort({ts:-1}).limit(10).explain("executionStats")
;
{
explainVersion: 2,
...
executionStats: {
nReturned: Long('10'),
executionTimeMillis: 0.286,
executionStartAtTimeMillis: 0.256,
totalDocsExamined: Long('10'),
totalKeysExamined: Long('10'),
executionStages: {
stage: 'LIMIT',
nReturned: Long('10'),
executionTimeMillis: 0.286,
executionStartAtTimeMillis: 0.256,
totalDocsExamined: 10,
totalKeysExamined: 10,
numBlocksFromCache: 25,
inputStage: {
stage: 'FETCH',
nReturned: Long('10'),
executionTimeMillis: 0.267,
executionStartAtTimeMillis: 0.253,
totalKeysExamined: 10,
numBlocksFromCache: 25,
inputStage: {
stage: 'IXSCAN',
nReturned: Long('10'),
executionTimeMillis: 0.267,
executionStartAtTimeMillis: 0.253,
indexName: 'a_1_ts_-1',
totalKeysExamined: 10,
numBlocksFromCache: 25
}
}
}
},
ok: 1
}
It read only the minimum necessary to get the result: ten index entries (totalKeysExamined: 10) in the expected order and fetched only ten documents (totalDocsExamined: 10). This is the most efficient execution plan.
Comparing the Two RUM Index Definitions
I connect to PostgreSQL to describe the table that stores my collection documents (you will see later how I obtained the name):
postgres=# \d documentdb_data.documents_7
Table "documentdb_data.documents_7"
Column | Type | Collation | Nullable | Default
-----------------+--------+-----------+----------+---------
shard_key_value | bigint | | not null |
object_id | bson | | not null |
document | bson | | not null |
Indexes:
"collection_pk_7" PRIMARY KEY, btree (shard_key_value, object_id)
"documents_rum_index_25" documentdb_extended_rum (document documentdb_extended_rum_catalog.bson_extended_rum_composite_path_ops (pathspec='[ "a", { "ts" : -1 } ]', tl='2691'))
Check constraints:
"shard_key_value_check" CHECK (shard_key_value = '7'::bigint)
postgres=#
What was a standard RUM index in the previous post is now an extended RUM index:
| Attribute | Previous post | Current test |
|---|---|---|
| Index Type | documentdb_rum |
documentdb_extended_rum |
| Operator Class |
bson_rum_single_path_ops (×2) |
bson_extended_rum_composite_path_ops |
| Fields |
a (asc), ts (implicit asc) |
a (asc), ts (desc) |
Sort Direction on ts |
Not specified / default ascending | Explicitly -1 (descending) |
| Path Encoding | Two separate path= entries |
Single JSON pathspec array |
The extended RUM index acts as a sort-order-aware composite index, embedding the descending direction directly into the pathspec. Unlike the previous approach, which stored each path independently, this approach encodes all indexed fields as a single composite pathspec and generates a single composite index entry per document, preserving the relative ordering between fields. An index scan (RumOrderedScan) efficiently covers both filtering and sorting, eliminating the need for a separate Sort node in the PostgreSQL execution plan. This benefit is evident when executing the same query via the DocumentDB API in PostgreSQL:
postgres=# explain (analyze, buffers, verbose, costs off)
select document from bson_aggregation_find(
'test',
'{
"find": "demo",
"filter": { "a": 1 },
"sort": { "ts": -1 },
"limit": 10
}'::documentdb_core.bson
);
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
Limit (actual time=0.108..0.138 rows=10 loops=1)
Output: document, (bson_orderby(document, '{ "ts" : { "$numberInt" : "-1" } }'::bson))
Buffers: shared hit=4
-> Index Scan using "a_1_ts_-1" on documentdb_data.documents_7 collection (actual time=0.104..0.117 rows=10 loops=1)
Output: document, bson_orderby(document, '{ "ts" : { "$numberInt" : "-1" } }'::bson)
Index Cond: (collection.document @= '{ "a" : { "$numberInt" : "1" } }'::bson)
Order By: (collection.document |-<> '{ "ts" : { "$numberInt" : "-1" } }'::bson)
Buffers: shared hit=4
Planning:
Buffers: shared hit=2
Planning Time: 0.453 ms
Execution Time: 0.192 ms
(12 rows)
postgres=#
Note: I got the name of the internal table I described above from this execution plan, which uses the collection name in the query. The MongoDB API's explain() shows a MongoDB-compatible execution plan, and EXPLAIN in PostgreSQL shows the PostgreSQL version of it.
Comparison of Execution Plans
Here is how the new behavior with an ordered index scan compares to the previous bitmap scan.
| Feature | Ordered Index Scan | Bitmap Index Scan |
|---|---|---|
| PostgreSQL Node | Index Scan |
Bitmap Index Scan |
| Ordering | Handled by sort direction in index | Lost—requires a Sort node |
| Scan Type | scanType: RumOrderedScan |
scanType: RumFastScan / RumRegularScan |
| Efficiency | Supports early termination (LIMIT) |
Must scan all matching TIDs into bitmap |
| RUM Entry Point | rumgettuple() |
rumgetbitmap() |
| Sort Step | None — useSimpleScan = true
|
rum_tuplesort_performsort() required |
| Memory Usage | Low—one tuple at a time | High—full TIDBitmap + sort state |
| Index Structure Used | B-tree walk via orderStack
|
Posting list / posting tree dump |
| Filter Evaluation | Inline via ValidateIndexEntry()
|
Post-collection in keyGetItem()
|
| Seek Optimization | Yes—advances queryKey as entries exhaust |
No |
| Multi-column Support | Multi-column via composite pathspec | Multi-column via separate entries |
LIMIT benefit |
✅ Full — stops after N rows | ❌ None — bitmap built before LIMIT applies |
| Recheck Behavior |
xs_recheckorderby per tuple |
xs_recheck on bitmap result |
| Trigger Condition |
RumEnableOrderedOperatorScans + willSort + norderbys > 0
|
Default path |
Index Only Scan
Other improvements are coming to Extended RUM, like Index Only Scan, currently supported for COUNT:
postgres=# explain (analyze, buffers, verbose, costs off)
SELECT document FROM bson_aggregation_count(
'test',
'{
"count": "demo",
"query": { "a": 1 }
}'::documentdb_core.bson
);
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
Aggregate (actual time=22.760..22.763 rows=1 loops=1)
Output: documentdb_api_internal.bsoncommandcount(1)
Buffers: shared hit=109
-> Index Only Scan using "a_1_ts_-1" on documentdb_data.documents_7 collection (actual time=0.089..14.292 rows=10000 loops=1)
Output: collection.document
Index Cond: (collection.document @= '{ "a" : { "$numberInt" : "1" } }'::bson)
Heap Fetches: 0
Buffers: shared hit=109
Planning:
Buffers: shared hit=4
Planning Time: 0.441 ms
Execution Time: 22.883 ms
(12 rows)
Index Only Scan will be supported in the future (see IsQueryValidForIndexOnlyScan)
Conclusion
A year ago, DocumentDB's RUM indexes had a significant limitation for pagination queries: even with the right compound index, the planner would fall back to a Bitmap Index Scan followed by a full Sort, meaning every matching document had to be collected and sorted before the first result could be returned. A LIMIT 10 query on 10,000 documents would examine all 10,000—defeating the purpose of the compound index.
With v0.112-0, this is fixed. The new documentdb_extended_rum index type, combined with the RumOrderedScan execution path, reduces the gap with native MongoDB behavior:
- The index encodes sort direction directly in the
pathspec({ "ts": -1 }) - The planner chooses an Index Scan instead of a Bitmap Index Scan
- No
Sortnode appears in the plan -
LIMIT 10examines exactly 10 index entries and 10 documents
This is more than just a cosmetic change. In time-series queries—such as filtering on a low-cardinality field, sorting by timestamp descending, and retrieving the first page—the difference between the two plans ranges from O(result) to O(total size). For OLTP systems, pagination queries are common and need to be quick and reliable, since they show results to the user before the user takes any action, selects, refines filters, or moves to the next page.
This ordered scan is also essential for TTL indexes to efficiently identify expiration candidates.
The key ingredients that make this work together are visible from the DocumentDB open-source code:
-
documentdb_extended_rum—compositepathspecwith explicit sort direction -
bson_extended_rum_composite_path_ops— single operator class covering all fields -
RumOrderedScan— B-tree walk in index order viaorderStack, bypassingrumgetbitmap() -
useSimpleScan— returns one tuple at a time, enabling trueLIMITpushdown -
RumAllowOrderByRawKeys— the GUC that enables this path, now on by default
This is all enabled by default:
postgres=# \dconfig *rum*order*
List of configuration parameters
Parameter | Value
----------------------------------------------+-------
documentdb_rum.enable_ordered_operator_scans | on
documentdb_rum.forceRumOrderedIndexScan | off
(2 rows)
In under a year, DocumentDB evolved from "RUM instead of GIN, but with the same pagination limitations" to "RUM with ordered scan, aligning more with MongoDB's IXSCAN behavior for ESR-pattern indexes". For developers implementing cursor-based pagination or queries with a selective filter and sorting on a time or sequence field, this marks the version at which it begins to function as expected.
June 05, 2026
Getting Started with pg_durable: Durable Workflows Inside PostgreSQL
Modern applications demand long‑lasting workflows that can withstand crashes, restarts, and failures, often relying on external platforms. pg_durable integrates durable workflow orchestration directly into PostgreSQL, enabling workflows to be created in SQL and reliably executed, monitored, inspected, and recovered from disruptions.
This tool is particularly useful for ETL processes, data pipelines, background tasks, scheduled jobs, long‑running business procedures, and internal workflow management.
To truly grasp its capabilities, the best approach is to try it yourself. I did so by building and running it locally, which provided practical insight.
Installing pg_durable
I installed Rust to compile the extension:
curl https://sh.rustup.rs -sSf | sh
. "$HOME/.cargo/env"
I installed the PostgreSQL build dependencies.
sudo dnf install -y openssl-devel pkg-config libicu-devel readline-devel flex
I installed pgrx for PostgreSQL 17:
cargo install cargo-pgrx --version 0.16.1 --locked
cargo pgrx init --pg17=download
export PG_VERSION=pg17
I enabled the extension preload library:
sed -e "/shared_preload_libraries/s/^.*$/shared_preload_libraries='pg_durable'/" \
-i ~/.pgrx/data-17/postgresql.conf
I declared the superuser that I'll create to run the background worker:
cat >> ~/.pgrx/data-17/postgresql.conf <<<"pg_durable.worker_role='superworker'"
I cloned pg_durable from the Microsoft repository:
git clone https://github.com/microsoft/pg_durable.git
cd pg_durable
I built the extension, started PostgreSQL, and got a psql prompt:
cargo pgrx run
I checked the pg_durable configuration, connected to the pg_durable.database, created the pg_durable.worker_role user, and installed the extension:
pg_durable=# \dconfig pg_durable.*
List of configuration parameters
Parameter | Value
---------------------------------------+-------------
pg_durable.database | postgres
pg_durable.enable_superuser_instances | off
pg_durable.execution_acquire_timeout | 30
pg_durable.max_duroxide_connections | 10
pg_durable.max_management_connections | 6
pg_durable.max_user_connections | 10
pg_durable.worker_role | superworker
(7 rows)
\c postgres
CREATE ROLE superworker SUPERUSER LOGIN;
CREATE EXTENSION pg_durable;
My PostgreSQL instance is now ready to use durable functions.
Hello World
I create a user that will execute workflows and grant access to pg_durable:
CREATE USER franck;
GRANT CREATE ON DATABASE postgres TO franck;
SELECT df.grant_usage('franck');
I reconnect as that user:
postgres=# \c postgres franck
You are now connected to the database "postgres" as user "franck".
postgres=>
I start a durable workflow:
postgres=> SELECT df.start(
$$SELECT ''Hello, durable world!'' AS message$$
);
start
----------
393f09e0
(1 row)
One second later, it is completed:
postgres=> SELECT df.status('393f09e0');
status
-----------
completed
(1 row)
The result can be fetched as a document with a list of rows:
postgres=> SELECT df.result('393f09e0');
result
------------------------------------------------------------------
{"rows": [{"message": "Hello, durable world!"}], "row_count": 1}
(1 row)
I used the same "Hello, durable world!" example as in the HorizonDB durable functions documentation, because this is where pg_durable comes from: Microsoft provides it to PostgreSQL as an open-source extension.
Here is more information about this execution:
postgres=> \x
Expanded display is on.
postgres=> SELECT * FROM df.list_instances();
-[ RECORD 1 ]---+-------------------------------------------------------------
instance_id | 393f09e0
label |
function_name | pg_durable::orchestration::execute-function-graph
status | completed
execution_count | 1
output | {"rows":[{"message":"Hello, durable world!"}],"row_count":1}
postgres=> \x
Expanded display is off.
postgres=>
The execution graph is basic, as there's a single operation:
postgres=> SELECT df.explain('393f09e0');
explain
------------------------------------------------------------------------
Instance: 393f09e0 +
Status: ✓ Completed +
Output: {"rows":[{"message":"Hello, durable world!"}],"row_count":1}+
+
SQL: SELECT 'Hello, durable world!' AS message ✓
(1 row)
There is a single node:
postgres=> \x
Expanded display is on.
postgres=> SELECT * from df.instance_nodes('393f09e0');
-[ RECORD 1 ]+-----------------------------------------------------------------
execution_id | 1
node_id | 60e433bb
node_type | SQL
query | SELECT 'Hello, durable world!' AS message
result_name |
left_node |
right_node |
status | completed
result | {"rows": [{"message": "Hello, durable world!"}], "row_count": 1}
updated_at | 2026-06-04 16:36:38.594906+00
postgres=> \x
Expanded display is off.
It was executed once:
postgres=> SELECT * from df.instance_executions('393f09e0');
execution_id | status | event_count | duration_ms | output
--------------+-----------+-------------+-------------+--------------------------------------------------------------
1 | Completed | 14 | 692 | {"rows":[{"message":"Hello, durable world!"}],"row_count":1}
(1 row)
Here are the global metrics:
postgres=> SELECT * FROM df.metrics();
total_instances | running_instances | completed_instances | failed_instances | total_executions | total_events
-----------------+-------------------+---------------------+------------------+------------------+--------------
1 | 0 | 1 | 0 | 1 | 14
(1 row)
This simply looks like ADBA—Asynchronous Database Access, but it can do more. We will see that it can define complex workflows with a declarative syntax. The examples start the workflow immediately, but it can also wait for a signal or a specific time using a cron syntax:
Beyond a Single Query
While a simple SQL statement demonstrates the basics, pg_durable becomes much more powerful when orchestrating multiple operations.
The SQL workflow DSL supports:
| Feature | Operator |
|---|---|
| Variable capture | => |
| Sequential execution | ~> |
| Parallel execution (wait) | & |
| Parallel execution (race) | ` |
| Conditional branching | {% raw %}?> and !>
|
| Infinite loops | @> |
Here are simple table-free examples demonstrating pg_durable's key features. Let's start with a sequential execution (~>):
postgres=> SELECT df.start(
'SELECT now() as step1' |=> 't1'
~> 'SELECT pg_sleep(5)'
~> 'SELECT now() as step2' |=> 't2'
~> 'SELECT now() as step3' |=> 't3',
'sequential-timing'
) as i \gset
-- Check the results after five seconds
postgres=> SELECT df.result(:'i');
result
---------------------------------------------------------------------------
{"rows": [{"step3": "2026-06-04T21:33:59.939785+00:00"}], "row_count": 1}
(1 row)
postgres=> SELECT node_type, result_name, result, node_id, left_node, right_node FROM df.instance_nodes(:'i');
node_type | result_name | result | node_id | left_node | right_node
-----------+-------------+---------------------------------------------------------------------------+----------+-----------+------------
THEN | t3 | {"rows": [{"step3": "2026-06-04T21:33:59.939785+00:00"}], "row_count": 1} | 49c9c8b4 | 48fcec43 | 9f818758
THEN | t2 | {"rows": [{"step2": "2026-06-04T21:33:59.507283+00:00"}], "row_count": 1} | 48fcec43 | 348c1948 | c78f2200
THEN | | {"rows": [{"pg_sleep": null}], "row_count": 1} | 348c1948 | c7317ea8 | 200f8a52
SQL | t1 | {"rows": [{"step1": "2026-06-04T21:33:53.746030+00:00"}], "row_count": 1} | c7317ea8 | |
SQL | | {"rows": [{"pg_sleep": null}], "row_count": 1} | 200f8a52 | |
SQL | | {"rows": [{"step2": "2026-06-04T21:33:59.507283+00:00"}], "row_count": 1} | c78f2200 | |
SQL | | {"rows": [{"step3": "2026-06-04T21:33:59.939785+00:00"}], "row_count": 1} | 9f818758 | |
(7 rows)
I displayed the timestamps to show that steps 2 and 3 occur 5 seconds after step 1 because I used pg_sleep(5) in the sequential execution.
The node_type column indicates the type of operation each node represents in the function graph. Here, THEN denotes sequential execution: running the left child, then the right child, and SQL is the leaf node with no children, executing the SQL query.
The complete list of valid node types in pg_durable is:
| node_type | Description | Operator/Function |
|---|---|---|
| SQL | Execute SQL query | Plain string or df.sql()
|
| THEN | Sequential execution |
~> operator or df.seq()
|
| IF | Conditional branch |
?> !> operators or df.if()
|
| JOIN | Parallel execution (wait all) |
& operator or df.join()
|
| RACE | Parallel execution (first wins) |
\ operator or df.race()
|
| LOOP | Infinite loop |
@> operator or df.loop()
|
| BREAK | Exit loop | df.break() |
| SLEEP | Pause execution | df.sleep() |
| WAIT_SCHEDULE | Wait for cron schedule | df.wait_for_schedule() |
| HTTP | Make HTTP request | df.http() |
| SIGNAL | Wait for external signal | df.wait_for_signal() |
We can execute steps in parallel (&):
SELECT df.start(
('SELECT now() as branch1' |=> 'b1' ~> 'SELECT pg_sleep(20)')
&
('SELECT now() as branch2' |=> 'b2' ~> 'SELECT pg_sleep(10)')
~> 'SELECT now() as after_join' |=> 'final',
'parallel-sleep'
) as i \gset
postgres=> SELECT node_type, result_name, result, node_id, left_node, right_node FROM df.instance_nodes(:'i');
node_type | result_name | result | node_id | left_node | right_node
-----------+-------------+--------------------------------------------------------------------------------------------------+----------+-----------+------------
SQL | b1 | {"rows": [{"branch1": "2026-06-04T21:53:47.870539+00:00"}], "row_count": 1} | 9d479c96 | |
SQL | b2 | {"rows": [{"branch2": "2026-06-04T21:53:47.870606+00:00"}], "row_count": 1} | cb844cd8 | |
SQL | | {"rows": [{"pg_sleep": null}], "row_count": 1} | ad8ec143 | |
THEN | | {"rows": [{"pg_sleep": null}], "row_count": 1} | 1ff3adc8 | cb844cd8 | ad8ec143
SQL | | {"rows": [{"pg_sleep": null}], "row_count": 1} | 9db57c15 | |
THEN | | {"rows": [{"pg_sleep": null}], "row_count": 1} | 5e4d8069 | 9d479c96 | 9db57c15
JOIN | | [{"rows": [{"pg_sleep": null}], "row_count": 1}, {"rows": [{"pg_sleep": null}], "row_count": 1}] | 6f95c84d | 5e4d8069 | 1ff3adc8
SQL | | {"rows": [{"after_join": "2026-06-04T21:54:08.752247+00:00"}], "row_count": 1} | d4be28af | |
THEN | final | {"rows": [{"after_join": "2026-06-04T21:54:08.752247+00:00"}], "row_count": 1} | df59e250 | 6f95c84d | d4be28af
(9 rows)
Here, b1 and b2 started at the same time, in parallel. One branch completed after 20 seconds and the other after 10 seconds, so the final step started after 20 seconds, and the total duration was 20 seconds—it would have been more than 30 seconds if executed serially:
postgres=> SELECT * from df.instance_executions(:'i');
execution_id | status | event_count | duration_ms | output
--------------+-----------+-------------+-------------+----------------------------------------------------------------------------
1 | Completed | 26 | 22107 | {"rows":[{"after_join":"2026-06-04T21:54:08.752247+00:00"}],"row_count":1}
(1 row)
Parallel nodes are used when we need the result from all operations. The next step awaits all. If we want to continue as soon as one branch gets a result, we can run a race, where the first one wins (|):
SELECT df.start(
'SELECT pg_sleep(30)' | 'SELECT pg_sleep(90)' ~> 'SELECT ''fast'' as winner',
'race-test'
) as i \gset
select usename,backend_start,query_start,wait_event,state,query,application_name from pg_stat_activity where usename=user and pid!=pg_backend_pid(
)
\watch 10
Fri 05 Jun 2026 04:08:30 AM GMT (every 10s)
usename | backend_start | query_start | wait_event | state | query | application_name
---------+-------------------------------+-------------------------------+------------+--------+---------------------+------------------
franck | 2026-06-05 04:08:29.873785+00 | 2026-06-05 04:08:29.877368+00 | PgSleep | active | SELECT pg_sleep(30) |
franck | 2026-06-05 04:08:29.875738+00 | 2026-06-05 04:08:29.879296+00 | PgSleep | active | SELECT pg_sleep(90) |
(2 rows)
Fri 05 Jun 2026 04:08:40 AM GMT (every 10s)
usename | backend_start | query_start | wait_event | state | query | application_name
---------+-------------------------------+-------------------------------+------------+--------+---------------------+------------------
franck | 2026-06-05 04:08:29.873785+00 | 2026-06-05 04:08:29.877368+00 | PgSleep | active | SELECT pg_sleep(30) |
franck | 2026-06-05 04:08:29.875738+00 | 2026-06-05 04:08:29.879296+00 | PgSleep | active | SELECT pg_sleep(90) |
(2 rows)
Fri 05 Jun 2026 04:08:50 AM GMT (every 10s)
usename | backend_start | query_start | wait_event | state | query | application_name
---------+-------------------------------+-------------------------------+------------+--------+---------------------+------------------
franck | 2026-06-05 04:08:29.873785+00 | 2026-06-05 04:08:29.877368+00 | PgSleep | active | SELECT pg_sleep(30) |
franck | 2026-06-05 04:08:29.875738+00 | 2026-06-05 04:08:29.879296+00 | PgSleep | active | SELECT pg_sleep(90) |
(2 rows)
Fri 05 Jun 2026 04:09:00 AM GMT (every 10s)
usename | backend_start | query_start | wait_event | state | query | application_name
---------+-------------------------------+-------------------------------+------------+--------+---------------------+------------------
franck | 2026-06-05 04:08:29.875738+00 | 2026-06-05 04:08:29.879296+00 | PgSleep | active | SELECT pg_sleep(90) |
(1 row)
Fri 05 Jun 2026 04:09:10 AM GMT (every 10s)
usename | backend_start | query_start | wait_event | state | query | application_name
---------+-------------------------------+-------------------------------+------------+--------+---------------------+------------------
franck | 2026-06-05 04:08:29.875738+00 | 2026-06-05 04:08:29.879296+00 | PgSleep | active | SELECT pg_sleep(90) |
(1 row)
Fri 05 Jun 2026 04:09:20 AM GMT (every 10s)
usename | backend_start | query_start | wait_event | state | query | application_name
---------+-------------------------------+-------------------------------+------------+--------+---------------------+------------------
franck | 2026-06-05 04:08:29.875738+00 | 2026-06-05 04:08:29.879296+00 | PgSleep | active | SELECT pg_sleep(90) |
(1 row)
Fri... (truncated)
Village News: MySQL/Database News + Events (5 June 2026)
June 04, 2026
Understanding how backups work in Amazon Aurora
Migrating from MongoDB 6.0 to 8.0: How Percona ClusterSync Handles Cross-Version Replication
Percona ClusterSync for MongoDB (PCSM) replicates data between MongoDB clusters to keep migrations with near-zero downtime. Prior to version 0.9.0 it required the source and target to run the same major version, which ruled out the lift-and-shift move most migrations want: going from an older major like 6.0 straight onto a newer one like 8.0. … Continued
The post Migrating from MongoDB 6.0 to 8.0: How Percona ClusterSync Handles Cross-Version Replication appeared first on Percona.