a curated list of database news from authoritative sources

June 08, 2026

A Case for Simulation-Driven Resilience in Agentic Data Systems

As I mentioned in my previous post, I traveled to San Jose at the end of May for the ACM CAIS conference. On Day 0, I gave a very short talk at the Supporting our AI Overlords (SAO) workshop. This post is the promised summary of our paper, "A Case for Simulation-Driven Resilience in Agentic Data Systems", joint work with Aleksey Charapko (University of New Hampshire) and Akshat Vig (MongoDB).


Metastability is critical for building the next generation of distributed systems

Our story starts with metastability. Metastability is the failure mode where the mechanisms built to protect the system (retries, queues, timeouts, load shedding) turn into amplifiers. Even after the trigger that caused the overload goes away, the system stays behind, churning through busy work, perpetually trying to catch up with the remnants of failed and behind-schedule tasks. It's a bit like missing some foundational math in high school. You spend so long backfilling the old gaps that you never keep up with the new material piling on top, so you stay permanently behind. The catching up work is what keeps you behind, which requires catching up work later, which keeps you behind. (This is presumably why I'm not a rich machine learning scientist today at Deep Mind.)

Avoiding and tolerating metastable failures is critical to building the next generation of reliable distributed systems. Aleksey's Metastable Failures in the Wild study (OSDI'22) cataloged these failures from real production incidents. They matter so much because they are the hard faults that remain. We have largely learned to deal with the straightforward ones (crashes, corruptions, dropped packets), which leaves emergent performance failures as the final boss. Metastable failures are responsible for a disproportionate share of critical cloud unavailability incidents, with no single broken part to point at and no obvious way to fix/reset them when they emerge. The cloud economics make the problem worse. Providers and operators have every incentive to run with the absolute minimum of excess capacity and to trim slack and "waste". But this thin margin is exactly where metastability thrives.


Agents supercharge the feedback loops and put metastability on steroids

As AI agents are replacing humans as the primary clients of modern data systems, they are bringing a qualitative shift in workload characteristics. Agents retry aggressively while mutating the query on each attempt. They fan out into bursty parallel sub-tasks. They hold transactions open while they wait on an external LLM to provide the next step. The "Supporting our AI overlords" paper showed that agents create ~20x more branches and perform ~50x more rollbacks than humans do. These behaviors violate assumptions baked into every layer of a modern data system. Execution control assumes stationary arrivals. Caching assumes temporal locality. Concurrency control assumes bounded hold times. Agents break all three at once!


We propose simulation-driven resilience to address this problem

Simulation can enable us to systematically explore the agent-database boundary, and discover/prevent metastability failures before a production incident forces a reactive (and nonworking) patch. We propose a simulation based approach because only simulation is cheap enough to sweep an enormous trigger space, and deterministic enough to replay and dissect every failure it finds.

  • Benchmarks measure steady state. But, metastability is transient and emergent: it lives in the sequence of rare events, not in the average.
  • Queueing theory assumes mostly stationary independent arrivals. Agents break these assumptions.
  • Testing with the production system is hopeless for design sweeps and gives you almost no observability. When your database falls over under load, you still don't know what went wrong, or how it went wrong, and you can't explore the design space because you have no feedback to explore it with.


MESSI finds failures in the seams

Metastability scurries in the seams/interaction of the composition of subsystems,  so we need a tool that lets us look there. Aleksey developed MESSI (MEtaStability SImulator), a discrete-event simulation framework for exploring metastability dynamics in distributed systems. MESSI enables modeling any (sub)system as a directed graph and composition of (sub)systems. Logic Nodes implement routing and state policy. Processors model the physical resource constraints (queues, I/O delays, network latency). Individual work items, QItems, carry state as they traverse the graph. There is a clear separation of roles: policy lives in the nodes, resource contention lives in the edges, and you can vary each independently.

Because this is a simulator, it is deterministic and replayable, and it exposes the full internal state of every component at every tick. A metastable trigger you discover once can then be re-run against alternate designs to see which ones survive.


Our findings from the Execution Control System (ECS) simulations

Using MESSI, we performed an analysis of the Execution Control System (ECS), because it is the critical first domino: when the ECS fails, every subsystem downstream of it (caches, buffer pools, lock managers) fails after it. The ECS sits between admitted requests and the execution engine and decides who runs, in what order, with how much. It is the component that mediates resource contention at the backend. The usual design hands out a bounded pool of execution tickets (one ticket buys one worker thread), sorts admitted requests into a few priority queues with different ticket budgets, and dispatches them to worker threads and I/O slots. But, unlike an OS scheduler, which aims for fairness and wants every thread to eventually run, the ECS has to make decisive choices. It needs to prioritize the latency-sensitive short queries and shed the excess, as it aims the cost of waiting off the server and back onto the client. This is, of course, exactly what closes the feedback loop with a retrying agent.

We found two interesting results in our analysis.


Two reasonable policies may compose into a metastable loop

A natural ECS design uses two queues, a high-priority one for short tasks and a low-priority one for long tasks, with a probing policy on each that nudges its ticket count up or down to chase a performance metric. Each policy is sensible in isolation. But when you compose them, they interfere with each other on occasion in a metastable manner. The long queue probes for more tickets to improve its own throughput. More long-task tickets means more threads contending for the same cores, which steals CPU from the short queue, which then escalates its ticket count to keep up. Now both queues are inflating until they slam into their hard limits. The trap is that ticket acquisition rate, the metric each policy is optimizing, stops predicting actual progress under overload. A queue full of waiting tasks can churn tickets at a furious rate (grab a ticket, do 1 ms of work, yield, repeat) while getting almost nothing done. The metric looks healthy while the system performance collapses.


Admission control and the ECS destructively interfere

Putting an admission controller in front of the ECS sounds like defense in depth, but this can also backfire when done naively. Admission control drops requests indiscriminately. It sits at the network edge, and when it sees elevated latency, it starts rejecting a fraction of everything, short and long task alike. In simulations, we found that a workload spike can trigger admission control immediately, even though the ECS, left alone, would have rebalanced its tickets and absorbed the load after a brief adjustment. But the drops from the admission control prevent the ECS from rebalancing. With work being shed out from under it, the ECS never gets the signal it needs to adjust allocations and priorities, and the system stays parked in reduced-goodput mode until the workload subsides on its own. Under agents this gets worse, because admission control can't tell an agent's first attempt from its fifth retry, and a rejection would usually cause the agent to escalate things.

The full ECS design study is in our companion paper, "Towards Designing an Execution Control System with Metastability Resilience" (to appear at IEEE ICCCN 2026).

June 07, 2026

Getting Paid by Flat Rate Movers

Back in 2023, I hired Flat Rate Movers (A.K.A. Flat Rate Moving) for an interstate move. They subcontracted to a third party who showed up under-staffed, under-equipped, and very confused; the whole mess wound up causing a good deal of damage to my belongings and home. I filed a claim with Flat Rate and also requested they issue a partial refund. They refused to talk to me or their insurance company, and after six months of calling, emailing, and writing, I got overwhelmed by other life struggles and gave up. I did manage to file a claim with their backing insurance company, but got nothing for the damage to the house, or failure to deliver goods and services.

In October 2025 Flat Rate started sending me e-mails asking me to use Flat Rate for a move again. I called, explained the situation, and asked how they were planning to resolve it. They invited me to write to customer service, and the whole process resumed, with essentially the same result. I’d call every few days to ask how things were going, and various Flat Rate representatives would promise they’d get back to me tomorrow, or a check was in the mail, or that they’d talk to claims for me. Unsurprisingly, none of this worked.

Long story short, after a bunch of claim forms, emails, letters, complaints to various regulators, requests for arbitration, and one hundred and sixty-one phone calls with Flat Rate directly, I have succeeded. I started calling their main line at (212) 988-9292, picking a three-digit extension, and asking that person for help. I did this ten times each morning, and within two weeks had multiple people offering to help. I can’t say for sure which person got things over the line, because I called a lot of them, but by April 2026, I had a check in the mail.

If you wind up in the same boat, I recommend two things. First, keep a log of everything you do. Flat Rate may tell you things like “we did not receive supporting documentation within sixty days” or “you did not file the claim in time”; it is extremely helpful to be able to say things like “Please see my email from 2023-05-06 at 15:32 in which I submitted the requested documentation.” Second, set up a spreadsheet to keep track of which extensions you’ve called and when. Here's mine—each row number is the three-digit extension.

Extended RUM in DocumentDB extension for PostgreSQL: Efficient ESR (Equality, Sort, Range) Queries

Last year, I examined RUM indexes within this series on multi-key indexing, demonstrating that they cannot substitute MongoDB's compound indexes for sorted queries. A year later, the DocumentDB extension has evolved to incorporate an Extended RUM index, which allows for an ordered scan rather than a bitmap scan. Let's revisit our pagination query to see how it performs now.

I start a container with the latest DocumentDB (version v0.112-0 from May 26, 2026):

docker run -d --name documentdb-local -p 10260:10260 -p 9712:9712 ghcr.io/documentdb/documentdb/documentdb-local:latest  --username franck --password franck --start-pg

I can connect to PostgreSQL on port 9712, where many extensions are installed, including the extended RUM index:

docker exec -it documentdb-local psql -p 9712 postgres

psql (17.10 (Debian 17.10-1.pgdg13+1))
Type "help" for help.

postgres=# \dx
                                        List of installed extensions
Name           | Version |   Schema   |                        Description                         
-------------------------+---------+------------+------------------------------------------------------------
 documentdb              | 0.112-0 | public     | API surface for DocumentDB for PostgreSQL
 documentdb_core         | 0.112-0 | public     | Core API surface for DocumentDB on PostgreSQL
 documentdb_extended_rum | 0.112-0 | public     | DocumentDB Extended RUM index access method
 pg_cron                 | 1.6     | pg_catalog | Job scheduler for PostgreSQL
 plpgsql                 | 1.0     | pg_catalog | PL/pgSQL procedural language
 postgis                 | 3.6.3   | public     | PostGIS geometry and geography spatial types and functions
 tsm_system_rows         | 1.0     | public     | TABLESAMPLE method which accepts number of rows as a limit
 vector                  | 0.8.2   | public     | vector data type and ivfflat and hnsw access methods
(8 rows)

postgres=#

I can also connect to the MongoDB-compatible API:

docker exec -it documentdb-local mongosh -u franck -p franck 'mongodb://localhost:10260/?tls=true&tlsAllowInvalidCertificates=true'

Current Mongosh Log ID: 6a0b3b537d2a1c3471d1a7ba
Connecting to:          mongodb://<credentials>@localhost:10260/?tls=true&tlsAllowInvalidCertificates=true&directConnection=true&serverSelectionTimeoutMS=2000&appName=mongosh+2.8.3
Using MongoDB:          7.0.0
Using Mongosh:          2.8.3

For mongosh info see: https://www.mongodb.com/docs/mongodb-shell/

[direct: mongos] test>

Like in the previous post, I created a simple collection with 10,000 documents:

[direct: mongos] test>
 for (let i = 0; i < 10000; i++) {
  db.demo.insertOne({
    a: 1,
    b: Math.random(),
    ts: new Date()
  });
}

I create a compound index that follows the MongoDB Equality, Sort, Range rule—designed for queries with an equality filter on a and a sort on ts:

[direct: mongos] test>
 db.demo.createIndex({ "a": 1, "ts": -1 });

I run the same query as in the previous post, which, with the standard RUM indexes, produced a Bitmap Index Scan followed by a Sort of all documents matching a: 1 before returning the top 10:

[direct: mongos] test>
 db.demo.find(
 { a: 1 }
).sort(
 { ts: -1 }
).limit(10).explain("executionStats");

The good surprise is that with the current version of DocumentDB, the execution plan looks like MongoDB's native IXSCAN with no additional sort step:

[direct: mongos] test> db.demo.find(
  { a: 1 }).sort({ts:-1}).limit(10).explain("executionStats")
;

{
  explainVersion: 2,
...
  executionStats: {
    nReturned: Long('10'),
    executionTimeMillis: 0.286,
    executionStartAtTimeMillis: 0.256,
    totalDocsExamined: Long('10'),
    totalKeysExamined: Long('10'),
    executionStages: {
      stage: 'LIMIT',
      nReturned: Long('10'),
      executionTimeMillis: 0.286,
      executionStartAtTimeMillis: 0.256,
      totalDocsExamined: 10,
      totalKeysExamined: 10,
      numBlocksFromCache: 25,
      inputStage: {
        stage: 'FETCH',
        nReturned: Long('10'),
        executionTimeMillis: 0.267,
        executionStartAtTimeMillis: 0.253,
        totalKeysExamined: 10,
        numBlocksFromCache: 25,
        inputStage: {
          stage: 'IXSCAN',
          nReturned: Long('10'),
          executionTimeMillis: 0.267,
          executionStartAtTimeMillis: 0.253,
          indexName: 'a_1_ts_-1',
          totalKeysExamined: 10,
          numBlocksFromCache: 25
        }
      }
    }
  },
  ok: 1
}

It read only the minimum necessary to get the result: ten index entries (totalKeysExamined: 10) in the expected order and fetched only ten documents (totalDocsExamined: 10). This is the most efficient execution plan.

Comparing the Two RUM Index Definitions

I connect to PostgreSQL to describe the table that stores my collection documents (you will see later how I obtained the name):

postgres=# \d documentdb_data.documents_7

            Table "documentdb_data.documents_7"

     Column      |  Type  | Collation | Nullable | Default 
-----------------+--------+-----------+----------+---------
 shard_key_value | bigint |           | not null | 
 object_id       | bson   |           | not null | 
 document        | bson   |           | not null | 

Indexes:

    "collection_pk_7" PRIMARY KEY, btree (shard_key_value, object_id)

    "documents_rum_index_25" documentdb_extended_rum (document documentdb_extended_rum_catalog.bson_extended_rum_composite_path_ops (pathspec='[ "a", { "ts" : -1 } ]', tl='2691'))

Check constraints:

    "shard_key_value_check" CHECK (shard_key_value = '7'::bigint)

postgres=#

What was a standard RUM index in the previous post is now an extended RUM index:

Attribute Previous post Current test
Index Type documentdb_rum documentdb_extended_rum
Operator Class bson_rum_single_path_ops (×2) bson_extended_rum_composite_path_ops
Fields a (asc), ts (implicit asc) a (asc), ts (desc)
Sort Direction on ts Not specified / default ascending Explicitly -1 (descending)
Path Encoding Two separate path= entries Single JSON pathspec array

The extended RUM index acts as a sort-order-aware composite index, embedding the descending direction directly into the pathspec. Unlike the previous approach, which stored each path independently, this approach encodes all indexed fields as a single composite pathspec and generates a single composite index entry per document, preserving the relative ordering between fields. An index scan (RumOrderedScan) efficiently covers both filtering and sorting, eliminating the need for a separate Sort node in the PostgreSQL execution plan. This benefit is evident when executing the same query via the DocumentDB API in PostgreSQL:

postgres=# explain (analyze, buffers, verbose, costs off)
select document from bson_aggregation_find(
  'test',
  '{
    "find": "demo",
    "filter": { "a": 1 },
    "sort":   { "ts": -1 },
    "limit": 10
  }'::documentdb_core.bson
);
                                                       QUERY PLAN                                                        
-------------------------------------------------------------------------------------------------------------------------
 Limit (actual time=0.108..0.138 rows=10 loops=1)
   Output: document, (bson_orderby(document, '{ "ts" : { "$numberInt" : "-1" } }'::bson))
   Buffers: shared hit=4
   ->  Index Scan using "a_1_ts_-1" on documentdb_data.documents_7 collection (actual time=0.104..0.117 rows=10 loops=1)
         Output: document, bson_orderby(document, '{ "ts" : { "$numberInt" : "-1" } }'::bson)
         Index Cond: (collection.document @= '{ "a" : { "$numberInt" : "1" } }'::bson)
         Order By: (collection.document |-<> '{ "ts" : { "$numberInt" : "-1" } }'::bson)
         Buffers: shared hit=4
 Planning:
   Buffers: shared hit=2
 Planning Time: 0.453 ms
 Execution Time: 0.192 ms
(12 rows)

postgres=#

Note: I got the name of the internal table I described above from this execution plan, which uses the collection name in the query. The MongoDB API's explain() shows a MongoDB-compatible execution plan, and EXPLAIN in PostgreSQL shows the PostgreSQL version of it.

Comparison of Execution Plans

Here is how the new behavior with an ordered index scan compares to the previous bitmap scan.

Feature Ordered Index Scan Bitmap Index Scan
PostgreSQL Node Index Scan Bitmap Index Scan
Ordering Handled by sort direction in index Lost—requires a Sort node
Scan Type scanType: RumOrderedScan scanType: RumFastScan / RumRegularScan
Efficiency Supports early termination (LIMIT) Must scan all matching TIDs into bitmap
RUM Entry Point rumgettuple() rumgetbitmap()
Sort Step None — useSimpleScan = true rum_tuplesort_performsort() required
Memory Usage Low—one tuple at a time High—full TIDBitmap + sort state
Index Structure Used B-tree walk via orderStack Posting list / posting tree dump
Filter Evaluation Inline via ValidateIndexEntry() Post-collection in keyGetItem()
Seek Optimization Yes—advances queryKey as entries exhaust No
Multi-column Support Multi-column via composite pathspec Multi-column via separate entries
LIMIT benefit ✅ Full — stops after N rows ❌ None — bitmap built before LIMIT applies
Recheck Behavior xs_recheckorderby per tuple xs_recheck on bitmap result
Trigger Condition RumEnableOrderedOperatorScans + willSort + norderbys > 0 Default path

Index Only Scan

Other improvements are coming to Extended RUM, like Index Only Scan, currently supported for COUNT:

postgres=# explain (analyze, buffers, verbose, costs off)  
SELECT document FROM bson_aggregation_count(  
  'test',  
  '{  
    "count": "demo",  
    "query": { "a": 1 }  
  }'::documentdb_core.bson  
);
                                                            QUERY PLAN                                                            
----------------------------------------------------------------------------------------------------------------------------------
 Aggregate (actual time=22.760..22.763 rows=1 loops=1)
   Output: documentdb_api_internal.bsoncommandcount(1)
   Buffers: shared hit=109
   ->  Index Only Scan using "a_1_ts_-1" on documentdb_data.documents_7 collection (actual time=0.089..14.292 rows=10000 loops=1)
         Output: collection.document
         Index Cond: (collection.document @= '{ "a" : { "$numberInt" : "1" } }'::bson)
         Heap Fetches: 0
         Buffers: shared hit=109
 Planning:
   Buffers: shared hit=4
 Planning Time: 0.441 ms
 Execution Time: 22.883 ms
(12 rows)

Index Only Scan will be supported in the future (see IsQueryValidForIndexOnlyScan)

Conclusion

A year ago, DocumentDB's RUM indexes had a significant limitation for pagination queries: even with the right compound index, the planner would fall back to a Bitmap Index Scan followed by a full Sort, meaning every matching document had to be collected and sorted before the first result could be returned. A LIMIT 10 query on 10,000 documents would examine all 10,000—defeating the purpose of the compound index.

With v0.112-0, this is fixed. The new documentdb_extended_rum index type, combined with the RumOrderedScan execution path, reduces the gap with native MongoDB behavior:

  • The index encodes sort direction directly in the pathspec ({ "ts": -1 })
  • The planner chooses an Index Scan instead of a Bitmap Index Scan
  • No Sort node appears in the plan
  • LIMIT 10 examines exactly 10 index entries and 10 documents

This is more than just a cosmetic change. In time-series queries—such as filtering on a low-cardinality field, sorting by timestamp descending, and retrieving the first page—the difference between the two plans ranges from O(result) to O(total size). For OLTP systems, pagination queries are common and need to be quick and reliable, since they show results to the user before the user takes any action, selects, refines filters, or moves to the next page.

This ordered scan is also essential for TTL indexes to efficiently identify expiration candidates.

The key ingredients that make this work together are visible from the DocumentDB open-source code:

  1. documentdb_extended_rum—composite pathspec with explicit sort direction
  2. bson_extended_rum_composite_path_ops — single operator class covering all fields
  3. RumOrderedScan — B-tree walk in index order via orderStack, bypassing rumgetbitmap()
  4. useSimpleScan — returns one tuple at a time, enabling true LIMIT pushdown
  5. RumAllowOrderByRawKeys — the GUC that enables this path, now on by default

This is all enabled by default:

postgres=# \dconfig *rum*order*
           List of configuration parameters
                  Parameter                   | Value
----------------------------------------------+-------
 documentdb_rum.enable_ordered_operator_scans | on
 documentdb_rum.forceRumOrderedIndexScan      | off
(2 rows)

In under a year, DocumentDB evolved from "RUM instead of GIN, but with the same pagination limitations" to "RUM with ordered scan, aligning more with MongoDB's IXSCAN behavior for ESR-pattern indexes". For developers implementing cursor-based pagination or queries with a selective filter and sorting on a time or sequence field, this marks the version at which it begins to function as expected.

June 05, 2026

Getting Started with pg_durable: Durable Workflows Inside PostgreSQL

Modern applications demand long‑lasting workflows that can withstand crashes, restarts, and failures, often relying on external platforms. pg_durable integrates durable workflow orchestration directly into PostgreSQL, enabling workflows to be created in SQL and reliably executed, monitored, inspected, and recovered from disruptions.

This tool is particularly useful for ETL processes, data pipelines, background tasks, scheduled jobs, long‑running business procedures, and internal workflow management.

To truly grasp its capabilities, the best approach is to try it yourself. I did so by building and running it locally, which provided practical insight.

Installing pg_durable

I installed Rust to compile the extension:

curl https://sh.rustup.rs -sSf | sh

. "$HOME/.cargo/env"

I installed the PostgreSQL build dependencies.

sudo dnf install -y openssl-devel pkg-config libicu-devel readline-devel flex

I installed pgrx for PostgreSQL 17:

cargo install cargo-pgrx --version 0.16.1 --locked

cargo pgrx init --pg17=download

export PG_VERSION=pg17

I enabled the extension preload library:

sed -e "/shared_preload_libraries/s/^.*$/shared_preload_libraries='pg_durable'/" \
    -i ~/.pgrx/data-17/postgresql.conf

I declared the superuser that I'll create to run the background worker:

cat >> ~/.pgrx/data-17/postgresql.conf <<<"pg_durable.worker_role='superworker'"

I cloned pg_durable from the Microsoft repository:

git clone https://github.com/microsoft/pg_durable.git

cd pg_durable

I built the extension, started PostgreSQL, and got a psql prompt:


cargo pgrx run

I checked the pg_durable configuration, connected to the pg_durable.database, created the pg_durable.worker_role user, and installed the extension:

pg_durable=# \dconfig pg_durable.*

          List of configuration parameters
               Parameter               |    Value
---------------------------------------+-------------
 pg_durable.database                   | postgres
 pg_durable.enable_superuser_instances | off
 pg_durable.execution_acquire_timeout  | 30
 pg_durable.max_duroxide_connections   | 10
 pg_durable.max_management_connections | 6
 pg_durable.max_user_connections       | 10
 pg_durable.worker_role                | superworker
(7 rows)

\c postgres

CREATE ROLE superworker SUPERUSER LOGIN;

CREATE EXTENSION pg_durable;

My PostgreSQL instance is now ready to use durable functions.

Hello World

I create a user that will execute workflows and grant access to pg_durable:

CREATE USER franck;

GRANT CREATE ON DATABASE postgres TO franck;

SELECT df.grant_usage('franck');

I reconnect as that user:

postgres=# \c postgres franck

You are now connected to the database "postgres" as user "franck".

postgres=>

I start a durable workflow:

postgres=> SELECT df.start(
           $$SELECT ''Hello, durable world!'' AS message$$
          );

  start
----------
 393f09e0
(1 row)

One second later, it is completed:

postgres=> SELECT df.status('393f09e0');

  status
-----------
 completed

(1 row)

The result can be fetched as a document with a list of rows:

postgres=> SELECT df.result('393f09e0');

                              result
------------------------------------------------------------------
 {"rows": [{"message": "Hello, durable world!"}], "row_count": 1}

(1 row)

I used the same "Hello, durable world!" example as in the HorizonDB durable functions documentation, because this is where pg_durable comes from: Microsoft provides it to PostgreSQL as an open-source extension.

Here is more information about this execution:

postgres=> \x
Expanded display is on.

postgres=> SELECT * FROM df.list_instances();

-[ RECORD 1 ]---+-------------------------------------------------------------
instance_id     | 393f09e0
label           |
function_name   | pg_durable::orchestration::execute-function-graph
status          | completed
execution_count | 1
output          | {"rows":[{"message":"Hello, durable world!"}],"row_count":1}

postgres=> \x
Expanded display is off.
postgres=>

The execution graph is basic, as there's a single operation:

postgres=> SELECT df.explain('393f09e0');

                                explain
------------------------------------------------------------------------
 Instance: 393f09e0                                                    +
 Status:    Completed                                                 +
 Output:   {"rows":[{"message":"Hello, durable world!"}],"row_count":1}+
                                                                       +
 SQL: SELECT 'Hello, durable world!' AS message 
(1 row)

There is a single node:

postgres=> \x
Expanded display is on.

postgres=> SELECT * from df.instance_nodes('393f09e0');

-[ RECORD 1 ]+-----------------------------------------------------------------
execution_id | 1
node_id      | 60e433bb
node_type    | SQL
query        | SELECT 'Hello, durable world!' AS message
result_name  |
left_node    |
right_node   |
status       | completed
result       | {"rows": [{"message": "Hello, durable world!"}], "row_count": 1}
updated_at   | 2026-06-04 16:36:38.594906+00

postgres=> \x
Expanded display is off.

It was executed once:


postgres=> SELECT * from df.instance_executions('393f09e0');

 execution_id |  status   | event_count | duration_ms |                            output
--------------+-----------+-------------+-------------+--------------------------------------------------------------
            1 | Completed |          14 |         692 | {"rows":[{"message":"Hello, durable world!"}],"row_count":1}
(1 row)

Here are the global metrics:

postgres=> SELECT * FROM df.metrics();

 total_instances | running_instances | completed_instances | failed_instances | total_executions | total_events
-----------------+-------------------+---------------------+------------------+------------------+--------------
               1 |                 0 |                   1 |                0 |                1 |           14
(1 row)

This simply looks like ADBA—Asynchronous Database Access, but it can do more. We will see that it can define complex workflows with a declarative syntax. The examples start the workflow immediately, but it can also wait for a signal or a specific time using a cron syntax:

Beyond a Single Query

While a simple SQL statement demonstrates the basics, pg_durable becomes much more powerful when orchestrating multiple operations.

The SQL workflow DSL supports:

Feature Operator
Variable capture =>
Sequential execution ~>
Parallel execution (wait) &
Parallel execution (race) `
Conditional branching {% raw %}?> and !>
Infinite loops @>

Here are simple table-free examples demonstrating pg_durable's key features. Let's start with a sequential execution (~>):

postgres=> SELECT df.start(
       'SELECT now() as step1' |=> 't1'
    ~> 'SELECT pg_sleep(5)'
    ~> 'SELECT now() as step2' |=> 't2'
    ~> 'SELECT now() as step3' |=> 't3',
    'sequential-timing'
) as i \gset

-- Check the results after five seconds
postgres=> SELECT df.result(:'i');

                                  result
---------------------------------------------------------------------------
 {"rows": [{"step3": "2026-06-04T21:33:59.939785+00:00"}], "row_count": 1}

(1 row)

postgres=> SELECT node_type, result_name, result, node_id, left_node, right_node FROM df.instance_nodes(:'i');

 node_type | result_name |                                  result                                   | node_id  | left_node | right_node
-----------+-------------+---------------------------------------------------------------------------+----------+-----------+------------
 THEN      | t3          | {"rows": [{"step3": "2026-06-04T21:33:59.939785+00:00"}], "row_count": 1} | 49c9c8b4 | 48fcec43  | 9f818758
 THEN      | t2          | {"rows": [{"step2": "2026-06-04T21:33:59.507283+00:00"}], "row_count": 1} | 48fcec43 | 348c1948  | c78f2200
 THEN      |             | {"rows": [{"pg_sleep": null}], "row_count": 1}                            | 348c1948 | c7317ea8  | 200f8a52
 SQL       | t1          | {"rows": [{"step1": "2026-06-04T21:33:53.746030+00:00"}], "row_count": 1} | c7317ea8 |           |
 SQL       |             | {"rows": [{"pg_sleep": null}], "row_count": 1}                            | 200f8a52 |           |
 SQL       |             | {"rows": [{"step2": "2026-06-04T21:33:59.507283+00:00"}], "row_count": 1} | c78f2200 |           |
 SQL       |             | {"rows": [{"step3": "2026-06-04T21:33:59.939785+00:00"}], "row_count": 1} | 9f818758 |           |

(7 rows)

I displayed the timestamps to show that steps 2 and 3 occur 5 seconds after step 1 because I used pg_sleep(5) in the sequential execution.

The node_type column indicates the type of operation each node represents in the function graph. Here, THEN denotes sequential execution: running the left child, then the right child, and SQL is the leaf node with no children, executing the SQL query.

The complete list of valid node types in pg_durable is:

node_type Description Operator/Function
SQL Execute SQL query Plain string or df.sql()
THEN Sequential execution ~> operator or df.seq()
IF Conditional branch ?> !> operators or df.if()
JOIN Parallel execution (wait all) & operator or df.join()
RACE Parallel execution (first wins) \ operator or df.race()
LOOP Infinite loop @> operator or df.loop()
BREAK Exit loop df.break()
SLEEP Pause execution df.sleep()
WAIT_SCHEDULE Wait for cron schedule df.wait_for_schedule()
HTTP Make HTTP request df.http()
SIGNAL Wait for external signal df.wait_for_signal()

We can execute steps in parallel (&):

SELECT df.start(
    ('SELECT now() as branch1' |=> 'b1' ~> 'SELECT pg_sleep(20)')
    &
    ('SELECT now() as branch2' |=> 'b2' ~> 'SELECT pg_sleep(10)')
    ~> 'SELECT now() as after_join' |=> 'final',
    'parallel-sleep'
) as i \gset

postgres=> SELECT node_type, result_name, result, node_id, left_node, right_node FROM df.instance_nodes(:'i');

 node_type | result_name |                                              result                                              | node_id  | left_node | right_node
-----------+-------------+--------------------------------------------------------------------------------------------------+----------+-----------+------------
 SQL       | b1          | {"rows": [{"branch1": "2026-06-04T21:53:47.870539+00:00"}], "row_count": 1}                      | 9d479c96 |           |
 SQL       | b2          | {"rows": [{"branch2": "2026-06-04T21:53:47.870606+00:00"}], "row_count": 1}                      | cb844cd8 |           |
 SQL       |             | {"rows": [{"pg_sleep": null}], "row_count": 1}                                                   | ad8ec143 |           |
 THEN      |             | {"rows": [{"pg_sleep": null}], "row_count": 1}                                                   | 1ff3adc8 | cb844cd8  | ad8ec143
 SQL       |             | {"rows": [{"pg_sleep": null}], "row_count": 1}                                                   | 9db57c15 |           |
 THEN      |             | {"rows": [{"pg_sleep": null}], "row_count": 1}                                                   | 5e4d8069 | 9d479c96  | 9db57c15
 JOIN      |             | [{"rows": [{"pg_sleep": null}], "row_count": 1}, {"rows": [{"pg_sleep": null}], "row_count": 1}] | 6f95c84d | 5e4d8069  | 1ff3adc8
 SQL       |             | {"rows": [{"after_join": "2026-06-04T21:54:08.752247+00:00"}], "row_count": 1}                   | d4be28af |           |
 THEN      | final       | {"rows": [{"after_join": "2026-06-04T21:54:08.752247+00:00"}], "row_count": 1}                   | df59e250 | 6f95c84d  | d4be28af

(9 rows)

Here, b1 and b2 started at the same time, in parallel. One branch completed after 20 seconds and the other after 10 seconds, so the final step started after 20 seconds, and the total duration was 20 seconds—it would have been more than 30 seconds if executed serially:

postgres=> SELECT * from df.instance_executions(:'i');

 execution_id |  status   | event_count | duration_ms |                                   output
--------------+-----------+-------------+-------------+----------------------------------------------------------------------------
            1 | Completed |          26 |       22107 | {"rows":[{"after_join":"2026-06-04T21:54:08.752247+00:00"}],"row_count":1}

(1 row)

Parallel nodes are used when we need the result from all operations. The next step awaits all. If we want to continue as soon as one branch gets a result, we can run a race, where the first one wins (|):

SELECT df.start(
    'SELECT pg_sleep(30)' | 'SELECT pg_sleep(90)' ~> 'SELECT ''fast'' as winner',
    'race-test'
) as i \gset

select usename,backend_start,query_start,wait_event,state,query,application_name from pg_stat_activity where usename=user and pid!=pg_backend_pid(
)
\watch 10


                                              Fri 05 Jun 2026 04:08:30 AM GMT (every 10s)

 usename |         backend_start         |          query_start          | wait_event | state  |        query        | application_name
---------+-------------------------------+-------------------------------+------------+--------+---------------------+------------------
 franck  | 2026-06-05 04:08:29.873785+00 | 2026-06-05 04:08:29.877368+00 | PgSleep    | active | SELECT pg_sleep(30) |
 franck  | 2026-06-05 04:08:29.875738+00 | 2026-06-05 04:08:29.879296+00 | PgSleep    | active | SELECT pg_sleep(90) |
(2 rows)

                                              Fri 05 Jun 2026 04:08:40 AM GMT (every 10s)

 usename |         backend_start         |          query_start          | wait_event | state  |        query        | application_name
---------+-------------------------------+-------------------------------+------------+--------+---------------------+------------------
 franck  | 2026-06-05 04:08:29.873785+00 | 2026-06-05 04:08:29.877368+00 | PgSleep    | active | SELECT pg_sleep(30) |
 franck  | 2026-06-05 04:08:29.875738+00 | 2026-06-05 04:08:29.879296+00 | PgSleep    | active | SELECT pg_sleep(90) |
(2 rows)

                                              Fri 05 Jun 2026 04:08:50 AM GMT (every 10s)

 usename |         backend_start         |          query_start          | wait_event | state  |        query        | application_name
---------+-------------------------------+-------------------------------+------------+--------+---------------------+------------------
 franck  | 2026-06-05 04:08:29.873785+00 | 2026-06-05 04:08:29.877368+00 | PgSleep    | active | SELECT pg_sleep(30) |
 franck  | 2026-06-05 04:08:29.875738+00 | 2026-06-05 04:08:29.879296+00 | PgSleep    | active | SELECT pg_sleep(90) |
(2 rows)

                                              Fri 05 Jun 2026 04:09:00 AM GMT (every 10s)

 usename |         backend_start         |          query_start          | wait_event | state  |        query        | application_name
---------+-------------------------------+-------------------------------+------------+--------+---------------------+------------------
 franck  | 2026-06-05 04:08:29.875738+00 | 2026-06-05 04:08:29.879296+00 | PgSleep    | active | SELECT pg_sleep(90) |
(1 row)

                                              Fri 05 Jun 2026 04:09:10 AM GMT (every 10s)

 usename |         backend_start         |          query_start          | wait_event | state  |        query        | application_name
---------+-------------------------------+-------------------------------+------------+--------+---------------------+------------------
 franck  | 2026-06-05 04:08:29.875738+00 | 2026-06-05 04:08:29.879296+00 | PgSleep    | active | SELECT pg_sleep(90) |
(1 row)

                                              Fri 05 Jun 2026 04:09:20 AM GMT (every 10s)

 usename |         backend_start         |          query_start          | wait_event | state  |        query        | application_name
---------+-------------------------------+-------------------------------+------------+--------+---------------------+------------------
 franck  | 2026-06-05 04:08:29.875738+00 | 2026-06-05 04:08:29.879296+00 | PgSleep    | active | SELECT pg_sleep(90) |
(1 row)

                                              Fri... (truncated)
                                    

June 04, 2026

Understanding how backups work in Amazon Aurora

In this post, we dive deep into the Aurora backup architecture, how it differs from Amazon RDS backups, and the Amazon CloudWatch metrics available to monitor your backup storage usage. Through detailed scenarios and visualizations, we demonstrate how workload patterns and retention periods impact backup costs. We also explore cross-Region backup options and share recommended practices to optimize your backup storage consumption.

Migrating from MongoDB 6.0 to 8.0: How Percona ClusterSync Handles Cross-Version Replication

Percona ClusterSync for MongoDB (PCSM) replicates data between MongoDB clusters to keep migrations with near-zero downtime. Prior to version 0.9.0 it required the source and target to run the same major version, which ruled out the lift-and-shift move most migrations want: going from an older major like 6.0 straight onto a newer one like 8.0. … Continued

The post Migrating from MongoDB 6.0 to 8.0: How Percona ClusterSync Handles Cross-Version Replication appeared first on Percona.

Supabase Series F

Supabase has raised a $500M Series F at a $10B pre-money valuation, led by GIC.

June 03, 2026

Index types supported in Amazon Aurora PostgreSQL and Amazon RDS for PostgreSQL using extensions (Bloom, pg_trgm, and pg_bigm)

In Part 1, Part 2, and Part 3 of this series, we explored PostgreSQL’s native indexes (B-tree, GIN, GiST, HASH, BRIN) and specialized extension-based index types (SP-GiST, btree_gin, btree_gist). In this post, we dive into three additional extensions: Bloom (for space-efficient multi-column equality filtering), pg_trgm (for fuzzy text matching and similarity searches), and pg_bigm (for full-text search optimized for Asian languages)

Index types supported in Amazon Aurora PostgreSQL and Amazon RDS for PostgreSQL using extensions (SP-GiST, Btree_Gin and Btree_Gist)

In this post, the third in the series, we dive into three extension-based index types: SP-GiST, btree_gin, and btree_gist. These are available in Amazon Aurora PostgreSQL-Compatible Edition and Amazon Relational Database Service (Amazon RDS) for PostgreSQL. PostgreSQL’s index infrastructure is extensible. Operator classes define how indexes behave for specific data types and operations. The SP-GiST, btree_gin, and btree_gist extensions take advantage of this extensibility to give you additional indexing strategies beyond the native options. We walk through when to use each extension, the data types they support, and practical examples that demonstrate their performance benefits.

Migrating data from Oracle to Amazon Aurora DSQL

This post walks through migrating data from an Oracle source to Amazon Aurora DSQL, using AWS DMS, Amazon S3, AWS Glue, and AWS Step Functions to create an automated, cost-effective migration pipeline suitable for enterprise-scale deployments.

Implementing real-time change data capture with Debezium for Amazon Aurora PostgreSQL and Amazon RDS for PostgreSQL

In this post, we demonstrate how to implement a production-ready CDC solution by using Amazon Aurora for PostgreSQL, Debezium connectors, and Amazon Managed Streaming for Apache Kafka (Amazon MSK). This solution captures database changes in real time and streams them to Kafka topics so that downstream consumers can process the same data for different business purposes.

June 02, 2026

ACM CAIS: Conference on AI and Agentic Systems

Last week, I traveled to San Jose to attend the ACM CAIS conference. On Day 0, I gave a short talk at the Supporting our AI Overlords (SAO) workshop. And yes, I promise to write a summary of our paper, "A Case for Simulation-Driven Resilience in Agent-First Data Systems" soon! 

To start with an overall impression of the conference: much of the work presented felt exploratory and anecdotal. Since the compound AI space is still so new, many work seemed to share on-the-ground best practices that worked for them rather than principled results. Some talks really leaned into the "agent, act like a senior engineer and don't make mistakes" vibe. This was especially apparent in the "Agent Skills Workshop". I am not saying this is a bad thing, I learned some valuable lessons from that workshop, which I'll share below.

CAIS defines the conference's scope broadly as "research on compound AI architectures, optimization, and deployment". Unfortunately, this broadness seemed to work against the main track. Attendance at the primary conference talks was low, and it often felt like attendees were there solely to present their own work rather than engage with others, likely because the subject matter was spread too thin.

In contrast, the workshops were highly focused, which led to much better engagement and active listening from the audience. Moving forward, I think CAIS would benefit greatly from narrowing its focus, maybe specifically focusing more on data systems and infrastructure in support of AI.

On that note, what happened to our collective attention span? CAIS limited paper presentations to 7 minutes with just 2 minutes for questions. This year, SIGMOD also shifted to 9-minute talks. Our own paper, "LeaseGuard: Raft Leases Done Right!", got a mere 9 minutes in the spotlight after Jesse traveled all the way to Bangalore to present it. I've even heard that USENIX Security is down to 3-minute talks now. Should we maybe consider slowing down? After all, isn't attention all we need?


Agent Skills workshop (Day 0)

In the first talk, Graham Neubig discussed OpenHands, their open-source AI developer agent platform that's getting a lot of traction in highly regulated fields like finance and healthcare to speed up software development. A big theme of his talk was skill induction: "the process of inducing/verifying programmatic or prompt-based capabilities through testing/evaluation to enable single-agent systems complete complex long-horizon tasks". Through leveraging offline human-annotated examples or online user feedback, skill induction kicks off a rapid learning phase. In web navigation tests like WebArena, an agent's success rate can ramp up dramatically over a small number of trials before settling into a robust repeatable skill set.

Later on, Kanav Garg (co-founder of Core Automation) walked us through the lifecycle of a Reinforcement Learning (RL) environment. He defined it as a continuous loop made up of an actionable prompt, a starting state, a runtime environment (like a Docker container), configuration, and a reward system. The main takeaway here was that successful RL needs careful difficulty calibration and precise reward shaping to keep agents from hacking the reward system. To get this right, engineers have to actually look at the agent's traces instead of blindly trusting that the numbers on a chart are going up. Kanav also said that data environments are living projects with a shelf life of at most two months, and this means continuous learning from task failures and automated data pipelines are far more important/effective than relying on static expensive human data.


SAO: Supporting our AI overlords (Day 0)

The core theme of the SAO workshop was that agents are rapidly becoming both the primary users and the builders of data systems, and this shift is creating a vast new design space demanding entirely new abstractions. The workshop featured three keynote speakers and was incredibly well attended. With no seating left, people were standing in the doorway just to watch the talks.

Aaron Katz from Clickhouse gave a nice talk on this transition from human to agent users. He said that agents aren't just querying data anymore; they are actively provisioning services, so they need their own identities and budgets. Because agents drive such massive concurrency and require interactive latencies, traditional per-query pricing models are becoming way too punitive. To keep up, platforms have to adapt to headless API-first experiences. He said that Clickhouse is launching the ability to build agents directly inside the database for lower latency and for blending structured and full-text search.

Next, Andy Pavlo talked about databases being the "final boss" for agents. Check out his slides. It is classic Andy humor and style, though a Wu-Tang Clan reference was sorely missing this time. Andy focused heavily on automated database tuning and development, comparing their Proto-X tuning agent with ChatGPT's Lambda-tune. While Proto-X gets better optimization results, it takes 12 hours to train per database, whereas ChatGPT is fast (14 minutes) but performs terribly. To bridge this gap, they adopted LLMs to boost automatic tuning algorithms by leveraging prior history. In the second (and shorter) part of the talk, Andy also noted that while coding agents are making progress, they still completely fail at building complex database components like query optimizers, which require much more support. Although they have successfully "vibed" and manually verified a couple of optimization passes (like DPHyp and Unnesting v2), blindly accepting an LLM's output is fraught with problems because verifying query plan equivalence is notoriously difficult (despite solver efforts from UW, Berkeley, and Microsoft). It seems like coding agents love to add special-case code when an optimizer actually needs to be as general as possible.

Finally, Nikita Shamgunov (ex-Neon, now Databricks) discussed the infrastructure needed for agents, categorizing this down into three pillars: state, compute, and middleware. He argued that because agents speed up the dev loop by 1000x, true serverless architectures are now absolutely critical to avoid the insane costs of overprovisioning. He also described Neon's architecture, highlighting how separating compute from storage allows for instant database branching for agents using microVMs. While he shared some genuinely interesting technical points, the presentation itself was pretty dry and felt like it was missing a clear focus.

The workshop ended with a panel. The organizers tried to spark a debate, but there wasn't much disagreement. The general consensus was that while traditional OLTP and OLAP boundaries will remain, agents will increasingly be the ones conducting tasks that span seamlessly across them. After the panel, we headed to a MongoDB-sponsored happy hour, where the workshop audience had more time to have relaxed conversations about the breakneck hellscape transition our industry is currently going through.


Wednesday (Day 1)

Since CAIS is an AI systems conference, the focus is more on building the systems that surround AI models rather training individual AI models. The dominant theme this year seemed to be [multi-]agent architectures, coordination protocols, and workflow design. The emphasis is on composition: how do we organize these agents, manage their contexts, coordinate their interactions, and handle their lifecycles?

A second major trend is the growing focus on day-to-day operations. Entire sessions were dedicated to evaluation, trace analysis, failure detection, routing, and cost optimization. With so many papers covering efficiency, scheduling, and economic tradeoffs, it is clear that the industry is shifting from just maximizing capability to maximizing capability per dollar.

My shortlist from Day 1:

Here a brief overview of the TraceFix paper, which is most relevant to my interests.

TraceFix: Repairing Agent Coordination Protocols with TLA+ Counterexamples

TraceFix tackles the coordination failures that happen when multiple LLM agents try to work together concurrently on shared tasks. It is crucial to understand that this paper is not about generating classical distributed computing algorithms; rather, it focuses on scaffolding the formal "rules of engagement" for multi-agent LLM systems. When agents collaborate on domain tasks requiring fine-grained mutual exclusion over shared mutable resources (such as editing the same codebase or scheduling access to a simulated lab instrument) they naturally run into interleaving-sensitive bugs like deadlocks, missed handshakes, and race conditions. TraceFix solves this by isolating the coordination layer, formally verifying the protocol for how agents use shared locks and message channels, and then allowing the agents to remain completely autonomous in executing their actual domain-specific work.

To achieve this, TraceFix introduces a verification-first pipeline where an orchestration agent first synthesizes a declarative protocol topology and writes the behavioral logic in PlusCal. Before any agent takes action, the TLA+ model checker (TLC) exhaustively searches this proposed protocol for safety violations and feeds concrete counterexample traces back to the agent for iterative repair until the code is fully verified. At runtime, these verified process bodies are compiled into agent prompts, and a monitor strictly enforces the approved topology by rejecting any invalid or out-of-bounds coordination attempts. TraceFix is evaluated across a benchmark of 48 complex tasks. The system achieved a 100% verification success rate within four repair iterations and significantly improved runtime task completion, proving that formal model-checker feedback can effectively eliminate the deadlocks and resource clashes in multi-agent workflows.

Counterintuitively, the introduction of this formal coordination scaffolding did not bog the system down with unnecessary overhead, but rather, it significantly accelerated execution. By structurally preventing agents from colliding over resources (a problem that caused a massive 61.2% contention rate and endless retry loops in unstructured chat-only setups) TraceFix eliminated wasted trial-and-error steps. As a result, the verified topology-monitored protocol executed in an average of just 93 seconds using 62 tool-call steps. This vastly outperformed both the chaotic chat-only baseline (229 seconds and 203 steps) and a sequential single-agent baseline (~304 seconds).


Thursday (Day 2)

On Thursday, the papers were more about making agent systems manageable. Many of these papers are about memory, planning, governance, verification, security, and control. Researchers are increasingly treating agents as long-running software systems that require architecture, interfaces, observability, safety mechanisms, and operational discipline. This resembles the evolution of distributed systems from clever protocols toward operational concerns such as consistency models, monitoring, fault tolerance, and standards.

A second trend was the emergence of agents as compound systems rather than monolithic models. Instead of expecting a single model to solve everything, people arenow  building ecosystems of interacting components with explicit roles. The vibe here is more like systems engineering for AI. 

My shortlist:


Friday (Day 3)

In the Friday program, multiple papers study agent societies, debate, persuasion, socialization, consensus formation, and safety in multi-agent environments. Also, several papers focus on practical concerns such as routing requests across models, optimizing energy consumption, serving multi-agent systems, evaluating agent frameworks, and integrating AI into production workflows. 

My shortlist:

Extend MySQL Using Rust

Discover how the VillageSQL Rust SDK brings native, memory-safe Rust extension development to MySQL, bypassing C++ entirely.

June 01, 2026

Accelerating developer productivity in the agentic AI era with Amazon Aurora PostgreSQL

In this post, you learn how Amazon Aurora PostgreSQL-Compatible Edition accelerates developer productivity in the agentic AI era. We explore three core design convictions: meet developers where they work, absorb workload variability, and grow with the application from prototype to global scale.

Filter on Children, Sort by Parent: One-to-Many Compound Index Strategies in PostgreSQL

Before looking at PostgreSQL, I'll first introduce the problem by showing how MongoDB's document model and multi-key indexes handle compound indexes across a one-to-many relationship. Then we will see how to approximate this in PostgreSQL by denormalizing on the "many" side or the "one" side, how to maintain consistency with cascade foreign keys or triggers, and how to accelerate filtering with sorted pagination using B-tree, GIN, or RUM indexes.

Multi-key indexes in MongoDB allow compound indexes on one-to-many relationships. For example, the following index covers fields from both the children (child_value) and the parent (parent_value) when children are embedded as an array (in a children field) within the parent document:

db.parent.createIndex(
 { "children.child_value": 1, parent_value: 1 }
);

Such an index can efficiently support an equality filter on child_value combined with a sort and pagination on parent_value. For instance, finding the top 10 parents, ordered by parent_value, where at least one child has a child_value of 0.9:

db.parent.find(
 { "children.child_value": 0.9 }
).sort({ "parent_value": 1 }).limit(10);

It is important to understand the semantics of predicates on embedded arrays. This query does not return the top 10 children with this value — a parent may have multiple matching children, but we get one result per parent. This differs from a relational join (or $unwind in a MongoDB aggregation pipeline), which would produce one row per child. Here, the predicate tests only for the existence of at least one child with the specified value. The result set contains distinct parents, not (parent, child) pairs.

A multi-key index stores multiple index entries per parent document — one for each distinct child_value in the array — paired with the parent's parent_value. The number of index entries per parent equals the number of distinct child values, which can be less than the total number of children when multiple children share the same value. For example, given these documents:

{
  _id: 24,
  parent_value: "Y",
  children: [
    { child_num: 1, child_value: 0.9 },
    { child_num: 2, child_value: 0.8 }
  ]
},
{
  _id: 42,
  parent_value: "X",
  children: [
    { child_num: 1, child_value: 0.5 },
    { child_num: 2, child_value: 0.9 },
    { child_num: 3, child_value: 0.9 },
    { child_num: 4, child_value: 0.7 }
  ]
}

The index contains three entries for the second document, because the two children with child_value: 0.9 produce only one index entry:

  (0.5, "X", 42) → record ID of { _id: 42 }
  (0.7, "X", 42) → record ID of { _id: 42 }
  (0.8, "Y", 24) → record ID of { _id: 24 }
  (0.9, "X", 42) → record ID of { _id: 42 }
  (0.9, "Y", 24) → record ID of { _id: 24 }

The index scan for {child_value: 0.9} finds two entries, already ordered by the second field of the compound key, parent_value. MongoDB can read them in order — "X" before "Y" — and fetch the corresponding documents without an additional sort. With a .limit(10), the scan stops after 10 distinct parents are found, making the query efficient regardless of how many total documents exist in the collection.

There is no direct equivalent of this in PostgreSQL. If the one-to-many relationship is stored as JSONB, GIN indexes can locate documents containing {child_value: 0.9} but do not preserve ordering on other fields. The query must fetch all matching documents, sort them on parent_value, and only then return the top 10. If the one-to-many relationship is normalized into two tables, no single index can span columns from both tables — a join is required first, and the sort optimization depends on the join strategy chosen by the planner.

Normalized One-to-Many

Now that the problem is defined, let's see how we can work around this limitation in a normalized model:

drop table if exists parent, child cascade;

create table parent (
 parent_id    bigserial primary key,
 parent_value float
);

create table child (
 child_id     bigserial primary key,
 parent_id    int references parent (parent_id),
 child_value  float
);

insert into parent (parent_value)
 select random() from generate_series(1, 1000)
;

insert into child (parent_id, child_value)
 select parent_id, round(random()::numeric, 4) from parent, generate_series(1, 1000)
;

create index on parent (parent_value);
create index on child (child_value);

vacuum analyze parent, child;

The following query finds the top 10 parents, by parent_value, that have at least one child with a child_value of 0.9999:

-- explain (analyze, buffers, verbose, costs off)
select *
from parent p
where exists (
  select 1 from child c
  where c.parent_id = p.parent_id
  and c.child_value = 0.9999
)
order by parent_value
limit 10;

 parent_id |     parent_value
-----------+----------------------
       956 | 0.013518349069161273
       789 | 0.022813950892476287
       307 | 0.024740344416860793
       833 | 0.041118650516225985
       493 |  0.07587347477845374
       402 |  0.08400127026866477
       256 |   0.0980166832909124
       342 |  0.11175505348300807
       846 |  0.12047992766758941
       590 |  0.16840580135419425
(10 rows)

The execution plan shows that:

  1. The index on child_value was scanned to find all index entries with the expected value — here rows=99.0 — which are then used to locate the corresponding blocks in the child table (this is a Bitmap Index Scan).
  2. The Bitmap Heap Scan on the child table returned all children with child_value = 0.9999 (there are 95 if we count the actual matches — a bitmap scan can have some false positives). The results were stored in memory (Materialize).
  3. The index on parent_value was scanned to retrieve parents in the desired order, checking for each whether its parent_id appears in the materialized buffer. It had to read rows=169.00 parents before finding rows=10.00 that satisfied the condition, with lookups on the materialized result (loops=169).
                                                       QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
 Limit (actual time=0.366..2.272 rows=10.00 loops=1)
   Output: p.parent_id, p.parent_value
   Buffers: shared hit=245
   ->  Nested Loop Semi Join (actual time=0.365..2.270 rows=10.00 loops=1)
         Output: p.parent_id, p.parent_value
         Join Filter: (p.parent_id = c.parent_id)
         Rows Removed by Join Filter: 16246
         Buffers: shared hit=245
         ->  Index Scan using parent_parent_value_idx on public.parent p (actual time=0.012..0.080 rows=169.00 loops=1)
               Output: p.parent_id, p.parent_value
               Index Searches: 1
               Buffers: shared hit=142
         ->  Materialize (actual time=0.000..0.006 rows=96.19 loops=169)
               Output: c.parent_id
               Storage: Memory  Maximum Storage: 20kB
               Buffers: shared hit=103
               ->  Bitmap Heap Scan on public.child c (actual time=0.032..0.151 rows=99.00 loops=1)
                     Output: c.parent_id
                     Recheck Cond: (c.child_value = '0.9999'::double precision)
                     Heap Blocks: exact=99
                     Buffers: shared hit=103
                     ->  Bitmap Index Scan on child_child_value_idx (actual time=0.016..0.016 rows=99.00 loops=1)
                           Index Cond: (c.child_value = '0.9999'::double precision)
                           Index Searches: 1
                           Buffers: shared hit=4

The other index on parent_value can serve the sort but not the filter on child_value. In both cases, many more rows must be read to combine the two conditions after a Nested Loop Join. That is the price of normalization.

In a SQL database, if we want a single index to cover both fields, they must exist in the same table.

Denormalize into the "Many" Side (Children)

A common solution is to duplicate the parent value into the child table and create a compound index:

alter table child add parent_value float;

update child
  set parent_value = parent.parent_value
  from parent
  where child.parent_id = parent.parent_id;

alter table child alter parent_value set not null;

create index on child (child_value, parent_value);

The query can now operate on the child table only:

-- explain (analyze, buffers, verbose, costs off)
select distinct parent_id, parent_value
 from child
 where child_value = 0.9999
 order by parent_value
 limit 10;

 parent_id |     parent_value
-----------+----------------------
       956 | 0.013518349069161273
       789 | 0.022813950892476287
       307 | 0.024740344416860793
       833 | 0.041118650516225985
       493 |  0.07587347477845374
       402 |  0.08400127026866477
       256 |   0.0980166832909124
       342 |  0.11175505348300807
       846 |  0.12047992766758941
       590 |  0.16840580135419425
(10 rows)

The execution plan shows a single index scan which covers:

  • the filter: Index Cond: child_value = 0.9999
  • the order: Presorted Key: child.parent_value
                                                             QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------
 Limit (actual time=0.067..0.072 rows=10.00 loops=1)
   Output: parent_id, parent_value
   Buffers: shared hit=36
   ->  Unique (actual time=0.066..0.070 rows=10.00 loops=1)
         Output: parent_id, parent_value
         Buffers: shared hit=36
         ->  Incremental Sort (actual time=0.066..0.066 rows=11.00 loops=1)
               Output: parent_id, parent_value
               Sort Key: child.parent_value, child.parent_id
               Presorted Key: child.parent_value
               Full-sort Groups: 1  Sort Method: quicksort  Average Memory: 26kB  Peak Memory: 26kB
               Buffers: shared hit=36
               ->  Index Scan using child_child_value_parent_value_idx on public.child (actual time=0.014..0.054 rows=33.00 loops=1)
                     Output: parent_id, parent_value
                     Index Cond: (child.child_value = '0.9999'::double precision)
                     Index Searches: 1
                     Buffers: shared hit=36
 Planning:
   Buffers: shared hit=3
 Planning Time: 0.132 ms
 Execution Time: 0.088 ms

There is an additional sort because the result may contain duplicate parents when they have multiple children with the same value, but it is an incremental sort so it does not have to read all children — here only rows=33.00 before producing rows=10.00. This gets the top ten parents by parent_value without reading more child rows than necessary.

This is nearly optimal. The new index resembles a multi-key index, except that it may contain duplicate entries for the same parent — unnecessary for this query, but potentially useful for others that need the detail of individual children. Like the multi-key index, it must be maintained when the parent value changes. We have two solutions: referential integrity constraint and triggers.

Update with Cascade Constraint (Recommended)

One way to keep this column automatically updated is declarative. We add an additional foreign key that enforces that the parent_value in the child table matches the one in the parent, and declare it on update cascade:

alter table parent add unique (parent_id, parent_value);

alter table child
 add constraint fk_child_parent
 foreign key (parent_id, parent_value)
 references parent (parent_id, parent_value)
 on update cascade;

Unfortunately, PostgreSQL cannot use the same index for two unique constraints that share a prefix, so two indexes are maintained on the parent table. This solution is intended for cases where updates are not frequent.

I test the performance overhead when updating one parent:

\timing on
explain (analyze, buffers, costs off, verbose on)
 update parent
 set parent_value = parent_value - 0.000000000000000042
 where parent_id = 789
;

                                                       QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
 Update on public.parent (actual time=0.063..0.064 rows=0.00 loops=1)
   Buffers: shared hit=16
   ->  Index Scan using parent_parent_id_parent_value_key on public.parent (actual time=0.022..0.025 rows=1.00 loops=1)
         Output: (parent_value - '4.2e-17'::double precision), ctid
         Index Cond: (parent.parent_id = 789)
         Index Searches: 1
         Buffers: shared hit=5
 Planning Time: 0.080 ms
 Trigger RI_ConstraintTrigger_a_18059 for constraint fk_child_parent on parent: time=37.950 calls=1
 Trigger RI_ConstraintTrigger_c_18061 for constraint fk_child_parent on child: time=6.767 calls=1000
 Execution Time: 44.874 ms

There is an overhead: 37.9 milliseconds to cascade the update to a thousand of children (parent action), and 6.7 milliseconds to validate each _c_hild row (child check).

This is not problematic for occasional updates. Of course, if we update all parents, the row-by-row cascading constraints can be slow:

\timing on
explain (analyze, buffers, costs off, verbose on)
 update parent
 set parent_value = parent_value - 0.000000000000000042
;
vacuum parent, child;

                                                QUERY PLAN
----------------------------------------------------------------------------------------------------------
 Update on public.parent (actual time=5.944..5.945 rows=0.00 loops=1)
   Buffers: shared hit=10975 dirtied=1
   ->  Seq Scan on public.parent (actual time=0.010..0.156 rows=1000.00 loops=1)
         Output: (parent_value - '4.2e-17'::double precision), ctid
         Buffers: shared hit=11
 Planning:
   Buffers: shared hit=86
 Planning Time: 0.235 ms
 Trigger RI_ConstraintTrigger_a_18059 for constraint fk_child_parent on parent: time=42659.655 calls=506
 Trigger RI_ConstraintTrigger_c_18061 for constraint fk_child_parent on child: time=3398.155 calls=506000
 Execution Time: 46103.577 ms
(11 rows)

Time: 46105.778 ms (00:46.106)

Obviously this is not a solution for frequently updated columns. However, the parent side of a one-to-many relationship often consists of static references, slowly changing dimensions, or immutable event headers, making this overhead acceptable. I also tested with the constraint set as DEFERRABLE. The elapsed time is similar, but you do not see it in the EXPLAIN output because the validation occurs at commit time.

Update with Trigger (Not Recommended)

The cascade constraint has the advantage of being declarative, but it executes internally as a trigger. We can instead create a custom trigger that cascades the update without re-validating the foreign key — assuming that nobody updates the denormalized column directly, and that it is set correctly on insert.

-- function to update the parent value in the children
create or replace function sync_parent_value() returns trigger as $$
begin
  update child set parent_value = new.parent_value
  where parent_id = new.parent_id;
  return new;
end;
$$ language plpgsql;

-- trigger raised for each update
create trigger trg_sync_parent_value
after update of parent_value on parent
for each row
when (old.parent_value is distinct from new.parent_value)
execute function sync_parent_value();

-- drop the previous constraint
alter table child drop constraint fk_child_parent;
alter table parent drop constraint parent_parent_id_parent_value_key;

I test the update of all parents:

\timing on
explain (analyze, buffers, costs off, verbose on)
 update parent
 set parent_value = parent_value - 0.000000000000000042
;
vacuum parent, child;

                                   QUERY PLAN
---------------------------------------------------------------------------------
 Update on public.parent (actual time=4.547..4.547 rows=0.00 loops=1)
   Buffers: shared hit=8453 dirtied=26
   ->  Seq Scan on public.parent (actual time=0.009..0.124 rows=1000.00 loops=1)
         Output: (parent_value - '4.2e-17'::double precision), ctid
         Buffers: shared hit=11
 Planning:
   Buffers: shared hit=8 dirtied=2
 Planning Time: 0.108 ms
 Trigger trg_sync_parent_value: time=42719.154 calls=506
 Execution Time: 42724.042 ms
(10 rows)

Time: 42729.150 ms (00:42.729)

postgres=# vacuum parent, child;

VACUUM
Time: 176.585 ms

The performance advantage over the cascade constraint is minimal, and the risk is higher because parent_value in the child table could be updated independently without any check that it still matches the parent. We would also need an insert trigger to set the correct value, and an update trigger to prevent direct modifications. Ultimately, the cascade constraint solution is strongly preferable.

Denormalize into the "One" Side (Parent)

If we want to more closely mimic a MongoDB multi-key index, denormalizing into the child table is not an exact equivalent because there is no deduplication. It has its advantages — it can serve queries that need individual child rows — but a structure analogous to a multi-key index should be stored on the parent side, where each parent holds a summary of its children's values.

One approach is to add an array of child values to the parent table so that we can filter on it:

-- add array column
alter table parent add child_values float[];

-- populate from child
update parent p set child_values = (
  select array_agg(distinct child_value)
  from child where parent_id = p.parent_id
);

A trigger must maintain the array when children change:

create or replace function sync_child_values_array() returns trigger as $$
begin
  update parent set child_values = (
    select array_agg(distinct child_value)
    from child where parent_id = coalesce(new.parent_id, old.parent_id)
  ) where parent_id = coalesce(new.parent_id, old.parent_id);