a curated list of database news from authoritative sources

June 13, 2026

Our MongoDB TLA+ Workshop

Shortly after I joined MongoDB research, we ran a TLA+ workshop. It was a two-day ordeal. We had a 1.5 days of instruction on TLA+ and syntax, after which we tried to help people get started with modeling. People liked learning about TLA+ on the first day, but except for a person or two, we didn't get anyone onboarded with TLA+ modeling. It was too much to offload on people and ask them to level up in a short time frame.

Well, two years after that first workshop, on May 11th, 2026, we ran a second edition of this workshop with one very big difference. What is that big difference, you ask?

AI!

AI makes formal methods not only necessary, but also more feasible and easier!

Jesse, Will, and I planned this workshop to be aggressively short. We provide under two hours of instruction, then everyone starts modeling hands-on. We act as TAs and help people as they go. The AI takes care of the syntax problem for TLA+, and also helps with modeling. We just need to teach people enough to read and make sense of a model, and Claude Code would take care of the rest. You have a problem? Ask Claude. Still stuck? Ask us.


Take away: Green Path to 100 true fans

Ok, I will cut to the chase. The workshop was a great success!

My biggest takeaway from it is this: I finally see a green path to 100 true fans. We had an order of magnitude greater conversion rate compared to the first workshop. Almost all of the attendees got started with modeling, and almost all of those modeled a part of an actual work project they are currently on.

As I've written before, with AI tools it is easy to be "technically fearless." People who had been heavy Claude users did a great job prompting and interacting with Claude, and hit the ground running. Some people were still hesitant to take the first step, but when one of us sat down with them, they quickly got over that initial friction and were off and modeling within minutes.


Workshop feedback report

We asked attendees to fill out a feedback survey afterward, and the results were encouraging. Across the board, attendees rated the workshop a 5 out of 5 and strongly agreed that TLA+ is a valuable tool for ensuring system correctness. Several attendees reported modeling real, current work projects during the hands-on session (including a concurrent rewrite of a core subsystem, a data materialization pipeline, and a distributed storage component) and found the exercise revealed liveness issues and design ambiguities they hadn't previously articulated. One attendee described how modeling gave them confidence in a design they would otherwise have had to validate the hard way through implementation.

On the question of AI integration, the feedback was uniformly positive. Multiple respondents noted that Claude Code removed syntax as a barrier entirely, letting them focus on the modeling itself rather than fighting the language. One participant said: "I have tried to write TLA+ pre-LLMs, and it was prohibitively difficult." Likelihood-to-use scores ranged from 3 to 5, with most landing at 4 or 5. The most common suggestions for improvement centered on adoption strategy: how to make TLA+ part of the default design workflow for engineers who aren't yet in the habit, perhaps through a champion system similar to what MongoDB has done with AI tooling.


TLA+ skills file

At the workshop, we brainstormed what a best-practices guide for AI-assisted TLA+ modeling should look like. Here is what we landed on.

No PlusCal! Stick to TLA+ directly. Keep it simple, easy to read, guarded commands. This is something I changed my mind on about three years ago, and I'm now happy to see that with AI assistance, there is no need to detour through PlusCal. As a mnemonic: /\ is like ∩ (intersection), so it means "and"; \/ is like ∪ (union), so it means "or."

Always include a .cfg file with the same name as the spec.

Before writing a single line, interview the user:

  • Do you have a design doc?
  • Do you have source code?
  • Which property do you care about?
  • Are you after correctness (safety), liveness, or both?

(This is a bit tongue-in-cheek.) Pretend you are Lamport. Be brutally minimalist, and only add things if you can show they are relevant.

Start from the most abstract model possible. Only go more concrete if you need to, and consult the user when you do.

Initially aim to stay within 50–200 lines. A model that stays small forces you to think clearly about what matters.

Prevent cognitive debt: explain each action with a short comment, so the reader can follow the intent without reverse-engineering it.


Finally, this is what we taught


Motivational overview

I opened with a motivational overview of modeling, drawing on my previous posts. The core message was this: as AI models write more of our code, formal modeling for concurrent and distributed systems is becoming more valuable than ever.

TLA+ is not just a verification tool. It is a design accelerator. It shifts your perspective: forces you to think about the system globally, lets you explore alternative designs quickly and safely before any code exists. It is faster to fix a conceptual model than to whack-a-mole corner cases in implementation. And thanks to LLMs, learning TLA+ syntax is no longer the bottleneck. AI and TLA+ amplify each other.

The workflow I described:

  1. Design first in TLA+ to produce a verified unambiguous specification.
  2. Hand the verified spec to an LLM to generate the corresponding code.
  3. Use TLA+ execution traces to derive model-based conformance tests.
  4. Use TLA+ invariants to derive property-based tests for the generated code.

I reinforced the message that AI + TLA+ is a massive win for the future of software development.

I also introduced a distinction I find underappreciated: the difference between modularity abstraction and modeling abstraction. Modularity abstraction hides implementation details through encapsulation and layering (the traditional approach taught in CS curricula). Modeling abstraction does something different: it reduces a system to its essential behavioral elements by stripping away everything irrelevant to the properties you want to study. While modularity abstraction aspires to conceal concurrency, modeling abstraction deliberately exposes it to maximize safe parallelism. The key skill in formal methods is knowing how to model abstractly (deciding on what to throw away) which is fundamentally different from the modularity thinking that programmers already have. 


Jesse's talk

Jesse gave an introduction to TLA+ grounded in its computational model pitched for the LLM era. He started from the transition model: a TLA+ spec defines a state machine through three pieces: an initial-state predicate describing valid starting configurations, a set of actions (state-transition rules) that define legal moves, and fairness constraints that ensure the system makes progress rather than stalling forever. Each state is a complete assignment of values to the variables, and a behavior is a sequence of such states. Jesse walked through the coffee beans example to show how guards (preconditions) and next-state assignments work together.

He then explained how TLC, the TLA+ model checker, verifies specs by doing breadth-first search through the reachable state graph, exhaustively exploring all possible behaviors that the spec allows. This is what makes TLA+ different from testing: you are not sampling executions, you are covering all of them (up to the model bounds). Temporal operators like □ ("always") and ◇ ("eventually") let you express properties about entire infinite behaviors, not just individual states. The punchline Jesse made was again this: In the LLM era, you don't need to memorize the syntax to do this. You only need to understand what you are asking the model to check. 


Will's talk

Will Schultz walked through a realistic data cloning specification, defining key properties formally and showing how to think about correctness in a domain most of the attendees work in daily. He then demoed Spectacle, his tool for interactive spec exploration and visualization — letting attendees see the state space as a navigable graph rather than a wall of output. He also covered the current state of AI-assisted modeling workflows with TLA+, grounding the "Claude can help" claim in actual practice.


Hands-on modeling

Then we did the hands-on portion. Attendees picked a current work project, and we worked with them as TAs. That's where the real learning happened of course!


Closing: TLA+ mental models

I closed the workshop by pointing people at where to go from here, and tried to give them a map of the conceptual territory for leveling up.

Effective TLA+ modeling rests on seven interconnected mental models. The foundation is aggressive abstraction: knowing what to discard rather than what to include. Paired with that is embracing TLA+'s global shared-memory fiction as a powerful reasoning tool rather than fighting it as unrealistic. From there, three refinement techniques: localizing guards to stable predicates (which makes slow models fast), deriving tight invariants that reflect genuine understanding of the system, and splitting atomic actions deliberately to expose real-world concurrency. These help you deepen a model until it tells you something you didn't already know.

Beyond correctness verification, these mental models serve a critical communication function. A well-written TLA+ spec is precise and executable documentation. It keeps distributed teams aligned on design intent in a way that English prose cannot. By exploring protocol alternatives through stepwise refinement and varying atomicity levels, teams can discover optimizations they wouldn't find through implementation alone.

HammerDB tproc-c on a large server, Postgres 14 to 19 beta1

This has results for HammerDB tproc-c on a large server using MySQL and Postgres. I am new to HammerDB and still figuring out how to explain and present results so I will keep this simple and just share graphs without explaining the results.

tl;dr

    • There are small regressions in versions 16, 17 and 18
    • NOPM usually improves a small amount in 19 beta1 relative to 18
    Builds, configuration and hardware

    I compiled Postgres versions from source: 14.22, 14.23, 15.17, 15.18, 16.13, 16.14, 17.9, 17.10, 18.0, 18.1, 18.2, 18.3, 18.4 and 19 beta1.

    I used a 48-core server from Hetzner
    • an ax162s with an AMD EPYC 9454P 48-Core Processor with SMT disabled
    • 2 Intel D7-P5520 NVMe storage devices with RAID 1 (3.8T each) using ext4
    • 128G RAM
    • Ubuntu 24.04
    Postgres configuration files:
    • prior to version 18 the config file is named conf.diff.cx10a50g_c32r128 (x10a_c32r128) and is here for versions 14, 15, 16 and 17.
    • for Postgres 18 and 19 I used conf.diff.cx10b_c32r128 (x10b_c32r128) with io_method=sync to be similar to the config used for versions 14 through 17.
    Benchmark

    The benchmark is tproc-c from HammerDB. The tproc-c benchmark is derived from TPC-C.

    The benchmark was run for several workloads:
    • vu=10, wh=1000 - 10 virtual users, 1000 warehouses
    • vu=20, wh=1000 - 20 virtual users, 1000 warehouses
    • vu=40, wh=1000 - 40 virtual users, 1000 warehouses
    • vu=10, wh=2000 - 10 virtual users, 2000 warehouses
    • vu=20, wh=2000 - 20 virtual users, 2000 warehouses
    • vu=40, wh=2000 - 40 virtual users, 2000 warehouses
    • vu=10, wh=4000 - 10 virtual users, 4000 warehouses
    • vu=20, wh=4000 - 20 virtual users, 4000 warehouses
    • vu=40, wh=4000 - 40 virtual users, 4000 warehouses
    The wh=1000 workloads are less heavy on IO. The wh=4000 workloads are more heavy on IO.

    The benchmark for Postgres is run by a variant of this script which depends on scripts here.
    • stored procedures are enabled
    • partitioning is used because the warehouse count is >= 1000
    • a 5 minute rampup is used
    • then performance is measured for 60 minutes
    Results

    My analysis at this point is simple -- I only consider average throughput. Eventually I will examine throughput over time and efficiency (CPU and IO).

    On the charts that follow y-axis does not start at 0 to improve readability at the risk of overstating the differences. The y-axis shows relative throughput. There might be a regression when the relative throughput is less than 1.0. There might be an improvement when it is > 1.0. The relative throughput is:
    (NOPM for some-version / NOPM for base-version)

    The base version is Postgres 14.22.

    A spreadsheet with absolute and relative values for NOPM is here.

    Results: vu=10, wh=1000

    Summary:

    • There are small regressions in versions 16, 17 and 18 while NOPM improves is 19 beta1

    Results: vu=20, wh=1000

    Summary:

    • There are small regressions in versions 16, 17 and 18 while NOPM improves is 19 beta1

    Results: vu=40, wh=1000

    Summary:

    • There are small regressions in versions 17 and 18 while NOPM improves is 19 beta1

    Results: vu=10, wh=2000

    Summary:

    • There are small regressions in version 18 while NOPM improves is 19 beta1

    Results: vu=20, wh=2000

    Summary:

    • There are small regressions in versions 16, 17 and 18 while NOPM improves is 19 beta1

    Results: vu=40, wh=2000

    Summary:

    • There are small regressions in versions 16, 17 and 18 while NOPM improves is 19 beta1
    • There is no result for 18.1 because of a bug in my test scripts

    Results: vu=10, wh=4000

    Summary:

    • There are small regressions in versions 16, 17 and 18 while NOPM improves is 19 beta1

    Results: vu=20, wh=4000

    Summary:

    • There are small regressions in versions 16, 17 and 18

    Results: vu=40, wh=4000

    Summary:

    • There are small regressions in versions 16, 17 and 18 while NOPM improves is 19 beta1


    June 12, 2026

    HorizonDB preview: automate a reproducible lab with ARM

    Azure HorizonDB is a new database service on Azure. It is PostgreSQL-compatible from a developer perspective, but its storage layer differs to support built-in high availability and scalability.

    Since it's currently in preview, APIs might evolve and defaults could change. This is a great opportunity to explore how well PostgreSQL works with your existing applications.

    If you're setting up a lab for those tests, you don’t want to navigate the UI each time. You need something that can be recreated quickly and dismantled once finished.

    Azure Resource Manager (ARM) is well suited for that.

    Login to Azure

    I installed the Azure CLI and logged in on my laptop:

    
    az login
    
    

    This opens the browser, lets me sign in interactively, and lets me select the tenant and subscription.

    If I'm already logged in, I can get my subscription ID:

    
    az account show --query id -o tsv
    
    

    From there, everything can be automated via the CLI.

    ARM template for HorizonDB preview

    I created a parameterized template, in hdb-template.json:

    
    {
        "$schema": "https://schema.management.azure.com/schemas/2019-04-01/deploymentTemplate.json#",
        "contentVersion": "1.0.0.0",
        "parameters": {
            "apiVersion":         { "type": "string", "defaultValue": "2026-01-20-preview" },
            "clusterName":        { "type": "string" },
            "location":           { "type": "string" },
            "poolName":           { "defaultValue": "DefaultPool", "type": "string" },
            "administratorLogin": { "type": "string" },
            "administratorLoginPassword": { "type": "securestring" },
            "tags":               { "type": "object", "defaultValue": {} },
            "firewallRules":      { "type": "array", "defaultValue": [] },
            "guid":               { "type": "string", "defaultValue": "[newGuid()]" }
        },
        "resources": [
            {
                "type": "Microsoft.HorizonDB/clusters",
                "apiVersion": "[parameters('apiVersion')]",
                "name": "[parameters('clusterName')]",
                "location": "[parameters('location')]",
                "properties": {
                    "createMode":                 "Default",
                    "version":                    "17",
                    "zonePlacementPolicy":        "BestEffort",
                    "replicaCount":               2,
                    "vCores":                     2,
                    "administratorLogin":         "[parameters('administratorLogin')]",
                    "administratorLoginPassword": "[parameters('administratorLoginPassword')]",
                    "network":                    { "publicNetworkAccess": "Enabled" },
                    "aiModelManagement":          1
                },
                "tags":                           "[parameters('tags')]"
            },
            {
                "condition": "[greater(length(parameters('firewallRules')), 0)]",
                "type": "Microsoft.Resources/deployments",
                "apiVersion": "2019-08-01",
                "name": "[concat('firewallRules-', parameters('guid'), '-', copyIndex())]",
                "copy": {
                    "count": "[if(greater(length(parameters('firewallRules')), 0), length(parameters('firewallRules')), 1)]",
                    "mode": "Serial",
                    "name": "firewallRulesIterator"
                },
                "dependsOn": [
                    "[concat('Microsoft.HorizonDB/clusters/', parameters('clusterName'))]"
                ],
                "properties": {
                    "mode": "Incremental",
                    "template": {
                        "$schema": "http://schema.management.azure.com/schemas/2014-04-01-preview/deploymentTemplate.json#",
                        "contentVersion": "1.0.0.0",
                        "resources": [
                            {
                                "type": "Microsoft.HorizonDB/clusters/pools/firewallRules",
                                "name": "[concat(parameters('clusterName'),'/',parameters('poolName'),'/',parameters('firewallRules')[copyIndex()].name)]",
                                "apiVersion": "[parameters('apiVersion')]",
                                "properties": {
                                    "Description": "[parameters('firewallRules')[copyIndex()].name]",
                                    "StartIpAddress": "[parameters('firewallRules')[copyIndex()].startIpAddress]",
                                    "EndIpAddress": "[parameters('firewallRules')[copyIndex()].endIpAddress]"
                                }
                            }
                        ]
                    }
                }
            }
        ]
    }
    

    This template defines the HorizonDB cluster and its configuration. I generated it the first time I manually created a cluster via the UI portal. I include the firewall rules in a loop to define multiple client IPs.

    Keeping firewall rules in the template makes the lab fully reproducible, including network access, without requiring additional CLI commands.

    HorizonDB parameters

    In hdb-parameters.json I put all my parameters used by the ARM templates:

    {
      "$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentParameters.json#",
      "contentVersion": "1.0.0.0",
      "parameters": {
        "clusterName":                { "value": "franckpachot-hdb" },
        "location":                   { "value": "swedencentral" },
        "administratorLogin":         { "value": "franck" },
        "tags":          { "value":   { "Owner": "franckpachot" } },
        "firewallRules": {
          "value": [
            { "name": "home",  "startIpAddress": "192.0.2.10",    "endIpAddress": "192.0.2.20" },
            { "name": "shell", "startIpAddress": "52.178.13.135", "endIpAddress": "52.178.13.135" }
          ]
        }
      }
    }
    

    This file contains all environment-specific values: cluster name, region, allowed IP ranges, and admin user. You must update them. The regions where HorizonDB preview is available are listed here: https://learn.microsoft.com/en-us/azure/horizondb/overview#azure-regions

    I do not store the password in the parameters file and pass it at deployment time.

    Resource group

    I create a resource group that will contain everything:

    
    az group create --name franckpachot-rg --location swedencentral
    
    

    This makes it easy to identify and delete everything related to the lab.

    You will change the location to the region you chose above, and the resource group name will be used in the next command.

    Deploy the cluster

    A single command deploys the cluster from the template:

    
    az deployment group create --resource-group franckpachot-rg --template-file hdb-template.json --parameters @hdb-parameters.json --parameters administratorLoginPassword=PostgreSQL@Azure
    
    

    The template defines the infrastructure. The parameters define the environment, and the CLI command combines both.

    The deployment takes about 15 minutes.

    Retrieve the endpoint and connect

    The endpoint is visible in the Azure portal, but it can also be retrieved directly from the CLI:

    
    az resource show --resource-type Microsoft.HorizonDB/clusters --name franckpachot-hdb --resource-group franckpachot-rg --query properties.fullyQualifiedDomainName -o tsv
    
    

    I set my PostgreSQL connection variables using the credentials used when creating it, and the hostname of the endpoint:

    export PGPORT=5432
    export PGUSER=franck
    export PGPASSWORD=PostgreSQL@Azure
    export PGDATABASE=postgres
    export PGHOST=
    
    

    You can then connect with psql or any PostgreSQL client, including the PostgreSQL extension for Visual Studio Code

    Let's check the version banner and a few parameters:

    
    $ psql
    psql (16.2, server 17.9 (Azure HorizonDB (70f3b593ec7)(release)))
    SSL connection (protocol: TLSv1.3, cipher: TLS_AES_256_GCM_SHA384, compression: off)
    Type "help" for help.
    
    postgres=> select version();
    
                                                            version
    -----------------------------------------------------------------------------------------------------------------------
     PostgreSQL 17.9 (Azure HorizonDB (70f3b593ec7)(release)) on x86_64-pc-linux-gnu, compiled by gcc (GCC) 13.2.0, 64-bit
    
    (1 row)
    
    postgres=> \dconfig azure*
    
                    List of configuration parameters
    
                     Parameter                  |       Value
    --------------------------------------------+-------------------
     azure.accepted_password_auth_method        | md5,scram-sha-256
     azure.enable_temp_tablespaces_on_local_ssd | on
     azure.extensions                           |
     azure.fabric_mirror_enabled                | off
     azure.service_principal_id                 |
     azure.service_principal_tenant_id          |
    (6 rows)
    
    postgres=> \dconfig shared*
    
                                            List of configuration parameters
    
                Parameter             |                                    Value
    ----------------------------------+------------------------------------------------------------------------------
     shared_buffers                   | 11241MB
     shared_memory_size               | 11776MB
     shared_memory_size_in_huge_pages | 5888
     shared_memory_type               | mmap
     shared_preload_libraries         | azure, orion_storage, pg_availability, pg_qs, pgms_stats, pgms_wait_sampling
    
    (5 rows)
    
    postgres=> show effective_cache_size;
     effective_cache_size
    ----------------------
     11241MB
    (1 row)
    
    postgres=> show full_page_writes;
     full_page_writes
    ------------------
     off
    (1 row)
    
    

    An interesting aspect is the memory setup. On a 2 vCore, 16 GiB RAM instance, shared_buffers is allocated about 11 GB, which exceeds typical PostgreSQL recommendations.

    Traditionally, PostgreSQL utilizes both its buffer cache and the OS filesystem cache. In HorizonDB with disaggregated storage, the PostgreSQL fork avoids double buffering, and shared_buffers and effective_cache_size are closely aligned.

    Additionally, full_page_writes is turned off. This is uncommon in PostgreSQL, but justified here because the storage layer already ensures data consistency without torn pages, and has a huge impact on performance.

    To use extensions, create a parameter group and set azure.extensions. Managed services must control PostgreSQL extensibility because extensions can gain elevated system privileges.

    Cleanup

    The advantage of the resource group is that you can delete everything with a single command:

    
    az group delete --name franckpachot-rg --yes
    
    

    This removes the cluster and all associated resources.

    Final thoughts

    HorizonDB is currently in preview, and this is a perfect chance to test it!

    With ARM and Azure CLI:

    • You can set up a lab in just 15 minutes
    • Test how PostgreSQL applications work with it
    • Explore various extensions and features
    • And easily clean everything up when you're done

    This way, you can focus on what really matters: understanding how this PostgreSQL‑compatible platform performs with your applications and workloads. Please share your comments and feedback.

    More Fake Mastodon Signups

    On 2026-05-30, the Mastodon server I run started getting a flood of implausible signup requests. We think these are likely bots because our server rules and signup form clearly tell users to talk about their interest in gay leather. Most people sign up with something like “42 yo levis enthusiast into piss” or “Hi! I’m Snoofles the pup! I love bones and being a good girl for Madam!”. Instead these bios are either friendly, generic text…

    I value online communities that are moderated with care and that encourage constructive interaction. I plan to use my account to share interesting stuff, connect with others who share similar interests, and contribute to discussions following the server’s rules and culture.

    Or vaguely machine-flavored weirdness:

    Pick the Antarctic path and commit. Custodian inspired by Brendan Fraser, active with Axe Throwing in Indianapolis.

    Or text taken directly from existing Fediverse bios, like this one:

    Ukraineophile, Save Western Culture and Civilization from russian kleptocracy Trying to make Putin’s bum as irritable as possible.

    Often there’s nonsensical usernames (e.g. “ypy_asi”) or email addresses that don’t line up, or we get similar language from multiple email domains. I suspect there’s a few actors here, because the tactics don’t line up exactly, but it’s weird that they all started in the last few weeks.

    A few of the email domains involved:

    • forge45k.io.vn
    • shortweb.live
    • vtx.pbhak.dev
    • funnyfail.app
    • cyberlinkhub.com
    • ptncereio.com
    • sugarloafstudios.net
    • a6nc1sl.jejes.de
    • tiksofi.uk
    • nowtopzen.com
    • nanopools.info
    • tmail.lt
    • tmail.mx
    • phugruphy.com
    • sphinx.launders.money
    • datamzone.com
    • deisgn-ai.work.gd
    • haibabon.com
    • ai46boh.jejes.de
    • initwag.com
    • mailba.uk
    • compservmail.com
    • ff.zero34.qd.je

    Many look to be transient domains that no longer resolve. Two (tmail.mx and tmail.lt) belong to a service which provides temporary email addresses to bypass email verification. For pbhak.dev, I think it might be someone’s personal machine which… maybe got compromised? Others, like jejes.de, look like abandoned wordpress installs; I assume they’re compromised too.

    I’m seeing a fair number of European IPs for browsers–looks like VPNs and cloud providers.

    Last year most of our spammers were from a single company. I’ve approved one of these inauthentic accounts, and am going to keep an eye on it to see what happens.

    Write-heavy sysbench tests, a large server, modern Postgres and MySQL

    This has results for modern Postgres and MySQL using write-heavy tests from sysbench and a large server. I think there are regressions in Postgres that arrive in some of versions 16, 17, 18 and 19 beta1 but I am far from certain and this blog post is just another step in my journey to figure that out.

    tl;dr

    • Postgres suffers a lot from throughput variation while MySQL+InnoDB does not
    • InnoDB gets much better average throughput on 6 of 10 tests, similar throughput one one and then Postgres does better on 3 of 10 tests
    • For tests from which I provided vmstat and iostat results, Postgres does more write IO per operation. In some cases InnoDB uses more CPU, in other cases it does not.

    Builds, configuration and hardware

    I compiled:
    • Postgres from source for versions 15.17, 16.13, 17.9 and 18.3.
    • MySQL from source for version 8.4.7
    I used a 48-core server from Hetzner
    • an ax162s with an AMD EPYC 9454P 48-Core Processor with SMT disabled
    • 2 Intel D7-P5520 NVMe storage devices with RAID 1 (3.8T each) using ext4
    • 128G RAM
    • Ubuntu 24.04
    Configuration files for Postgres:
    • the config file is named conf.diff.cx10a_c32r128 (x10a_c32r128) and is here for versions 15, 16 and 17.
    • for Postgres 18 I used conf.diff.cx10b_c32r128 (x10b_c32r128) which is as close as possible to the Postgres 17 config and uses io_method=sync
    Benchmark

    I used sysbench and my usage is explained here. Normally I run 32 of the 42 microbenchmarks listed in that blog post using tables small enough to be cached by the DBMS. Most test only one type of SQL statement.

    The tests can be called microbenchmarks. They are very synthetic. But microbenchmarks also make it easy to understand which types of SQL statements have great or lousy performance. Performance testing benefits from a variety of workloads -- both more and less synthetic.

    But I did things differently here:
    • I only run the write-heavy tests (to save time)
    • The tables are larger than memory and cannot be cached
    • Each test (microbenchmark) is run for 2 hours when I normally run each for 15 minutes
    • After each test a vacuum is done
    The purpose is to search for regressions from new CPU overhead and mutex contention related to MVCC GC (vacuum for Postgres, purge for InnoDB).

    Results

    I provide charts below with relative QPS. The relative QPS is the following:
    (QPS for some version) / (QPS for Postgres 15.17)
    When the relative QPS is > 1 then some version is faster than base version.  When it is < 1 then there might be a regression. When the relative QPS is 1.2 then some version is about 20% faster than base version.

    The per-test results from vmstat and iostat can help to explain why something is faster or slower because it shows how much HW is used per request, including CPU overhead per operation (cpu/o) and context switches per operation (cs/o) which are often a proxy for mutex contention.

    Results: writes

    The table below has relative QPS for Postgres 16 to 19 and then InnoDB all relative to the throughput for Postgres 15.17. Columns 1 to 4 have results for Postgres and the numbers in yellow highlight the tests where there is a regression in Postgres. For column 5 (MySQL with InnoDB) the numbers in yellow and red indicate tests where InnoDB's throughput is less than Postgres. And then the numbers in green indicate tests where InnoDB's throughput is much larger than Postgres.

    Note that when relative QPS (rQPS) is 0.90 then throughput dropped by ~10%.

    Summary:
    • throughput for Postgres drops after version 15.17. I don't know yet whether this is a regression.
    • throughput for InnoDB is much better than Postgres in 6 of 10 tests, similar in one test, and much worse in 3 of 10 tests.
    The sections that follow this one have more detail on results from the update-index, update-zipf tests and insert tests.

    Relative to: Postgres 15.17
    col-1 : Postgres 16.13
    col-2 : Postgres 17.9
    col-3 : Postgres 18.3
    col-4 : Postgres 19 beta1
    col-5 : MySQL 8.4.7

    col-1   col-2   col-3   col-4   col-5
    0.94    0.97    0.98    1.02    1.88    update-inlist
    0.94    0.90    0.88    0.92    1.43    update-index
    0.91    0.86    0.87    0.92    1.19    update-nonindex
    0.96    0.99    0.98    0.98    0.71    update-one
    0.92    0.83    0.81    0.85    0.93    update-zipf
    0.95    0.93    0.84    0.81    1.71    write-only
    0.94    0.94    0.90    0.92    1.14    read-write_range=10
    0.95    0.96    0.95    0.95    1.93    read-write_range=100
    0.89    0.82    0.80    0.84    1.01    delete
    1.05    1.05    1.01    1.10    0.53    insert

    Results: update-index

    Summary:
    • Postgres suffers from too much variance
    • Average throughput is ~1.55X larger for InnoDB than for Postgres
    • Per operation, Postgres does ~1.20X more write IO (KB written) to storage than InnoDB
    • Per operation, InnoDB uses more CPU and does more context switches. While autovacuum was enabled and was likely running during the test, my measurements exclude the manual vacuum done at the end of each test.
    iostat, vmstat normalized by operation rate
    r/s     rMB/s   w/s     wMB/s   r/o     rKB/o   wKB/o   o/s     dbms
    35503.0 373.7   58795.7 1345.1  1.375   14.824  53.351  25817   PG 19b1
    33140.6 517.8   53449.6 1735.3  0.827   13.226  44.326  40090   MySQL 8.4.7

    cs/s    cpu/s   cs/o    cpu/o   dbms
    176167  14.4     6.824  .000557 PG 19b1
    661395  41.9    16.498  .001046 MySQL 8.4.7

    Results: update-zipf

    Summary:
    • Postgres suffers from too much variance
    • Average throughput is ~1.09X larger for InnoDB than for Postgres
    • Per operation, Postgres does ~1.30X more write IO (KB written) to storage than InnoDB
    • Per operation, InnoDB uses more CPU and does more context switches. While autovacuum was enabled and was likely running during the test, my measurements exclude the manual vacuum done at the end of each test.
    iostat, vmstat normalized by operation rate
    r/s     rMB/s   w/s     wMB/s   r/o     rKB/o   wKB/o   o/s     dbms
    55595.5 620.7   64264.4 1352.3  0.622   7.110   15.490  89396   PG 19b1
    27405.9 428.2   37465.1 1133.6  0.282   4.508   11.933  97270   MySQL 8.4.7

    cs/s    cpu/s   cs/o    cpu/o   dbms
    424392  27.2     4.747  .000304 PG 19b1
    1213054 44.5    12.471  .000458 MySQL 8.4.7

    Results: insert

    Summary:
    • Postgres suffers from too much variance
    • Average throughput is ~2.06X larger for Postgres than for InnoDB
    • Per operation, Postgres does ~1.67X more write IO (KB written) to storage than InnoDB
    • Per operation, Postgres uses more CPU and does more context switches. This is the opposite of what happens above for update-index and update-zipf.

    iostat, vmstat normalized by operation rate
    r/s     rMB/s   w/s     wMB/s   r/o     rKB/o   wKB/o   o/s     dbms
    1615.5  56.0    15321.7 1170.9  0.007   0.242   5.059   237009  PG 19b1
    3.6     0.1     8275.4  340.7   0.000   0.000   3.029   115155  MySQL 8.4.7

    cs/s    cpu/s   cs/o    cpu/o   dbms
    1214563 46.0    10.547  .000399 PG 19b1
    800827  50.5     3.379  .000213 MySQL 8.4.7













    June 11, 2026

    The insert benchmark on a small server, cached workload : Postgres 19 beta1

    This has results for Postgres versions 19 beta1, 18.4 and 17.10 with the Insert Benchmark on a small server using a cached and CPU-bound workload.

    Postgres continues to be boring in a good way. It is hard to find performance regressions.

     tl;dr

    • I don't see regressions here in 19 beta1
    • I see some improvements here in 19 beta1
      • index create (l.x) is faster but the step is short-running so I don't assume much from this
      • the write-heavy steps (l.i1, l.i2) are faster and CPU overhead is lower in 19 beta1, I hope to explain why the CPU overhead is lower, but that waits for another day.

    Builds, configuration and hardware

    I compiled Postgres from source using -O2 -fno-omit-frame-pointer for versions 19 beta1, 18.4 and 17.10.

    The server is an Beelink SER7 with a Ryzen 7 7840HS CPU with 8 cores and AMD SMT disabled, 32G of RAM. Storage is one SSD for the OS and an NVMe SSD for the database using ext-4 with discard enabled. The OS is Ubuntu 24.04.

    For 17.10 the config file is named conf.diff.cx10a_c8r32 (cx10a) and is here.

    For Postgres 18 and 19 the config file is conf.diff.cx10b_c8r32 (cx10b) which is as similar as possible to the config for version 17.

    The Benchmark

    The benchmark is explained here and is run with 1 client.

    The point query (qp100, qp500, qp1000) and range query (qr100, qr500, qr1000) steps are run for 3600 seconds each.

    The benchmark steps are:

    • l.i0
      • insert 30M rows per table in PK order. The table has a PK index but no secondary indexes. There is one connection per client.
    • l.x
      • create 3 secondary indexes per table. There is one connection per client.
    • l.i1
      • use 2 connections/client. One inserts 40M rows per table and the other does deletes at the same rate as the inserts. Each transaction modifies 50 rows (big transactions). This step is run for a fixed number of inserts, so the run time varies depending on the insert rate.
    • l.i2
      • like l.i1 but each transaction modifies 5 rows (small transactions) and 10M rows are inserted and deleted per table.
      • Wait for S seconds after the step finishes to reduce variance during the read-write benchmark steps that follow. The value of S is a function of the table size.
    • qr100
      • use 3 connections/client. One does range queries and performance is reported for this. The second does does 100 inserts/s and the third does 100 deletes/s. The second and third are less busy than the first. The range queries use covering secondary indexes. If the target insert rate is not sustained then that is considered to be an SLA failure. If the target insert rate is sustained then the step does the same number of inserts for all systems tested. This step is frequently not IO-bound for the IO-bound workload.
    • qp100
      • like qr100 except uses point queries on the PK index
    • qr500
      • like qr100 but the insert and delete rates are increased from 100/s to 500/s
    • qp500
      • like qp100 but the insert and delete rates are increased from 100/s to 500/s
    • qr1000
      • like qr100 but the insert and delete rates are increased from 100/s to 1000/s
    • qp1000
      • like qp100 but the insert and delete rates are increased from 100/s to 1000/s
    Results

    The performance summary with charts is here.

    This table lists relative QPS per benchmark step and relative QPS is:
        (QPS for my version / QPS for Postgres 17.10)

    The background in the table cells is blue for big improvements and yellow for regressions. There are no regressions here. 

    The index create (l.x) step is much faster in 19.10. I usually ignore results on this step but I am curious if something was done in 19.10 to improve index create. But this step takes between 1 and 2 minutes and I am reluctant to assume too much from a short running step.

    For the write-heavy steps (l.i1, l.i2)
    • there are small improvements in 18.4
    • there are large improvements in 19 beta1. The CPU overhead is lower in 19 beta1 compared to 17.10, ~15% lower for l.i1 and ~10% lower for l.i2. Hopefully I can explain why. But the lower CPU overhead might explain the improved performance in 19 beta1. Some of the metrics from iostat and vmstat are here.
    dbmsl.i0l.xl.i1l.i2qr100qp100qr500qp500qr1000qp1000
    17.101.001.001.001.001.001.001.001.001.001.00
    18.41.001.031.021.070.991.001.001.001.011.00
    19 beta11.011.161.231.220.991.000.990.991.001.00

    Prompts Don't Enforce Business Rules. Databases Do.

    CedarDB co-founder Lukas Vogel on why AI agents are forcing a long-overdue conversation about where enforcement actually lives. AI agents can now write and execute SQL. That changes everything about how you think about data security.

    For decades, permission checks lived in the application layer. That worked fine when applications were deterministic. But agents generate queries dynamically, decide what actions to take, and move fast. If enforcement lives above the database, eventually something gets through that shouldn’t.

    The answer isn’t a better prompt. It’s moving enforcement back to where it belongs: the database itself.

    Lukas sat down with Shane Snider at Data Center Knowledge to talk through why this problem is real, why it’s not actually new, and what it means for teams building agentic applications today.

    “Prompts don’t enforce business rules. Databases do.”

    “If you want to give an agent access to the bottom of the stack, permissions have to live at the bottom of the stack.”

    CedarDB was built from scratch for exactly this. No legacy assumptions, no workarounds. Row-level security and role-based access control enforced at the database layer, transactions and analytics in one engine, and a Postgres interface your agents already know how to use.

    Read the full interview at Data Center Knowledge.

    Want to see how CedarDB handles enforcement for agentic workloads? Book a demo.

    June 10, 2026

    How to migrate from Oracle to Amazon Aurora PostgreSQL using AWS CloudFormation (Part 1)

    In this post, you learn how to use AWS DMS Schema Conversion to migrate Oracle schemas to PostgreSQL. AWS DMS Schema Conversion converts database schemas and code objects to formats compatible with your target database. You also learn how to use AWS DMS to migrate data to Amazon Aurora PostgreSQL-Compatible Edition.

    Postgres 19 beta1 vs sysbench on a small server

    This has results from sysbench on a small server with Postgres 19 beta1, 18.4 and 17.10. Sysbench is run with low concurrency (1 thread) and a cached database. The purpose is to search for changes in performance, often from new CPU overheads.

    tl;dr

    • 19beta1, 18.4 and 17.10 have mostly similar performance
    • There might be small regressions (about 2%) from 17.10 to 19beta1 but my tests are not good at spotting that.
    • 19beta1 is much faster on one test (read-only-count) thanks to a new query plan

    Builds, configuration and hardware

    I compiled Postgres from source. 

    The server is a Beelink SER7 7840HS with an AMD Ryzen 7 7840HS CPU and 32G RAM. Storage uses an NVMe device with ext-4 and discard enabled. The OS is Ubuntu 24.04. 

    The config files are here for 17.10, 18.4 and 19 beta1.

    Benchmark

    I used sysbench and my usage is explained here. To save time I only run 32 of the 42 microbenchmarks 
    and most test only 1 type of SQL statement. Benchmarks are run with the database cached by Postgres.

    The read-heavy microbenchmarks run for 600 seconds and the write-heavy for 900 seconds.

    The benchmark is run with 1 client, 1 table and 50M rows. The purpose is to search for CPU regressions.

    Results

    The microbenchmarks are split into 4 groups -- 1 for point queries, 2 for range queries, 1 for writes. For the range query microbenchmarks, part 1 has queries that don't do aggregation while part 2 has queries that do aggregation. 

    I provide charts below with relative QPS (rQPS). The relative QPS is the following:
    (QPS for some version) / (QPS for base version)
    When the relative QPS is > 1 then some version is faster than base version.  When it is < 1 then there might be a regression. Values from iostat and vmstat divided by QPS are also provided here. These can help to explain why something is faster or slower because it shows how much HW is used per request.

    Here, base version is Postgres 17.10 and some version is either 18.4 or 19 beta1.

    I describe performance changes (changes to relative QPS) in terms of basis points. Performance changes by one basis point when the difference in rQPS is 0.01. When rQPS decreases from 0.95 to 0.85 then it changed by 10 basis points.

    Results: point queries

    Summary:
    • 19beta1 is better than 17.10 by ~3 basis points for most tests
    • 19beta1 is slightly better than 18.4
    Relative to Postgres 17.10
    col-1 : Postgres 18.4
    col-2 : Postgres 19 beta1

    col-1   col-2
    1.01    1.00    hot-points
    0.98    0.97    point-query
    1.01    1.03    points-covered-pk
    1.00    1.04    points-covered-si
    1.00    1.02    points-notcovered-pk
    1.00    1.03    points-notcovered-si
    0.99    0.99    random-points_range=10
    1.00    1.03    random-points_range=100
    1.01    1.03    random-points_range=1000

    Results: range queries without aggregation

    Summary:
    • 19beta1 is worse than 17.10 by ~3 basis points in 4 of 5 tests
    • 19beta1 is better than 17.10 by 5 basis points in the scan test
    • 19beta1 and 18.4 are similar except for the scan test where 19beta1 did better
    Relative to Postgres 17.10
    col-1 : Postgres 18.4
    col-2 : Postgres 19 beta1

    col-1   col-2
    0.98    0.97    range-covered-pk
    0.96    0.96    range-covered-si
    0.98    0.98    range-notcovered-pk
    0.99    0.99    range-notcovered-si
    0.95    1.05    scan

    Results: range queries with aggregation

    Summary:
    • 19beta1 is worse than than 17.10 on two tests
    • 19beta1 is better than 17.10 on five tests
    • 19beta1 and 17.10 are the same on one test
    • 19beta1 is ~2.5X better than 17.10 on the read-only-count test
    • 19beta1 and 18.4 have similar results except for the read-only-count test
    The query for the read-only-count test appears to have a different plan in 19beta1 and that might explain the ~2.5X speedup. In 17.10 and 18.4 it gets Index Scan while in 19beta1 it gets Index Only Scan.

    Query plans for the read-only-count test ...

    For 17.10
    explain SELECT count(c) FROM sbtest1 WHERE id BETWEEN 17704460 AND 17705459
            Aggregate  (cost=1424.42..1424.43 rows=1 width=8)
              ->  Index Scan using sbtest1_pkey on sbtest1  (cost=0.56..1421.93 rows=996 width=121)
                    Index Cond: ((id >= 17704460) AND (id <= 17705459))

    For 18.4
    explain SELECT count(c) FROM sbtest1 WHERE id BETWEEN 11575278 AND 11576277
            Aggregate  (cost=1310.09..1310.10 rows=1 width=8)
              ->  Index Scan using sbtest1_pkey on sbtest1  (cost=0.56..1307.89 rows=882 width=121)
                    Index Cond: ((id >= 11575278) AND (id <= 11576277))

    For 19beta1
    explain SELECT count(c) FROM sbtest1 WHERE id BETWEEN 11686801 AND 11687800
            Aggregate  (cost=32.32..32.33 rows=1 width=8)
              ->  Index Only Scan using sbtest1_pkey on sbtest1  (cost=0.56..30.13 rows=878 width=0)
                    Index Cond: ((id >= 11686801) AND (id <= 11687800))

    Relative to Postgres 17.10
    col-1 : Postgres 18.4
    col-2 : Postgres 19 beta1

    col-1   col-2
    1.04    2.47    read-only-count
    1.00    0.99    read-only-distinct
    1.02    1.01    read-only-order
    0.98    0.97    read-only_range=10
    1.00    1.00    read-only_range=100
    1.02    1.03    read-only_range=10000
    1.09    1.09    read-only-simple
    1.01    1.01    read-only-sum

    Results: writes

    Summary:
    • 19beta1 is worse than 17.10 by 2 to 5 basis points
    • 18.4 is worse than 17.10 by 2 to 3 basis points
    Relative to Postgres 17.10
    col-1 : Postgres 18.4
    col-2 : Postgres 19 beta1

    col-1   col-2
    0.97    0.97    delete
    0.99    0.96    insert
    0.98    0.97    read-write_range=10
    0.98    0.98    read-write_range=100
    0.96    0.95    update-index
    0.99    0.97    update-inlist
    0.97    0.96    update-nonindex
    0.97    0.95    update-one
    0.97    0.95    update-zipf
    0.98    0.97    write-only

    June 09, 2026

    Building type-safe applications with Drizzle ORM in Aurora DSQL

    In this post, you’ll build a working veterinary clinic CLI application that demonstrates production-ready patterns for connecting Drizzle ORM to Aurora DSQL. By the end, you’ll have a running app with one-to-many and many-to-many relationships, and the patterns you learn (UUID primary keys, application-level relationships, and a custom migration runner) work with other TypeScript ORMs on Aurora DSQL too.

    Pagination patterns in Amazon Aurora DSQL

    In this post, you learn three pagination techniques for Aurora DSQL: OFFSET/LIMIT, cursor-based (keyset), and temporal. You implement keyset pagination in SQL and Python, build it into an API layer, optimize with composite indexes, handle batch processing within the 3,000-row transaction limit, and avoid five common anti-patterns. By the end, you can choose the right pagination method for your workload and implement it with confidence.

    Your MySQL Schema is Already a REST API

    Instantly expose your MySQL schema as a REST API with vsql-rest. Get zero-code CRUD, automatic filtering, and JWT auth directly in the database.

    Percona Operator for MySQL (PXC) 1.20.0: Automatic Storage Resizing, TLS Certificate Rotation, and ARM64 Support

    Percona Operator for MySQL PXC 1.20.0 is out today, and it addresses three long-requested operational headaches: storage that grows on its own before it fills up, TLS certificates that rotate without cluster downtime, and images that run natively on ARM64. Disk-full incidents on PXC clusters often arrive at 2 AM when monitoring alerts fire, and … Continued

    The post Percona Operator for MySQL (PXC) 1.20.0: Automatic Storage Resizing, TLS Certificate Rotation, and ARM64 Support appeared first on Percona.

    RUM—Storing More in the Index

    This series of posts traces the evolution from GIN to RUM to Extended RUM, showing how a single architectural idea—store more in the index to do less work at query time—unlocks major performance improvements at each step.

    RUM was created by Alexander Korotkov, Oleg Bartunov, and Teodor Sigaev at Postgres Professional. It started as a direct response to the GIN limitations that users kept hitting in production — particularly in full-text search scenarios where ranking and ordering dominate query time.

    The name is a nod: GIN (the drink) → RUM (a stronger drink). The index is stronger, too.

    RUM's innovation is exactly one architectural change: each entry in a posting list can carry an additional datum alongside the TID:

    GIN posting list entry:  [TID]
    RUM posting list entry:  [TID, addInfo]
    

    A 2016 thread on pgsql-general captures the exact pain point. Andreas Krogh was building a web-based email system — millions of messages, users expecting millisecond response — and wanted a single index that could:

    1. Match full-text search terms (fts_all @@ to_tsquery(...))
    2. Filter by folder (folder_id = ANY(ARRAY[2,3]))
    3. Sort by timestamp (ORDER BY sent DESC)
    4. Stop early (LIMIT 101)

    With GIN, steps 1 and 2 worked (via btree_gin), but step 3 always required a separate Sort node — the index couldn't deliver results in timestamp order. The planner had to scan all matching TIDs, fetch every heap tuple, and sort them. For a mailbox with 100K matches, that's unacceptable.

    Oleg Bartunov's response was direct: "We are [working] hard on our internal version of rum." RUM was designed to solve exactly this class of problem.

    RUM extends the GIN framework with an order_by_attach option:

    CREATE INDEX idx ON documents USING rum (tsv rum_tsvector_addon_ops, created_at)
      WITH (attach = 'created_at', to = 'tsv', order_by_attach = true);
    
    

    This tells RUM: "for every TID in the tsv posting list, also store the corresponding created_at value." The posting list entries become:

    
    "postgresql" → [(0,1, 2024-01-15), (3,7, 2024-02-20), (5,2, 2024-03-01)]
    
    

    Now the posting list is sorted by addInfo (the timestamp), not by the physical order of TIDs.

    What RUM Enables Over GIN:

    1. Ordered results without sorting. The index walks the posting list in addInfo order and stops after N entries. No Sort node, no full scan.
    2. Distance-based ordering with <=>. RUM introduces distance operators (<=>, <=|, |=>). The <=> computes ABS(a - b) — nearest-first retrieval. The <=| and |=> variants restrict to one direction. For full-text ranking, RUM's <=> operator has a built-in ranking function that combines ts_rank and ts_rank_cd semantics and handles OR queries better than either function alone.
    3. Depth-first traversal: first results immediately. Unlike GIN's bitmap approach (collect all TIDs, then access the heap), RUM performs a depth-first traversal. It can return first results immediately — critical for LIMIT queries.
    4. Phrase search without recheck. RUM's rum_tsvector_ops stores word positions as addInfo. The index verifies phrase adjacency during the scan — no heap rechecks needed.

    Here are the operator classes:

    • rum_tsvector_ops stores lexemes with positional information, supports ordering (<=>) by relevance
    • rum_tsvector_hash_ops stores hashed lexemes with positions, supports ordering (<=>) with no prefix search
    • rum_tsvector_addon_ops stores lexemes + any attached column, supports ordering (<=>) on attached column
    • rum_anyarray_ops stores array elements + array length, supports ordering (<=>) by similarity
    • rum_anyarray_addon_ops stores array elements + attached column, supports ordering (<=>) on attached column
    • rum_timestamp_ops stores scalar values, supports ordering (<=>)

    Using the same articles table from the previous post, I add a RUM index that adds the published timestamp to tsv:

    postgres=# CREATE EXTENSION IF NOT EXISTS rum;
    
    postgres=# CREATE INDEX idx_rum_addon ON articles
        USING rum (tsv rum_tsvector_addon_ops, published)
        WITH (attach = 'published', to = 'tsv');
    
    

    I searched for the five most recent articles by publication date using the same text query as in the previous post.

    postgres=# EXPLAIN (COSTS OFF, ANALYZE, BUFFERS, VERBOSE)
    SELECT id, title, published,
           published <=> '2020-06-01'::timestamp AS distance
    FROM articles
    WHERE tsv @@ to_tsquery('simple', 'postgresql & article')
    ORDER BY published <=> '2020-06-01'::timestamp
    LIMIT 5;
    
    QUERY PLAN                                                  
    --------------------------------------------------------------------------------------------------------------
     Limit (actual time=89.101..89.552 rows=5 loops=1)
       Output: id, title, published, ((published <=> '2020-06-01 00:00:00'::timestamp without time zone))
       Buffers: shared hit=1233 read=1514, temp read=1303 written=1303
       ->  Index Scan using idx_rum_addon on documentdb_core.articles (actual time=89.097..89.542 rows=5 loops=1)
             Output: id, title, published, (published <=> '2020-06-01 00:00:00'::timestamp without time zone)
             Index Cond: (articles.tsv @@ '''postgresql'' & ''article'''::tsquery)
             Order By: (articles.published <=> '2020-06-01 00:00:00'::timestamp without time zone)
             Buffers: shared hit=1233 read=1514, temp read=1303 written=1303
     Planning:
       Buffers: shared hit=3
     Planning Time: 0.401 ms
     Execution Time: 89.747 ms
    (12 rows)
    

    No Sort node. No Bitmap. A true Index Scan with Order By pushed into the index. Stops after 5 results.

    I add a RUM index on tsv that adds the position, for phrase search without recheck:

    postgres=# CREATE INDEX idx_rum_pos ON articles USING rum (tsv rum_tsvector_ops);
    
    postgres=# EXPLAIN (COSTS OFF, ANALYZE, BUFFERS, VERBOSE)
    SELECT id, title
    FROM articles
    WHERE tsv @@ to_tsquery('simple', 'postgresql <-> article')
    LIMIT 5;
    
    QUERY PLAN                                                  
    --------------------------------------------------------------------------------------------------------------
     Limit (actual time=116.196..116.198 rows=0 loops=1)
       Output: id, title
       Buffers: shared hit=661
       ->  Index Scan using idx_rum_pos on documentdb_core.articles (actual time=116.192..116.193 rows=0 loops=1)
             Output: id, title
             Index Cond: (articles.tsv @@ '''postgresql'' <-> ''article'''::tsquery)
             Buffers: shared hit=661
     Planning:
       Buffers: shared hit=3
     Planning Time: 0.223 ms
     Execution Time: 117.838 ms
    (11 rows)
    

    No "Rows Removed by Index Recheck". Positions stored in the index enable direct verification of adjacency. The same idea applies beyond full‑text search, for example with JSONB, where RUM can store element positions to reduce rechecks and improve performance.

    The same index is used for relevance ranking:

    postgres=# EXPLAIN (COSTS OFF, ANALYZE, BUFFERS, VERBOSE)
    SELECT id, title,
           tsv <=> to_tsquery('english', 'postgresql | optimization') AS rank
    FROM articles
    WHERE tsv @@ to_tsquery('simple', 'postgresql | article')
    ORDER BY tsv <=> to_tsquery('english', 'postgresql | optimization')
    LIMIT 10;
    
    QUERY PLAN                                                   
    ---------------------------------------------------------------------------------------------------------------
     Limit (actual time=543.958..544.006 rows=10 loops=1)
       Output: id, title, ((tsv <=> '''postgresql'' | ''optim'''::tsquery))
       Buffers: shared hit=783, temp read=13289 written=15244
       ->  Index Scan using idx_rum_pos on documentdb_core.articles (actual time=543.954..543.977 rows=10 loops=1)
             Output: id, title, (tsv <=> '''postgresql'' | ''optim'''::tsquery)
             Index Cond: (articles.tsv @@ '''postgresql'' | ''article'''::tsquery)
             Order By: (articles.tsv <=> '''postgresql'' | ''optim'''::tsquery)
             Buffers: shared hit=783, temp read=13289 written=15244
     Planning:
       Buffers: shared hit=3
     Planning Time: 0.198 ms
     Execution Time: 548.884 ms
    (12 rows)
    

    Ranking is computed directly during the index scan. No Sort node, no heap access for computing the score, enabling early termination with LIMIT. This contrasts with GIN, where ranking is computed after fetching all matching rows from the heap and sorting them.

    Early versions of RUM allowed multi-column indexes, but only the main inverted column participated in the ordered scan (2017 thread). Additional columns were applied as post-filters, which limited the benefit of LIMIT queries. Currently filter columns can participate in the index scan itself, allowing pruning during traversal of the ordered posting list — but ordering is still driven by a single attached column. Here is an example:

    postgres=# CREATE INDEX idx_rum_multi ON articles
        USING rum (tsv rum_tsvector_addon_ops, category, published)
        WITH (attach = 'published', to = 'tsv');
    
    postgres=# EXPLAIN (COSTS OFF, ANALYZE, BUFFERS, VERBOSE)
    SELECT id, title, published
    FROM articles
    WHERE tsv @@ to_tsquery('english', 'postgresql')
      AND category = 'tech'
    ORDER BY published <=> '2020-06-01'::timestamp
    LIMIT 5;
    
    QUERY PLAN                                                  
    --------------------------------------------------------------------------------------------------------------
     Limit (actual time=50.212..51.269 rows=5 loops=1)
       Output: id, title, published, ((published <=> '2020-06-01 00:00:00'::timestamp without time zone))
       Buffers: shared hit=117 read=112, temp read=550 written=550
       ->  Index Scan using idx_rum_multi on documentdb_core.articles (actual time=50.209..51.258 rows=5 loops=1)
             Output: id, title, published, (published <=> '2020-06-01 00:00:00'::timestamp without time zone)
             Index Cond: ((articles.tsv @@ '''postgresql'''::tsquery) AND (articles.category = 'tech'::text))
             Order By: (articles.published <=> '2020-06-01 00:00:00'::timestamp without time zone)
             Buffers: shared hit=117 read=112, temp read=550 written=550
     Planning:
       Buffers: shared hit=37 read=9 dirtied=3
     Planning Time: 7.805 ms
     Execution Time: 51.667 ms
    (12 rows)
    

    Here, category = 'tech' is in Index Cond, not a Filter, so pre-filtering on multiple columns is possible. RUM’s <=> operator uses a ranking model (ts_score) that combines ts_rank and ts_rank_cd, addressing their respective limitations (logical operators and OR queries).

    The RUM Data Structure includes addInfo and may be sorted on it instead of TID:

    RUM Posting List (leaf page):
    ┌─────────────────────────────────────────────┐
    │ [TID₁, addInfo₁] [TID₂, addInfo₂] ...       │
    │ sorted by addInfo (when order_by_attach=on) │
    │ or by TID (when order_by_attach=off)        │
    └─────────────────────────────────────────────┘
    

    One drawback of the RUM index is that it has slower build and insert times than GIN. This is because RUM stores additional information alongside keys and utilizes generic Write-Ahead Logging (WAL) records, which can significantly increase WAL volume compared to GIN. It fits workloads where data is written once, like append-only or event-driven use cases, and queried many times later. Unlike GIN, RUM does not implement a pending list for fast updates, which contributes to its higher write and build costs. RUM performs best when dealing with highly repetitive keys, such as those found in natural language text, or denormalized documents. However, for high-cardinality unique keys, like UUIDs, the inverted index structure offers little advantage over a B-tree.

    The next post will cover Extended RUM, which enables additional indexing and ordering capabilities in DocumentDB.

    GIN: Understanding PostgreSQL's Inverted Index and Its Limitations

    This series of posts traces the evolution from GIN to RUM to Extended RUM, showing how a single architectural idea—to store more in the index to do less work at query time—unlocks major performance improvements at each step.

    Generalized Inverted Index (GIN) is PostgreSQL's answer to the question "how do I index composite values?" — including arrays, JSONB documents, tsvector, and any data type that decomposes into multiple keys per row.

    The structure is conceptually simple:

    Entry Tree (B-tree of keys)
    ├── "alice"  → Posting List: [(0,1), (3,7), (5,2)]
    ├── "bob"    → Posting List: [(0,2), (4,1)]
    ├── "carol"  → Posting Tree (when list becomes too large for in-line storage)"
    │             └── B-tree of ItemPointers
    └── ...
    

    For each indexed value, GIN calls extractValue to decompose it into keys. Each key gets an entry in the B-tree. Each entry points to a posting list — a sorted array of heap TIDs (tuple identifiers) that contain that key.

    At query time, GIN calls extractQuery to decompose the search condition into keys, then intersects or unions the posting lists using a consistent function.

    What GIN Gets Right:

    1. Space efficiency: posting lists are compressed (variable-byte encoding), and duplicate keys are stored once.
    2. Fast containment queries: @>, <@, @@ — any "does this value contain these keys?" pattern is natural.
    3. Extensible operator classes: anyone can plug in new data types via extractValue/extractQuery/consistent.

    Limitations of GIN:

    1. Posting lists lack ordering info. GIN's lists only contain TIDs sorted by physical location, with no space for extra data.
    2. Cannot perform ordered scans. GIN doesn’t support ORDER BY pushdown, and queries with sorting require an explicit sort after the index scan.
    3. No extra payload per TID. Posting lists hold only 6-byte TIDs, without timestamps, scores, or additional info.
    4. Only bitmap scans are supported, not true index scans. GIN generates and processes bitmaps of matching TIDs.

    Implications:

    • Ranking needs heap access. For ranked searches like ts_rank(), GIN finds TIDs but lacks term frequency or position info, so it must fetch and sort matching heap tuples.
    • Phrase searches need reprocessing. GIN can locate tuples containing specific words, but it cannot confirm if these words are adjacent based only on the index. It identifies possible matches and then rechecks each heap tuple to ensure the words are in sequence. As clarified by Artur Zakirov (Postgres Professional), "GIN supports it, but it requires additional bitmap heap scan and so it is slower."
    • No LIMIT pushdown — must scan the entire posting list even for a small number of results.
    • Cursor-based iteration isn't supported.

    I created the following tables that I'll use for the examples in this blog post and the next ones:

    CREATE EXTENSION IF NOT EXISTS btree_gin;
    
    DROP TABLE IF EXISTS articles;
    CREATE TABLE articles (
        id          serial PRIMARY KEY,
        title       text NOT NULL,
        body        text NOT NULL,
        category    text NOT NULL,
        published   timestamp NOT NULL,
        score       int NOT NULL,
        tsv         tsvector GENERATED ALWAYS AS (
                        to_tsvector('simple', title || ' ' || body)
                    ) STORED
    );
    
    INSERT INTO articles (title, body, category, published, score)
    SELECT
        'Article about ' || words[1 + (i % 10)] || ' and ' || words[1 + ((i*7) % 10)],
        'This is the body discussing ' || words[1 + ((i*3) % 10)] ||
        ' in the context of ' || words[1 + ((i*11) % 10)] ||
        '. We also mention ' || words[1 + ((i*13) % 10)] || ' here.',
        categories[1 + (i % 4)],
        '2020-01-01'::timestamp + (i || ' hours')::interval,
        (i * 17) % 100
    FROM generate_series(1, 1e6) AS s(i),
    LATERAL (SELECT ARRAY['postgresql','indexing','performance','database',
                           'query','optimization','storage','replication',
                           'vacuum','analytics'] AS words) w,
    LATERAL (SELECT ARRAY['tech','science','blog','news'] AS categories) c;
    
    ANALYZE articles;
    
    CREATE INDEX idx_gin_tsv ON articles USING gin (tsv);
    
    

    Here is a Full-Text Search — Bitmap Only:

    postgres=# EXPLAIN (COSTS OFF, ANALYZE, BUFFERS)
    SELECT id, title, score
    FROM articles
    WHERE tsv @@ to_tsquery('simple', 'postgresql & article')
    LIMIT 10
    ;
    
    QUERY PLAN                                           
    -----------------------------------------------------------------------------------------------
     Limit (actual time=24.049..24.104 rows=10 loops=1)
       Buffers: shared hit=299 read=6
       ->  Bitmap Heap Scan on articles (actual time=24.045..24.087 rows=10 loops=1)
             Recheck Cond: (tsv @@ '''postgresql'' & ''article'''::tsquery)
             Heap Blocks: exact=6
             Buffers: shared hit=299 read=6
             ->  Bitmap Index Scan on idx_gin_tsv (actual time=16.165..16.165 rows=100000 loops=1)
                   Index Cond: (tsv @@ '''postgresql'' & ''article'''::tsquery)
                   Buffers: shared hit=299
     Planning:
       Buffers: shared hit=33 read=2 dirtied=7
     Planning Time: 0.314 ms
     Execution Time: 24.138 ms
    (13 rows)
    

    GIN uses Bitmap Index ScanBitmap Heap Scan. It never produces a true Index Scan. Even with LIMIT 10, it builds the full bitmap first on one million rows before discarding most of them to return only ten.

    An ORDER BY requires a Sort operation:

    postgres=# EXPLAIN (COSTS OFF, ANALYZE)
    SELECT id, title, published
    FROM articles
    WHERE tsv @@ to_tsquery('simple', 'postgresql & article')
    ORDER BY published DESC
    LIMIT 5;
    
    QUERY PLAN                                                  
    -------------------------------------------------------------------------------------------------------------
     Limit (actual time=1838.940..1853.064 rows=5 loops=1)
       ->  Gather Merge (actual time=1838.937..1853.054 rows=5 loops=1)
             Workers Planned: 2
             Workers Launched: 2
             ->  Sort (actual time=1812.430..1812.435 rows=4 loops=3)
                   Sort Key: published DESC
                   Sort Method: top-N heapsort  Memory: 26kB
                   Worker 0:  Sort Method: top-N heapsort  Memory: 26kB
                   Worker 1:  Sort Method: top-N heapsort  Memory: 26kB
                   ->  Parallel Bitmap Heap Scan on articles (actual time=113.247..1784.517 rows=33333 loops=3)
                         Recheck Cond: (tsv @@ '''postgresql'' & ''article'''::tsquery)
                         Heap Blocks: exact=19968
                         ->  Bitmap Index Scan on idx_gin_tsv (actual time=127.531..127.532 rows=100000 loops=1)
                               Index Cond: (tsv @@ '''postgresql'' & ''article'''::tsquery)
     Planning Time: 1.290 ms
     Execution Time: 1853.445 ms
    (16 rows)
    

    The Sort node is unavoidable. GIN collected all matching TIDs, accessed the heap, sorted them, and retained the top 5.

    Phrase search with distance — Recheck Required:

    postgres=# EXPLAIN (COSTS OFF, ANALYZE)
    SELECT id, title
    FROM articles
    WHERE tsv @@ to_tsquery('simple', 'postgresql <-> article')
    LIMIT 5;
    
    QUERY PLAN                                           
    -----------------------------------------------------------------------------------------------
     Limit (actual time=357.799..357.803 rows=0 loops=1)
       ->  Bitmap Heap Scan on articles (actual time=357.795..357.797 rows=0 loops=1)
             Recheck Cond: (tsv @@ '''postgresql'' <-> ''article'''::tsquery)
             Rows Removed by Index Recheck: 100000
             Heap Blocks: exact=55556
             ->  Bitmap Index Scan on idx_gin_tsv (actual time=48.745..48.746 rows=100000 loops=1)
                   Index Cond: (tsv @@ '''postgresql'' <-> ''article'''::tsquery)
     Planning Time: 0.561 ms
     Execution Time: 357.873 ms
    (9 rows)
    

    With Rows Removed by Index Recheck, GIN found candidates containing both words but had to access the heap to verify adjacency. No positional information in the index.

    What's Next

    These three limitations — no ordering, no positions, bitmap-only scans — are exactly what RUM was designed to solve. In the next post of this series, we'll see how a single architectural change (adding addInfo to each posting list entry) eliminates all three.

    Writing Code vs. Shipping Code: Productivity Effects Across Generations of AI Coding Tools

    The transformative power of LLMs in coding has been irrefutable, and it feels like we are living through a magical computing renaissance. On the socials, we hear impressive numbers of lines of code generated, features delivered, and bugs fixed. But, the macroeconomic indicators seem to be still lagging. Heck, if you talk with an engineering manager, you find that their product shipping dates haven't miraculously compressed by a factor of five, either.

    This paper just landed 10 days ago. It is from MIT and Wharton by Mert Demirer, Leon Musolff, and Liyuan Yang. Their study attempts to provide a structured economic model for evaluating actual productivity obtained from AI coding tools. By pairing confidential Microsoft telemetry with the public footprints of over 100,000 GitHub developers (tracking everything from open-source utilities to web app repositories), the authors show significant systemic friction downstream of AI code generation.

    Of course, I do my usual skeptical critic of the paper. In this case, this is especially heightened because these are economists peeking into the messy non-linear world of software engineering and trying to impose a "production hierarchy" abstraction onto it. But if we reconsider their analysis from a different perspective, it becomes possible to translate their complex production functions into Amdahl's Law terms, and then we can start doing our own evaluations and draw our own conclusions as I discuss below.


    The Monotonic Decay

    The core of the paper rests on this monotonic decay argument. The sheer task-level velocity gains we see from AI coding tools start to bleed out as they move up the production hierarchy. The authors break down AI tool adoption into three distinct generational tiers:

    • Autocomplete (intelligent text prediction),
    • Synchronous (Sync) Agents (interactive, real-time code modifiers like Claude Code or Cursor),
    • Asynchronous (Async) Agents (autonomous async agents).

    When we look at the task-level velocity of these tools, we see impressive numbers. The paper's abstract claims that autocomplete, interactive sync agents, and autonomous async agents increase overall commit activity by cumulative totals of 40%, 140%, and 180% respectively. But as that work climbs toward an official production milestone, the improvements decay significantly.

    For Autocomplete, the +228.2% explosion in raw lines of code bleeds out layer-by-layer until becomes a meager +10.2% increase in actual shipped software releases. For Sync agents, a gigantic +741.3% surge in code syntax reduces down to a modest +20.3% final weekly releases.


    My immediate reaction to this vertical hierarchy (Lines -> Files -> Commits -> PRs -> Repositories -> Releases) is skepticism. Treating code production like a neat production line feels superficial. Software engineering is not a linear conveyor belt, as coding is highly nonlinear, and a single commit routinely alters fifty files. However, giving the authors the benefit of doubt and reading onwards, I find that there is still value to this naive abstraction, as it points to human gatekeeping and coordination overhead at higher levels of the CI/CD pipeline.  AI can write lines of code instantly, not being bogged down by the code syntax at the lower layers. But as that work climbs toward an official production milestone, the structural constraints of the system and human bottlenecks take over, and the massive improvements at the task level decay down to nearly nothing. 

    Let's dive deeper on the mathematical modeling behind this. By taking only the performance of Autocomplete into account (because it operates exclusively at the code-writing level), the authors chose parameters that minimized the differences between their model's predictions and actual developer behavior. Through this exercise, they extracted a local Upstream Output Elasticity ($\theta$) of approximately 0.75. In this layered production model, $\theta = 0.75$ acts as a vertical pass-through metric. It means that at any given stage (say, turning raw commits into clean pull requests), 75% of that layer's success leans entirely on the upstream technical assets flowing into it from the layer below, while a remaining fraction represents the local human effort added at that layer. Because they model software development as a vertically sequential aggregation process, that human intervention operates like a compounding efficiency tax. A massive initial code productivity surge at the bottom layer gets relentlessly multiplied by 0.75 over and over again as it attempts to climb the hierarchy, mathematically forcing the steep vertical attenuation we see in the empirical data.

    Now, the sync and async agents aren't just typing lines inside an editor. They operate at a level where they directly manage files, commits, and pull requests. This expanded layer coverage allows agentic workflows to drop their productivity contributions closer to the finish line. By short-circuiting the early stages of the vertical decay chain, agents handle the work more efficiently, doubling the final impact on shipped software compared to autocomplete as seen in Figure 1.


    Translating the Economics formulas to Amdahl’s Law

    To model what happens inside each of these individual layers, the paper transitions to a nested Constant Elasticity of Substitution (CES) production function. Here, they extract an Elasticity of Substitution ($\sigma$) between AI-generated code and human review effort of a rigid 0.25. In economic lingo, an elasticity of substitution well below $1.0$ means the inputs are "strong complements". That means they are tied/dependent together like a car chassis and tires. It doesn't matter if an automated factory line can manufacture tires 10,000% faster; if you don't speed up the production of the chassis, you don't get more cars. 

    This of course looks a whole lot like Amdahl’s Law, which dictates that the overall speedup of a system is strictly limited by its sequential un-parallelizable bottlenecks:

    $S_{\text{total}} = \frac{1}{(1-P) + \frac{P}{S_{\text{task}}}}$

    where $S_{\text{task}}$ is the speedup achieved at the automated task level, and $P$ is the "Global Parallelizable Fraction" of the entire system workload.

    When the elasticity of substitution ($\sigma$) between machine output and human validation drops to 0.25, the economic model behaves almost exactly like a Leontief production function. (A Leontief production function describes a strict, zero-flexibility production process where inputs must be combined in exact, unalterable proportions, meaning an excess of one ingredient cannot substitute for a shortage of another.) This dictates that human code review is a non-negotiable strictly sequential bottleneck ($1 - P$). If $\sigma$ were infinite, you could completely substitute human verification with raw AI text volume effectively parallelizing the entire layer. But because $\sigma = 0.25$, throwing an infinite mountain of automated code ($S_{\text{task}} \to \infty$) at the problem does nothing to diminish the fixed, sequential time investment required for a human to review it.

    If we run the paper's real-world empirical findings through this equation (isolating the global parallel fraction $P$ for commits against final releases), we find the following:

    • Autocomplete: $S_{\text{task}} = 1.359\times$, $S_{\text{total}} = 1.102\times \implies \mathbf{P \approx 35.0\%}$ 
    • Sync Agents: $S_{\text{task}} = 2.091\times$, $S_{\text{total}} = 1.202\times \implies \mathbf{P \approx 32.2\%}$ 
    • Async Agents: $S_{\text{task}} = 2.800\times$, $S_{\text{total}} = 1.300\times \implies \mathbf{P \approx 35.9\%}$ 

    Huh! Even going from a simple inline autocomplete tool to fully autonomous agents that clone repositories and run test suites out-of-band, the global parallelizable fraction ($P$) refuses to budge and hovers around 35% across all three generations of AI!

    Then how come Sync and Async agents manage to squeeze out 2-3x more final software releases than Autocomplete? While Autocomplete notches a modest 10.2% release expansion, Sync agents push it to 20.3%, and the cumulative stack of Async tools lifts the final output baseline up by 30%. If the global parallelizable envelope ($P$) is locked at 35% from the formulas above, how is the system actually accelerating? This is because according to Amdahl's Law, a system has two entirely separate levers for optimization. You can either increase $P$, or you can aggressively push harder and increase $S_{\text{task}}$. Autocomplete achieved a commit-level speedup ($S_{\text{task}}$) of just $1.359\times$ relative to releases. But Sync agents drive that localized task speedup to $2.091\times$, and Async agents push $S_{\text{task}}$ to $2.800\times$. But this hits a wall of diminishing returns quickly. When the parallelizable footprint ($P$) is pinned to 35%, scaling the localized task speedup ($S_{\text{task}}$) toward infinity yields a hard asymptotic ceiling. Mathematically, the absolute maximum total speedup this configuration can ever achieve is $1 / (1 - 0.35)$, which works out to a hard cap of a 53% overall increase ($1.53\times$) in shipped software. So, no matter how fast an autonomous bot can process a commit, the remaining 65% human sequential bottleneck ($1 - P$) acts as a hard stop.

    Is P=0.35 sensible? I think that 35% parallel fraction passes the smell test. If you ask any developer what percentage of their week is spent actually writing code, they'll give you a number right in this ballpark, around 20% to 40%. The remaining 65% of the developers' time is consumed by finding/defining the task, planning, and paying the inevitable human communication tax of meetings and team alignment.

    This is why that flatlining 35% profile across all three tool generations makes  sense to me. Generative AI can supercharge the coding sandbox by churning out code at high speed (maxing out $S_{\text{task}}$), but it can't parallelize the systemic reasoning, organizational consensus, and deep problem/model comprehension that dominates the rest of the job. Until we find a way to automate those parts as well, shipping software will remain an inherently human-throttled process.

    June 08, 2026

    Automate Oracle PL/SQL to PostgreSQL migration with Amazon Bedrock and Strands Agents

    In this post, you learn how to build a generative AI–powered migration assistant that helps automate portions of the last mile of code conversion. Using Anthropic’s Claude Sonnet 4.6 on Amazon Bedrock, the Strands Agents framework, and the AWS Knowledge MCP Server, you can automate the conversion and validation of PL/SQL objects against Amazon Aurora PostgreSQL-Compatible Edition. The assistant reads the AWS DMS SC assessment CSV, fetches live PL/SQL source from Oracle, converts each object, deploys the result to Aurora PostgreSQL through AWS Lambda, and runs automated tests, in a single pipeline.

    Building Python applications with SQLAlchemy and Aurora DSQL

    In this post, you’ll build a working veterinary clinic command line interface (CLI) application that demonstrates production-ready patterns for connecting SQLAlchemy to Aurora DSQL. The patterns you implement (UUID primary keys, application-level relationships, and AUTOCOMMIT engine configuration) apply to other Python ORMs on Aurora DSQL.

    A Case for Simulation-Driven Resilience in Agentic Data Systems

    As I mentioned in my previous post, I traveled to San Jose at the end of May for the ACM CAIS conference. On Day 0, I gave a very short talk at the Supporting our AI Overlords (SAO) workshop. This post is the promised summary of our paper, "A Case for Simulation-Driven Resilience in Agentic Data Systems", joint work with Aleksey Charapko (University of New Hampshire) and Akshat Vig (MongoDB).


    Metastability is critical for building the next generation of distributed systems

    Our story starts with metastability. Metastability is the failure mode where the mechanisms built to protect the system (retries, queues, timeouts, load shedding) turn into amplifiers. Even after the trigger that caused the overload goes away, the system stays behind, churning through busy work, perpetually trying to catch up with the remnants of failed and behind-schedule tasks. It's a bit like missing some foundational math in high school. You spend so long backfilling the old gaps that you never keep up with the new material piling on top, so you stay permanently behind. The catching up work is what keeps you behind, which requires catching up work later, which keeps you behind. (This is presumably why I'm not a rich machine learning scientist today at Deep Mind.)

    Avoiding and tolerating metastable failures is critical to building the next generation of reliable distributed systems. Aleksey's Metastable Failures in the Wild study (OSDI'22) cataloged these failures from real production incidents. They matter so much because they are the hard faults that remain. We have largely learned to deal with the straightforward ones (crashes, corruptions, dropped packets), which leaves emergent performance failures as the final boss. Metastable failures are responsible for a disproportionate share of critical cloud unavailability incidents, with no single broken part to point at and no obvious way to fix/reset them when they emerge. The cloud economics make the problem worse. Providers and operators have every incentive to run with the absolute minimum of excess capacity and to trim slack and "waste". But this thin margin is exactly where metastability thrives.


    Agents supercharge the feedback loops and put metastability on steroids

    As AI agents are replacing humans as the primary clients of modern data systems, they are bringing a qualitative shift in workload characteristics. Agents retry aggressively while mutating the query on each attempt. They fan out into bursty parallel sub-tasks. They hold transactions open while they wait on an external LLM to provide the next step. The "Supporting our AI overlords" paper showed that agents create ~20x more branches and perform ~50x more rollbacks than humans do. These behaviors violate assumptions baked into every layer of a modern data system. Execution control assumes stationary arrivals. Caching assumes temporal locality. Concurrency control assumes bounded hold times. Agents break all three at once!


    We propose simulation-driven resilience to address this problem

    Simulation can enable us to systematically explore the agent-database boundary, and discover/prevent metastability failures before a production incident forces a reactive (and nonworking) patch. We propose a simulation based approach because only simulation is cheap enough to sweep an enormous trigger space, and deterministic enough to replay and dissect every failure it finds.

    • Benchmarks measure steady state. But, metastability is transient and emergent: it lives in the sequence of rare events, not in the average.
    • Queueing theory assumes mostly stationary independent arrivals. Agents break these assumptions.
    • Testing with the production system is hopeless for design sweeps and gives you almost no observability. When your database falls over under load, you still don't know what went wrong, or how it went wrong, and you can't explore the design space because you have no feedback to explore it with.


    MESSI finds failures in the seams

    Metastability scurries in the seams/interaction of the composition of subsystems,  so we need a tool that lets us look there. Aleksey developed MESSI (MEtaStability SImulator), a discrete-event simulation framework for exploring metastability dynamics in distributed systems. MESSI enables modeling any (sub)system as a directed graph and composition of (sub)systems. Logic Nodes implement routing and state policy. Processors model the physical resource constraints (queues, I/O delays, network latency). Individual work items, QItems, carry state as they traverse the graph. There is a clear separation of roles: policy lives in the nodes, resource contention lives in the edges, and you can vary each independently.

    Because this is a simulator, it is deterministic and replayable, and it exposes the full internal state of every component at every tick. A metastable trigger you discover once can then be re-run against alternate designs to see which ones survive.


    Our findings from the Execution Control System (ECS) simulations

    Using MESSI, we performed an analysis of the Execution Control System (ECS), because it is the critical first domino: when the ECS fails, every subsystem downstream of it (caches, buffer pools, lock managers) fails after it. The ECS sits between admitted requests and the execution engine and decides who runs, in what order, with how much. It is the component that mediates resource contention at the backend. The usual design hands out a bounded pool of execution tickets (one ticket buys one worker thread), sorts admitted requests into a few priority queues with different ticket budgets, and dispatches them to worker threads and I/O slots. But, unlike an OS scheduler, which aims for fairness and wants every thread to eventually run, the ECS has to make decisive choices. It needs to prioritize the latency-sensitive short queries and shed the excess, as it aims the cost of waiting off the server and back onto the client. This is, of course, exactly what closes the feedback loop with a retrying agent.

    We found two interesting results in our analysis.


    Two reasonable policies may compose into a metastable loop

    A natural ECS design uses two queues, a high-priority one for short tasks and a low-priority one for long tasks, with a probing policy on each that nudges its ticket count up or down to chase a performance metric. Each policy is sensible in isolation. But when you compose them, they interfere with each other on occasion in a metastable manner. The long queue probes for more tickets to improve its own throughput. More long-task tickets means more threads contending for the same cores, which steals CPU from the short queue, which then escalates its ticket count to keep up. Now both queues are inflating until they slam into their hard limits. The trap is that ticket acquisition rate, the metric each policy is optimizing, stops predicting actual progress under overload. A queue full of waiting tasks can churn tickets at a furious rate (grab a ticket, do 1 ms of work, yield, repeat) while getting almost nothing done. The metric looks healthy while the system performance collapses.


    Admission control and the ECS destructively interfere

    Putting an admission controller in front of the ECS sounds like defense in depth, but this can also backfire when done naively. Admission control drops requests indiscriminately. It sits at the network edge, and when it sees elevated latency, it starts rejecting a fraction of everything, short and long task alike. In simulations, we found that a workload spike can trigger admission control immediately, even though the ECS, left alone, would have rebalanced its tickets and absorbed the load after a brief adjustment. But the drops from the admission control prevent the ECS from rebalancing. With work being shed out from under it, the ECS never gets the signal it needs to adjust allocations and priorities, and the system stays parked in reduced-goodput mode until the workload subsides on its own. Under agents this gets worse, because admission control can't tell an agent's first attempt from its fifth retry, and a rejection would usually cause the agent to escalate things.

    The full ECS design study is in our companion paper, "Towards Designing an Execution Control System with Metastability Resilience" (to appear at IEEE ICCCN 2026).

    June 07, 2026

    Getting Paid by Flat Rate Movers

    Back in 2023, I hired Flat Rate Movers (A.K.A. Flat Rate Moving) for an interstate move. They subcontracted to a third party who showed up under-staffed, under-equipped, and very confused; the whole mess wound up causing a good deal of damage to my belongings and home. I filed a claim with Flat Rate and also requested they issue a partial refund. They refused to talk to me or their insurance company, and after six months of calling, emailing, and writing, I got overwhelmed by other life struggles and gave up. I did manage to file a claim with their backing insurance company, but got nothing for the damage to the house, or failure to deliver goods and services.

    In October 2025 Flat Rate started sending me e-mails asking me to use Flat Rate for a move again. I called, explained the situation, and asked how they were planning to resolve it. They invited me to write to customer service, and the whole process resumed, with essentially the same result. I’d call every few days to ask how things were going, and various Flat Rate representatives would promise they’d get back to me tomorrow, or a check was in the mail, or that they’d talk to claims for me. Unsurprisingly, none of this worked.

    Long story short, after a bunch of claim forms, emails, letters, complaints to various regulators, requests for arbitration, and one hundred and sixty-one phone calls with Flat Rate directly, I have succeeded. I started calling their main line at (212) 988-9292, picking a three-digit extension, and asking that person for help. I did this ten times each morning, and within two weeks had multiple people offering to help. I can’t say for sure which person got things over the line, because I called a lot of them, but by April 2026, I had a check in the mail.

    If you wind up in the same boat, I recommend two things. First, keep a log of everything you do. Flat Rate may tell you things like “we did not receive supporting documentation within sixty days” or “you did not file the claim in time”; it is extremely helpful to be able to say things like “Please see my email from 2023-05-06 at 15:32 in which I submitted the requested documentation.” Second, set up a spreadsheet to keep track of which extensions you’ve called and when. Here's mine—each row number is the three-digit extension.

    Extended RUM in DocumentDB extension for PostgreSQL: Efficient ESR (Equality, Sort, Range) Queries

    Last year, I examined RUM indexes within this series on multi-key indexing, demonstrating that they cannot substitute MongoDB's compound indexes for sorted queries. A year later, the DocumentDB extension has evolved to incorporate an Extended RUM index, which allows for an ordered scan rather than a bitmap scan. Let's revisit our pagination query to see how it performs now.

    I start a container with the latest DocumentDB (version v0.112-0 from May 26, 2026):

    docker run -d --name documentdb-local -p 10260:10260 -p 9712:9712 ghcr.io/documentdb/documentdb/documentdb-local:latest  --username franck --password franck --start-pg
    
    

    I can connect to PostgreSQL on port 9712, where many extensions are installed, including the extended RUM index:

    docker exec -it documentdb-local psql -p 9712 postgres
    
    psql (17.10 (Debian 17.10-1.pgdg13+1))
    Type "help" for help.
    
    postgres=# \dx
                                            List of installed extensions
    Name           | Version |   Schema   |                        Description                         
    -------------------------+---------+------------+------------------------------------------------------------
     documentdb              | 0.112-0 | public     | API surface for DocumentDB for PostgreSQL
     documentdb_core         | 0.112-0 | public     | Core API surface for DocumentDB on PostgreSQL
     documentdb_extended_rum | 0.112-0 | public     | DocumentDB Extended RUM index access method
     pg_cron                 | 1.6     | pg_catalog | Job scheduler for PostgreSQL
     plpgsql                 | 1.0     | pg_catalog | PL/pgSQL procedural language
     postgis                 | 3.6.3   | public     | PostGIS geometry and geography spatial types and functions
     tsm_system_rows         | 1.0     | public     | TABLESAMPLE method which accepts number of rows as a limit
     vector                  | 0.8.2   | public     | vector data type and ivfflat and hnsw access methods
    (8 rows)
    
    postgres=#
    

    I can also connect to the MongoDB-compatible API:

    docker exec -it documentdb-local mongosh -u franck -p franck 'mongodb://localhost:10260/?tls=true&tlsAllowInvalidCertificates=true'
    
    Current Mongosh Log ID: 6a0b3b537d2a1c3471d1a7ba
    Connecting to:          mongodb://<credentials>@localhost:10260/?tls=true&tlsAllowInvalidCertificates=true&directConnection=true&serverSelectionTimeoutMS=2000&appName=mongosh+2.8.3
    Using MongoDB:          7.0.0
    Using Mongosh:          2.8.3
    
    For mongosh info see: https://www.mongodb.com/docs/mongodb-shell/
    
    [direct: mongos] test>
    

    Like in the previous post, I created a simple collection with 10,000 documents:

    [direct: mongos] test>
     for (let i = 0; i < 10000; i++) {
      db.demo.insertOne({
        a: 1,
        b: Math.random(),
        ts: new Date()
      });
    }
    
    

    I create a compound index that follows the MongoDB Equality, Sort, Range rule—designed for queries with an equality filter on a and a sort on ts:

    [direct: mongos] test>
     db.demo.createIndex({ "a": 1, "ts": -1 });
    
    

    I run the same query as in the previous post, which, with the standard RUM indexes, produced a Bitmap Index Scan followed by a Sort of all documents matching a: 1 before returning the top 10:

    [direct: mongos] test>
     db.demo.find(
     { a: 1 }
    ).sort(
     { ts: -1 }
    ).limit(10).explain("executionStats");
    
    

    The good surprise is that with the current version of DocumentDB, the execution plan looks like MongoDB's native IXSCAN with no additional sort step:

    [direct: mongos] test> db.demo.find(
      { a: 1 }).sort({ts:-1}).limit(10).explain("executionStats")
    ;
    
    {
      explainVersion: 2,
    ...
      executionStats: {
        nReturned: Long('10'),
        executionTimeMillis: 0.286,
        executionStartAtTimeMillis: 0.256,
        totalDocsExamined: Long('10'),
        totalKeysExamined: Long('10'),
        executionStages: {
          stage: 'LIMIT',
          nReturned: Long('10'),
          executionTimeMillis: 0.286,
          executionStartAtTimeMillis: 0.256,
          totalDocsExamined: 10,
          totalKeysExamined: 10,
          numBlocksFromCache: 25,
          inputStage: {
            stage: 'FETCH',
            nReturned: Long('10'),
            executionTimeMillis: 0.267,
            executionStartAtTimeMillis: 0.253,
            totalKeysExamined: 10,
            numBlocksFromCache: 25,
            inputStage: {
              stage: 'IXSCAN',
              nReturned: Long('10'),
              executionTimeMillis: 0.267,
              executionStartAtTimeMillis: 0.253,
              indexName: 'a_1_ts_-1',
              totalKeysExamined: 10,
              numBlocksFromCache: 25
            }
          }
        }
      },
      ok: 1
    }
    

    It read only the minimum necessary to get the result: ten index entries (totalKeysExamined: 10) in the expected order and fetched only ten documents (totalDocsExamined: 10). This is the most efficient execution plan.

    Comparing the Two RUM Index Definitions

    I connect to PostgreSQL to describe the table that stores my collection documents (you will see later how I obtained the name):

    postgres=# \d documentdb_data.documents_7
    
                Table "documentdb_data.documents_7"
    
         Column      |  Type  | Collation | Nullable | Default 
    -----------------+--------+-----------+----------+---------
     shard_key_value | bigint |           | not null | 
     object_id       | bson   |           | not null | 
     document        | bson   |           | not null | 
    
    Indexes:
    
        "collection_pk_7" PRIMARY KEY, btree (shard_key_value, object_id)
    
        "documents_rum_index_25" documentdb_extended_rum (document documentdb_extended_rum_catalog.bson_extended_rum_composite_path_ops (pathspec='[ "a", { "ts" : -1 } ]', tl='2691'))
    
    Check constraints:
    
        "shard_key_value_check" CHECK (shard_key_value = '7'::bigint)
    
    postgres=#
    

    What was a standard RUM index in the previous post is now an extended RUM index:

    Attribute Previous post Current test
    Index Type documentdb_rum documentdb_extended_rum
    Operator Class bson_rum_single_path_ops (×2) bson_extended_rum_composite_path_ops
    Fields a (asc), ts (implicit asc) a (asc), ts (desc)
    Sort Direction on ts Not specified / default ascending Explicitly -1 (descending)
    Path Encoding Two separate path= entries Single JSON pathspec array

    The extended RUM index acts as a sort-order-aware composite index, embedding the descending direction directly into the pathspec. Unlike the previous approach, which stored each path independently, this approach encodes all indexed fields as a single composite pathspec and generates a single composite index entry per document, preserving the relative ordering between fields. An index scan (RumOrderedScan) efficiently covers both filtering and sorting, eliminating the need for a separate Sort node in the PostgreSQL execution plan. This benefit is evident when executing the same query via the DocumentDB API in PostgreSQL:

    postgres=# explain (analyze, buffers, verbose, costs off)
    select document from bson_aggregation_find(
      'test',
      '{
        "find": "demo",
        "filter": { "a": 1 },
        "sort":   { "ts": -1 },
        "limit": 10
      }'::documentdb_core.bson
    );
                                                           QUERY PLAN                                                        
    -------------------------------------------------------------------------------------------------------------------------
     Limit (actual time=0.108..0.138 rows=10 loops=1)
       Output: document, (bson_orderby(document, '{ "ts" : { "$numberInt" : "-1" } }'::bson))
       Buffers: shared hit=4
       ->  Index Scan using "a_1_ts_-1" on documentdb_data.documents_7 collection (actual time=0.104..0.117 rows=10 loops=1)
             Output: document, bson_orderby(document, '{ "ts" : { "$numberInt" : "-1" } }'::bson)
             Index Cond: (collection.document @= '{ "a" : { "$numberInt" : "1" } }'::bson)
             Order By: (collection.document |-<> '{ "ts" : { "$numberInt" : "-1" } }'::bson)
             Buffers: shared hit=4
     Planning:
       Buffers: shared hit=2
     Planning Time: 0.453 ms
     Execution Time: 0.192 ms
    (12 rows)
    
    postgres=#
    

    Note: I got the name of the internal table I described above from this execution plan, which uses the collection name in the query. The MongoDB API's explain() shows a MongoDB-compatible execution plan, and EXPLAIN in PostgreSQL shows the PostgreSQL version of it.

    Comparison of Execution Plans

    Here is how the new behavior with an ordered index scan compares to the previous bitmap scan.

    Feature Ordered Index Scan Bitmap Index Scan
    PostgreSQL Node Index Scan Bitmap Index Scan
    Ordering Handled by sort direction in index Lost—requires a Sort node
    Scan Type scanType: RumOrderedScan scanType: RumFastScan / RumRegularScan
    Efficiency Supports early termination (LIMIT) Must scan all matching TIDs into bitmap
    RUM Entry Point rumgettuple() rumgetbitmap()
    Sort Step None — useSimpleScan = true rum_tuplesort_performsort() required
    Memory Usage Low—one tuple at a time High—full TIDBitmap + sort state
    Index Structure Used B-tree walk via orderStack Posting list / posting tree dump
    Filter Evaluation Inline via ValidateIndexEntry() Post-collection in keyGetItem()
    Seek Optimization Yes—advances queryKey as entries exhaust No
    Multi-column Support Multi-column via composite pathspec Multi-column via separate entries
    LIMIT benefit ✅ Full — stops after N rows ❌ None — bitmap built before LIMIT applies
    Recheck Behavior xs_recheckorderby per tuple xs_recheck on bitmap result
    Trigger Condition RumEnableOrderedOperatorScans + willSort + norderbys > 0 Default path

    Index Only Scan

    Other improvements are coming to Extended RUM, like Index Only Scan, currently supported for COUNT:

    postgres=# explain (analyze, buffers, verbose, costs off)  
    SELECT document FROM bson_aggregation_count(  
      'test',  
      '{  
        "count": "demo",  
        "query": { "a": 1 }  
      }'::documentdb_core.bson  
    );
                                                                QUERY PLAN                                                            
    ----------------------------------------------------------------------------------------------------------------------------------
     Aggregate (actual time=22.760..22.763 rows=1 loops=1)
       Output: documentdb_api_internal.bsoncommandcount(1)
       Buffers: shared hit=109
       ->  Index Only Scan using "a_1_ts_-1" on documentdb_data.documents_7 collection (actual time=0.089..14.292 rows=10000 loops=1)
             Output: collection.document
             Index Cond: (collection.document @= '{ "a" : { "$numberInt" : "1" } }'::bson)
             Heap Fetches: 0
             Buffers: shared hit=109
     Planning:
       Buffers: shared hit=4
     Planning Time: 0.441 ms
     Execution Time: 22.883 ms
    (12 rows)
    

    Index Only Scan will be supported in the future (see IsQueryValidForIndexOnlyScan)

    Conclusion

    A year ago, DocumentDB's RUM indexes had a significant limitation for pagination queries: even with the right compound index, the planner would fall back to a Bitmap Index Scan followed by a full Sort, meaning every matching document had to be collected and sorted before the first result could be returned. A LIMIT 10 query on 10,000 documents would examine all 10,000—defeating the purpose of the compound index.

    With v0.112-0, this is fixed. The new documentdb_extended_rum index type, combined with the RumOrderedScan execution path, reduces the gap with native MongoDB behavior:

    • The index encodes sort direction directly in the pathspec ({ "ts": -1 })
    • The planner chooses an Index Scan instead of a Bitmap Index Scan
    • No Sort node appears in the plan
    • LIMIT 10 examines exactly 10 index entries and 10 documents

    This is more than just a cosmetic change. In time-series queries—such as filtering on a low-cardinality field, sorting by timestamp descending, and retrieving the first page—the difference between the two plans ranges from O(result) to O(total size). For OLTP systems, pagination queries are common and need to be quick and reliable, since they show results to the user before the user takes any action, selects, refines filters, or moves to the next page.

    This ordered scan is also essential for TTL indexes to efficiently identify expiration candidates.

    The key ingredients that make this work together are visible from the DocumentDB open-source code:

    1. documentdb_extended_rum—composite pathspec with explicit sort direction
    2. bson_extended_rum_composite_path_ops — single operator class covering all fields
    3. RumOrderedScan — B-tree walk in index order via orderStack, bypassing rumgetbitmap()
    4. useSimpleScan — returns one tuple at a time, enabling true LIMIT pushdown
    5. RumAllowOrderByRawKeys — the GUC that enables this path, now on by default

    This is all enabled by default:

    postgres=# \dconfig *rum*order*
               List of configuration parameters
                      Parameter                   | Value
    ----------------------------------------------+-------
     documentdb_rum.enable_ordered_operator_scans | on
     documentdb_rum.forceRumOrderedIndexScan      | off
    (2 rows)
    

    In under a year, DocumentDB evolved from "RUM instead of GIN, but with the same pagination limitations" to "RUM with ordered scan, aligning more with MongoDB's IXSCAN behavior for ESR-pattern indexes". For developers implementing cursor-based pagination or queries with a selective filter and sorting on a time or sequence field, this marks the version at which it begins to function as expected.