a curated list of database news from authoritative sources

March 29, 2026

Selecting a character set for MySQL and MariaDB clients

 MySQL and MariaDB have many character-set related options, perhaps too many:

  1. character_set_client
  2. character_set_connection
  3. character_set_database
  4. character_set_filesystem
  5. character_set_results
  6. character_set_server
  7. character_set_system
This is a topic that I don't know much about and I am still far from an expert. My focus has been other DBMS topics. But I spent time recently on this topic while explaining what looked like a performance regression, but really was just a new release of MySQL using a charset that is less CPU-efficient than the previous charset that was used.

Debugging

The intial sequence to understand what was going on was:
  1. mysql -e 'SHOW GLOBAL VARIABLES like "character_set_%"
  2. mysql -e 'SHOW SESSION VARIABLES like "character_set_%"
  3. run "SHOW SESSION VARIABLES" from my benchmark client
Note:
  • the output from steps 1 and 2 was different
    • with SHOW GLOBAL VARIABLES I got character_set_client =latin1 but with SHOW SESSION VARIABLES I got character_set_client =utf8mb3. This happens. One reason is that some MySQL client binaries autodetect the charset based on the value of LANG or LC_TYPE from your Linux env. Another reason is that if autodetection isn't done then the clients can use the default charset that was set at compile time. That charset is then passed to the server during connection handshake (see thd_init_client_charset). So it is likely that character_set_client as displayed by SHOW GLOBAL VARIABLES isn't what your client will use.
  • the output from steps 2 and 3 was different
    • autodetection is only done when mysql_options() is called with a certain flag (see below). And that is not done by the MySQL driver in sysbench, nor is it done by Python's MySQLdb. So my benchmark clients are likely selecting the default charset and don't do autodetection. And that default is determined by the version of the MySQL client library, meaning that default can change over the years. For the source that implements this, search for MYSQL_AUTODETECT_CHARSET_NAME and read sql-common/client.c.
The following enables autodetection and should be called before calling mysql_real_connect():
    mysql_options(..., 
                  MYSQL_SET_CHARSET_NAME,
                  MYSQL_AUTODETECT_CHARSET_NAME);

Note that adding the following into my.cnf isn't a workaround for clients that don't do autodetect.
    [client]
    default-character-set=...

Notes

These are from my usage of MySQL 5.7.44, 8.0.45 and 8.4.8 along with MariaDB 10.6.25, 10.11.16 and 11.4.10. All were compiled from source as was sysbench. I installed MySQLdb and the MySQL client library via apt for Ubuntu 24.04.

The values for character_set_client, character_set_results and character_set_connection were measured via the MySQL command-line client running SHOW GLOBAL VARIABLES and SHOW SESSION VARIABLES and then the benchmark clients running SHOW SESSION VARIABLES.

The reason for sharing this is to explain the many possible values your session might use for character_set_client, character_set_results and character_set_connection. And using the wrong value might waste CPU.

What per-session values are used for character_set_client|results|connection?
* my.cnf has character_set_server=latin1
* per SHOW GLOBAL VARIABLES each is set to =latin1
* values below measured via SHOW SESSION VARIABLES
Values for character_set_client|results|connection
... with "mysql" command line client
... this is easy to change with --default-character-set command line option or equivalent option in my.cnf
dbms
5.7.44 utf8
8.0.45 utf8mb4
8.4.8 utf8mb4
10.6.25 utf8mb3
10.11.16 utf8mb3
11.4.10 utf8mb3
Values for character_set_client|results|connection
... with sysbench

client library version
dbms 5.7 8.0 8.4 10.6 10.11 11.4
5.7.44 latin1 latin1 latin1 NA NA NA
8.0.45 latin1 utf8mb4 utf8mb4 NA NA NA
8.4.8 latin1 utf8mb4 utf8mb4 NA NA NA
10.6.25 latin1 latin1 latin1 utf8mb4 utf8mb4 utf8mb4
10.11.16 latin1 latin1 latin1 utf8mb4 utf8mb4 utf8mb4
11.4.10 latin1 utf8mb4 utf8mb4 utf8mb4 utf8mb4 utf8mb4
Values for character_set_client|results|connection
... with insert benchmark (Python MySQLdb and /lib/x86_64-linux-gnu/libmysqlclient.so.21
... I am not what version is libmysqlclient.so.21, this is on Ubuntu 24.04

dbms
5.7.44 latin1
8.0.45 utf8mb4
8.4.8 utf8mb4
10.6.25 latin1
10.11.16 latin1
11.4.10 utf8mb4

March 28, 2026

Measuring AI Ability to Complete Long Software Tasks

This paper from METR (Model Evaluation & Threat Research) introduces a new metric for tracking AI progress: the "50%-task-completion time horizon". This denotes the length of software engineering task (measured by how long a skilled human developer takes to complete it) that the AI model can finish with 50% success rate. The researchers evaluated 12 frontier AI models on 170 tasks across three benchmarks: HCAST (97 diverse software tasks ranging from 1 minute to 30 hours), RE-Bench (7 difficult ML research engineering tasks, each 8 hours), and SWAA (66 short software actions taking 1-30 seconds). To calibrate task difficulty, they collected over 800 human baselines from professional developers, totaling 2,529 hours of work.

The headline finding is that this time horizon has been doubling every 7 months since 2019. GPT-2 could handle tasks that take humans about 2 seconds. The o3 model reached 110 minutes. Extrapolating this trend, AI reaches a one-month time horizon (167 working hours) between mid-2028 and mid-2031, with a central estimate of mid-2029. To be precise about what this means: the "167 working hours" refers to task complexity as measured by human effort, not AI execution time. An AI agent with a one-month time horizon could complete a task that takes a human developer a month of work within hours; it works continuously, doesn't context-switch, and executes steps faster. What the one-month time horizon means is that, AI would be able to autonomously build a SaaS MVP, migrate a 200k-line legacy codebase to microservices, or implement complex features end-to-end.

That said, "50% success rate" is doing the heavy lifting in this claim. There is a significant reliability gap: the 80% success-rate time horizon is 4-6x shorter than the 50% horizon. A model that can sometimes complete a month-long task can only reliably complete week-long ones. Models also struggle in messy environments that lack clear feedback loops (like automated tests) and require proactive information gathering. This means well-structured well-tested codebases will benefit far more than legacy systems with poor documentation. That is, greenfield work will see more gains than deep maintenance of complex existing systems.

Another caveat is that, experiments with pull requests showed that AI performance aligns more closely with low-context contractors (who take 5-18x longer than expert maintainers) than with experienced developers who know the codebase. This matters because the extrapolation is calibrated against general-purpose developers, not domain experts. When the paper says "one-month time horizon", it means one month of contractor-level work, not one month of work by someone who built the system. For tasks that require deep institutional knowledge (debugging subtle production issues, evolving tightly coupled legacy systems) the effective time horizon is much shorter.

I think the external validity checks were a strong part of the paper. The paper shows that the trend holds on SWE-bench (with an even faster 70-day doubling time). Again, messy tasks show lower absolute success but the same rate of improvement. The methodology is clean and reproducible. The main weakness is the task distribution. These 170 tasks are mostly software tasks with high inter-model task correlation (r=0.73), and they largely represent isolated coding challenges rather than the full breadth of real-world software engineering. True full-stack ownership (gathering specifications, architectural decision-making, deployment, and live operations) remains untested here. Furthermore, while AI excels at isolated prototyping, engineering end-to-end production-grade distributed systems requires a high level of fault tolerance/reliability that remains incredibly challenging to automate.


Discussion: What does a one-month AI time horizon world look like?

If the extrapolation holds and AI systems can automate month-long software tasks by ~2029, what are the implications?

This changes the unit economics of software fundamentally. Tasks that currently require a team of developers working for weeks (building integrations, standing up new services, migrating databases, writing and testing features) become something you can delegate to an AI agent for a fraction of the cost and time. The AI agent operates at the level of a competent junior-to-mid engineer working on a well-scoped project. Software goes from expensive and slow to cheap and fast.

But something tells me we will find ways to squander this. When hardware kept getting faster through Moore's law, we produced bloated software that felt just as slow. Wirth's law ate Moore's law. When cloud computing made infrastructure cheap, we built sprawling microservice architectures that ended up like a ball of yarn, tangled, and sprawling, and hard to unravel. Every time we reduce one cost, we expand scope until the new bottleneck hurts just as much as the old one. I expect AI-generated software to follow the same pattern: more software would mean more complexity, and new categories of problems we don't yet have names for. 50 shades of metastability?


What does the month-long task horizon mean for big tech, enterprise software, and startups? How the hell should I know? But let me speculate anyway, since you are still reading (or at least your LLMs are). 

Big tech companies employ tens of thousands of engineers specifically for the month-long projects this extrapolation targets: internal tools, service migrations, API implementations, data pipelines. AI will accelerate all of this. But these companies are structured around large engineering orgs. Promotion ladders, team hierarchies, planning processes all assume that software requires massive human headcount. If one engineer with AI agents does the work of five, the traditional organizational model breaks. Middle management's role become muddled. We may see flattening hierarchies and small high-leverage teams. Whether the behemoth organizations can handle such a rapid restructuring is unclear.

For infrastructure and database companies, the Jevons Paradox applies. Cheaper software means more of it: more applications, more databases, more demand for managed services. The total addressable market could grow substantially. But the flip side is that AI agents that can build month-long projects can also evaluate and switch infrastructure. Vendor lock-in weakens when migration is cheap, and customers become more price-sensitive. There's also the question of whether AI agents will develop strong default preferences shaped by their training data or their creator companies. This is a brand new market force, which we didn't have to contend with before.

I think the most dramatic impact is on software startups. Today, the primary cost of a startup is engineering talent. If AI handles month-long implementation tasks, that cost drops by an order of magnitude. The barrier to entry collapses. Many more products get built.  That may mean that differentiation shifts to domain expertise, data moats, and taste. I forecasted a similar trend earlier for academic papers.

The reliability gap tells us the world of 2029-2031 is probably not "AI replaces developers" but "developers who use AI effectively are 5-10x more productive". The scarce resource shifts from the ability to write code to the ability to specify what to build, evaluate whether it is correct, and manage the complexity of systems that grow much faster than before. What worries me more is whether organizations can restructure fast enough to keep up.

March 27, 2026

PostgreSQL: Bye-Bye MD5 Authentication. What’s Next?

Introduction MD5 has been the most popular algorithm for encoding passwords in PostgreSQL and other database systems. It is simple to implement, causes less overhead and less latency for completing the authentication, and this is why it has been the most preferred method. However, a discussion thread in the PG community has given a signal […]

March 26, 2026

OSTEP Chapter 13: The Abstraction of Address Spaces

Chapter 13 of OSTEP provides a primer on how and why modern operating systems abstract physical hardware.

This is part of our series going through OSTEP book chapters. The OSTEP textbook is freely available at Remzi's website if you like to follow along.


Multiprogramming and Time Sharing

In the early days of computing, machines didn't provide much of a memory abstraction to users. The operating system was essentially a library of routines sitting at the bottom of physical memory, while a single running process occupied the rest of the available space. 

I am not that old, but I got to experience this in 2002 through programming sensor nodes with TinyOS. These "motes" were operating on extremely constrained hardware with just 64KB of memory, so there was no complex virtualization. The entire OS and the application were compiled together as a single process, and all memory was statically preallocated. Ah, the joy of debugging a physically distributed multi-node deployment using three LEDs per node, with everyone shouting "red fast-blinking, yellow on". Here’s a snapshot from two mote deployments we ran at UB. Most of my pre-2010 work was on sensor networks. I realize I never wrote about that period. Maybe there’s a reason.

The shift away from the simple early computer systems began with multiprogramming. The OS switched among ready processes to increase CPU utilization on expensive machines. Later, the demand for interactivity gave rise to time sharing. Initially, implementing time sharing involved saving a process's entire memory state to disk and loading another, but this had terrible overhead. Thus, operating systems began keeping multiple processes resident in memory, enabling fast context switches.



The Address Space and Virtualization

To solve the protection problems caused by sharing memory, the OS introduced the address space. This abstraction is a running program's view of memory, containing three components. 

  • Program Code: The static instructions, placed at the top of the address space.
  • Heap: Dynamically allocated, user-managed memory that grows downward.
  • Stack: Used for local variables and function calls, growing upward from the bottom.

The crux of virtual memory is how the OS translates virtual addresses into physical addresses. The OS aims for three primary goals. Transparency: The program should act as if it has its own private physical memory. Efficiency: The OS must minimize time and space overhead. Protection: The OS must isolate processes, ensuring a program cannot affect the memory of the OS or other applications.


GPUs and Beyond

So, do these classic principles still apply to modern hardware like GPUs? Yes, but the scale and complexity is larger. Early CPUs handled a few processes; GPUs handle thousands of threads. Yet, the foundational abstractions (virtual address spaces, hardware-assisted translation, and memory isolation) mostly stay the same (for now). CUDA Unified Memory pushes things further with a single address space across system RAM and GPU VRAM.

A major challenge in modern machine learning is the so-called memory wall. AI accelerators perform computations so fast that they waste most of their time waiting for data to arrive. Conventional system RAM cannot deliver the required bandwidth, and this forced the adoption of High Bandwidth Memory (HBM): stacks of memory chips placed adjacent to compute cores, connected with ultra-fine wiring capable of moving terabytes of data per second.

The memory wall is so severe that the main bottleneck in scaling LLMs today is the memory. Generating text relies on the attention mechanism, which keeps KeyValue representations of all previous tokens to avoid recomputing context. When we present an AI with a huge codebase, the GPU’s ultra-fast but limited HBM will quickly be exhausted. To avoid running out of memory, engineers come up with new designs. They use tiered memory, and keep hot data while spilling overflow to cheaper CXL-attached system RAM. Cache offloading is another key technique: it moves inactive users' KV caches to the host CPU or fast NVMe SSDs to free GPU space.  

2026 – MySQL Ecosystem Performance Benchmark Report

By Percona Lab Results  ·  2026  ·  MySQL MariaDB Percona Benchmark Database MySQL Ecosystem Performance Benchmark Report 2026 Comparative Analysis of InnoDB-Compatible Engines — Percona Lab Results Repository: github.com/Percona-Lab-results/2026-interactive-metrics Interactive graphs available: Explore the full dataset dynamically — click any graph below to open the interactive version. OLTP Read-Write Local — interactive benchmark graph OLTP […]

Navigating Regional Network Blocks

A look at recent ISP and government-directed blocks of Supabase domains in three regions—what triggered them, how we worked with authorities and customers to restore access, and what multi-tenant platforms can do to prepare.

March 25, 2026

Non-First Normal Forms ( 1NF) and MongoDB: an alternative to 4NF to address 3NF anomalies

SQL databases are grounded in relational algebra, but they are not the only databases with a strong theoretical basis. MongoDB, designed as a document database to solve practical engineering problems and improve developer experience, also builds on theory. It supports non–first-normal-form schemas and extends relational operators to work with them. This foundation is described in a 1986 paper, published 23 years earlier: Theory of Non-First Normal Form Relational Databases

MongoDB's aggregation pipeline operators ($unwind, $group, $lookup, set operations) are concrete implementations of the paper's abstract algebraic operators.

I've build the following examples while reading the paper to illustrate the concepts with practical examples.

The Basic ¬1NF Data Model

The paper defines nested relations where attributes can be atomic or relation-valued:

This structure records facts about employees, like some information about their children, their skills, and exams they passed.

This maps directly to a MongoDB document schema:

// This IS the paper's example, expressed as MongoDB documents
db.employees.insertMany([
  {
    ename: "Smith",
    Children: [
      { name: "Sam", dob: "2/10/84" },
      { name: "Sue", dob: "1/20/85" }
    ],
    Skills: [
      {
        type: "typing",
        Exams: [
          { year: 1984, city: "Atlanta" },
          { year: 1985, city: "Dallas" }
        ]
      },
      {
        type: "dictation",
        Exams: [
          { year: 1984, city: "Atlanta" }
        ]
      }
    ]
  },
  {
    ename: "Watson",
    Children: [
      { name: "Sam", dob: "3/12/78" }
    ],
    Skills: [
      {
        type: "filing",
        Exams: [
          { year: 1984, city: "Atlanta" },
          { year: 1975, city: "Austin" },
          { year: 1971, city: "Austin" }
        ]
      },
      {
        type: "typing",
        Exams: [
          { year: 1962, city: "Waco" }
        ]
      }
    ]
  }
])

MongoDB's document model is essentially what Roth called a "database scheme" — a collection of rules where attributes can be zero-order (scalar fields) or higher-order (embedded arrays of documents). The paper's Figure 3-1 is literally a MongoDB collection:

When using an object-oriented language, this model looks natural but this paper is from 1986 so there was another motivation.

Motivation: ¬1NF as an alternative to 4NF decomposition

The paper's example (Figure 1-1) shows an employee relation in 1NF requires 10 rows with massive redundancy:

employee | child | skill
-----------------------------
Smith    | Sam   | typing
Smith    | Sue   | typing
Smith    | Sam   | filing
Smith    | Sue   | filing
Jones    | Joe   | typing
Jones    | Mike  | typing
Jones    | Joe   | dictation
Jones    | Mike  | dictation
Jones    | Joe   | data entry
Jones    | Mike  | data entry

The ¬1NF version needs only 2 tuples. The paper notes three problems with 1NF:

  • Insert anomaly: Adding a child for Jones requires adding 3 tuples (one per skill)
  • Update anomaly: Changing Smith's "typing" to "word processing" requires updating multiple rows
  • Decomposition cost: The 1NF solution requires splitting into two tables and joining them back

Today, these anomalies are not a major problem because a single updateMany() or insertMany operation in a MongoDB transaction is atomic, just like an UPDATE or INSERT in SQL databases. More importantly, the document model makes multi-document operations rare—the Cartesian explosion of redundant data remains undesirable.

The first four normal forms (1NF, 2NF, 3NF, BCNF) do not address this, because there are no non-trivial functional dependencies in this relation: all three attributes together form the only candidate key. This relation is therefore already in BCNF. This is precisely why Fagin introduced 4NF in 1977.

This example illustrates a multivalued dependency (MVD) violation: children and skills are independent multivalued attributes of an employee, but they’re stored together. This creates a redundant Cartesian product (10 rows for 2 employees). Fourth Normal Form (4NF) addresses this by splitting the independent dependencies into separate tables to remove these anomalies, but at the cost of more tables touched per transaction and more joins per query.

Even a Third Normal Form (3NF) schema can still suffer from insert and update anomalies and data redundancy. MongoDB’s document model can offer many of 4NF’s benefits without incurring the join overhead.

Here is the MongoDB Equivalent:

// The ¬1NF representation — exactly what MongoDB does naturally
db.employees.insertMany([
  {
    employee: "Smith",
    Children: [{ child: "Sam" }, { child: "Sue" }],
    Skills: [{ skill: "typing" }, { skill: "filing" }]
  },
  {
    employee: "Jones",
    Children: [{ child: "Joe" }, { child: "Mike" }],
    Skills: [
      { skill: "typing" },
      { skill: "dictation" },
      { skill: "data entry" }
    ]
  }
])

Adding a new child for Jones is just one operation, no anomaly:


db.employees.updateOne(
  { employee: "Jones" },
  { $push: { Children: { child: "Sara" } } }
)

Changing Smith's "typing" to "word processing" — one operation:


db.employees.updateOne(
  { employee: "Smith", "Skills.skill": "typing" },
  { $set: { "Skills.$.skill": "word processing" } }
)

Querying everything about an employee doesnt need a join:


db.employees.findOne({ employee: "Smith" })

MongoDB exists for a fundamental reason: its document model eliminates the need for decomposition and joins, as well as the update anomalies introduced by the 1NF constraint. Most workloads rely on single-document operations, which MongoDB physically optimizes as single-shard transactions, a single disk read or write, and a single replication operation. These inserts and updates lock the document, which serves as the consistency boundary, but modify only the fields whose values change and update only the corresponding index entries.

Nest and Unnest Operators

The paper defines nest (ν) to aggregate flat rows into nested structures, and unnest (μ) to flatten nested structures back:

  • ν_{B=(C,D)}(r): nest attributes C,D into a new nested relation B
  • μ_{B}(r): unnest nested relation B

MongoDB's aggregation pipeline has direct equivalents.

The paper's unnest (μ) operator is $unwind:

db.employees.aggregate([
  { $unwind: "$Skills" }
])

Each skill becomes its own document, duplicating employee info.

Deep unnest (μ* in the paper) is possible with multiple $unnest stages:

db.employees.aggregate([
  { $unwind: "$Children" },
  { $unwind: "$Skills" }
])

This produces the full 1NF Cartesian expansion, flattening both Children and Skills.

When starting with a flat collection, the paper's nest operator (ν) is $group:

db.flat_employees.aggregate([
  {
    $group: {
      _id: "$employee",
      Children: { $addToSet: { child: "$child" } },
      Skills:   { $addToSet: { skill: "$skill" } }
    }
  }
])

The paper proves that the order of unnesting doesn't matter (Thomas & Fischer's result). In MongoDB, these produce the same fully-flat result:

db.employees.aggregate([
  { $unwind: "$Children" },
  { $unwind: "$Skills" }
])

db.employees.aggregate([
  { $unwind: "$Skills" },
  { $unwind: "$Children" }
])

The paper also notes that nest is not always an inverse for unnest in general, but IS an inverse for Partitioned Normal Form (PNF) relations.

Partitioned Normal Form (PNF)

PNF requires that the atomic attributes of each relation (and each nested relation) form a key. The paper shows a pathological relation (Figure 1-3) that violates PNF:

Smith, the same employee, has two different skill sets, and Jones has a duplicate across sets.

In MongoDB, PNF corresponds to the fundamental design principle that _id determines the document. The pathological case above would mean having two documents for "Smith" with different Skills arrays — which is exactly what MongoDB's _id uniqueness constraint prevents at the collection level.

Without using _id as the employee identifier, we could have two employees with the same name:

db.bad_design.insertMany([
  { employee: "Smith", Skills: ["typing", "filing"] },
  { employee: "Smith", Skills: ["sorting", "mailing"] }
])

The correct design if we don't want to use _id is enforcing PNF with a unique index:

db.good_design.createIndex({ employee: 1 }, { unique: true })

db.good_design.insertOne({
  employee: "Smith",
  Skills: ["typing", "filing", "sorting", "mailing"]
})

Like in many papers, the employee name is used as an identifier (two "Smith" is one person) but obviously in real life there's a generated identifier to identify the physical person.

The paper, in theorem 5-1, proves that PNF is closed under unnesting. In MongoDB terms: if your documents are well-designed (one document per logical entity), then $unwind won't create ambiguous or inconsistent flat results.

For nested relations, PNF also applies recursively.

MongoDB schema validation can enforce the PNF constraint by comparing the size of an array with the size when duplicates are ignored (using $setUnion with an empty array):

db.createCollection("employees", {
  validator: {
    $and: [
      // JSON Schema for structure and types
      {
        $jsonSchema: {
          bsonType: "object",
          required: ["ename"],
          properties: {
            ename: { bsonType: "string" },
            Children: { bsonType: "array", items: { bsonType: "object", required: ["name"], properties: { name: { bsonType: "string" }, dob: { bsonType: "string" } } } },
            Skills: { bsonType: "array", items: { bsonType: "object", required: ["type"], properties: { type: { bsonType: "string" },
             Exams: { bsonType: "array", items: { bsonType: "object", required: ["year"], properties: { year: { bsonType: "int" }, city: { bsonType: "string" } } } } } } }
          }
        }
      },
      // PNF uniqueness constraints using $expr
      {
        $expr: {
          $and: [
            // Level 1: Children.name must be unique within the document
            {
              $eq: [
                { $size: { $ifNull: ["$Children.name", []] } },
                { $size: { $setUnion: [{ $ifNull: ["$Children.name", []] }] } }
              ]
            },
            // Level 1: Skills.type must be unique within the document
            {
              $eq: [
                { $size: { $ifNull: ["$Skills.type", []] } },
                { $size: { $setUnion: [{ $ifNull: ["$Skills.type", []] }] } }
              ]
            },
            // Level 2: Within EACH Skills element, Exams.year must be unique
            {
              $allElementsTrue: {
                $map: {
                  input: { $ifNull: ["$Skills", []] },
                  as: "skill",
                  in: { $eq: [ { $size: { $ifNull: ["$$skill.Exams.year", []] } }, { $size: { $setUnion: [{ $ifNull: ["$$skill.Exams.year", []] }] } } ] }
                }
              }
            }
          ]
        }
      }
    ]
  },
  validationLevel: "strict",
  validationAction: "error"
});

The fact that nest is not always the inverse of unnest is a well-known MongoDB pitfall:

db.employees.drop()  
db.employees.insertMany([  
  {  
    _id: 1,  
    employee: "Smith",  
    Children: [{ child: "Sam" }, { child: "Sue" }],  
    Skills: [{ skill: "typing" }]  
  },  
  {  
    _id: 2,  
    employee: "Smith",  
    Children: [{ child: "Tom" }],  
    Skills: [{ skill: "filing" }, { skill: "sorting" }]  
  },  
  {  
    _id: 3,  
    employee: "Jones",  
    Children: [{ child: "Joe" }],  
    Skills: [{ skill: "typing" }]  
  }  
])  

db.employees.aggregate([  
  // unnest (μ) 
  { $unwind: "$Children" },  
  // nest (ν), grouping by employee, which is NOT a key  
  {  
    $group: {  
      _id: "$employee",  
      Children: { $addToSet: "$Children" },  
      Skills: { $first: "$Skills" }  
    }  
  },  
  // Clean up  
  { $project: { _id: 0, employee: "$_id", Children: 1, Skills: 1 } },  
  { $sort: { employee: 1 } }  
])  

This results in two Smiths collapsed into one:

[
  {
    Children: [ { child: 'Joe' } ],
    Skills: [ { skill: 'typing' } ],
    employee: 'Jones'
  },
  {
    Children: [ { child: 'Tom' }, { child: 'Sam' }, { child: 'Sue' } ],
    Skills: [ { skill: 'typing' } ],
    employee: 'Smith'
  }
]

The correct way is grouping by the proper key:

db.employees.aggregate([  
  // unnest (μ)
  { $unwind: "$Children" },  
  // nest (ν), grouping by _id, which IS a key (PNF)  
  {  
    $group: {  
      _id: "$_id",  
      employee: { $first: "$employee" },  
      Children: { $push: "$Children" },  
      Skills: { $first: "$Skills" }  
    }  
  },  
  // Clean up  
  { $project: { _id: 1, employee: 1, Children: 1, Skills: 1 } },  
  { $sort: { _id: 1 } }  
])  

This is the paper's Theorem 5-2 demonstrated in MongoDB: nest inverts unnest if and only if the grouping key satisfies PNF.

Extended Algebra Operators

I a previous post, From Relational Algebra to Document Semantics I explained how MongoDB extended the semantics of the relational selection Selection (σ) to non-1NF schemas, and mentioned that other relational operations are available in MongoDB. They are covered by the ¬1NF paper.

The paper defines extended union (∪ᵉ) to merge nested relations for tuples that agree on atomic attributes, rather than treating them as separate tuples. In MongoDB, this is achieved with $merge or with aggregation. Suppose we want to merge new course data into existing student records:

db.students.insertMany([
  {
    sname: "Jones",
    Courses: [
      { cname: "Math", grade: "A" },
      { cname: "Science", grade: "B" }
    ]
  },
  {
    sname: "Smith",
    Courses: [
      { cname: "Math", grade: "A" },
      { cname: "Physics", grade: "C" },
      { cname: "Science", grade: "A" }
    ]
  }
])

db.students.updateOne(
  { sname: "Jones" },
  { $addToSet: { Courses: { cname: "Physics", grade: "B" } } }
)

db.students.updateOne(
  { sname: "Smith" },
  {
    $addToSet: {
      Courses: {
        $each: [
          { cname: "Chemistry", grade: "A" },
          { cname: "English", grade: "B" }
        ]
      }
    }
  }
)

This added "Physics:B" to Jones, and "Chemistry:A", "English:B" to Smith without adding two tuples with different course sets like a standard union would do:

db.students.find()

[
  {
    _id: ObjectId('69c3a72344fc089068d4b0c2'),
    sname: 'Jones',
    Courses: [
      { cname: 'Math', grade: 'A' },
      { cname: 'Science', grade: 'B' },
      { cname: 'Physics', grade: 'B' }
    ]
  },
  {
    _id: ObjectId('69c3a72344fc089068d4b0c3'),
    sname: 'Smith',
    Courses: [
      { cname: 'Math', grade: 'A' },
      { cname: 'Physics', grade: 'C' },
      { cname: 'Science', grade: 'A' },
      { cname: 'Chemistry', grade: 'A' },
      { cname: 'English', grade: 'B' }
    ]
  }
]

The paper emphasizes that standard set union treats entire tuples as atomic — if two tuples differ at all, both appear in the result. Extended union is ... (truncated)

March 24, 2026

MongoDB Transaction Performance

Many believe MongoDB transactions are slow, but this misconception often comes from misunderstanding optimistic locking — where transactions retry on conflict rather than blocking, making perceived slowness a function of contention, not inherent overhead (see this previous post).

In MongoDB, all data manipulation operations are transactional at the storage engine level. Single-document operations (insert, update, delete) use internal storage transactions via the WriteUnitOfWork and RecoveryUnit interfaces, ensuring ACID properties for each document change and its index entries.

In MongoDB, the term transaction refers specifically to multi-document transactions, which provide atomicity across multiple documents and collections via sessions. Their performance differs from single-document operations, but they are not clearly faster or slower:

  • Atomicity: Multi-document transactions use extra memory to track uncommitted changes and maintain transaction state across operations.

  • Consistency: Both single-document and multi-document operations enforce index updates and schema validation at write time.

  • Isolation: Multi-document transactions offer snapshot isolation using optimistic concurrency control, which can lead to write conflicts. When conflicts occur, MongoDB labels the error as TransientTransactionError — clients should handle this with retry logic and exponential backoff, as described in the previous post.

  • Durability: Multi-document transactions batch changes into fewer oplog entries, so they can reduce latency compared to multiple single-document transactions.

Example: Inserting One Million Documents

Here is an example on a MongoDB Atlas free cluster where I insert one million documents with a single insertMany() call:

db.test.drop();
const start = new Date();
// Insert a million documents
const res = db.test.insertMany(
  Array.from({ length: 1e6 }, (_, i) => ({
    name: `user_${i}`,
    value: Math.random(),
  }))
);
// Show timing
const elapsed = new Date() - start;
print(`Elapsed: ${elapsed} ms`);

On a free MongoDB Atlas cluster, this operation takes Elapsed: 53025 ms. Although it's a single call, ACID properties apply per document — each document is its own unit of consistency, much like an aggregate in domain-driven design. This means another session can see some inserted documents before the operation completes.

If we want the inserted documents to be visible only when they are all there, we can run the same operation in a transaction:

db.test.drop();
// Start a transaction in a session
const se = db.getMongo().startSession();
const sessionDb = se.getDatabase(db.getName());
const start = new Date();
se.startTransaction();
// Insert a million documents
const res = sessionDb.test.insertMany(
  Array.from({ length: 1e6 }, (_, i) => ({
    name: `user_${i}`,
    value: Math.random(),
  }))
);
// Commit and show timing
se.commitTransaction();
const elapsed = new Date() - start;
print(`Elapsed: ${elapsed} ms`);
se.endSession();

On a MongoDB Atlas free cluster, this takes Elapsed: 49047 ms, which is slightly faster. Note that this approaches the default transactionLifetimeLimitSeconds of 60 seconds — larger operations or slower clusters should adjust this parameter to avoid hitting the limit. MongoDB is optimized for OLTP with short transactions — it’s better to fail than to wait an unpredictable amount of time.

You might expect a larger speedup from batched durability, but single-document transactions are already highly optimized. MongoDB can piggyback multiple inserts into a single applyOps oplog entry, and WiredTiger batches log flushes so multiple transactions can share a single disk sync. For j: true and w: majority (the defaults), MongoDB often triggers a journal flush without waiting, piggybacking on replication acknowledgment to ensure durability.

When you hear that transactions are slow, remember this is usually a misunderstanding. Use transactions as your application requires, based on its atomicity, consistency, isolation, and durability boundaries. Performance depends on whether documents are on the same shard or the transaction spans multiple shards. Transactions aren’t inherently slow—the document model is tuned to the domain model’s consistency boundaries, favoring single-document transactions.

Sysbench vs MySQL on a small server: no new regressions, many old ones

This has performance results for InnoDB from MySQL 5.6.51, 5.7.44, 8.0.X, 8.4.8 and 9.7.0 on a small server with sysbench microbenchmarks. The workload here is cached by InnoDB and my focus is on regressions from new CPU overheads. 

In many cases, MySQL 5.6.51 gets about 1.5X more QPS than modern MySQL (8.0.x thru 9.7). The root cause is new CPU overhead, possibly from code bloat.

tl;dr

  • There are too many performance regressions in MySQL 8.0.X
  • There are few performance regressions in MySQL 8.4 through 9.7.0
  • In many cases MySQL 5.6.51 gets ~1.5X more QPS than 9.7.0 because 9.7.0 uses more CPU
  • Large regressions arrived in MySQL 8.0.30 and 8.0.32, especiall for full-table scans

Builds, configuration and hardware

I compiled MySQL from source for versions 5.6.51, 5.7.44, 8.0.X, 8.4.8 and 9.7.0. For MySQL 8.0.X I used 8.0.28, 8.0.30, 8.0.31, 8.0.32, 8.0.33, 8.0.34, 8.0.35, 8.0.36 and 8.0.45.

The server is an ASUS ExpertCenter PN53 with AMD Ryzen 7 7735HS, 32G RAM and an m.2 device for the database. More details on it are here. The OS is Ubuntu 24.04 and the database filesystem is ext4 with discard enabled.

The my.cnf files are here for 5.6, 5.7, 8.4 and 9.7.

There my.cnf files are here fo 8.0.28, 8.0.30, 8.0.31, 8.0.32, 8.0.33, 8.0.34, 8.0.35, 8.0.36 and 8.0.45.

Benchmark

I used sysbench and my usage is explained here. To save time I only run 32 of the 42 microbenchmarks and most test only 1 type of SQL statement. Benchmarks are run with the database cached by InnoDB.

The tests are run using 1 table with 50M rows. The read-heavy microbenchmarks run for 630 seconds and the write-heavy for 930 seconds.

Results

The microbenchmarks are split into 4 groups -- 1 for point queries, 2 for range queries, 1 for writes. For the range query microbenchmarks, part 1 has queries that don't do aggregation while part 2 has queries that do aggregation. 

I provide tables below with relative QPS. When the relative QPS is > 1 then some version is faster than the base version. When it is < 1 then there might be a regression.  The relative QPS is below where the base version is either MySQL 5.6.51 or 8.0.28:
(QPS for some version) / (QPS for base version) 
Values from iostat and vmstat divided by QPS are here for 5.6.51 as the base version and then here for 8.0.28 as the base version. These can help to explain why something is faster or slower because it shows how much HW is used per request.

Results: point queries

Summary:
  • there are large regressions from 5.6.51 to 5.7.44
  • there are larger regressions from 5.7.44 to 8.0.45
  • the regressions from 8.0.45 to 9.7.0 are small
  • the regressions in the random-points tests are larger for range=10 than range=1000 (larger when the range is smaller). So the regressions are more likely to be in places other than InnoDB. The problem is new CPU overhead (see cpu/o here) which is 1.55X larger in 9.7.0 vs 5.6.51 for random-points_range=10 but only 1.19X larger in 9.7.0 for random-points_range=1000.
Relative to: 5.6.51
col-1 : 5.7.44
col-2 : 8.0.45
col-3 : 8.4.8
col-4 : 9.7.0

col-1   col-2   col-3   col-4
0.87    0.65    0.65    0.64    hot-points
0.87    0.69    0.67    0.63    point-query
0.87    0.72    0.72    0.71    points-covered-pk
0.90    0.78    0.78    0.76    points-covered-si
0.89    0.73    0.72    0.71    points-notcovered-pk
0.89    0.77    0.76    0.75    points-notcovered-si
1.00    0.84    0.83    0.83    random-points_range=1000
0.89    0.72    0.72    0.72    random-points_range=100
0.87    0.69    0.68    0.66    random-points_range=10

Summary:
  • The large regressions in 8.0.x for point queries (see above) occur prior to 8.0.28
Relative to: 8.0.28
col-1 : 8.0.30
col-2 : 8.0.31
col-3 : 8.0.32
col-4 : 8.0.33
col-5 : 8.0.34
col-6 : 8.0.35
col-7 : 8.0.36
col-8 : 8.0.45

col-1   col-2   col-3   col-4   col-5   col-6   col-7   col-8
0.92    1.14    1.14    1.12    1.17    1.16    1.16    1.16    hot-points
0.97    0.97    0.95    0.96    0.95    0.95    0.95    0.95    point-query
0.94    1.09    1.09    1.08    1.12    1.12    1.11    1.15    points-covered-pk
0.90    1.08    1.07    1.07    1.12    1.13    1.12    1.16    points-covered-si
0.91    1.04    1.04    1.03    1.07    1.07    1.06    1.11    points-notcovered-pk
0.88    0.96    0.96    0.95    1.00    1.01    1.00    1.06    points-notcovered-si
0.79    2.35    2.42    2.37    2.45    2.45    2.47    2.56    random-points_range=1000
0.94    1.07    1.06    1.06    1.09    1.08    1.10    1.12    random-points_range=100
0.93    0.94    0.93    0.93    0.94    0.94    0.93    0.95    random-points_range=10

Results: range queries without aggregation

Summary:
  • there are large regressions from 5.6.51 to 5.7.44
  • there are larger regressions from 5.7.44 to 8.0.45
  • the regressions from 8.0.45 to 9.7.0 are small
  • the problem is new CPU overhead and for the scan test the CPU overhead per query is about 1.5X larger in modern MySQL (8.0 thru 9.7) relative to MySQL 5.6.51 (see cpu/o here)
Relative to: 5.6.51
col-1 : 5.7.44
col-2 : 8.0.45
col-3 : 8.4.8
col-4 : 9.7.0

col-1   col-2   col-3   col-4
0.83    0.68    0.66    0.65    range-covered-pk
0.83    0.70    0.69    0.67    range-covered-si
0.84    0.66    0.65    0.64    range-notcovered-pk
0.88    0.74    0.73    0.73    range-notcovered-si
0.84    0.67    0.66    0.67    scan

Summary:
  • There is a large regression in 8.0.30 and a larger one in 8.0.32
  • The scan test is the worst case for the regression.
Relative to: 8.0.28
col-1 : 8.0.30
col-2 : 8.0.31
col-3 : 8.0.32
col-4 : 8.0.33
col-5 : 8.0.34
col-6 : 8.0.35
col-7 : 8.0.36
col-8 : 8.0.45

col-1   col-2   col-3   col-4   col-5   col-6   col-7   col-8
0.95    0.94    0.92    0.92    0.92    0.93    0.93    0.96    range-covered-pk
0.96    0.96    0.94    0.93    0.93    0.94    0.93    0.95    range-covered-si
0.94    0.94    0.93    0.93    0.94    0.94    0.93    0.93    range-notcovered-pk
0.89    0.87    0.87    0.86    0.89    0.91    0.89    0.95    range-notcovered-si
0.93    0.92    0.79    0.82    0.83    0.77    0.82    0.80    scan

Results: range queries with aggregation

Summary:
  • there are large regressions from 5.6.51 to 5.7.44
  • there are larger regressions from 5.7.44 to 8.0.45
  • the regressions from 8.0.45 to 9.7.0 are small
Relative to: 5.6.51
col-1 : 5.7.44
col-2 : 8.0.45
col-3 : 8.4.8
col-4 : 9.7.0

col-1   col-2   col-3   col-4
0.86    0.70    0.69    0.68    read-only-count
1.42    1.27    1.24    1.23    read-only-distinct
0.91    0.75    0.74    0.73    read-only-order
1.23    1.01    1.01    1.01    read-only_range=10000
0.93    0.77    0.76    0.74    read-only_range=100
0.86    0.69    0.68    0.66    read-only_range=10
0.83    0.68    0.68    0.66    read-only-simple
0.83    0.67    0.67    0.66    read-only-sum

Summary:
  • There are significant regressions in 8.0.30 and 8.0.32
Relative to: 8.0.28
col-1 : 8.0.30
col-2 : 8.0.31
col-3 : 8.0.32
col-4 : 8.0.33
col-5 : 8.0.34
col-6 : 8.0.35
col-7 : 8.0.36
col-8 : 8.0.45

col-1   col-2   col-3   col-4   col-5   col-6   col-7   col-8
0.95    0.94    0.87    0.87    0.88    0.87    0.89    0.91    read-only-count
0.97    0.96    0.94    0.95    0.96    0.95    0.95    0.96    read-only-distinct
0.97    0.96    0.93    0.95    0.95    0.94    0.95    0.95    read-only-order
0.96    0.95    0.93    0.94    0.95    0.95    0.96    0.98    read-only_range=10000
0.96    0.96    0.94    0.95    0.95    0.94    0.95    0.94    read-only_range=100
0.96    0.97    0.95    0.95    0.95    0.94    0.95    0.94    read-only_range=10
0.94    0.94    0.92    0.93    0.93    0.93    0.94    0.94    read-only-simple
0.94    0.94    0.89    0.91    0.92    0.90    0.93    0.91    read-only-sum

Results: writes

Summary:
  • there are large regressions from 5.6.51 to 5.7.44
  • there are larger regressions from 5.7.44 to 8.0.45
  • the regressions from 8.0.45 to 9.7.0 are small
  • the insert test is the worst case and a big part of that is new CPU overhead, see cpu/o here, where it is 2.13X larger in 9.7.0 than 5.6.51. But for update-one the problem is writing more to storage per commit (see wkbpi here) rather than new CPU overhead.
Relative to: 5.6.51
col-1 : 5.7.44
col-2 : 8.0.45
col-3 : 8.4.8
col-4 : 9.7.0

col-1   col-2   col-3   col-4
0.85    0.60    0.59    0.55    delete
0.81    0.55    0.54    0.52    insert
0.93    0.75    0.74    0.71    read-write_range=100
0.87    0.70    0.68    0.66    read-write_range=10
1.20    0.88    0.89    0.91    update-index
1.04    0.74    0.73    0.71    update-inlist
0.87    0.62    0.61    0.57    update-nonindex
0.87    0.62    0.60    0.57    update-one
0.87    0.63    0.61    0.58    update-zipf
0.93    0.69    0.68    0.66    write-only

Summary:
  • There are significant regressions in 8.0.30 and 8.0.32
Relative to: 8.0.28
col-1 : 8.0.30
col-2 : 8.0.31
col-3 : 8.0.32
col-4 : 8.0.33
col-5 : 8.0.34
col-6 : 8.0.35
col-7 : 8.0.36
col-8 : 8.0.45

col-1   col-2   col-3   col-4   col-5   col-6   col-7   col-8
0.96    0.95    0.92    0.92    0.92    0.91    0.91    0.91    delete
0.94    0.93    0.91    0.91    0.91    0.90    0.90    0.90    insert
0.96    0.96    0.94    0.94    0.94    0.94    0.94    0.93    read-write_range=100
0.96    0.96    0.94    0.94    0.94    0.94    0.94    0.93    read-write_range=10
0.91    0.91    0.84    0.84    0.86    0.85    0.86    0.79    update-index
0.94    0.95    0.92    0.91    0.92    0.91    0.91    0.91    update-inlist
0.95    0.96    0.92    0.92    0.92    0.91    0.91    0.90    update-nonindex
0.96    0.96    0.93    0.92    0.92    0.91    0.92    0.91    update-one
0.96    0.96    0.92    0.92    0.92    0.91    0.91    0.90    update-zipf
0.94    0.94    0.91    0.91    0.91    0.91    0.91    0.89    write-only

SysMoBench: Evaluating AI on Formally Modeling Complex Real-World Systems

This paper presents SysMoBench, a benchmark designed to evaluate generative AI's ability to formally model complex concurrent and distributed systems. Although the paper is published on January 2026, the AI landscape moves so fast that the models evaluated (like Claude-Sonnet-4 and GPT-5) already feel dated, after the release of heavy hitters like Claude 3.5 Opus or OpenAI's Codex. 

The paper draws a distinction between algorithms/protocols and system modeling. As the paper (somewhat circularly) defines it, "system models enable verification of system code via comprehensive testing and model checking". The paper says: Modeling these systems requires the AI to deeply understand the system design, reason about safety and liveness under unexpected faults, and abstract system behavior into an executable program.


This paper is a great paper to follow up my post yesterday on TLA+ Mental Models. It is instructive to see how the paper corroborates my claims in that post. My post identified deciding what to abstract as the hardest skill in TLA+. Rather than letting the AI decide what to ignore, the benchmark tasks in Section 3.1 give the AI detailed instructions about what core actions to model and what implementation details to explicitly exclude.

The paper also validates my observation about invariants being the most signal-heavy part of a spec and, hence, being most difficult for AI and people to write. The authors abandoned having the AI write invariants from scratch, and instead they provide invariant template that contain both a natural language description and a formal TLA+ example. The AI is only asked to concretize these templates by mapping them to its own variable names. 

I'll be criticizing the paper in several places in my writeup below, so just to make it clear upfront, I really like the paper. This paper is a strong accept from me for any conference.

SysMoBench uses the following automated quality metrics for evaluating TLA+ models:

  • Syntax correctness: Statically checked using the SANY Syntactic Analyzer.
  • Runtime correctness: Checked by running the TLC model checker as a proxy for logical self-consistency.
  • Conformance: Measured via trace validation to see if the model conforms to the actual system implementation.
  • Invariant correctness: Model-checked against system-specific safety and liveness properties.

Here "conformance" look contentious to me because of the uncertainty around how refinement mappings are handled and how step-size impedance mismatches are resolved. The paper circumvents this by instrumenting the system code to generate trace logs, and then uses an LLM to automatically map the TLA+ variables and actions to the code traces. If AI-generated models inherently lean towards line-by-line code translation, are we risking losing the cognitive and design benefits of abstraction? Does the Conformance metric inherently bias the benchmark against good abstraction? How could a benchmark be designed to automatically evaluate and reward an AI for successfully cutting away irrelevant mechanics rather than matching them?


Looking at Table 3, the LLMs fail miserably on complex systems, with only Claude-Sonnet-4 managing to piece together anything functional (and even then, scoring very low on conformance for Etcd Raft). The paper's Analysis on LLMs highlights fundamental weaknesses. LLMs constantly introduce syntax errors (as shown in Figure 4a). For instance, DeepSeek-R1 frequently hallucinates mathematical symbols (like $\cap, \forall$) instead of standard ASCII TLA+ operators. GPT-5 and Gemini-2.5-Pro frequently mix TLA+ syntax with Python. Regarding runtime errors (Figure 4b), LLMs frequently generate inconsistent TLC configurations or fundamentally misunderstand TLA+ data structures (e.g., trying to apply set operations to records). I think that things have improved significantly with newer models, so an update from the team would be really great! 

The paper notes that LLMs violated 41.9% of liveness properties but only 8.3% of safety properties (as seen in Figure 4c). This indicates a severe limitation in temporal reasoning. The violations are largely due to missing or incorrectly specified fairness assumptions, though logical structural errors often block the model from making progress before the fairness issues even manifest.

The Trace Learning Agent, which attempts to infer the TLA+ model purely from system execution traces rather than source code, also failed miserably. And I think this is another indication of the temporal reasoning problem LLMs have for modeling. The trace learning agent underperformed compared to the basic modeling and code translation agents, failing to pass runtime checks entirely. Even Claude-Sonnet-4 achieved extremely low syntax scores when using the Trace Learning Agent.

The paper is a great start for an LLM system modeling benchmark. But it currently lacks diversity as the dataset leans heavily on Consensus protocols. Out of the 7 distributed systems artifacts, a huge portion (Etcd Raft, Redis Raft, Xline CURP, PGo raftkvs, ZooKeeper FLE) are essentially variations of distributed consensus/leader election.  The authors mention that adding new systems to SYSMOBENCH requires significant effort to instrument the system to collect execution logs for trace validation. I think they need a "better sell". Why should an engineer go through the effort to instrument their system for SysMoBench if the AI is going to fail at modeling it? A clearer value proposition is needed to drive community contributions here.

Restoring a 2018 iPad Pro

This was surprisingly hard to find—hat tip to Reddit’s Nakkokaro and xBl4ck. Apple’s instructions for restoring an iPad Pro (3rd generation, 2018) seem to be wrong; both me and an Apple Store technician found that the Finder, at least in Tahoe, won’t show the iPad once it reboots in recovery mode. The trick seems to be that you need to unplug the cable, start the reset process, and during the reset, plug the cable back in:

  1. Unplug the USB cable from the iPad.
  2. Tap volume-up
  3. Tap volume-down
  4. Begin holding the power button
  5. After two roughly two seconds of holding the power button, plug in the USB cable.
  6. Continue holding until the iPad reboots in recovery mode.

Hopefully this helps someone else!

March 23, 2026

TLA+ mental models

In the age of LLMs, syntax is no longer the bottleneck for writing, reading, or learning TLA+. People are even getting value by generating TLA+ models and counterexamples directly from Google Docs descriptions of the algorithms. The accidental complexity of TLA+ (its syntax and tooling) is going away.

But the intrinsic complexity remains: knowing where to start a model, what to ignore, and how to choose the right abstractions. This is modeling judgment, and it is the hardest skill to teach. Engineers are trained to think in code, control flow, and local state. TLA+ forces you into a different mode: mathematical, declarative, and global. You specify what must hold, not how to achieve it. Once you get comfortable with this shift, it changes how you think about systems, even away from the keyboard.

In a companion post, I described TLA+ as a design accelerator based on lessons from 8 industry projects. Here I want to go deeper and articulate the mental models behind effective TLA+ use. These are the thinking patterns that TLA+ experts apply implicitly; these are the kind of knowledge acquired by osmosis over the years. I will try to make them explicit and actionable.

The mental models are:

  1. Abstraction, abstraction, abstraction
  2. Embrace the global shared memory model
  3. Refine to local guards and effects (slow is fast)
  4. Derive good invariants
  5. Explore alternatives through stepwise refinement
  6. Aggressively refine atomicity
  7. Share your mental models


1. Abstraction, abstraction, abstraction

Abstraction is a powerful tool for avoiding distraction. The etymology of the word abstract comes from Latin for "cut and draw away". With abstraction, you slice out the protocol from a complex system, omit unnecessary details, and simplify a complex system into a useful model. You don't have to model everything. Most of the value comes from knowing what to ignore. The omission is the default: add a component only when leaving it out breaks your reasoning goal.

Abstraction is the art of knowing what to discard. You can cross-cut to focus on the protocol you want to investigate. If you are interested in the consistency model of your distributed system, you can abstract away the mechanics of communication when that is an unnecessary distraction. If you are interested in a replication protocol, you can abstract away the client interaction. Consider two examples from our industry experience.

  • CosmosDB consistency modeling: We needed to specify CosmosDB's client-facing consistency semantics. The anti-pattern would have been modeling the distributed database engine, which would have caused an immediate state-space explosion and an unreadable spec. Instead, we modeled just the "history as a log" abstraction for client-facing behavior. The inner details of the database were "environment"; irrelevant to what we were trying to reason about. We used sort-merge to capture the internals of replication, and a read index to model consistency. This way five consistency levels became clear predicates over operation histories.
  • Secondary Index at Aurora DSQL: Here, we focused on the secondary index and abstracted away the rest of the database as a log of operations. Log is a common abstraction for State Machine Replication (SMR), which is itself a common abstraction for databases and data stores. By cutting the problem down to its essential slice, we made a complex system tractable.

Leslie Lamport gave a beautiful demonstration of this when he described how he derived Paxos. He started with the most abstract specification: defining agreement based on a vote array. In his words: "I don't remember the thought process that led me to Paxos. But I knew that execution on computers sending messages to one another was an irrelevant detail. I was thinking only about a set of processes and what they needed to know about one another. How they could get that information from messages was the easy part that came later. What I had first was what, many years later, I described as the Voting algorithm."

TLA+ is useful for teaching engineers the art of abstraction and the right way to think and reason about distributed systems. The modeling process itself trains you to ask: what is the behavioral slice I care about, and what can I safely ignore? And because TLA+ gives you a rapid prototyping tool with a tight feedback loop (write a model, check it, revise) you accumulate design experience much faster than you would by building and debugging real systems.


2. Embrace the Global Shared Memory Model

TLA+ gives you a deliberate fiction: a global shared memory that all processes can read and write. This fiction is the foundation of its computational model, and understanding it is essential to thinking in TLA+.

In TLA+, a program consists of two things: (1) a set of variables that define a global state space, and (2) a finite set of actions that transition from one state to the next. This is state-centric reasoning as everything is a predicate (a function mapping to a boolean). This approach promotes invariant-based thinking (see mental model 4).

Each action follows Sir Tony Hoare's triples: {state} action {state'}. The execution of an action is conditional on a predicate (the "guard") being true. For example: x > 0  →  x := 0; y := y + 1. Or in TLA+ notation:

/\ x > 0
/\ x' = 0
/\ y' = y + 1

The guard is a predicate on the state space. If the guard is true in some state, the action is said to be "enabled" in that state. A program begins in any state satisfying the initial predicate. Then an enabled action is nondeterministically selected and executed atomically. This process repeats infinitely. If a selected action's guard is false, it is simply a skip and it does not change the state.

The global shared memory (GSM) model fits naturally with this style of reasoning. You read the current state and install a new state; no channels, no message serialization, no process boundaries to manage. It is the most minimal way to write models that fit the guarded-command model. Be frugal in defining variables, though: each one exponentially explodes the state space. The payoff is that safety and liveness become compact predicates over global state. Your program defines an invariant set (i.e., the good states) and must never transition out of it.

You don't have to model channels, message serialization, or network topology unless those are the specific things you're reasoning about. It is possible to map GSM to message passing if you keep to "localish" guards and definitely local variable assignments. What do we mean by "local variable" in a global shared state space? A common way is to use indices per node, so vote[i] refers to node i's vote. The global variable is the vote array, and the local version is vote[i]. It's all math, and math needs abstraction. TLA+'s computational model that is shaped around the global shared memory fiction enables you reason at the right level of abstraction.


3. Refine to local guards and effects (slow is fast)

The global shared memory fiction of TLA+ is powerful for reasoning, but it creates a trap: it is easy to write guards that read global state no real process could observe atomically. This is one of the most common modeling errors. A guard that checks what three different nodes have done simultaneously is "illegal knowledge" as no single node in a real distributed system can know all of that at once. A dedicated review pass should ask, for every action: what information could a real node actually know when it decides to act?

So how do you transform your guards to be local? There is a folk theorem that reading your immediate neighbor is fine because it can be mapped to message passing (think hygenic dining philosophers). This essentially boils down to adding another level of indirection and proxy variables, which means locking. Locking is not good for performance, and in mental model 6 I will advocate that you should refine your atomicity and allow as much concurrency as possible for the implementation to enjoy freedom responsibly. So what do we do?

The key insight is to try to figure out stable, monotonic, or locally stable predicates to use for your guards. In my "Hints for Distributed Systems Design", when I tell you to "embrace monotonicity", I am thinking of exactly this: "For doing coordination efficiently and safely in a distributed system, the insight is almost always to transform the problem into a monotonic one."

This is where things get subtle, and it touches on esoteric knowledge that is rarely made explicit. This is something you get to intuit after years of working on distributed systems close to the theoretical side. Your advisor intuited it, their advisor intuited it, they had developed vocabulary shortcuts to refer to this but never made it fully explicit. UC Berkeley's efforts on CALM (Consistency As Logical Monotonicity) made this more concrete by basing it on a language like Dedalus, but the basic gist lives at the level of abstract global state reasoning.

In our "Slow is Fast" paper, we formalized this intuition by partitioning actions into "slow" and "fast". A slow action's guard remains true even if the node's information is slightly stale. This is because either the guard is a stable predicate (once true, stays true), depends only on local variables, or is a locally stable predicate (only the node's own actions can falsify it). A fast action, by contrast, requires fresh global state to evaluate its guard. The key result is this: if you can make your guards locally stable, the protocol requires less coordination and tolerates communication delays gracefully. Hence, slow is fast.

This is not easy to do. Ernie Cohen had internalized this skill, and he could quickly zero in on the monotonic or locally stable predicates to exploit in his protocol explorations. Some people live in this way of thinking so completely that they may not even know how to articulate it, like fish in water. (Case in point: the phrase "global shared memory fiction" was pointed out to me by Marc Brooker. Sometimes it takes a slight outsider to name what insiders take for granted.) The practical takeaway for TLA+ modeling is this: when you write your guards, ask yourself whether the information the guard relies on could become stale, and if so, whether the guard is still safe to act on. If you can make your guards depend on monotonic or locally stable predicates, your protocol will be more robust, more concurrent, and closer to a correct distributed implementation.

Lamport’s derivation of Paxos illustrates this beautifully. He begins with the simplest specification of consensus: chosen starts as the empty set and transitions to a singleton {v}. That is the entire next-state formula. He then refines to a voting algorithm where acceptors vote and a value is chosen if a majority votes for it, and refines further to Paxos to handle the problems that arise (what if N acceptors vote for v1, N for v2, and the remaining acceptor fails?). At each refinement step, the guards become more local. In Paxos, the guard for whether an acceptor should cast a vote depends on local knowledge: what ballots this acceptor has participated in. The monotonic structure of ballot numbers ensures that this local knowledge does not become invalid: once an acceptor knows something about the progress of voting, that fact is permanent. This is what makes Paxos work despite asynchrony and failures.


4. Derive good invariants

You did the modeling for a purpose, not for sport. You want to arrive at reasoning insights about your protocol, and invariants are the distilled version of those insights. Invariant-based reasoning is non-operational: instead of tracing execution paths and happy-path thinking, you ask "what needs to go right?" You specify the boundary conditions, and the model checker explores all possible interleavings to verify them.

Spend time distilling good invariants because they serve you in many ways. They guide you as you explore protocol variants (see mental model 5), telling you what you can and cannot change. They act as contracts when you compose components, defining what each part guarantees to the others. They translate directly into runtime assertions and test oracles for your implementation. And they are essential for adding fault-tolerance: once you know the invariant, you know where you need to recover to after a fault, and you can design recovery actions to reestablish it.

Invariant writing is still mostly manual. LLMs struggle with it because invariants are the most signal-heavy part of a spec: they distill your understanding of what this protocol must guarantee. Getting them right means you have closure on the problem.

People new to TLA+ and formal methods repeatedly fail at this step. I occasionally struggle with it too, especially when entering an unfamiliar domain I have to first pay my dues and think harder to gain understanding. The most common failure mode is writing "trivial invariants" that are always true regardless of what the protocol does; you've written the spec for naught. Another is confusing the "end state" with an invariant: an invariant must hold at every reachable state, not just the final one. We are not expecting inductive invariants (that is harder still, and more valuable since a formal proof follows easily from one). But a reasonably tight invariant that demonstrates understanding and scaffolds further exploration, and that is what you should aim for.

Don't stop at safety properties (what the system is allowed to do). Write liveness properties too (what the system must eventually do). It is important to check properties like <> Termination and Init ~> Solution. Do requests complete? Do leaders emerge? Many "correct" models quietly do nothing forever. A model that never violates safety but makes no progress is useless. Checking liveness catches paths that stall, specs that are overly constrained, and actions that never get enabled.

Returning to our running example: in Lamport's Paxos derivation, the invariants at each refinement level are instructive. At the Consensus level, the safety invariant is simply that at most one value is chosen (Cardinality(chosen) <= 1). At the Voting level, chosen is defined as the set of values for which there exists a ballot where a quorum of acceptors all voted for that value (chosen == {v \in Value : \E b \in Ballot : ChosenAt(b, v)}). The safety of the Voting algorithm rests on two rules: (1) different acceptors cannot vote for different values in the same ballot, and (2) an acceptor may only vote for value v in ballot b if v is safe at b. These invariants are tight, non-trivial, and they scaffold the entire refinement chain down to Paxos. They show genuine understanding of why consensus works, and that is what I mean by "closure on the problem".


5. Explore alternatives through stepwise refinement

One of TLA+'s greatest strengths is that it supports fast exploration of protocol variants. The key technique is stepwise refinement: start with the most abstract specification of your problem, then progressively add implementation detail, verifying at each step that the refinement preserves the invariants from the level above.

The canonical example is again Lamport's derivation of Paxos. The abstract Consensus spec defines a single variable chosen that transitions from the empty set to a singleton. Then an intermediate Voting model is defined and shown to be a refinement of Consensus. It describes what properties voting must satisfy, but not how acceptors implement them. Finally, the Paxos model refines Voting by adding message passing, ballot leaders, and the two-phase protocol. At each level, the refinement adds detail (ballots, quorums, messages) while preserving the safety property from above.

Refinement is at the heart of abstraction and a cornerstone of TLA+. In TLA+, refinement is simply implication: the concrete system's behaviors must be a subset of the abstract system's allowed behaviors. You check this by declaring an instance of the abstract spec in the concrete one and verifying via TLC that every behavior of the concrete system is an accepted behavior of the abstract system. Even invariant checking is refinement in disguise: does the system model implement this invariant formula?

A big advantage of stepwise refinement is that when you want to explore a protocol variant, you don't start from scratch. You go back up to the appropriate level of abstraction, change one refinement step, and get a different protocol that still satisfies the same high-level specification. This is systematic design space exploration. With LeaseGuard, for example, we started modeling a lease protocol for Raft early. While refining the abstract spec, we discovered two optimizations we hadn't anticipated, including inherited lease reads, which we probably wouldn't have found without thinking at multiple levels of abstraction.

A common failure pattern here is getting stuck at a level of detail, patching corner cases one by one. This is the implementation mindset leaking into modeling. When this happens, go back up. I saw this with the Secondary Index project at Aurora DSQL: an engineer's design was growing by accretion, each corner-case patch creating new corner cases. TLA+ forced a different approach: specify what the secondary index must guarantee abstractly, then search the solution space through refinement. Over a weekend, with no prior TLA+ experience, the engineer had written several variations. The lesson: specify behavior, not implementation, then explore different "how" choices through refinement.


6. Aggressively refine atomicity

Overly large atomic actions hide races. If your TLA+ action does ten things atomically in a single step, you're sweeping concurrency under the rug. The model will look correct, but it won't represent the interleavings your real system will face. Actions should be as fine-grained as correctness allows. Smaller steps expose the interleavings the protocol must tolerate and make invariants more meaningful.

A practical approach is to start coarse-grained to establish correctness, then systematically split actions into smaller steps and verify safety still holds. This is stepwise refinement (mental model 5) applied to action granularity. Each split increases the interleaving space, which is precisely where TLC earns its keep: the interference surface area may explode, but TLC will exhaustively check that your invariants hold.

The goal here is to give the implementation maximum freedom to reorder and parallelize, while verifying that the protocol remains correct under that concurrency. Fine-grained atomicity in the spec means the implementation can schedule operations in any order that respects the guard conditions.

Our StableEmptySet project is a good example. Ernie Cohen pushed atomic actions to the finest possible granularity to avoid distributed locking, and the resulting protocol was more concurrent than a lock-based design. The interleaving surface area increased, but TLC handled it without breaking a sweat. This is where the "slow is fast" insight (mental model 3) connects back: by refining guards to depend on locally stable predicates, you can split actions finely without introducing illegal knowledge: the guards remain valid even as the global state changes between steps.


7. Share your mental models

TLA+ models are also meant to be communication tools. A well-written spec serves as precise, executable documentation of a protocol, capturing design intent in a way that prose specifications and code comments cannot match.

At AWS, when I wrote the first TLA+ model of Aurora DSQL's distributed transaction protocol, the model's value quickly went beyond correctness confidence. It served as a communication anchor for a large team. When we sought further formal methods support, the TLA+ models sped up onboarding for new team members and kept everyone aligned on the protocol's design. Instead of arguing over ambiguous prose in a design document, the team could point to specific actions and invariants in the spec.

Our MongoDB distributed transactions work reinforces this. We wrote the first modular TLA+ specification of the protocol many years after it was in production. The model now serves as an authoritative description of what the protocol actually guarantees, bringing clarity to a system that had become difficult to reason about as a whole.

So, you should write your specs for humans, not just for TLC. Use clear action names, write TypeOK invariants early (they serve as executable documentation of your data model), and make every important property an explicit invariant. A well-structured spec communicates protocol intent more clearly than thousands of lines of implementation code, because it shows the essential logic without the noise of error handling, serialization, and configuration.

Tools like Spectacle can visualize TLA+ state spaces and execution traces, bridging the gap between mathematical precision and the operational intuition engineers thrive on. TLA+ is one of the purest intellectual pleasures: reducing a complex distributed system to its essential logic. Please share that pleasure with others by writing about your models, presenting them, and using them to teach.

As LLMs write more of our code, the value of TLA+ for design and reasoning will only grow. TLA+ has the potential to become a cornerstone in an AI+formal methods stack for building systems. The mental models I've described here are the foundation for that future. By mastering abstraction, embracing the global shared memory model, refining to local guards, deriving good invariants, exploring alternatives through refinement, aggressively refining atomicity, and sharing our mental models, we can unlock the full power of TLA+ to design better distributed systems in the age of AI. 

Introducing Database Traffic Control

Enforce real-time limits on your Postgres query traffic to protect your database from runaway queries and unexpected load spikes.