TLA+ mental models
In the age of LLMs, syntax is no longer the bottleneck for writing, reading, or learning TLA+. People are even getting value by generating TLA+ models and counterexamples directly from Google Docs descriptions of the algorithms. The accidental complexity of TLA+ (its syntax and tooling) is going away.
But the intrinsic complexity remains: knowing where to start a model, what to ignore, and how to choose the right abstractions. This is modeling judgment, and it is the hardest skill to teach. Engineers are trained to think in code, control flow, and local state. TLA+ forces you into a different mode: mathematical, declarative, and global. You specify what must hold, not how to achieve it. Once you get comfortable with this shift, it changes how you think about systems, even away from the keyboard.
In a companion post, I described TLA+ as a design accelerator based on lessons from 8 industry projects. Here I want to go deeper and articulate the mental models behind effective TLA+ use. These are the thinking patterns that TLA+ experts apply implicitly; these are the kind of knowledge acquired by osmosis over the years. I will try to make them explicit and actionable.
The mental models are:
- Abstraction, abstraction, abstraction
- Embrace the global shared memory model
- Refine to local guards and effects (slow is fast)
- Derive good invariants
- Explore alternatives through stepwise refinement
- Aggressively refine atomicity
- Share your mental models
1. Abstraction, abstraction, abstraction
Abstraction is a powerful tool for avoiding distraction. The etymology of the word abstract comes from Latin for "cut and draw away". With abstraction, you slice out the protocol from a complex system, omit unnecessary details, and simplify a complex system into a useful model. You don't have to model everything. Most of the value comes from knowing what to ignore. The omission is the default: add a component only when leaving it out breaks your reasoning goal.
Abstraction is the art of knowing what to discard. You can cross-cut to focus on the protocol you want to investigate. If you are interested in the consistency model of your distributed system, you can abstract away the mechanics of communication when that is an unnecessary distraction. If you are interested in a replication protocol, you can abstract away the client interaction. Consider two examples from our industry experience.
- CosmosDB consistency modeling: We needed to specify CosmosDB's client-facing consistency semantics. The anti-pattern would have been modeling the distributed database engine, which would have caused an immediate state-space explosion and an unreadable spec. Instead, we modeled just the "history as a log" abstraction for client-facing behavior. The inner details of the database were "environment"; irrelevant to what we were trying to reason about. We used sort-merge to capture the internals of replication, and a read index to model consistency. This way five consistency levels became clear predicates over operation histories.
- Secondary Index at Aurora DSQL: Here, we focused on the secondary index and abstracted away the rest of the database as a log of operations. Log is a common abstraction for State Machine Replication (SMR), which is itself a common abstraction for databases and data stores. By cutting the problem down to its essential slice, we made a complex system tractable.
Leslie Lamport gave a beautiful demonstration of this when he described how he derived Paxos. He started with the most abstract specification: defining agreement based on a vote array. In his words: "I don't remember the thought process that led me to Paxos. But I knew that execution on computers sending messages to one another was an irrelevant detail. I was thinking only about a set of processes and what they needed to know about one another. How they could get that information from messages was the easy part that came later. What I had first was what, many years later, I described as the Voting algorithm."
TLA+ is useful for teaching engineers the art of abstraction and the right way to think and reason about distributed systems. The modeling process itself trains you to ask: what is the behavioral slice I care about, and what can I safely ignore? And because TLA+ gives you a rapid prototyping tool with a tight feedback loop (write a model, check it, revise) you accumulate design experience much faster than you would by building and debugging real systems.
2. Embrace the Global Shared Memory Model
TLA+ gives you a deliberate fiction: a global shared memory that all processes can read and write. This fiction is the foundation of its computational model, and understanding it is essential to thinking in TLA+.
In TLA+, a program consists of two things: (1) a set of variables that define a global state space, and (2) a finite set of actions that transition from one state to the next. This is state-centric reasoning as everything is a predicate (a function mapping to a boolean). This approach promotes invariant-based thinking (see mental model 4).
Each action follows Sir Tony Hoare's triples: {state} action {state'}. The execution of an action is conditional on a predicate (the "guard") being true. For example: x > 0 → x := 0; y := y + 1. Or in TLA+ notation:
/\ x > 0
/\ x' = 0
/\ y' = y + 1
The guard is a predicate on the state space. If the guard is true in some state, the action is said to be "enabled" in that state. A program begins in any state satisfying the initial predicate. Then an enabled action is nondeterministically selected and executed atomically. This process repeats infinitely. If a selected action's guard is false, it is simply a skip and it does not change the state.
The global shared memory (GSM) model fits naturally with this style of reasoning. You read the current state and install a new state; no channels, no message serialization, no process boundaries to manage. It is the most minimal way to write models that fit the guarded-command model. Be frugal in defining variables, though: each one exponentially explodes the state space. The payoff is that safety and liveness become compact predicates over global state. Your program defines an invariant set (i.e., the good states) and must never transition out of it.
You don't have to model channels, message serialization, or network topology unless those are the specific things you're reasoning about. It is possible to map GSM to message passing if you keep to "localish" guards and definitely local variable assignments. What do we mean by "local variable" in a global shared state space? A common way is to use indices per node, so vote[i] refers to node i's vote. The global variable is the vote array, and the local version is vote[i]. It's all math, and math needs abstraction. TLA+'s computational model that is shaped around the global shared memory fiction enables you reason at the right level of abstraction.
3. Refine to local guards and effects (slow is fast)
The global shared memory fiction of TLA+ is powerful for reasoning, but it creates a trap: it is easy to write guards that read global state no real process could observe atomically. This is one of the most common modeling errors. A guard that checks what three different nodes have done simultaneously is "illegal knowledge" as no single node in a real distributed system can know all of that at once. A dedicated review pass should ask, for every action: what information could a real node actually know when it decides to act?
So how do you transform your guards to be local? There is a folk theorem that reading your immediate neighbor is fine because it can be mapped to message passing (think hygenic dining philosophers). This essentially boils down to adding another level of indirection and proxy variables, which means locking. Locking is not good for performance, and in mental model 6 I will advocate that you should refine your atomicity and allow as much concurrency as possible for the implementation to enjoy freedom responsibly. So what do we do?
The key insight is to try to figure out stable, monotonic, or locally stable predicates to use for your guards. In my "Hints for Distributed Systems Design", when I tell you to "embrace monotonicity", I am thinking of exactly this: "For doing coordination efficiently and safely in a distributed system, the insight is almost always to transform the problem into a monotonic one."
This is where things get subtle, and it touches on esoteric knowledge that is rarely made explicit. This is something you get to intuit after years of working on distributed systems close to the theoretical side. Your advisor intuited it, their advisor intuited it, they had developed vocabulary shortcuts to refer to this but never made it fully explicit. UC Berkeley's efforts on CALM (Consistency As Logical Monotonicity) made this more concrete by basing it on a language like Dedalus, but the basic gist lives at the level of abstract global state reasoning.
In our "Slow is Fast" paper, we formalized this intuition by partitioning actions into "slow" and "fast". A slow action's guard remains true even if the node's information is slightly stale. This is because either the guard is a stable predicate (once true, stays true), depends only on local variables, or is a locally stable predicate (only the node's own actions can falsify it). A fast action, by contrast, requires fresh global state to evaluate its guard. The key result is this: if you can make your guards locally stable, the protocol requires less coordination and tolerates communication delays gracefully. Hence, slow is fast.
This is not easy to do. Ernie Cohen had internalized this skill, and he could quickly zero in on the monotonic or locally stable predicates to exploit in his protocol explorations. Some people live in this way of thinking so completely that they may not even know how to articulate it, like fish in water. (Case in point: the phrase "global shared memory fiction" was pointed out to me by Marc Brooker. Sometimes it takes a slight outsider to name what insiders take for granted.) The practical takeaway for TLA+ modeling is this: when you write your guards, ask yourself whether the information the guard relies on could become stale, and if so, whether the guard is still safe to act on. If you can make your guards depend on monotonic or locally stable predicates, your protocol will be more robust, more concurrent, and closer to a correct distributed implementation.
Lamport’s derivation of Paxos illustrates this beautifully. He begins with the simplest specification of consensus: chosen starts as the empty set and transitions to a singleton {v}. That is the entire next-state formula. He then refines to a voting algorithm where acceptors vote and a value is chosen if a majority votes for it, and refines further to Paxos to handle the problems that arise (what if N acceptors vote for v1, N for v2, and the remaining acceptor fails?). At each refinement step, the guards become more local. In Paxos, the guard for whether an acceptor should cast a vote depends on local knowledge: what ballots this acceptor has participated in. The monotonic structure of ballot numbers ensures that this local knowledge does not become invalid: once an acceptor knows something about the progress of voting, that fact is permanent. This is what makes Paxos work despite asynchrony and failures.
4. Derive good invariants
You did the modeling for a purpose, not for sport. You want to arrive at reasoning insights about your protocol, and invariants are the distilled version of those insights. Invariant-based reasoning is non-operational: instead of tracing execution paths and happy-path thinking, you ask "what needs to go right?" You specify the boundary conditions, and the model checker explores all possible interleavings to verify them.
Spend time distilling good invariants because they serve you in many ways. They guide you as you explore protocol variants (see mental model 5), telling you what you can and cannot change. They act as contracts when you compose components, defining what each part guarantees to the others. They translate directly into runtime assertions and test oracles for your implementation. And they are essential for adding fault-tolerance: once you know the invariant, you know where you need to recover to after a fault, and you can design recovery actions to reestablish it.
Invariant writing is still mostly manual. LLMs struggle with it because invariants are the most signal-heavy part of a spec: they distill your understanding of what this protocol must guarantee. Getting them right means you have closure on the problem.
People new to TLA+ and formal methods repeatedly fail at this step. I occasionally struggle with it too, especially when entering an unfamiliar domain I have to first pay my dues and think harder to gain understanding. The most common failure mode is writing "trivial invariants" that are always true regardless of what the protocol does; you've written the spec for naught. Another is confusing the "end state" with an invariant: an invariant must hold at every reachable state, not just the final one. We are not expecting inductive invariants (that is harder still, and more valuable since a formal proof follows easily from one). But a reasonably tight invariant that demonstrates understanding and scaffolds further exploration, and that is what you should aim for.
Don't stop at safety properties (what the system is allowed to do). Write liveness properties too (what the system must eventually do). It is important to check properties like <> Termination and Init ~> Solution. Do requests complete? Do leaders emerge? Many "correct" models quietly do nothing forever. A model that never violates safety but makes no progress is useless. Checking liveness catches paths that stall, specs that are overly constrained, and actions that never get enabled.
Returning to our running example: in Lamport's Paxos derivation, the invariants at each refinement level are instructive. At the Consensus level, the safety invariant is simply that at most one value is chosen (Cardinality(chosen) <= 1). At the Voting level, chosen is defined as the set of values for which there exists a ballot where a quorum of acceptors all voted for that value (chosen == {v \in Value : \E b \in Ballot : ChosenAt(b, v)}). The safety of the Voting algorithm rests on two rules: (1) different acceptors cannot vote for different values in the same ballot, and (2) an acceptor may only vote for value v in ballot b if v is safe at b. These invariants are tight, non-trivial, and they scaffold the entire refinement chain down to Paxos. They show genuine understanding of why consensus works, and that is what I mean by "closure on the problem".
5. Explore alternatives through stepwise refinement
One of TLA+'s greatest strengths is that it supports fast exploration of protocol variants. The key technique is stepwise refinement: start with the most abstract specification of your problem, then progressively add implementation detail, verifying at each step that the refinement preserves the invariants from the level above.
The canonical example is again Lamport's derivation of Paxos. The abstract Consensus spec defines a single variable chosen that transitions from the empty set to a singleton. Then an intermediate Voting model is defined and shown to be a refinement of Consensus. It describes what properties voting must satisfy, but not how acceptors implement them. Finally, the Paxos model refines Voting by adding message passing, ballot leaders, and the two-phase protocol. At each level, the refinement adds detail (ballots, quorums, messages) while preserving the safety property from above.
Refinement is at the heart of abstraction and a cornerstone of TLA+. In TLA+, refinement is simply implication: the concrete system's behaviors must be a subset of the abstract system's allowed behaviors. You check this by declaring an instance of the abstract spec in the concrete one and verifying via TLC that every behavior of the concrete system is an accepted behavior of the abstract system. Even invariant checking is refinement in disguise: does the system model implement this invariant formula?
A big advantage of stepwise refinement is that when you want to explore a protocol variant, you don't start from scratch. You go back up to the appropriate level of abstraction, change one refinement step, and get a different protocol that still satisfies the same high-level specification. This is systematic design space exploration. With LeaseGuard, for example, we started modeling a lease protocol for Raft early. While refining the abstract spec, we discovered two optimizations we hadn't anticipated, including inherited lease reads, which we probably wouldn't have found without thinking at multiple levels of abstraction.
A common failure pattern here is getting stuck at a level of detail, patching corner cases one by one. This is the implementation mindset leaking into modeling. When this happens, go back up. I saw this with the Secondary Index project at Aurora DSQL: an engineer's design was growing by accretion, each corner-case patch creating new corner cases. TLA+ forced a different approach: specify what the secondary index must guarantee abstractly, then search the solution space through refinement. Over a weekend, with no prior TLA+ experience, the engineer had written several variations. The lesson: specify behavior, not implementation, then explore different "how" choices through refinement.
6. Aggressively refine atomicity
Overly large atomic actions hide races. If your TLA+ action does ten things atomically in a single step, you're sweeping concurrency under the rug. The model will look correct, but it won't represent the interleavings your real system will face. Actions should be as fine-grained as correctness allows. Smaller steps expose the interleavings the protocol must tolerate and make invariants more meaningful.
A practical approach is to start coarse-grained to establish correctness, then systematically split actions into smaller steps and verify safety still holds. This is stepwise refinement (mental model 5) applied to action granularity. Each split increases the interleaving space, which is precisely where TLC earns its keep: the interference surface area may explode, but TLC will exhaustively check that your invariants hold.
The goal here is to give the implementation maximum freedom to reorder and parallelize, while verifying that the protocol remains correct under that concurrency. Fine-grained atomicity in the spec means the implementation can schedule operations in any order that respects the guard conditions.
Our StableEmptySet project is a good example. Ernie Cohen pushed atomic actions to the finest possible granularity to avoid distributed locking, and the resulting protocol was more concurrent than a lock-based design. The interleaving surface area increased, but TLC handled it without breaking a sweat. This is where the "slow is fast" insight (mental model 3) connects back: by refining guards to depend on locally stable predicates, you can split actions finely without introducing illegal knowledge: the guards remain valid even as the global state changes between steps.
7. Share your mental models
TLA+ models are also meant to be communication tools. A well-written spec serves as precise, executable documentation of a protocol, capturing design intent in a way that prose specifications and code comments cannot match.
At AWS, when I wrote the first TLA+ model of Aurora DSQL's distributed transaction protocol, the model's value quickly went beyond correctness confidence. It served as a communication anchor for a large team. When we sought further formal methods support, the TLA+ models sped up onboarding for new team members and kept everyone aligned on the protocol's design. Instead of arguing over ambiguous prose in a design document, the team could point to specific actions and invariants in the spec.
Our MongoDB distributed transactions work reinforces this. We wrote the first modular TLA+ specification of the protocol many years after it was in production. The model now serves as an authoritative description of what the protocol actually guarantees, bringing clarity to a system that had become difficult to reason about as a whole.
So, you should write your specs for humans, not just for TLC. Use clear action names, write TypeOK invariants early (they serve as executable documentation of your data model), and make every important property an explicit invariant. A well-structured spec communicates protocol intent more clearly than thousands of lines of implementation code, because it shows the essential logic without the noise of error handling, serialization, and configuration.
Tools like Spectacle can visualize TLA+ state spaces and execution traces, bridging the gap between mathematical precision and the operational intuition engineers thrive on. TLA+ is one of the purest intellectual pleasures: reducing a complex distributed system to its essential logic. Please share that pleasure with others by writing about your models, presenting them, and using them to teach.
As LLMs write more of our code, the value of TLA+ for design and reasoning will only grow. TLA+ has the potential to become a cornerstone in an AI+formal methods stack for building systems. The mental models I've described here are the foundation for that future. By mastering abstraction, embracing the global shared memory model, refining to local guards, deriving good invariants, exploring alternatives through refinement, aggressively refining atomicity, and sharing our mental models, we can unlock the full power of TLA+ to design better distributed systems in the age of AI.