a curated list of database news from authoritative sources

March 10, 2026

TLA+ as a Design Accelerator: Lessons from the Industry

After 15+ years of using TLA+, I now think of it is a design accelerator. One of the purest intellectual pleasures is finding a way to simplify and cut out complexity. TLA+ is a thinking tool that lets you do that.

TLA+ forces us out of implementation-shaped and operational reasoning into mathematical declarative reasoning about system behavior. Its global state-transition model and its deliberate fiction of shared memory make complex distributed behavior manageable. Safety and liveness become clear and compact predicates over global state. This makes TLA+ powerful for design discovery. It supports fast exploration of protocol variants and convergence on sound designs before code exists.

TLA+ especially shines for distributed/concurrent complex systems. In such systems, complexity exceeds human intuition very quickly. (I often point out to very simple interleaving/nondeterministic execution puzzles to show how much we suck at reasoning about such systems.) Testing is inadequate for subtle design errors for complex distributed/concurrent systems. Code may faithfully implement a design, but the design itself can fail in rare concurrent scenarios. TLA+ provides exhaustively testable design; catches design errors before code, and enables rapid "what if" design explorations and aggressive protocol optimizations safely.

This is why there are so many cases of TLA+ modeling used in industry, including Amazon/AWS, Microsoft Azure, MongoDB, Oracle Cloud, Google, LinkedIn, Datadog, Nike, Intel.

In this post, I will talk about the TLA+ modeling projects I worked at. Mostly from the industry. (Ok, I counted it, 8 projects. I talk to you about 8 projects.)


1. WPaxos (2016)

This is the only non-industry experience I will mention. It is from my work on WPaxos at SUNY Buffalo, with my PhD students.

WPaxos adapts Paxos for geo-distributed systems by combining flexible quorums with many concurrent leaders. It uses two quorum types: a phase-1 quorum (Q1) used to establish or steal leadership for an object, and a phase-2 quorum (Q2) used to commit updates. Flexible quorum rules require only that every Q1 intersect every Q2, not that all quorums intersect each other. WPaxos exploits this by placing Q2 quorums largely within a single region, so the common case of committing updates happens with low local latency, while the rarer Q1 leader changes span zones. Nodes can steal leadership of objects via Q1 when they observe demand elsewhere, in order migrate the object's ownership toward the region issuing the most writes. Safety is assured because any attempted commit must pass through a Q2 quorum that intersects prior Q1 decisions, preventing conflicting updates despite failures, network delays, and concurrent leaders.

I had explained the basic WPaxos protocol here, if you like to read more. (Sadly there was never a part 2 for that post. I don't know if anyone is using WPaxos in production, but it is a really good idea and I hope to hear about deployments in the wild.)  As for our use of TLA+ for the protocol, it came early on. After we had the intuitive idea of the protocol, we knew we needed strong modeling support in order to get this complex thing completely right. The modeling also helped us sharpen our definitions. It is not straightforward to define quorums across zones, while getting the intersections right. The TLA+ modeling was so useful in fact that we used TLA+/PlusCal snippets in our paper to explain concepts (model-checking validated spec, rather than a hail-Mary pseudocode like everyone else does). The definitions also came from our TLA+ formal definitions. 

The lesson learned: Model early! Like we predicted, we got a lot of mileage by modeling in TLA+ early on in this project.


2. CosmosDB (2018)

During my sabbatical at Microsoft Azure CosmosDB, I helped specify the database's client-facing consistency semantics. The nice thing about these specs was that they didn't need to model internal implementation. The goal was to capture the consistency semantics for clients in a precise manner, rather than providing ambiguous English explanations. The model aimed to answer the question: What kind of behavior should a client be able to witness while interacting with the service?

The anti-pattern here would be to try to model the distributed database engine. Trying to show how the replication/coordination works would have lead to immediate state-space explosion and an unreadable/unusable specification. Instead we modeled at a high level and honed in on the "history as a log"  abstraction for us to represent/capture the user-facing concurrency.

The model is available here.  The "history" variable records all client operations (reads and writes) in order, and each consistency level validates different properties against it:

  • Strong Consistency enforces Linearizability: reads must always see the latest write globally across all regions
  • Bounded Staleness uses ReadYourWrite: clients see their own writes, plus bounded staleness by K operations
  • Session uses MonotonicReadPerClient: a client's reads are monotonic (never go backward)
  • Consistent Prefix uses MonotonicWritePerRegion: writes in a region appear in order
  • Eventual uses Eventual Convergence: reads eventually see writes that exist in the database.

The replication macro was particularly high-level and clever. When region d replicates from region s, it merges both write histories, sorts them, and deduplicates to get a consistent, monotonically increasing sequence of writes. After replication, Data[d] is set to the last value in the merged database, ensuring regions eventually converge to the same state. 

The lesson here is to model minimalistically. A model does not have to capture everything to be highly valuable; it just needs to capture the part/behavior that matters.

This minimalistic model served as precise documentation for outside-facing behavior, replacing ambiguous English explanations, and became foundational enough that a 2023 academic paper built/improved on it. I talked about this improved model here. The history/log was again the main abstraction in that model. The 2023 paper accompanying the model has this great opening paragraph, which echoes the experience of everyone that has painstakingly specified a distributed/concurrent system behavior:

"Consistency guarantees for distributed databases are notoriously hard to understand. Not only can distributed systems inherently behave in unexpected and counter-intuitive ways due to internal concurrency and failures, but they can also lull their users into a false sense of functional correctness: most of the time, users of a distributed database will witness a much simpler and more consistent set of behaviors than what is actually possible. Only timeouts, fail-overs, or other rare events will expose the true set of behaviors a user might witness. Testing for these scenarios is difficult at best: reproducing them reliably requires controlling complex concurrency factors, latency variations, and network behaviors. Even just producing usable documentation for developers is fundamentally challenging and explaining these subtle consistency issues via documentation comes as an additional burden to distributed system developers and technical writers alike."


3. AWS DistSQL (2022)

I worked on AWS DistSQL 2022 and 2023. Aurora DSQL builds a WAN distributed SQL database. It decomposes the database into independent services: stateless SQL compute nodes, durable storage, a replicated journal, and transaction adjudicators. Transactions execute optimistically and mostly locally. Reads use MVCC snapshots, and writes are buffered without coordination. Only at commit does the system perform global conflict validation and ordering, using adjudicators and the journal to finalize the transaction. This design pushes almost all distributed coordination to commit time, allowing statements inside a transaction to run with low latency while still providing strong transactional guarantees.

I did a first version of the TLA+ modeling of this system. This was great for getting confidence on the protocol. After writing the model, I had a better understanding of the invariants. This also served as a communication tool, to keep people on the same page. When we were trying to get more formal methods support, the TLA+ models sped up the process and anchored the communications. This was a surprising thing I learned, working as part of a big team, what a big challenge it is to keep everyone aligned. Brooker had banged out a 100+ page book on the design, which did really help. He also had written PLang models of the system as well. As far as I know, both modeling eventually gave way to closer-to-implementation Rust models/testing. I am not able to share the TLA+ models for DSQL, but I hope when a DSQL publication comes out eventually, it would have some TLA+ models accompanying. 

The lesson here is that TLA+ also works well as communication tool and serving as scaffolding for further formal methods and testing support.


4. StableEmptySet (2022)

I worked on this problem for a short time, but I find it still worth mentioning.  We needed to implement a distributed set that, once empty, remains empty permanently. This is a crucial property for safely garbage-collecting a set and ensuring we don't add symbolic links to a record that has already been deleted, and later lose durability when garbage collection kicks in.

Ok, but why don't we just check IsEmpty during an add operation, and disallow the addition if it is being done to an empty-set? You don't get to have such simple luxuries in a distributed system. This is a set implemented in a distributed manner, so we do not have an atomic check for IsEmpty. Think of a LIST scan across many machines, that is inherently a non-atomic check...

In a distributed system, concurrency is our nemesis, and it makes implementing this StableEmpty protocol very tricky due to many cornercases. Ernie Cohen designed a brilliant/elegant protocol to solve this. Ernie is a genius, who lives in an abstract astral plane many miles above us. My role was simply to translate his protocol into TLA+ to bring it down to a concrete plane where mere mortals like me could understand it. Again sorry that I cannot disclose the model.

The reason I am mentioning this problem is because it radically expanded my horizons on how far we can/should push the fine granularity of atomic actions. Of course the IsEmpty check is non-atomic, but Ernie also did push the update/communication steps of the algorithm to be as fine grain as possible so that there won't be a need to do distributed locking, and the implementation can scale. The problem is when you develop an algorithm with extremely fine-grained actions, the surface area for operation interleaving and interference explodes. That is why modeling the  protocol in TLA+ and model-checking it for correctness becomes non-negotiable.

The anti-pattern here would be to attempt to implement pseudo-atomic checks via distributed locking, or handling concurrent additions and deletions as ad-hoc operational edge cases, which would be both doomed to fail at large scale. 

The Lesson: Choose atomicity carefully and push for the finest possible granularity using TLA+ modeling as an exploration tool. TLA+ forces you to define exactly what operations are atomic and helps you to model-check if the interleavings of these operations are safe.


5. PowerSet Paxos (2022)

I helped briefly with this model when my colleague (Chuck Carman) hit an exponential state-space explosion with his distributed transaction protocol, PowerSet Paxos. For a change, this time we have a testimonial from Chuck with his own words. 

"The first time I made a distributed transaction protocol, I did it by sitting in a coffee shop with an excel sheet trying to come up with sets of rules to evolve the rows over time. This took weeks after work. I got on the core idea: metadata encoding partial causal orders writing overlapping sets of keys. I didn't trust my algorithm at all. It took another three months to refine the algorithm using a tool to make sure it was correct (TLA+). It took a week to translate TLA+ algorithm to code. It took way more time to write the test code.  Maybe 75-80% if the code is testing all the invariants the TLA+ spec had in it. The long pole there was creating a DSL in Java land to effectively test all of the invariants TLA+ checked. That took a month or two.

For PowerSet Paxos... We haven't had a transaction corrupted yet, and the team is learning Pluscal to apply it to the rest of the system where there are errors around state machine transitions. Much regret has been expressed around not modeling those parts in TLA+ so that the main implementations would be mostly error free."

The Lesson: Code is cheap, testing broken algorithms is expensive.

I hope Chuck can share this model someday, alongside a publication on this protocol. 


6. Secondary Index (2023)

When starting development on a secondary index for DSQL, the engineer that drafted the initial protocol wrote a 6 pager document as is Amazon's tradition. But he kept finding cornercases with the indexing. This problem, at first, did not look very complex and therefore a good fit for TLA+ modeling. After all, the indexing is happening centrally at a node, the concurrency came from ongoing operations on the database, while the indexing was trying to catchup to existing work. This sounds more data-centric than operation/protocol-centric so it didn't sound like an ideal fit. Here is the description simplified from the patent which ended up getting award for the final protocol:

"We initiate the creation of a unique secondary index on a live database without interrupting ongoing operations. To achieve this, the system backfills historical data up to a specific point in time while simultaneously applying all new, incoming updates directly to the index. Once the backfill finishes, we perform a final evaluation across the entire index to ensure no duplicate entries slipped through during this highly concurrent phase. If any unique constraint violations are detected, the system immediately flags the error and reports the exact cause."

I was visiting Seattle offices, and I told him that we should try TLA+ modeling this given the large number of cornercases popping up. I then wrote a very crude initial model, and apparently that was enough to get him started. I was surprised to find that over the weekend, he had written variations of the model and made improvements on the model, without prior TLA+ experience. 

The anti-pattern here would be to design/grow the protocol by thinking in control flow and patching corner cases one by one as they arise. Using TLA+ forced a more declarative mathematical approach. It acted as a design accelerator, because it is faster to fix a conceptual model than to whack-a-mole corner cases in code.

The lesson: break the implementation mindset and search at the protocol solution space.


7. LeaseGuard: Raft Leases Done Right (2024)

I joined MongoDB Research in 2024. MongoDB has 10+ year history of TLA+ success, including the Logless dynamic reconfiguration, pull-based consensus replication, and the extreme modeling projects.

Leader lease design was my first project at MongoDB. We designed a simple lease protocol tailored for Raft, called LeaseGuard. Our main innovation is to rely on Raft-specific guarantees to design a simpler lease protocol that recovers faster from a leader crash. We wrote about it here. Please go read it, it is a really good read.

Since we are TLA+ fans, and we knew the importance of getting started early on for modeling the algorithm in TLA+. And this paid off big time, we discovered our two optimizations in while writing the TLA+ spec for our initial crude concept of the algorithm. The inherited lease reads optimization was especially surprising to us; we probably wouldn't have realized it was possible if TLA+ wasn't helping us think. We also used TLA+ to  check that LeaseGuard guaranteed Read Your Writes and other correctness properties.

The Lesson: Design discovery over verification. TLA+ is useful not just for checking the correctness of a completed design, but for revealing new insights while the design is in progress. Modeling in TLA+ actively informed our protocol and we discovered surprising optimizations by exploring the protocol in TLA+.


8. MongoDB Distributed Transactions Modeling (2025)

This was my second project at MongoDB. We also wrote a blog post on this here, so I am going just cut to the chase here. 

In this project, we developed the first modular TLA+ specification of MongoDB's distributed transaction protocol. The model separates the sharded transaction logic from the underlying storage and replication behavior, which lets us reason about the protocol at a higher level while still capturing key cross-layer interactions. Using the model, we validated isolation guarantees, explored anomalies under different ReadConcern and WriteConcern settings, and clarified subtle issues such as interactions with prepared transactions. Our spec is available publicly on GitHub. 

This effort brought much-needed clarity to a big complex distributed transactions protocol. I believe this is the most detailed TLA+ model of a distributed transactions protocol in existence. While database/systems papers occasionally feature a TLA+ transaction model, those typically focus on a very narrow slice. I am not aware of any other model that captures distributed transactions at this level of granularity. A big value of our model is that it serves as a reference to answer questions about a protocol which span many teams, many years of development, and several software/service layers.

Furthermore, we used the TLA+ model traces from our spec to generate thousands of unit tests that exercise the actual WiredTiger implementation. This successfully bridged the gap between formal mathematical specification and concrete code.

The Lesson: Models can add value even retroactively, and can have a life beyond the initial design phase.



When I started writing this post this morning, I was originally planning to talk also about how to go about starting with your TLA+ modeling, and how things are/might-be changing in the post LLM world. But this post already got very long, and I will leave that for next time.

March 06, 2026

The first rule of database fight club: admit nothing

 I am fascinated by tech marketing but would be lousy at it.

A common practice is to admit nothing -- my product, project, company, idea is perfect. And I get it because admitting something isn't perfect just provides fodder for marketing done by the other side, and that marketing is often done in bad faith.

But it is harder to fix things when you don't acknowledge the problems.

In the MySQL community we did a good job of acknowledging problems -- sometimes too good. For a long time as an external contributor I filed many bug reports, fixed some bugs myself and then spent much time marketing open bugs that I hoped would be fixed by upstream. Upstream wasn't always happy about my marketing, sometimes there was much snark, but snark was required because there was a large wall between upstream and the community. I amplified the message to be heard.

My take is that the MySQL community was more willing than the Postgres community to acknowledge problems. I have theories about that and I think several help to explain this:

  • Not all criticism is valid
    • While I spend much time with Postgres on benchmarks I don't use it in production. I try to be fair and limit my feedback to things where I have sweat equity my perspective is skewed.  This doesn't mean my feedback is wrong but my context is different. And sometimes my feedback is wrong.
  • Bad faith
    • Some criticism is done in bad faith. By bad faith I means that truth takes a back seat to scoring points. A frequent source of Postgres criticism is done to promote another DBMS. Recently I have seen much anti-Postgres marketing from MongoDB. I assume they encounter Postgres as competition more than they used to. 
  • Good faith gone bad
    • Sometimes criticism given in good faith will be repackaged by others and used in bad faith. This happens with some of the content from my blog posts. I try to make this less likely by burying the lead in the details but it still happens.
  • MySQL was more popular than Postgres until recently. 
    • Perhaps people didn't like that MySQL was getting most of the attention and admitting flaws might not help with adoption. But today the attention has shifted to Postgres so this justification should end. I still remember my amusement at a Postgres conference long ago when the speaker claimed that MySQL doesn't do web-scale. Also amusing was being told that Postgres didn't need per-page checksums because you should just use ZFS to get similar protection.
  • Single-vendor vs community
    • MySQL is a single-vendor project currently owned by Oracle. At times that enables an us vs them mentality (community vs coporation). The coporation develops the product and it is often difficult for the community to contribute. So it was easy to complain about problems, because the corporation was responsible for fixing them.
    • Postgres is developed by the community. There is no us vs them here and the community is more reluctant to criticize the product (Postgres). This is human nature and I see variants of it at work -- my work colleagues are far more willing to be critical of open-source projects we used at work than they were to be critical of the many internally developed projects. 

Colorado SB26-051 Age Attestation

Colorado is presently considering a bill, SB26-051, patterned off of California’s AB1043, which establishes civil penalties for software developers who do not request age information for their users. The bills use a broad sense of “Application Store” which would seem to encompass essentially any package manager or web site one uses to download software—GitHub, Debian’s apt repos, Maven, etc. As far as I can tell, if someone under 18 were to run, say, a Jepsen test in California or Colorado, or use basically any Linux program, that could result in a $2500 fine. This (understandably!) has a lot of software engineers freaked out.

I reached out to the very nice folks at Colorado Representative Amy Paschal’s office, who understand exactly how bonkers this is. As they explained it, the Colorado Senate tried to adapted California’s bill closely in the hopes of building a consistent regulatory environment, but there weren’t really people with software expertise involved. Representative Paschal is a former software engineer herself, and was brought in to try and lend an expert opinion. She’s trying to amend the bill so it doesn’t, you know, outlaw most software. Her office has two recommendations for the Colorado bill:

  1. Reach out to Colorado Senator Matt Ball, one of the Senate sponsors.

  2. Please be polite. Folks are understandably angry, but I get the sense that their staffers are taking a bit of a beating, and that’s probably not helping.

I’m not sure what to do with California’s AB 1043 just yet. I called some of the co-sponsors in the California Assembly, and they suggested emailing Samantha Huynh. I wrote her explaining the situation and asking if they had any guidance for how to comply with the bill, but I haven’t heard back yet.

Valkey and Redis Sorted Sets: Leaderboards and Beyond

  This blog post covers the details about sorted set use cases as discussed in this video. Sorted sets are one of the most powerful data structures in Valkey and Redis. While most developers immediately think about “gaming leaderboards” when they hear about sorted sets, this versatile data type can solve many problems, from task […]

March 05, 2026

Log Drains: Now available on Pro

Supabase Pro users can now send their Supabase logs to their own logging backend, enabling them to debug in the same place as the rest of their stack.

Building a Database on S3

Hold your horses, though. I'm not unveiling a new S3-native database. This paper is from 2008. Many of its protocols feel clunky today. Yet it nails the core idea that defines modern cloud-native databases: separate storage from compute. The authors propose a shared-disk design over Amazon S3, with stateless clients executing transactions. The paper provides a blueprint for serverless before the term existed.


SQS as WAL and S3 as Pagestore

The 2008 S3 was painfully slow, and 100 ms reads weren't unusual. To hide that latency, the database separates "commit" from "apply". Clients write small, idempotent redo logs to Amazon Simple Queue Service (SQS) instead of touching S3 directly. An asynchronous checkpoint by a client applies those logs to B-tree pages on S3 later.

This design shows strong parallels to modern disaggregated architectures. SQS becomes the write-ahead log (WAL) and logstore. S3 becomes the pagestore. Modern Aurora follows a similar logic: the log is replicated, and storage materializes pages independently. Ok, in Aurora the primary write acknowledgment is synchronous after storage quorum replication, and of course Aurora does not rely on clients to pull logs manually and apply like this 2008 system, but what I am trying to say is the philosophy is identical.


Surviving SQS and building B-link Trees on S3

As mentioned above, to bypass the severe latency of writing full data pages directly to S3, clients commit transactions by shipping small redo log records to SQS queues. Subsequently, clients act as checkpointers, asynchronously pulling these queued logs and applying the updates to their local copies before writing the newly materialized B-tree pages back to S3. This asynchronous log-shipping model means B-tree pages on S3 can be arbitrarily out-of-date compared to the real-time logs in SQS. Working on such stale state seems impossible, but the authors bound the staleness: writers (and probabilistically readers) run asynchronous checkpoints that pull batches of logs from SQS and apply them to S3, keeping the database consistent despite delays.

SQS, however, throws a wrench in the works. I was initially very surprised by the paper’s description of SQS (the 2008 version). It said that a queue might hold 200 messages, but a client requesting 100 could randomly receive only 20. This is because, to provide low latency, SQS does a best-effort poll of a subset of its distributed servers and immediately returns whatever it finds. But don’t worry, the other messages aren’t lost. They sit on servers not checked in that round. But the price of this low-latency is that FIFO ordering isn’t guaranteed. The database handles this mess by making log records idempotent, and ensures that out-of-order or duplicate processing never corrupts data.

The commit protocol in the paper actually starts simple: clients send log records straight to Pending Update (PU) queues. But the problem with this naive direct-write approach is that if the client crashes mid-commit, only some records might make it to the queue, and this breaks atomicity. To fix this issue, the paper proposes an Atomicity protocol: clients first dump all logs plus a final “commit” token into a private ATOMIC queue, then push everything to the public PU queues. This guarantees all-or-nothing transactions, but it’s pricey, since every extra SQS message adds up. At $2.90 per 1,000 transactions, it’s almost twenty times the $0.15 of the naive direct-write approach. So here, consistency comes at a literal monetary cost!

The big picture here is about how brutally complex it is to build a real database on dumb cloud primitives. They had to implement Record Managers, Page Managers, and buffer pools entirely on the client side, in order to cluster tiny records into pages. For distributed coordination, they hack SQS into a locking system with dedicated LOCK queues and carefully timed tokens. On top of that, they have to handle SQS quirks, with idempotent log records as we discussed above. The engineering effort is massive.

Finally, to address the slow and weakly consistent S3 reads, the database leans on lock-free B-link trees. That lets readers keep moving while background checkpoints/updates by clients split or reorganize index pages. In B-link trees, each node points to its right sibling. If a checkpoint splits a page, readers just follow the pointer without blocking. Since update corruption is still a risk, a LOCK queue token ensures only one thread checkpoints a specific PU queue at a time. (I told you this is complicated.) The paper admits this is a serious bottleneck: hot-spot objects updated thousands of times per second simply can’t scale under this design.


Isolation guarantees

In order to prioritize extreme availability, the system throws traditional isolation guarantees out the window. The paper says ANSI SQL-style isolation and strict consistency cannot survive at scale in this architecture. The atomicity protocol prevents dirty reads by ensuring only fully committed logs leave a client’s private queue, but commit-time read-write and write-write conflicts are ignored entirely! If two clients hit the same record, the last-writer wins. So lost updates are common. To make this usable, the authors push consistency up to the client. For ensuring monotonic reads, each client tracks the highest commit timestamp it has seen, and if it sees any older version from S3 it rejects it and rereads. For monotonic writes, the client stamps version counters on log records and page headers. Checkpoints sort logs and defer any out-of-order SQS messages so each client’s writes stay in order.

I was also surprised by the discussion of stronger isolation in the paper. The paper claims snapshot isolation hasn’t been implemented in distributed systems yet, because it strictly requires a centralized global counter to serialize transactions. This is flagged as a fatal bottleneck and single point of failure.

Looking back, we find this claim outdated. Global counters aren’t a bottleneck for Snapshot Isolation: Amazon Aurora stamps transactions with a Global Log Sequence Number (GLSN) via a primary writer, but still scales (vertically) cleanly without slowing disaggregated storage. More importantly, modern distributed database systems implement snapshot isolation, using loosely synchronized physical clocks (and hybrid logical clocks) to give global ordering with no centralized counter at all. Thank God for synchronized clocks!


Conclusion

While the paper had to work around the messy 2008 cloud environment, it remains impressive for showing how to build a serverless database architecture on dumb object storage. In recent years, S3 has become faster, and in 2020 it gained strong read-after-write consistency for all PUTs and DELETEs. This made it much easier to build databases (especially for analytical workloads) over S3 directly, and this led to the modern data lake and lakehouse paradigms. We can say this paper laid some groundwork for systems like Databricks (Delta Lake), Apache Iceberg, and Snowflake.

March 03, 2026

800th blog post: Write that Blog!


I had given an email interview to the "Write That Blog!" newsletter. That came out today, which coincided with my 800th blog post. I am including my answers also here. 


Why did you start blogging – and why do you continue?

In 2010, when I was a professor, one of my colleagues in the department was teaching a cloud computing seminar. I wanted to enter that field coming from theory of distributed systems, and later wireless sensor networks fields. So I attended the seminar. As I read the papers, I started blogging about them. That is how I learn and retain concepts better, by writing about them. Writing things down helps crystalize ideas for me. It lets me understand papers more deeply and build on that understanding. The post on MapReduce, the first paper discussed in the seminar, seems to have opened the floodgates of my blogging streak, which has been going strong for 15 years.

I think a big influence on me has been the EWD documents. I remember the day I came across these as a PhD student. It felt like finding a treasure, a direct gateway into Dijkstra’s brain through his writings. As I wrote in this post, Dijkstra was the original hipster blogger. He was blogging before blogging was cool. “For over four decades, he mailed copies of his consecutively numbered technical notes, trip reports, insightful observations, and pungent commentaries, known collectively as EWDs, to several dozen recipients in academia and industry.” The EWDs go till 1318. I have a total of 799 posts in my blog as of today. I still have about nine more years to catch up with his sheer post count.

Why did I continue blogging? I continued because I like writing. While writing, I become smarter and more creative. One thought leads to another, one question opens a new investigation, and that leads to an insight. I even sound smart (if I may say so) when I read some of my old posts.

Writing is now a deeply ingrained habit. If I go without blogging for a week, I feel bad. It feels like my creativity got clogged. I am not going to compare myself to artists, but I do feel uneasy if I have not exercised my creativity for a while. My blog became a good vessel for creating and sharing ideas, so I can get them out of my system and make room for new ideas.


What has been the most surprising impact of blogging for you?

Around 2016-2017, I started running into people at conferences who followed my blog, and that surprised me. I had very few page views until then, and I never cared about page views. At first, I thought this was a fluke, but it kept happening more frequently. I was not expecting many people to read my blog because I write for myself first, and I had no expectations that others would read it. (I think this is the difference between intrinsic motivation versus extrinsic motivation. I get the reward while writing the post itself and learning through it. Of course, I am very grateful when people read these posts, and very happy when these turned out to be helpful.)

Related to this, I was surprised by how much developers enjoy reading and following my research paper reviews and summaries. Research papers are often not written to be accessible. They are written to satisfy three reviewers. And consequently, in some parts, the papers get overly defensive. In others, they become pompous and oversell to impress. Knowing how the sausage is made, I think I was able to interpret what was going on and cut to the main ideas and contributions better and translate them more clearly. There is still a large gap in translation and exposition, and I hope more people step in to fill it by blogging.

Another big surprise was how often I refer back to my own blog. Referring back to my posts lets me quickly cache the concepts again. Because I strained my brain while writing them in the first place, reading them later refires the same neurons and helps me reconstruct that state of understanding quickly. In that sense, my blog started acting as an external memory.

I also started pointing my PhD students, and later other people, to my blog. It is an easy and fairly reliable way to transfer knowledge because I package these advice posts neatly for consumption. Here is a snapshot from 2020, and I have written many more advice posts since then.

I actually just realized that I already wrote about why I blog, and I can refer people to that post for more insight into my thought process.


What blog post are you most proud of and why?

It is hard to choose. I think I am proud of all of them, simply because I like having written them and put them out there for others to benefit.

I am particularly fond of my advice posts about writing and research. They may sound a bit cheesy as I write them, and I sometimes feel a bit pompous giving advice. Still, they do help people. I occasionally get feedback about how a post meant a lot to someone, and that means a lot to me as well.

Some posts I like come out very easily, often within half an hour, and end up under the misc label. I wrote a short reminiscence about my life after realizing I am getting old and failing to follow new trends. That post became very popular and reached 100K reads. I also wrote a quick post about my time at MIT, which got 40K reads. I wrote a short post about what my cat taught me about communication. That one didn’t get many views, but I still think the world would be a better place if more people practiced what Pasha intrinsically knows.

On the technical side, it is again difficult to choose. It is futile to estimate the impact of a topic in advance, so I write about what I find interesting or what I am working on. Again, one quick post turned out to be especially impactful. This post about anatomical similarities between Paxos and Bitcoin/Nakamoto consensus protocols became the seed for one of our research papers. This was an example of generative writing. I began writing, and the connections became clearer as I went. The blog is a place where I am free to explore and play with wild ideas like this.

I am also proud of my TLA+ posts. I think the examples I modeled have helped many people get started with TLA+ modeling.


Your advice for people just getting started with blogging?

I wrote about how I write here. My approach is to mess up and tidy up later. I clearly separate drafting from editing.

In this other post, I draw inspiration from a legend about a horse and an outlaw. I must be nuts! But the idea is not that crazy. Keep a file where you dump half-ideas and half-written text. Accumulate as much writing as possible as a braindump, and then edit them by organizing/wrangling the text around. When you have something you are happy with, put it out there and forget about it. Let go of expectations about the fruits of your labor.

Finally, follow your curiosity. No niche is too small. Write for yourself and trust that over time, people will find it. Entertain and serve yourself first. Writing itself should feel good. Try to get your dopamine hit from finishing a post and hitting publish. The more you write, the more you can write. Be intrinsically motivated. I am repeating myself, but it is worth repeating: do not hold expectations about people reading your work.


Anything else you want to add?

I am https://x.com/muratdemirbas on twitter, and https://www.linkedin.com/in/murat-demirbas-distributolog-a2233b176/ on LinkedIn. I also have a newsletter now.


Basic Letters with LaTeX

Every so often I find myself cracking open LibreOffice to write a mildly-formal letter—perhaps a thank-you note to an author, or a letter to members of Congress—and going “Gosh, I wish I had LaTeX here”. I used to have a good template for this but lost it years ago; I’ve recently spent some time recreating it using KOMA-Script’s scrlttr2 class. KOMA’s docs are excellent, but there’s a lot to configure, and I hope this example might save others some time.

Here is the TeX file. You should be able to build it with pdflatex example.tex; it’ll spit out a PDF file like this one.

% Skip footer to give us more space on the first page, US letter paper, no fold
% marks, wide body text
\documentclass[version=last,foldmarks=false,paper=letter,
pagesize,firstfoot=false,DIV=13]{scrlttr2}

% Letter paper
\LoadLetterOption{UScommercial9}
% For various widths, see page 163 of the koma-script manual, figure 4.1,
% "Schematic of the pseudo-lengths for a letter",
% https://ctan.math.illinois.edu/macros/latex/contrib/koma-script/doc/scrguide-en.pdf
% I think the default puts the header a little too close to the top
\setplength{firstheadvpos}{1in}
% Bring the addresses and date in a bit from the edges
\setplength{firstheadhpos}{1in}
\setplength{toaddrhpos}{1in}
% Shrink the bottom margins; we don't have any footer and it looks weird
% otherwise
\setlength{\textheight}{9in}
\setlength{\footskip}{0in}
\setlength{\footheight}{0in}
% A little more room to write signatures
\setplength{sigbeforevskip}{5em}
% No page numbers etc
\pagestyle{empty}

\setkomavar{fromname}{Bruciferous Brunchley}
\setkomavar{fromaddress}{4321 Sender Ave\\
New York, NY 10001}
\setkomavar{fromphone}{(415) 308-7203}

\setkomavar{date}{March 3\textsuperscript{rd}, 2026}

\begin{document}
\begin{letter}{Carmen Carbide\\
1234 Recipient Way\\
Portland, OR 97203}

\opening{Dear Mx. Carbide,}

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor
incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis
nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore
eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident,
sunt in culpa qui officia deserunt mollit anim id est laborum...

\closing{Propituitously,}

\ps PS: What a delight to read of your recent adventures underwater. Perhaps
someday I'll join you on a dive.

\end{letter}
\end{document}

What Exactly Is the MySQL Ecosystem?

As we set out to help the MySQL ecosystem assert greater independence from Oracle by establishing a vendor-neutral industry association, we had to confront a deceptively simple question: What exactly is the MySQL ecosystem? There are many views on this question. Some argue it should revolve strictly around the MySQL brand—meaning MariaDB would be excluded. […]

Updating "denormalized" aggregates with "duplicates": MongoDB vs. PostgreSQL

TL;DR: In database-centric development, duplication is dangerous because, in shared, normalized relational databases, it leads to update anomalies. In contrast, in DDD, aggregates own their data, so duplication is intentional and updates are controlled by the application. MongoDB supports this approach with fine-grained updates on document arrays and fields, while PostgreSQL JSONB often requires rewriting entire documents, which can make normalization more appealing. As a result, duplication itself is not the core problem—what matters is who owns the data and how updates are optimized.

In Domain-Driven Design (DDD), aggregates define consistency boundaries, leading to an intentionally denormalized model. In a relational database, this can feel counterintuitive due to concerns over data duplication and update anomalies. In a shared SQL database with direct/unrestricted access (accepting ad-hoc SQL queries across the entire schema), normalization is often preferred to guarantee a single source of truth for each piece of data and to rely on joins/dynamic lookups at query time.

DDD addresses these concerns through the concept of ownership: each aggregate owns its data, and any duplication is deliberate and controlled. Updates occur only through the aggregate root (or the coordinating application service), whose behavior is tested and reviewed before deployment to production. The database is never modified directly from outside this boundary. As a result, changes are propagated safely — often via domain events and eventual consistency — ensuring the value is updated in all necessary places without violating aggregate boundaries. In a bounded context, a single update command can modify many occurrences in a collection of aggregates, but the performance depends on the capability of the database to update within the aggregate.

Example: player's scores

Consider 100,000 game records, each containing 1,000 players and their scores. Each score stores the player’s name directly instead of referencing a separate player document. When a player is renamed—a rare event—the name must be updated in all affected documents with a single updateMany() operation. This is more work than updating a single record, but it removes an extra lookup for every read, which happens far more often, and better matches the aggregate’s read and consistency requirements.

Example in MongoDB

I built the following dataset:


db.scores.drop()

db.scores.createIndex({ "players.name": 1 })
db.scores.createIndex({ "players.score": 1 })

async function init(docs,players,totalplayers=10000,batchsize=5000){
for (let c = 0; c < docs ; c += batchsize) {
  await db.scores.insertMany(
    Array.from({ length: batchsize }, () => ({
      players: Array.from({ length: players }, (_, i) => ({
        name: `player${Math.round(totalplayers*Math.random())}`,
        score: Math.random()
      }))
    }))
  );
  print(`${db.scores.countDocuments()} documents (with ${players} in array)}`);
  }
}

init(1e5,1e3)
;

In a normalized relational model, this scenario would use a many-to-many schema: a games table with 100,000 rows, a players table with 10,000 rows, and an association table with 100,000,000 rows. That’s the trade-off for storing each player’s name only once.

Instead, with a single collection, I stored 100,000 documents, each embedding 1,000 scores together with the player’s names:


db.scores.countDocuments();

100000

I created two indexes on fields in the array to show the impact of updates. Each document has 1,000 keys, giving a total of 100,000,000 keys per index:


db.scores.validate({full: true}).keysPerIndex;

{
  _id_: 100000,
  'players.name_1': 95165332,
  'players.score_1': 100000000
}

There are fewer than 100,000,000 keys for "name" because they were generated randomly, and some are duplicated within a document (for example, one player may have multiple scores). These duplicates are removed at the index key level.

Below are the index sizes in MB, each index is a few GB:

db.scores.stats(1024 * 1024).indexSizes;

{
  _id_: 2.1640625,
  'players.name_1': 2216.81640625,
  'players.score_1': 3611.44140625
}

I created two indexes to demonstrate that, in MongoDB’s document storage, updating array elements by deleting and reinserting them with new values does not add overhead to other fields, even when those fields are indexed.

A single player, "player42," has a recorded score in 9.5% of the documents:

db.scores.countDocuments({"players.name":"player42"});

9548

I want to rename "player42" so the new name appears in all historical records, as if it had been normalized to a single value.

UpdateMany() in MongoDB

I can change the name "player42" to "Franck" using a single updateMany() operation on the collection:

db.scores.updateMany(  
  { "players.name": "player42" },  
  { $set: { "players.$[i].name": "Franck" } },  
  { arrayFilters: [{ "i.name": "player42" }] }  
);  

This means:

  1. Select all documents that contain at least one element in the players array with name equal to "player42".
  2. For each of these documents, consider every players element whose name is "player42" as i.
  3. For each such element i, update i.name to "Franck".

In most cases, it’s fine if this update isn’t atomic. If atomicity is required, you should execute it within a transaction. It is also faster within a transaction, as long as the number of updates is bounded, as it has to be synced to disk only once at commit.

I executed it in a transaction and gathered statistics both before and after the transaction:


function updatePlayer42() { 

  // gather some stats
  const bstat_name = db.scores.stats({ indexDetails: true }).indexDetails["players.name_1"].cursor;
  const bstat_score = db.scores.stats({ indexDetails: true }).indexDetails["players.score_1"].cursor;
  const btime = new Date()
  print(`Start: ${btime}\n Index size:${JSON.stringify(db.scores.stats(1024 * 1024).indexSizes)} MB`);

  // ---- transaction start ----  
  const session = db.getMongo().startSession();  
  const scores = session.getDatabase(db.getName()).scores;  

  try {  

    session.startTransaction();  

    const res = scores.updateMany(  
      { "players.name": "player42" },  
      { $set: { "players.$[m].name": "Franck" } },  
      { arrayFilters: [{ "m.name": "player42" }] }  
    );  

    printjson(res);  

    session.commitTransaction();  

  } catch (e) {  
    print("Transaction aborted:", e);  
    session.abortTransaction();  
  } finally {  
    session.endSession();  
  }  
  // ---- transaction end ----  

  // gather and print stats
  const etime = new Date()
  const estat_name = db.scores.stats({ indexDetails: true }).indexDetails["players.name_1"].cursor;
  const estat_score = db.scores.stats({ indexDetails: true }).indexDetails["players.score_1"].cursor;

  // calculate and print stats
  print(`End: ${etime} - duration: ${ etime-btime} ms\n Index size:${JSON.stringify(db.scores.stats(1024 * 1024).indexSizes)} MB`);
  print(`Index on "players.name" cursor stats:`)
 Object.entries(estat_name).forEach(([k, v]) => {  const delta = v - (bstat_name[k] ?? 0);  if (delta !== 0) print(` ${k}: ${delta}`);  });
  print(`Index on "players.score" cursor stats:`)
 Object.entries(estat_score).forEach(([k, v]) => {  const delta = v - (bstat_score[k] ?? 0);  if (delta !== 0) print(` ${k}: ${delta}`);  });

}

updatePlayer42()

Here is the output:

Start: Mon Mar 02 2026 22:51:31 GMT+0000 (Coordinated Universal Time)
 Index size:{"_id_":2.7734375,"players.name_1":2216.81640625,"players.score_1":3611.44140625} MB
{
  acknowledged: true,
  insertedId: null,
  matchedCount: 9548,
  modifiedCount: 9548,
  upsertedCount: 0
}
End: Mon Mar 02 2026 22:51:43 GMT+0000 (Coordinated Universal Time) - duration: 12472 ms
 Index size:{"_id_":2.7734375, "players.name_1":2216.81640625, "players.score_1":3611.44140625} MB
Index on "players.name" cursor stats:
 cache cursors reuse count: 1
 close calls that result in cache: 2
 create calls: 1
 cursor bounds cleared from reset: 9549
 cursor bounds comparisons performed: 9549
 cursor bounds next called on an unpositioned cursor: 9549
 cursor bounds next early exit: 1
 insert calls: 9548
 insert key and value bytes: 114477
 next calls: 9549
 remove calls: 9548
 remove key bytes removed: 133573
 reset calls: 28648
Index on "players.score" cursor stats:


The statistics show only those that increased and display the difference.

In 12 seconds, this operation atomically updated the name of "player42" in 9,548 documents, out of a total of 100,000 large documents, and the indexes were maintained with strong consistency. The overall index size did not increase significantly. The index on "score" was unchanged. The index on "name" received 9,548 new entries (12 bytes on average) and removed 9,548 entries (14 bytes on average).

Although this is slower than performing a single update in a reference table, it is nowhere near the kind of update nightmare that SQL practitioners typically associate with denormalization. The crucial point is that only the index entries related to the updated array element and field must be maintained. To understand where the fear comes from, let's do the same in a SQL database.

Comparison with PostgreSQL JSONB

SQL databases were designed for normalisation, and even if they accept some JSON datatypes, they may not have the same optimisations, and such an update is much more expensive.

I created a similar dataset in PostgreSQL 18 with JSONB:


CREATE TABLE scores (  
  id     bigint PRIMARY KEY,  
  doc    jsonb  
)
;

INSERT INTO scores (id, doc)  
SELECT  
  g,  
  jsonb_build_object(  
    'players',  
    jsonb_agg(  
      jsonb_build_object(  
        'name',  'player' || ((g * 1000 + s) % 10000),  
        'score', random()  
      )  
    )  
  )  
FROM generate_series(1, 100000) AS g  
CROSS JOIN generate_series(1, 1000) AS s  
GROUP BY g;  

SELECT count(*) FROM scores 
 WHERE doc @? '$.players[*] ? (@.name == "player42")'
;

CREATE INDEX ON scores 
 USING gin ((doc->'players') jsonb_path_ops)
;

VACUUM ANALYZE scores
;

A single GIN index can cover equality filters (or rather containment) on the two fields.

I executed the same update, using @? in a WHERE clause to retrieve the documents to update via the GIN index, and jsonb_set to modify the name. There's no direct access to the JSONB content stored in the table, and it must read the JSON array with jsonb_array_elements and rebuild it with jsonb_agg:

EXPLAIN (ANALYZE, BUFFERS, WAL, COSTS OFF)
UPDATE scores     
SET doc = (    
  SELECT jsonb_agg(    
    CASE     
      WHEN p->>'name' = 'player42'     
      THEN jsonb_set(p, '{name}', '"Franck"')    
      ELSE p     
    END    
  )    
  FROM jsonb_array_elements(doc->'players') p    
)    
WHERE (doc->'players') @? '$[*] ? (@.name == "player42")'
;

Here is the execution plan:

                                                   QUERY PLAN                                                    
-----------------------------------------------------------------------------------------------------------------
 Update on scores (actual time=13126.454..13126.459 rows=0.00 loops=1)
   Buffers: shared hit=641088 read=39252 dirtied=57413 written=57455
   WAL: records=310285 fpi=31129 bytes=406743374 buffers full=33593
   ->  Bitmap Heap Scan on scores (actual time=6.333..7907.504 rows=10000.00 loops=1)
         Recheck Cond: ((doc -> 'players'::text) @? '$[*]?(@."name" == "player42")'::jsonpath)
         Heap Blocks: exact=736
         Buffers: shared hit=88347 read=32397 written=20050
         ->  Bitmap Index Scan on scores_expr_idx (actual time=3.491..3.491 rows=10000.00 loops=1)
               Index Cond: ((doc -> 'players'::text) @? '$[*]?(@."name" == "player42")'::jsonpath)
               Index Searches: 1
               Buffers: shared hit=1 read=7
         SubPlan 1
           ->  Aggregate (actual time=0.519..0.519 rows=1.00 loops=10000)
                 Buffers: shared hit=60000
                 ->  Function Scan on jsonb_array_elements p (actual time=0.060..0.091 rows=1000.00 loops=10000)
                       Buffers: shared hit=60000
 Planning:
   Buffers: shared hit=1
 Planning Time: 0.647 ms
 Execution Time: 13126.742 ms
(20 rows)

To update 10,000 documents, PostgreSQL generated 641,088 WAL records (~64 per document) and wrote 57,413 blocks (~46 KB per document). This is less than rewriting the full raw JSON, thanks to binary internal representation and TOAST compression, but still far larger than the logical change—a single field update. PostgreSQL JSONB makes denormalization workable but write‑amplified when duplicated values must be updated, not because they are duplicated, but because they reside in large, immutable JSONB documents.

Conclusion

In traditional relational databases, avoiding duplication is essential because the database is the main authority for correctness: every application, report, and ad‑hoc SQL query must respect the same invariants. In that context, normalization and foreign keys are the safest way to prevent inconsistencies.

In a Domain‑Driven Design (DDD) architecture, that responsibility shifts. Aggregates define consistency boundaries, a single trusted service mediates database access, and the application enforces invariants. Duplication is therefore intentional and bounded, not accidental.

The experiments above show that this difference is both physical and conceptual. In MongoDB, updating a value embedded in an array changes only the affected elements and their index entries. Index maintenance scales with the number of modified elements, not with overall document size, so even highly denormalized aggregates can still be updated efficiently and atomically.

In PostgreSQL, JSONB supports denormalized structures but with very different update semantics. Any change requires rebuilding the entire JSON value and regenerating all related index entries, regardless of how small the logical update is. Indexes help find the rows, but cannot avoid the cost of rewriting the document.

As a result, the trade‑off is clear: PostgreSQL JSONB–based denormalization mainly optimizes reads while imposing a write cost proportional to document size, whereas MongoDB’s document model supports both read locality and fine‑grained, efficient updates within aggregates. The issue is not whether denormalization is “good” or “bad,” but whether the database’s storage and indexing model fits the aggregate’s read‑write patterns and ownership assumptions.

March 02, 2026

March 01, 2026

PostgreSQL global statistics on partitionned table require a manual ANALYZE

PostgreSQL auto-analyze collects statistics on tables with rows. For partitioned tables, it excludes the parent as it has no rows by itself. So how does the query planner estimate cardinality when a query spans multiple partitions?

Some statistics are easy to derive: if it knows each partition’s row count, the global count is the total. Column statistics are trickier, especially with the number of distinct values, a key factor to estimate cardinalities with predicates or aggregates. Even with the number of distinct values per partition, it still doesn’t know how much those values overlap across partitions. The global distinct count therefore lies between the maximum per-partition distinct count and the sum of all per-partition counts.

Here is an example:


create table history (
   year int,
   num serial,
   x   int,
   y   int,
   primary key (year, num)
) partition by list (year)
;

create table history_2024 partition of history for values in (2024);
create table history_2025 partition of history for values in (2025);
create table history_2026 partition of history for values in (2026);
create table history_2027 partition of history for values in (2027);

insert into history select 
 extract(year from ( date '2026-01-02' - interval '1 minute' * num ))::int as year
 ,num             -- NDV ≈ rows (unique key, density ≈ 1 / rows)
 ,(num % 2) as x  -- NDV = 2 per partition (very high density: ~50% per value)  
 ,(num / 2) as y  -- NDV ≈ rows / 2 per partition (moderate density: ~2 rows per distinct value) 
 from generate_series(1,1e6) num
;

Here is the real data distribution:

select count(*), year
  , count(distinct x) as "distinct x"
  , min(x) as "min x"
  , max(x) as "max x"
  , count(*)::float / nullif(count(distinct x), 0) as density_x
  , count(distinct y) as "distinct y"
  , min(y) as "min y"
  , max(y) as "max y"
  , count(*)::float / nullif(count(distinct y), 0) as density_y
 from history group by grouping sets ((),year)
;

  count  | year | distinct x | min x | max x | density_x | distinct y | min y  | max y  | density_y
---------+------+------------+-------+-------+-----------+------------+--------+--------+-----------
  472960 | 2024 |          2 |     0 |     1 |    236480 |     236481 | 263520 | 500000 | 1.999991
  525600 | 2025 |          2 |     0 |     1 |    262800 |     262801 |    720 | 263520 | 1.999992
    1440 | 2026 |          2 |     0 |     1 |       720 |        721 |      0 |    720 | 1.997226
 1000000 |      |          2 |     0 |     1 |    500000 |     500001 |      0 | 500000 | 1.999996

(4 rows)

Immediately after the insert, there's no statistics:

select relname, relpages, reltuples, relkind, relhassubclass, relispopulated, relispartition  
 from pg_class where relname like 'history%' order by relkind desc, relname
;

      relname      | relpages | reltuples | relkind | relhassubclass | relispopulated | relispartition
-------------------+----------+-----------+---------+----------------+----------------+----------------
 history_2024      |        0 |        -1 | r       | f              | t              | t
 history_2025      |        0 |        -1 | r       | f              | t              | t
 history_2026      |        0 |        -1 | r       | f              | t              | t
 history_2027      |        0 |        -1 | r       | f              | t              | t
 history           |        0 |        -1 | p       | t              | t              | f
 history_2024_pkey |        1 |         0 | i       | f              | t              | t
 history_2025_pkey |        1 |         0 | i       | f              | t              | t
 history_2026_pkey |        1 |         0 | i       | f              | t              | t
 history_2027_pkey |        1 |         0 | i       | f              | t              | t
 history_num_seq   |        1 |         1 | S       | f              | t              | f
 history_pkey      |        0 |         0 | I       | t              | t              | f
(11 rows)

After a while, autovacuum’s auto-analyze has gathered statistics for the partitions:

select relname, relpages, reltuples, relkind, relhassubclass, relispopulated, relispartition  
 from pg_class where relname like 'history%' order by relkind desc, relname
;

      relname      | relpages | reltuples | relkind | relhassubclass | relispopulated | relispartition
-------------------+----------+-----------+---------+----------------+----------------+----------------
 history_2024      |     2557 |    472960 | r       | f              | t              | t
 history_2025      |     2842 |    525600 | r       | f              | t              | t
 history_2026      |        8 |      1440 | r       | f              | t              | t
 history_2027      |        0 |        -1 | r       | f              | t              | t
 history           |        0 |        -1 | p       | t              | t              | f
 history_2024_pkey |     1300 |    472960 | i       | f              | t              | t
 history_2025_pkey |     1443 |    525600 | i       | f              | t              | t
 history_2026_pkey |        6 |      1440 | i       | f              | t              | t
 history_2027_pkey |        1 |         0 | i       | f              | t              | t
 history_num_seq   |        1 |         1 | S       | f              | t              | f
 history_pkey      |        0 |         0 | I       | t              | t              | f
(11 rows)

However, there are still no global statistics (the "history" table itself). The available column statistics only cover partitions and include the number of distinct values for the two columns:

select tablename, attname, n_distinct, null_frac
 from pg_stats where tablename like 'history%'
;
  tablename   | attname | n_distinct  | null_frac
--------------+---------+-------------+-----------
 history_2025 | year    |           1 |         0
 history_2025 | num     |          -1 |         0
 history_2025 | x       |           2 |         0
 history_2025 | y       | -0.49914953 |         0
 history_2024 | year    |           1 |         0
 history_2024 | num     |          -1 |         0
 history_2024 | x       |           2 |         0
 history_2024 | y       |  -0.4907878 |         0
 history_2026 | year    |           1 |         0
 history_2026 | num     |          -1 |         0
 history_2026 | x       |           2 |         0
 history_2026 | y       | -0.50069445 |         0
(12 rows)

For queries spanning multiple partitions, the optimizer gets good cardinality estimates per partition. However, SELECT DISTINCT over multiple partitions cannot be estimated from the statistics.

For column x it estimates to rows=200 instead of rows=2.00:

explain (analyze, buffers off, summary off)
 select distinct x from history
;
                                                                 QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=22949.38..22951.38 rows=200 width=4) (actual time=263.739..263.742 rows=2.00 loops=1)
   Group Key: history.x
   Batches: 1  Memory Usage: 32kB
   ->  Append  (cost=0.00..20444.75 rows=1001850 width=4) (actual time=0.010..149.330 rows=1000000.00 loops=1)
         ->  Seq Scan on history_2024 history_1  (cost=0.00..7286.60 rows=472960 width=4) (actual time=0.009..37.586 rows=472960.00 loops=1)
         ->  Seq Scan on history_2025 history_2  (cost=0.00..8098.00 rows=525600 width=4) (actual time=0.016..41.565 rows=525600.00 loops=1)
         ->  Seq Scan on history_2026 history_3  (cost=0.00..22.40 rows=1440 width=4) (actual time=0.020..0.132 rows=1440.00 loops=1)
         ->  Seq Scan on history_2027 history_4  (cost=0.00..28.50 rows=1850 width=4) (actual time=0.002..0.002 rows=0.00 loops=1)
(8 rows)

For column y it estimates to rows=200 instead of rows=500000.00:

postgres=# explain (analyze, buffers off, summary off)
 select distinct y from history
;
                                                                 QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=22949.38..22951.38 rows=200 width=4) (actual time=341.476..527.623 rows=500000.00 loops=1)
   Group Key: history.y
   Batches: 5  Memory Usage: 8265kB  Disk Usage: 16048kB
   ->  Append  (cost=0.00..20444.75 rows=1001850 width=4) (actual time=0.009..151.806 rows=1000000.00 loops=1)
         ->  Seq Scan on history_2024 history_1  (cost=0.00..7286.60 rows=472960 width=4) (actual time=0.009..38.840 rows=472960.00 loops=1)
         ->  Seq Scan on history_2025 history_2  (cost=0.00..8098.00 rows=525600 width=4) (actual time=0.011..43.265 rows=525600.00 loops=1)
         ->  Seq Scan on history_2026 history_3  (cost=0.00..22.40 rows=1440 width=4) (actual time=0.011..0.131 rows=1440.00 loops=1)
         ->  Seq Scan on history_2027 history_4  (cost=0.00..28.50 rows=1850 width=4) (actual time=0.003..0.003 rows=0.00 loops=1)
(8 rows)

Here, the misestimate isn’t an issue by itself, but it could become catastrophic if the query needed to join other tables. Even though the two columns have different value distributions, PostgreSQL estimated both as rows=200. When no statistics are available, PostgreSQL falls back to generic defaults. It assumes a small, fixed number of distinct values - even when there are more in one partition.

Autovacuum’s auto-analyze collects statistics for partitions but not for the partitioned parent. To gather global statistics on the parent, you must run analyze manually: ANALYZE ONLY analyzes just the parent (no recursion), while ANALYZE also recursively analyzes each partition:


ANALYZE history
;

select relname, relpages, reltuples, relkind, relhassubclass, relispopulated, relispartition  
 from pg_class where relname like 'history%' order by relkind desc, relname
;

      relname      | relpages | reltuples | relkind | relhassubclass | relispopulated | relispartition
-------------------+----------+-----------+---------+----------------+----------------+----------------
 history_2024      |     2557 |    472960 | r       | f              | t              | t
 history_2025      |     2842 |    525600 | r       | f              | t              | t
 history_2026      |        8 |      1440 | r       | f              | t              | t
 history_2027      |        0 |        -1 | r       | f              | t              | t
 history           |       -1 |     1e+06 | p       | t              | t              | f
 history_2024_pkey |     1300 |    472960 | i       | f              | t              | t
 history_2025_pkey |     1443 |    525600 | i       | f              | t              | t
 history_2026_pkey |        6 |      1440 | i       | f              | t              | t
 history_2027_pkey |        1 |         0 | i       | f              | t              | t
 history_num_seq   |        1 |         1 | S       | f              | t              | f
 history_pkey      |        0 |         0 | I       | t              | t              | f
(11 rows)

select tablename, attname, n_distinct, null_frac
 from pg_stats where tablename like 'history%'
;

  tablename   | attname | n_distinct  | null_frac
--------------+---------+-------------+-----------
 history      | year    |           3 |         0
 history      | num     |          -1 |         0
 history      | x       |           2 |         0
 history      | y       |        -0.5 |         0
 history_2025 | year    |           1 |         0
 history_2025 | num     |          -1 |         0
 history_2025 | x       |           2 |         0
 history_2025 | y       | -0.49914953 |         0
 history_2024 | year    |           1 |         0
 history_2024 | num     |          -1 |         0
 history_2024 | x       |           2 |         0
 history_2024 | y       |  -0.4907878 |         0
 history_2026 | year    |           1 |         0
 history_2026 | num     |          -1 |         0
 history_2026 | x       |           2 |         0
 history_2026 | y       | -0.50069445 |         0
(16 rows)

Now the query planner knows there are one million rows in total (1e+06 in reltuples), with 2 distinct values for x (2 in n_distinct) and 50% distinct values for y (-0.5 in n_distinct, the negative sign is used to mean a relative number that should scale with the table size).

Now the execution plans on the same queries get better estimates:

postgres=# explain (analyze, buffers off, summary off) select distinct x from history;
                                                                 QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=22907.01..22907.03 rows=2 width=4) (actual time=265.727..265.729 rows=2.00 loops=1)
   Group Key: history.x
   Batches: 1  Memory Usage: 32kB
   ->  Append  (cost=0.00..20407.01 rows=1000001 width=4) (actual time=0.009..150.276 rows=1000000.00 loops=1)
         ->  Seq Scan on history_2024 history_1  (cost=0.00..7286.60 rows=472960 width=4) (actual time=0.008..37.620 rows=472960.00 loops=1)
         ->  Seq Scan on history_2025 history_2  (cost=0.00..8098.00 rows=525600 width<... (truncated)
                                    

February 27, 2026

From Relational Algebra to Document Semantics

The relational model was designed not to mirror application structures, but to optimize reasoning about data dependencies independently of access patterns. By restricting data to relations in First Normal Form (1NF) and enforcing closure, relational algebra makes query behavior mathematically derivable, so correctness, equivalence, and optimization follow from the model itself rather than implementation-specific rules.

Document databases take the opposite approach: they optimize representation for applications, network, and storage. Domain concepts such as ownership, lifecycle, and cardinality are embedded directly in the data’s shape. Without assuming value atomicity, algebra no longer applies, so semantics must be defined explicitly. This is what MongoDB did when defining query behavior in its document database.

Applications rarely start from relations

Most applications do not model data as relations. They model aggregates. It's obvious in Domain-Driven Design and Object-Oriented analysis, but it was true long before object‑oriented languages. Data structures in applications have always been hierarchical, with sub-structures and repeating groups:

  • Record-based models include nested tables, like in the COBOL data division where relationships are expressed through containment and repetition:
01 EMPLOYEE.
   05 NAME        PIC X(20).
   05 SKILLS.
      10 SKILL    PIC X(10) OCCURS 5 TIMES.
  • Object‑oriented models are more flexible but follow the same pattern with embedded objects, arrays, or lists:
@Entity
class Employee {
    @Id Long id;
    String name;
    @ElementCollection
    List<String> skills;
}

These models express semantics that do not exist in relational algebra:

  • ownership: the skills belong to the employee
  • lifecycle: the root aggregate is deleted with all its elements
  • cardinality: this model supposes a short and bounded list of skills per employee, but a large and growing number of employees
  • locality: employee's skills are queried with the employee, and stored together on disk

Those models are optimized for application behavior, not for algebraic reasoning. This is not new. It is how applications have always been written. Why do relational systems require data to be represented as flat relations at the logical level?

Why the relational model exists

When E. F. Codd published “Derivability, Redundancy and Consistency of Relations Stored in Large Data Banks” in 1969, his objective was not to help developers write applications, but to establish a formal mathematical foundation for data management.

He based the relational model and relational algebra on mathematics:

  • Set theory for relations, with operations such as (union), (intersection), (difference), and × (Cartesian product)
  • First‑order predicate logic for constraints and querying: selection (σ) corresponds to logical predicates, and joins correspond to conjunction with implicit existential quantification () over the join attributes
  • A closed algebraic query language at the logical level, where every operation produces another relation

Within this framework, a relation is defined as:

  • a set of tuples (unordered, with no duplicates)
  • where all tuples share the same attributes
  • and every attribute value is drawn from a simple (atomic) domain

These properties are not modeling advice. They are the definition.

First normal form

First Normal Form (1NF) is often presented as a design guideline. In Codd’s original work, it is not. It is a mandatory constraint to apply the first‑order predicate logic.

Without atomic attribute values, relations cease to be relations as defined by Codd, and the standard relational algebra no longer applies. Comparisons become ambiguous, predicates are no longer boolean, and algebraic equivalence rules break down.

Relational algebra is a closed system where inputs and outputs are relations. Its operators—selection (σ), projection (π), join (⨝)—all assume that:

  • attributes can be compared using equality
  • predicates evaluate to true or false
  • each comparison involves exactly one value

This is what enables equivalence rules, join reordering, and provable optimizations, and the maths is defined for atomic values only.

Let's see some examples with SQL, which is inspired by the relational model (but is not itself a pure relational language).

1NF to apply mathematical operations to relations

Here is an example with a table of employees' skills:

CREATE TABLE employee_skill (
    employee_name TEXT,
    skill TEXT
);

INSERT INTO employee_skill VALUES
   ('Ann', 'SQL'),
   ('Ann', 'Java'),
   ('Ann', 'Python'),
   ('Bob', 'Java')
;

A simple selection involves a predicate comparing a column to a value:

SELECT DISTINCT *
   FROM employee_skill
   WHERE skill = 'Java'
;

It returns another relation with only the facts that verify the predicate:


 employee_name | skill
---------------+-------
 Ann           | Java
 Bob           | Java

This works because skill is atomic, equality has one meaning, and the result is still a relation.

This collapses without 1NF

Here is another model, simple, but that violates 1NF:

CREATE TABLE employee_array (
    name TEXT,
    skills TEXT[]
);

INSERT INTO employee_array VALUES
   ('Ann', ARRAY['SQL', 'Java', 'Python']),
   ('Bob', ARRAY['Java'])
;

The same predicate no longer applies:

SELECT DISTINCT *
   FROM employee_array
   WHERE skills = 'Java'
;


ERROR:  malformed array literal: "Java"
LINE 3: WHERE skills = 'Java';
                       ^
DETAIL:  Array value must start with "{" or dimension information.

PostgreSQL requires new operators:

SELECT DISTINCT *
   FROM employee_array
   WHERE 'Java' = ANY(skills)
;

 name |      skills
------+-------------------
 Ann  | {SQL,Java,Python}
 Bob  | {Java}

(2 rows)

SELECT DISTINCT *
   FROM employee_array
   WHERE skills @> ARRAY['Java']
;

 name |      skills
------+-------------------
 Ann  | {SQL,Java,Python}
 Bob  | {Java}

(2 rows)

These operators encode membership and containment. They are not part of relational algebra, and their exact semantics and syntax are vendor‑specific in SQL systems.

SQL/JSON does not restore relational semantics

JSON and JSONB datatypes do not change this:

CREATE TABLE employee_json (
    doc JSONB
);

INSERT INTO employee_json VALUES
  ('{"name": "Ann", "skills": ["SQL", "Java", "Python"]}'),
  ('{"name": "Bob", "skills": ["Java"]}')
;

Selections rely on path navigation and containment, with special operators specific to the datatype:

SELECT DISTINCT *
   FROM employee_json
   WHERE doc->'skills' ? 'Java'
;

                         doc
------------------------------------------------------
 {"name": "Ann", "skills": ["SQL", "Java", "Python"]}
 {"name": "Bob", "skills": ["Java"]}

(2 rows)

SELECT DISTINCT *
   FROM employee_json
   WHERE doc->'skills' @> '["Java"]'
;

                         doc
------------------------------------------------------
 {"name": "Ann", "skills": ["SQL", "Java", "Python"]}
 {"name": "Bob", "skills": ["Java"]}

(2 rows)

This involves again different datatypes, operators, semantic, and also indexing. A different type of index is required to serve those predicates, a GIN index in PostgreSQL, with different syntax and different capabilities.

JSON has been added to the SQL standard as SQL/JSON but this doesn't unify the semantics. For example, an SQL array starts at 1 and a JSON array starts at 0:

SELECT 
 skills[0] "0",
 skills[1] "1",
 skills[2] "2",
 skills[3] "3"
FROM employee_array
;

 0 |  1   |  2   |   3
---+------+------+--------
   | SQL  | Java | Python
   | Java |      |

(2 rows)

SELECT 
 doc->'skills'->0 "0",
 doc->'skills'->1 "1",
 doc->'skills'->2 "2",
 doc->'skills'->3 "3"
FROM employee_json
;

   0    |   1    |    2     | 3
--------+--------+----------+---
 "SQL"  | "Java" | "Python" |
 "Java" |        |          |

(2 rows)

JSON support in RDBMS extends SQL beyond relational algebra and introduces datatype‑specific semantics that are not algebraically closed. This is expected and was foreseen when enforcing the first normal form. Codd’s insight was that once attributes stop being atomic, mathematics no longer dictates behavior. Meaning must be defined explicitly.

MongoDB’s added semantics

MongoDB embraces the document model directly to match the data representation in the domain model and application structures:


db.employees.insertMany([
  { name: "Bob", skills: "Java" },
  { name: "Ann", skills: ["SQL", "Java", "Python"] }
]);

This is intentionally not 1NF because multiple entities and values may belong to the same aggregate. The relational operations cannot simply use the mathematical definition.

Selection resembles the relational operation, but when applied to a non‑1NF collection, MongoDB defines an explicit, extended semantics:


db.employees.find({ skills: "Java" })
;

[
  {
    _id: ObjectId('69a0ccaece22bf6640d4b0c2'),
    name: 'Bob',
    skills: 'Java'
  },
  {
    _id: ObjectId('69a0ccaece22bf6640d4b0c3'),
    name: 'Ann',
    skills: [ 'SQL', 'Java', 'Python' ]
  }
]

The same predicate applies to scalars and arrays. The document matches if the value or any array element satisfies the filtering condition. The same in SQL would require a union between a query using the SQL selection and another using the JSON containment, and casting the final result to the same datatype.

With MongoDB, indexes, comparisons, and sorting follow the same rule, as confirmed by execution plans. I create an index on skills that can index scalar values as well as array items, with one index type, and a generic syntax:


db.employees.createIndex({ skills: 1 })
;

It is easy to verify that the index is used to find the documents on a multi-key path (which means including an array in the path):

db.employees.find(
 { skills: "Java" }
).explain().queryPlanner.winningPlan

{
  isCached: false,
  stage: 'FETCH',
  inputStage: {
    stage: 'IXSCAN',
    keyPattern: { skills: 1 },
    indexName: 'skills_1',
    isMultiKey: true,
    multiKeyPaths: { skills: [ 'skills' ] },
    isUnique: false,
    isSparse: false,
    isPartial: false,
    indexVersion: 2,
    direction: 'forward',
    indexBounds: { skills: [ '["Java", "Java"]' ] }
  }
}

The multi-key index has one index entry for each item when it is an array. When a comparison operator is applied to an array field, MongoDB compares the query value to each array element individually.

I used an equality predicate but the indexBounds in the execution plan show that the same can apply to a range. The same index is used for non-equality predicates:

db.employees.find({ skills: { $gt: "M"} })
;

[
  {
    _id: ObjectId('69a0ccaece22bf6640d4b0c3'),
    name: 'Ann',
    skills: [ 'SQL', 'Java', 'Python' ]
  }
]

db.employees.find({ skills: { $lt: "M"} })
;

[
  {
    _id: ObjectId('69a0ccaece22bf6640d4b0c2'),
    name: 'Bob',
    skills: 'Java'
  },
  {
    _id: ObjectId('69a0ccaece22bf6640d4b0c3'),
    name: 'Ann',
    skills: [ 'SQL', 'Java', 'Python' ]
  }
]

Only Ann has a skill with a name after 'M' in the alphabet. Both Ann and Bob have a skill with a name before 'M' in the alphabet.

When sorting on an array field, MongoDB uses the minimum array element for ascending sort and the maximum array element for descending sort, according to BSON comparison order:

db.employees.find().sort({ skills: 1 })
;

[
  {
    _id: ObjectId('69a0ccaece22bf6640d4b0c2'),
    name: 'Bob',
    skills: 'Java'
  },
  {
    _id: ObjectId('69a0ccaece22bf6640d4b0c3'),
    name: 'Ann',
    skills: [ 'SQL', 'Java', 'Python' ]
  }
]

db.employees.find().sort({ skills: -1 })
;

[
  {
    _id: ObjectId('69a0ccaece22bf6640d4b0c3'),
    name: 'Ann',
    skills: [ 'SQL', 'Java', 'Python' ]
  },
  {
    _id: ObjectId('69a0ccaece22bf6640d4b0c2'),
    name: 'Bob',
    skills: 'Java'
  }
]

Here, 'Java' is the first in alphabetical order, so both employees are at the same rank in an ascending sort, but 'SQL' is the last in alphabetical order so 'Ann' appears first in a descending sort.

Again, the index is used:

db.employees.find().sort({ skills: 1 }).explain().queryPlanner.winningPlan
;

{
  isCached: false,
  stage: 'FETCH',
  inputStage: {
    stage: 'IXSCAN',
    keyPattern: { skills: 1 },
    indexName: 'skills_1',
    isMultiKey: true,
    multiKeyPaths: { skills: [ 'skills' ] },
    isUnique: false,
    isSparse: false,
    isPartial: false,
    indexVersion: 2,
    direction: 'forward',
    indexBounds: { skills: [ '[MinKey, MaxKey]' ] }
  }
}

MongoDB is optimized for scalable OLTP, index access is a must for equality and range predicates as well as sorting for pagination. In SQL databases, inverted indexes such as GIN are typically specialized for containment and equality predicates and offer more limited support for range ordering and pagination than B‑tree indexes.

Not forcing First Normal Form allows storage and indexing to remain efficient:

  • compound index may include fields from multiple entities within one aggregate
  • storage involves a single disk I/O per aggregate

By deviating from 1NF, closure is not guaranteed—by design. An explicit $unwind operation in an aggregation pipeline can normalize the result to a relation, in its mathematical sense, if needed.

With MongoDB, we can list the skills of employees who have a 'Java' skills, with all their skill, as a relational result:

db.employees.aggregate([  
  { $match: { skills: "Java" } },  
  { $unwind: "$skills" },
  { $project: { _id: 0, name: "$name", skill: "$skills" } }
])
;

[
  { name: 'Bob', skill: 'Java' },
  { name: 'Ann', skill: 'SQL' },
  { name: 'Ann', skill: 'Java' },
  { name: 'Ann', skill: 'Python' }
]

This simple query in MongoDB would be much more complex with the PostgreSQL examples above:

-- Correlated semi-join (EXISTS) over a normalized relation
SELECT DISTINCT es.employee_name AS name, es.skill  
  FROM employee_skill es  
  WHERE EXISTS (  
    SELECT 1  
    FROM employee_skill j  
    WHERE j.employee_name = es.employee_name  
      AND j.skill = 'Java'
;  

-- Existential quantification (ANY) over a non-1NF attribute (ARRAY) with explicit normalization (UNNEST)
SELECT DISTINCT ea.name, s.skill  
  FROM employee_array ea  
  CROSS JOIN LATERAL unnest(ea.skills) AS s(skill)  
  WHERE 'Java' = ANY (ea.skills)
;  

-- JSON containment predicate (@>) with explicit normalization (jsonb_array_elements)
SELECT DISTINCT doc->>'name' AS name, skill  
  FROM employee_json  
  CROSS JOIN LATERAL jsonb_array_elements_text(doc->'skills') AS s(skill)  
  WHERE doc->'skills' @> '["Java"]'
;  

Those queries must return all employees's skill, not only the one used for the filter, because they are part of the same aggregate. With those SQL queries, the object-relational mapping (ORM) must regroup those rows to build the aggregate.

In practice, the MongoDB query will not even $unwind to mimic a relation as it gets directly the aggregate:

db.employees.aggregate([  
  { $match: { skills: "Java" } },  
  { $project: { _id: 0, name: 1, skills: 1 } }
])
;

[
  { name: 'Bob', skills: 'Java' },
  { name: 'Ann', skills: [ 'SQL', 'Java', 'Python' ] }
]

With this query, MongoDB returns the binary BSON object directly to the application driver, instead of converting it into records or JSON like most SQL databases.

Conclusion

We exposed the enhanced semantics for selection over a non-1NF collection, as an example. MongoDB does more than enhance selection. All relational operations are extended with a document semantics:

  • Selection works over scalars and arrays
  • Projection reshapes documents
  • Sort semantics are defined over arrays
  • Indexes apply uniformly to scalars and array elements
  • Joins exist ($lookup), and the semantics is defined even if the key is an array.

Relational theory is independent of physical implementation, but most RDBMS map each relation to a single table, and an index can cover the columns from a single table. Relational databases stem from mathematics and use normalization to structure storage. In contrast, applications center on aggregates, and MongoDB preserves these aggregates down to the storage layer.

First Normal Form (1NF) is required by the relational model—and therefore by relational SQL databases—because relational algebra assumes atomic attributes. In contrast, document databases relax these algebraic constraints in favor of explicit, document-oriented semantics, enabling similar operations to be performed directly on documents. So, when you hear that “data is relational” or feel you must apply normal forms to your domain model, recognize that this perspective is tied to a specific relational database implementation, not the nature of your data. The same data can support the same application with a document model.

Both models are valid but optimize for different abstraction layers: relational algebra offers a logical model for reasoning about data independently of the domain, while document databases model data as applications already do.