a curated list of database news from authoritative sources

March 15, 2026

The Serial Safety Net: Efficient Concurrency Control on Modern Hardware

This paper proposes a way to get serializability without completely destroying your system's performance. I quite like the paper, as it flips the script on how we think about database isolation levels. 


The Idea

In modern hardware setups (where we have massive multi-core processors, huge main memory, and I/O is no longer the main bottleneck), strict concurrency control schemes like Two-Phase Locking (2PL) choke the system due to contention on centralized structures. To keep things fast, most systems default to weaker schemes like Snapshot Isolation (SI) or Read Committed (RC) at the cost of allowing dependency cycles and data anomalies. Specifically, RC leaves your application vulnerable to unrepeatable reads as data shifts mid-flight, while SI famously opens the door to write skew, where two concurrent transactions update different halves of the same logical constraint.

Can we have our cake and eat it too? The paper introduces the Serial Safety Net (SSN), as a certifier that sits entirely on top of fast weak schemes like RC or SI, tracking the dependency graph and blessing a transaction only if it is serializable with respect to others.

Figure 1 shows the core value proposition of SSN. By layering SSN onto high-concurrency but weak schemes like RC or SI, the system eliminates all dependency cycles to achieve serializability without the performance hits seen in 2PL or Serializable Snapshot Isolation (SSI).


SSN implementation

When a transaction T tries to commit, SSN calculates a low watermark $\pi(T)$ (the oldest transaction in the future that depends on T) and a high watermark $\eta(T)$ (the newest transaction in the past that T depends on). If $\pi(T) \le \eta(T)$, it means the past has collided with the future, and a dependency cycle has closed. SSN aborts the transaction.

Because SSN throws out any transaction that forms a cycle, the final committed history is mathematically guaranteed to be cycle-free, and hence Serializable (SER).

Figure 2 illustrates how SSN detects serialization cycles using a serial-temporal graph. The x-axis represents the dependency order, while the y-axis tracks the global commit order. Forward dependency edges point upward, and backward edges (representing read anti-dependencies) point downward. Subfigures (a) and (b) illustrate a transaction cycle closing and the local exclusion window violation that triggers an abort: transaction T2 detects that its predecessor T1 committed after T2's oldest successor, $\pi(T2)$. This overlap proves T1 could also act as a successor, forming a forbidden loop.

Subfigures (c) and (d) demonstrate SSN's safe conditions and its conservative trade-offs. In (c), the exclusion window is satisfied because the predecessor T3 committed before the low watermark $\pi(Tx)$, making it impossible for T3 to loop back as a successor. Subfigure (d), however, shows a false positive where transaction T3 is aborted because its exclusion window is violated, even though no actual cycle exists yet. This strictness is necessary, though: allowing T3 to commit would be dangerous, as a future transaction could silently close the cycle later without triggering any further warnings. Since SSN summarizes complex graphs into just two numbers ($\pi$ and $\eta$), it will sometimes abort a transaction simply because the exclusion window was violated, even if a true cycle hasn't formed yet.

SSN vs. Pure OCC

Now, you might be asking: Wait, this sounds a lot like Optimistic Concurrency Control (OCC), so why not just use standard OCC for Serializability?

Yes, SSN is a form of optimistic certification, but the mechanisms are different, and the evaluation section of the paper exposes exactly why SSN is a superior architecture for high-contention workloads.

Standard OCC does validation by checking exact read/write set intersections. If someone overwrote your data, you abort. The problem is the OCC Retry Bloodbath! When standard OCC aborts a transaction, retrying it often throws it right back into the exact same conflict because the overwriting transaction might still be active. In the paper's evaluation, when transaction retries were enabled, the standard OCC prototype collapsed badly, wasting over 60% of its CPU cycles just fighting over index insertions.

SSN, however, possesses the "Safe Retry" property. If SSN aborts your transaction T because a predecessor U violated the exclusion window, U must have already committed. When you immediately retry, the conflict is physically in the past; your new transaction simply reads $U$'s freshly committed data, bypassing the conflict entirely. SSN's throughput stays stable under pressure while OCC falls over.


Discussion

So what do we have here? SSN offers a nice way to get to SER, while keeping decent concurrency. It proves that with a little bit of clever timestamp math, you can turn a dirty high-speed concurrency scheme into a serializable one.

Of course, no system is perfect. If you are going to deploy SSN, you have to pay the piper. Here are some critical trade-offs.

To track these dependencies, SSN requires you to store extra timestamps on every single version of a tuple in your database. In a massive in-memory system, this metadata bloat is a significant cost compared to leaner OCC implementations.

SSN is also not a standalone silver bullet for full serializability. While it is great at tracking row-level dependencies on existing records, it does not natively track phantoms (range-query insertions). Because an acyclic dependency graph only guarantees serializability in the absence of phantoms , you cannot just drop SSN onto vanilla RC or SI; you must actively extend the underlying CC scheme with separate mechanisms like index versioning or key-range locking to prevent them.


To bring closure on the SSN approach, let's address one final architectural puzzle. If you've been following the logic so far, you might have noticed a glaring question. The paper demonstrates that layering SSN on top of Read Committed guarantees serializability (RC + SSN = SER). It also shows that doing the exact same thing with Snapshot Isolation gets you to the exact same destination (SI + SSN = SER). If both combinations mathematically yield a serializable database, why would we ever willingly pay the higher performance overhead of Snapshot Isolation? Why would we want SI+SSN when we have RC+SSN at home?

While layering SSN on top of Read Committed (RC) guarantees a serializable outcome, it exposes your application to in-flight problems. Under RC, reads simply return the newest committed version of a record and never block. This means the underlying data can change right under your application's feet while the transaction is running. Your code might read Account A, and milliseconds later read Account B after a concurrent transfer committed, seeing a logically impossible total, an inconsistent snapshot. Even though SSN will ultimately catch this dependency cycle and safely abort the transaction during the pre-commit phase, your application logic might crash before it ever reaches that protective exit door. Furthermore, even if your code survives the run, this late abort mechanism hides a big performance penalty: your system might burn a lot of CPU and memory executing a complex doomed transaction, only for SSN to throw all that wasted work at the final commit check.

This is why we gladly pay the extra concurrency control overhead for SI. Under SI, each transaction reads from a perfectly consistent snapshot of the database taken at its start time. From your application's perspective, time stops, completely shielding your code from ever seeing those transiently broken states mid-flight. However, as we mentioned in the beginning, SI still allows write skews, and pairing it with SSN covers for that to guarantee serializability. 


If you like to dive into this more, the authors later published a 20 page journal version here. I also found a recent follow up by Japanese researchers here.

March 13, 2026

How to Customize PagerDuty Custom Details in Grafana: The Hidden Override Method

The Problem If you’ve integrated Grafana Alerting with PagerDuty, you’ve probably noticed something frustrating: the PagerDuty incident details are cluttered with every single label and annotation from your alerts. Here’s what you typically see: [crayon-69b41a4293131592303572/] This wall of text makes it hard for your on-call engineers to quickly identify what’s wrong. And actually, this was […]

Scaling Postgres connections with PgBouncer

PgBouncer is the perfect pairing for Postgres's biggest weakness: connection management. Tuning it just right is important to make this work well, and here we cover everything you need to know

March 12, 2026

A Failing Unit Test, a Mysterious TCMalloc Misconfiguration, and a 60% Performance Gain in Docker

We are pleased to share the news of a recent fix, tracked as PSMDB-1824/SMDB-1868, that has delivered significant, quantifiable performance enhancements for Percona Server for MongoDB instances, particularly when running in containerized environments like Docker. Percona Server for MongoDB version 8.0.16-5, featuring this improvement, was made available on December 2, 2025. Investigation The initial issue […]

March 11, 2026

Migrate Cloud SQL for MySQL to Amazon Aurora and Amazon RDS for MySQL Using AWS DMS

In this post, we demonstrate how to migrate from Cloud SQL for MySQL 8+ to Amazon RDS for MySQL 8+ or Amazon Aurora MySQL–Compatible using AWS DMS over an AWS Site-to-Site VPN. We cover preparing the source and target environments, exemplifying cross-cloud connectivity, and setting up DMS tasks.

Enzyme Detergents are Magic

This is one of those things I probably should have learned a long time ago, but enzyme detergents are magic. I had a pair of white sneakers that acquired some persistent yellow stains in the poly mesh upper—I think someone spilled a drink on them at the bar. I couldn’t get the stain out with Dawn, bleach, Woolite, OxiClean, or athletic shoe cleaner. After a week of failed attempts and hours of vigorous scrubbing I asked on Mastodon, and Vyr Cossont suggested an enzyme cleaner like Tergazyme.

I wasn’t able to find Tergazyme locally, but I did find another enzyme cleaner called Zout, and it worked like a charm. Sprayed, rubbed in, tossed in the washing machine per directions. Easy, and they came out looking almost new. Thanks Vyr!

Also the vinegar and baking soda thing that gets suggested over and over on the web is nonsense; don’t bother.

March 10, 2026

TLA+ as a Design Accelerator: Lessons from the Industry

After 15+ years of using TLA+, I now think of it is a design accelerator. One of the purest intellectual pleasures is finding a way to simplify and cut out complexity. TLA+ is a thinking tool that lets you do that.

TLA+ forces us out of implementation-shaped and operational reasoning into mathematical declarative reasoning about system behavior. Its global state-transition model and its deliberate fiction of shared memory make complex distributed behavior manageable. Safety and liveness become clear and compact predicates over global state. This makes TLA+ powerful for design discovery. It supports fast exploration of protocol variants and convergence on sound designs before code exists.

TLA+ especially shines for distributed/concurrent complex systems. In such systems, complexity exceeds human intuition very quickly. (I often point out to very simple interleaving/nondeterministic execution puzzles to show how much we suck at reasoning about such systems.) Testing is inadequate for subtle design errors for complex distributed/concurrent systems. Code may faithfully implement a design, but the design itself can fail in rare concurrent scenarios. TLA+ provides exhaustively testable design; catches design errors before code, and enables rapid "what if" design explorations and aggressive protocol optimizations safely.

This is why there are so many cases of TLA+ modeling used in industry, including Amazon/AWS, Microsoft Azure, MongoDB, Oracle Cloud, Google, LinkedIn, Datadog, Nike, Intel.

In this post, I will talk about the TLA+ modeling projects I worked at. Mostly from the industry. (Ok, I counted it, 8 projects. I talk to you about 8 projects.)


1. WPaxos (2016)

This is the only non-industry experience I will mention. It is from my work on WPaxos at SUNY Buffalo, with my PhD students.

WPaxos adapts Paxos for geo-distributed systems by combining flexible quorums with many concurrent leaders. It uses two quorum types: a phase-1 quorum (Q1) used to establish or steal leadership for an object, and a phase-2 quorum (Q2) used to commit updates. Flexible quorum rules require only that every Q1 intersect every Q2, not that all quorums intersect each other. WPaxos exploits this by placing Q2 quorums largely within a single region, so the common case of committing updates happens with low local latency, while the rarer Q1 leader changes span zones. Nodes can steal leadership of objects via Q1 when they observe demand elsewhere, in order migrate the object's ownership toward the region issuing the most writes. Safety is assured because any attempted commit must pass through a Q2 quorum that intersects prior Q1 decisions, preventing conflicting updates despite failures, network delays, and concurrent leaders.

I had explained the basic WPaxos protocol here, if you like to read more. (Sadly there was never a part 2 for that post. I don't know if anyone is using WPaxos in production, but it is a really good idea and I hope to hear about deployments in the wild.)  As for our use of TLA+ for the protocol, it came early on. After we had the intuitive idea of the protocol, we knew we needed strong modeling support in order to get this complex thing completely right. The modeling also helped us sharpen our definitions. It is not straightforward to define quorums across zones, while getting the intersections right. The TLA+ modeling was so useful in fact that we used TLA+/PlusCal snippets in our paper to explain concepts (model-checking validated spec, rather than a hail-Mary pseudocode like everyone else does). The definitions also came from our TLA+ formal definitions. 

The lesson learned: Model early! Like we predicted, we got a lot of mileage by modeling in TLA+ early on in this project.


2. CosmosDB (2018)

During my sabbatical at Microsoft Azure CosmosDB, I helped specify the database's client-facing consistency semantics. The nice thing about these specs was that they didn't need to model internal implementation. The goal was to capture the consistency semantics for clients in a precise manner, rather than providing ambiguous English explanations. The model aimed to answer the question: What kind of behavior should a client be able to witness while interacting with the service?

The anti-pattern here would be to try to model the distributed database engine. Trying to show how the replication/coordination works would have lead to immediate state-space explosion and an unreadable/unusable specification. Instead we modeled at a high level and honed in on the "history as a log"  abstraction for us to represent/capture the user-facing concurrency.

The model is available here.  The "history" variable records all client operations (reads and writes) in order, and each consistency level validates different properties against it:

  • Strong Consistency enforces Linearizability: reads must always see the latest write globally across all regions
  • Bounded Staleness uses ReadYourWrite: clients see their own writes, plus bounded staleness by K operations
  • Session uses MonotonicReadPerClient: a client's reads are monotonic (never go backward)
  • Consistent Prefix uses MonotonicWritePerRegion: writes in a region appear in order
  • Eventual uses Eventual Convergence: reads eventually see writes that exist in the database.

The replication macro was particularly high-level and clever. When region d replicates from region s, it merges both write histories, sorts them, and deduplicates to get a consistent, monotonically increasing sequence of writes. After replication, Data[d] is set to the last value in the merged database, ensuring regions eventually converge to the same state. 

The lesson here is to model minimalistically. A model does not have to capture everything to be highly valuable; it just needs to capture the part/behavior that matters.

This minimalistic model served as precise documentation for outside-facing behavior, replacing ambiguous English explanations, and became foundational enough that a 2023 academic paper built/improved on it. I talked about this improved model here. The history/log was again the main abstraction in that model. The 2023 paper accompanying the model has this great opening paragraph, which echoes the experience of everyone that has painstakingly specified a distributed/concurrent system behavior:

"Consistency guarantees for distributed databases are notoriously hard to understand. Not only can distributed systems inherently behave in unexpected and counter-intuitive ways due to internal concurrency and failures, but they can also lull their users into a false sense of functional correctness: most of the time, users of a distributed database will witness a much simpler and more consistent set of behaviors than what is actually possible. Only timeouts, fail-overs, or other rare events will expose the true set of behaviors a user might witness. Testing for these scenarios is difficult at best: reproducing them reliably requires controlling complex concurrency factors, latency variations, and network behaviors. Even just producing usable documentation for developers is fundamentally challenging and explaining these subtle consistency issues via documentation comes as an additional burden to distributed system developers and technical writers alike."


3. AWS DistSQL (2022)

I worked on AWS DistSQL 2022 and 2023. Aurora DSQL builds a WAN distributed SQL database. It decomposes the database into independent services: stateless SQL compute nodes, durable storage, a replicated journal, and transaction adjudicators. Transactions execute optimistically and mostly locally. Reads use MVCC snapshots, and writes are buffered without coordination. Only at commit does the system perform global conflict validation and ordering, using adjudicators and the journal to finalize the transaction. This design pushes almost all distributed coordination to commit time, allowing statements inside a transaction to run with low latency while still providing strong transactional guarantees.

I did a first version of the TLA+ modeling of this system. This was great for getting confidence on the protocol. After writing the model, I had a better understanding of the invariants. This also served as a communication tool, to keep people on the same page. When we were trying to get more formal methods support, the TLA+ models sped up the process and anchored the communications. This was a surprising thing I learned, working as part of a big team, what a big challenge it is to keep everyone aligned. Brooker had banged out a 100+ page book on the design, which did really help. He also had written PLang models of the system as well. As far as I know, both modeling eventually gave way to closer-to-implementation Rust models/testing. I am not able to share the TLA+ models for DSQL, but I hope when a DSQL publication comes out eventually, it would have some TLA+ models accompanying. 

The lesson here is that TLA+ also works well as communication tool and serving as scaffolding for further formal methods and testing support.


4. StableEmptySet (2022)

I worked on this problem for a short time, but I find it still worth mentioning.  We needed to implement a distributed set that, once empty, remains empty permanently. This is a crucial property for safely garbage-collecting a set and ensuring we don't add symbolic links to a record that has already been deleted, and later lose durability when garbage collection kicks in.

Ok, but why don't we just check IsEmpty during an add operation, and disallow the addition if it is being done to an empty-set? You don't get to have such simple luxuries in a distributed system. This is a set implemented in a distributed manner, so we do not have an atomic check for IsEmpty. Think of a LIST scan across many machines, that is inherently a non-atomic check...

In a distributed system, concurrency is our nemesis, and it makes implementing this StableEmpty protocol very tricky due to many cornercases. Ernie Cohen designed a brilliant/elegant protocol to solve this. Ernie is a genius, who lives in an abstract astral plane many miles above us. My role was simply to translate his protocol into TLA+ to bring it down to a concrete plane where mere mortals like me could understand it. Again sorry that I cannot disclose the model.

The reason I am mentioning this problem is because it radically expanded my horizons on how far we can/should push the fine granularity of atomic actions. Of course the IsEmpty check is non-atomic, but Ernie also did push the update/communication steps of the algorithm to be as fine grain as possible so that there won't be a need to do distributed locking, and the implementation can scale. The problem is when you develop an algorithm with extremely fine-grained actions, the surface area for operation interleaving and interference explodes. That is why modeling the  protocol in TLA+ and model-checking it for correctness becomes non-negotiable.

The anti-pattern here would be to attempt to implement pseudo-atomic checks via distributed locking, or handling concurrent additions and deletions as ad-hoc operational edge cases, which would be both doomed to fail at large scale. 

The Lesson: Choose atomicity carefully and push for the finest possible granularity using TLA+ modeling as an exploration tool. TLA+ forces you to define exactly what operations are atomic and helps you to model-check if the interleavings of these operations are safe.


5. PowerSet Paxos (2022)

I helped briefly with this model when my colleague (Chuck Carman) hit an exponential state-space explosion with his distributed transaction protocol, PowerSet Paxos. For a change, this time we have a testimonial from Chuck with his own words. 

"The first time I made a distributed transaction protocol, I did it by sitting in a coffee shop with an excel sheet trying to come up with sets of rules to evolve the rows over time. This took weeks after work. I got on the core idea: metadata encoding partial causal orders writing overlapping sets of keys. I didn't trust my algorithm at all. It took another three months to refine the algorithm using a tool to make sure it was correct (TLA+). It took a week to translate TLA+ algorithm to code. It took way more time to write the test code.  Maybe 75-80% if the code is testing all the invariants the TLA+ spec had in it. The long pole there was creating a DSL in Java land to effectively test all of the invariants TLA+ checked. That took a month or two.

For PowerSet Paxos... We haven't had a transaction corrupted yet, and the team is learning Pluscal to apply it to the rest of the system where there are errors around state machine transitions. Much regret has been expressed around not modeling those parts in TLA+ so that the main implementations would be mostly error free."

The Lesson: Code is cheap, testing broken algorithms is expensive.

I hope Chuck can share this model someday, alongside a publication on this protocol. 


6. Secondary Index (2023)

When starting development on a secondary index for DSQL, the engineer that drafted the initial protocol wrote a 6 pager document as is Amazon's tradition. But he kept finding cornercases with the indexing. This problem, at first, did not look very complex and therefore a good fit for TLA+ modeling. After all, the indexing is happening centrally at a node, the concurrency came from ongoing operations on the database, while the indexing was trying to catchup to existing work. This sounds more data-centric than operation/protocol-centric so it didn't sound like an ideal fit. Here is the description simplified from the patent which ended up getting award for the final protocol:

"We initiate the creation of a unique secondary index on a live database without interrupting ongoing operations. To achieve this, the system backfills historical data up to a specific point in time while simultaneously applying all new, incoming updates directly to the index. Once the backfill finishes, we perform a final evaluation across the entire index to ensure no duplicate entries slipped through during this highly concurrent phase. If any unique constraint violations are detected, the system immediately flags the error and reports the exact cause."

I was visiting Seattle offices, and I told him that we should try TLA+ modeling this given the large number of cornercases popping up. I then wrote a very crude initial model, and apparently that was enough to get him started. I was surprised to find that over the weekend, he had written variations of the model and made improvements on the model, without prior TLA+ experience. 

The anti-pattern here would be to design/grow the protocol by thinking in control flow and patching corner cases one by one as they arise. Using TLA+ forced a more declarative mathematical approach. It acted as a design accelerator, because it is faster to fix a conceptual model than to whack-a-mole corner cases in code.

The lesson: break the implementation mindset and search at the protocol solution space.


7. LeaseGuard: Raft Leases Done Right (2024)

I joined MongoDB Research in 2024. MongoDB has 10+ year history of TLA+ success, including the Logless dynamic reconfiguration, pull-based consensus replication, and the extreme modeling projects.

Leader lease design was my first project at MongoDB. We designed a simple lease protocol tailored for Raft, called LeaseGuard. Our main innovation is to rely on Raft-specific guarantees to design a simpler lease protocol that recovers faster from a leader crash. We wrote about it here. Please go read it, it is a really good read.

Since we are TLA+ fans, and we knew the importance of getting started early on for modeling the algorithm in TLA+. And this paid off big time, we discovered our two optimizations in while writing the TLA+ spec for our initial crude concept of the algorithm. The inherited lease reads optimization was especially surprising to us; we probably wouldn't have realized it was possible if TLA+ wasn't helping us think. We also used TLA+ to  check that LeaseGuard guaranteed Read Your Writes and other correctness properties.

The Lesson: Design discovery over verification. TLA+ is useful not just for checking the correctness of a completed design, but for revealing new insights while the design is in progress. Modeling in TLA+ actively informed our protocol and we discovered surprising optimizations by exploring the protocol in TLA+.


8. MongoDB Distributed Transactions Modeling (2025)

This was my second project at MongoDB. We also wrote a blog post on this here, so I am going just cut to the chase here. 

In this project, we developed the first modular TLA+ specification of MongoDB's distributed transaction protocol. The model separates the sharded transaction logic from the underlying storage and replication behavior, which lets us reason about the protocol at a higher level while still capturing key cross-layer interactions. Using the model, we validated isolation guarantees, explored anomalies under different ReadConcern and WriteConcern settings, and clarified subtle issues such as interactions with prepared transactions. Our spec is available publicly on GitHub. 

This effort brought much-needed clarity to a big complex distributed transactions protocol. I believe this is the most detailed TLA+ model of a distributed transactions protocol in existence. While database/systems papers occasionally feature a TLA+ transaction model, those typically focus on a very narrow slice. I am not aware of any other model that captures distributed transactions at this level of granularity. A big value of our model is that it serves as a reference to answer questions about a protocol which span many teams, many years of development, and several software/service layers.

Furthermore, we used the TLA+ model traces from our spec to generate thousands of unit tests that exercise the actual WiredTiger implementation. This successfully bridged the gap between formal mathematical specification and concrete code.

The Lesson: Models can add value even retroactively, and can have a life beyond the initial design phase.



When I started writing this post this morning, I was originally planning to talk also about how to go about starting with your TLA+ modeling, and how things are/might-be changing in the post LLM world. But this post already got very long, and I will leave that for next time.

March 06, 2026

The first rule of database fight club: admit nothing

 I am fascinated by tech marketing but would be lousy at it.

A common practice is to admit nothing -- my product, project, company, idea is perfect. And I get it because admitting something isn't perfect just provides fodder for marketing done by the other side, and that marketing is often done in bad faith.

But it is harder to fix things when you don't acknowledge the problems.

In the MySQL community we did a good job of acknowledging problems -- sometimes too good. For a long time as an external contributor I filed many bug reports, fixed some bugs myself and then spent much time marketing open bugs that I hoped would be fixed by upstream. Upstream wasn't always happy about my marketing, sometimes there was much snark, but snark was required because there was a large wall between upstream and the community. I amplified the message to be heard.

My take is that the MySQL community was more willing than the Postgres community to acknowledge problems. I have theories about that and I think several help to explain this:

  • Not all criticism is valid
    • While I spend much time with Postgres on benchmarks I don't use it in production. I try to be fair and limit my feedback to things where I have sweat equity my perspective is skewed.  This doesn't mean my feedback is wrong but my context is different. And sometimes my feedback is wrong.
  • Bad faith
    • Some criticism is done in bad faith. By bad faith I means that truth takes a back seat to scoring points. A frequent source of Postgres criticism is done to promote another DBMS. Recently I have seen much anti-Postgres marketing from MongoDB. I assume they encounter Postgres as competition more than they used to. 
  • Good faith gone bad
    • Sometimes criticism given in good faith will be repackaged by others and used in bad faith. This happens with some of the content from my blog posts. I try to make this less likely by burying the lead in the details but it still happens.
  • MySQL was more popular than Postgres until recently. 
    • Perhaps people didn't like that MySQL was getting most of the attention and admitting flaws might not help with adoption. But today the attention has shifted to Postgres so this justification should end. I still remember my amusement at a Postgres conference long ago when the speaker claimed that MySQL doesn't do web-scale. Also amusing was being told that Postgres didn't need per-page checksums because you should just use ZFS to get similar protection.
  • Single-vendor vs community
    • MySQL is a single-vendor project currently owned by Oracle. At times that enables an us vs them mentality (community vs coporation). The coporation develops the product and it is often difficult for the community to contribute. So it was easy to complain about problems, because the corporation was responsible for fixing them.
    • Postgres is developed by the community. There is no us vs them here and the community is more reluctant to criticize the product (Postgres). This is human nature and I see variants of it at work -- my work colleagues are far more willing to be critical of open-source projects we used at work than they were to be critical of the many internally developed projects. 

Colorado SB26-051 Age Attestation

Colorado is presently considering a bill, SB26-051, patterned off of California’s AB1043, which establishes civil penalties for software developers who do not request age information for their users. The bills use a broad sense of “Application Store” which would seem to encompass essentially any package manager or web site one uses to download software—GitHub, Debian’s apt repos, Maven, etc. As far as I can tell, if someone under 18 were to run, say, a Jepsen test in California or Colorado, or use basically any Linux program, that could result in a $2500 fine. This (understandably!) has a lot of software engineers freaked out.

I reached out to the very nice folks at Colorado Representative Amy Paschal’s office, who understand exactly how bonkers this is. As they explained it, the Colorado Senate tried to adapted California’s bill closely in the hopes of building a consistent regulatory environment, but there weren’t really people with software expertise involved. Representative Paschal is a former software engineer herself, and was brought in to try and lend an expert opinion. She’s trying to amend the bill so it doesn’t, you know, outlaw most software. Her office has two recommendations for the Colorado bill:

  1. Reach out to Colorado Senator Matt Ball, one of the Senate sponsors.

  2. Please be polite. Folks are understandably angry, but I get the sense that their staffers are taking a bit of a beating, and that’s probably not helping.

I’m not sure what to do with California’s AB 1043 just yet. I called some of the co-sponsors in the California Assembly, and they suggested emailing Samantha Huynh. I wrote her explaining the situation and asking if they had any guidance for how to comply with the bill, but I haven’t heard back yet.

Valkey and Redis Sorted Sets: Leaderboards and Beyond

  This blog post covers the details about sorted set use cases as discussed in this video. Sorted sets are one of the most powerful data structures in Valkey and Redis. While most developers immediately think about “gaming leaderboards” when they hear about sorted sets, this versatile data type can solve many problems, from task […]

March 05, 2026

Log Drains: Now available on Pro

Supabase Pro users can now send their Supabase logs to their own logging backend, enabling them to debug in the same place as the rest of their stack.

Building a Database on S3

Hold your horses, though. I'm not unveiling a new S3-native database. This paper is from 2008. Many of its protocols feel clunky today. Yet it nails the core idea that defines modern cloud-native databases: separate storage from compute. The authors propose a shared-disk design over Amazon S3, with stateless clients executing transactions. The paper provides a blueprint for serverless before the term existed.


SQS as WAL and S3 as Pagestore

The 2008 S3 was painfully slow, and 100 ms reads weren't unusual. To hide that latency, the database separates "commit" from "apply". Clients write small, idempotent redo logs to Amazon Simple Queue Service (SQS) instead of touching S3 directly. An asynchronous checkpoint by a client applies those logs to B-tree pages on S3 later.

This design shows strong parallels to modern disaggregated architectures. SQS becomes the write-ahead log (WAL) and logstore. S3 becomes the pagestore. Modern Aurora follows a similar logic: the log is replicated, and storage materializes pages independently. Ok, in Aurora the primary write acknowledgment is synchronous after storage quorum replication, and of course Aurora does not rely on clients to pull logs manually and apply like this 2008 system, but what I am trying to say is the philosophy is identical.


Surviving SQS and building B-link Trees on S3

As mentioned above, to bypass the severe latency of writing full data pages directly to S3, clients commit transactions by shipping small redo log records to SQS queues. Subsequently, clients act as checkpointers, asynchronously pulling these queued logs and applying the updates to their local copies before writing the newly materialized B-tree pages back to S3. This asynchronous log-shipping model means B-tree pages on S3 can be arbitrarily out-of-date compared to the real-time logs in SQS. Working on such stale state seems impossible, but the authors bound the staleness: writers (and probabilistically readers) run asynchronous checkpoints that pull batches of logs from SQS and apply them to S3, keeping the database consistent despite delays.

SQS, however, throws a wrench in the works. I was initially very surprised by the paper’s description of SQS (the 2008 version). It said that a queue might hold 200 messages, but a client requesting 100 could randomly receive only 20. This is because, to provide low latency, SQS does a best-effort poll of a subset of its distributed servers and immediately returns whatever it finds. But don’t worry, the other messages aren’t lost. They sit on servers not checked in that round. But the price of this low-latency is that FIFO ordering isn’t guaranteed. The database handles this mess by making log records idempotent, and ensures that out-of-order or duplicate processing never corrupts data.

The commit protocol in the paper actually starts simple: clients send log records straight to Pending Update (PU) queues. But the problem with this naive direct-write approach is that if the client crashes mid-commit, only some records might make it to the queue, and this breaks atomicity. To fix this issue, the paper proposes an Atomicity protocol: clients first dump all logs plus a final “commit” token into a private ATOMIC queue, then push everything to the public PU queues. This guarantees all-or-nothing transactions, but it’s pricey, since every extra SQS message adds up. At $2.90 per 1,000 transactions, it’s almost twenty times the $0.15 of the naive direct-write approach. So here, consistency comes at a literal monetary cost!

The big picture here is about how brutally complex it is to build a real database on dumb cloud primitives. They had to implement Record Managers, Page Managers, and buffer pools entirely on the client side, in order to cluster tiny records into pages. For distributed coordination, they hack SQS into a locking system with dedicated LOCK queues and carefully timed tokens. On top of that, they have to handle SQS quirks, with idempotent log records as we discussed above. The engineering effort is massive.

Finally, to address the slow and weakly consistent S3 reads, the database leans on lock-free B-link trees. That lets readers keep moving while background checkpoints/updates by clients split or reorganize index pages. In B-link trees, each node points to its right sibling. If a checkpoint splits a page, readers just follow the pointer without blocking. Since update corruption is still a risk, a LOCK queue token ensures only one thread checkpoints a specific PU queue at a time. (I told you this is complicated.) The paper admits this is a serious bottleneck: hot-spot objects updated thousands of times per second simply can’t scale under this design.


Isolation guarantees

In order to prioritize extreme availability, the system throws traditional isolation guarantees out the window. The paper says ANSI SQL-style isolation and strict consistency cannot survive at scale in this architecture. The atomicity protocol prevents dirty reads by ensuring only fully committed logs leave a client’s private queue, but commit-time read-write and write-write conflicts are ignored entirely! If two clients hit the same record, the last-writer wins. So lost updates are common. To make this usable, the authors push consistency up to the client. For ensuring monotonic reads, each client tracks the highest commit timestamp it has seen, and if it sees any older version from S3 it rejects it and rereads. For monotonic writes, the client stamps version counters on log records and page headers. Checkpoints sort logs and defer any out-of-order SQS messages so each client’s writes stay in order.

I was also surprised by the discussion of stronger isolation in the paper. The paper claims snapshot isolation hasn’t been implemented in distributed systems yet, because it strictly requires a centralized global counter to serialize transactions. This is flagged as a fatal bottleneck and single point of failure.

Looking back, we find this claim outdated. Global counters aren’t a bottleneck for Snapshot Isolation: Amazon Aurora stamps transactions with a Global Log Sequence Number (GLSN) via a primary writer, but still scales (vertically) cleanly without slowing disaggregated storage. More importantly, modern distributed database systems implement snapshot isolation, using loosely synchronized physical clocks (and hybrid logical clocks) to give global ordering with no centralized counter at all. Thank God for synchronized clocks!


Conclusion

While the paper had to work around the messy 2008 cloud environment, it remains impressive for showing how to build a serverless database architecture on dumb object storage. In recent years, S3 has become faster, and in 2020 it gained strong read-after-write consistency for all PUTs and DELETEs. This made it much easier to build databases (especially for analytical workloads) over S3 directly, and this led to the modern data lake and lakehouse paradigms. We can say this paper laid some groundwork for systems like Databricks (Delta Lake), Apache Iceberg, and Snowflake.