a curated list of database news from authoritative sources

November 14, 2025

A Tale of Two Databases: No-Op Updates in PostgreSQL and MySQL

I’m lazy when I’m speakin’ I’m lazy when I walk I’m lazy when I’m dancin’ I’m lazy when I talk   X-Press 2 Feat. David Byrne – Lazy While preparing a blog post to compare how PostgreSQL and MySQL handle locks, as part of a series covering the different approaches to MVCC for these databases, […]

November 13, 2025

Distributing Data in a Redis/Valkey Cluster: Slots, Hash Tags, and Hot Spots

When scaling Redis or its open source fork Valkey, a single instance can cause a bottleneck. The solution is to create a sharded cluster, where the cluster partitions data across multiple nodes. Understanding how this partitioning works is crucial for designing efficient, scalable applications. This article explores the mechanics of key distribution, the use of […]

November 12, 2025

PostgreSQL OIDC Authentication with pg_oidc_validator

Among the new fetures introduced in PostgreSQL 18 is support for OAuth-based authentication. This opened the door for the community to create extensions that integrate systems providing Single Sign-On (SSO) through OAuth 2.0 authentication with PostgreSQL. The reason this integration was not added directly to the core of PostgreSQL is due to the particularities found in those […]

November 11, 2025

Disaggregated Database Management Systems

This paper is based on a panel discussion from the TPC Technology Conference 2022. It surveys how cloud hardware and software trends are reshaping database system architecture around the idea of disaggregation.

For me, the core action is in Section 4: Disaggregated Database Management Systems. Here the paper discusses three case studies (Google AlloyDB, Rockset, and Nova-LSM) to give a taste of the software side of the movement. Of course there are many more. You can find Aurora, Socrates, and Taurus, and TaurusMM reviews in my blog. In addition, Amazon DSQL (which I worked on) is worth discussing soon. I’ll also revisit the PolarDB series of papers, which trace a fascinating arc from active log-replay storage toward simpler, compute-driven designs. Alibaba has been prolific in this space, but the direction they are ultimately advocating remains muddled across publications, which reflect conflicting goals/priorities.


AlloyDB

AlloyDB extends PostgreSQL with compute–storage disaggregation and HTAP support. Figure 4 in the paper shows its layered design: the primary node  (RW node) handles writes, a set of read pool replicas (RO nodes) provide scalable reads, and a shared distributed storage engine persists data in Google's Colossus file system. The read pools can be elastically scaled up or down with no data movement, because the data lives in disaggregated storage.

AlloyDB's hybrid nature enables it ot combine transactional and analytical processing by maintaining both a row cache and a pluggable columnar engine. The columnar engine vectorizes execution and automatically converts hot data into columnar format when it benefits analytic queries.

Under the covers, the database storage engine materializes pages from logs and stores blocks on Colossus. Logs are written to regional log storage; log-processing servers (LPS) continuously replay and materialize pages in the zones where compute nodes run. Durability and availability are decoupled: the logs are durable in regional log storage, while LPS workers ensure the blocks are always available near the compute.

This is a nice example of disaggregation serving elasticity and performance: compute scales independently and HTAP workloads benefit from a unified, multi-format cache hierarchy.


Rockset

Rockset seems to be a poster child for disaggregation in real-time analytics. Rockset's architecture follows the Aggregator–Leaf–Tailer (ALT) pattern (Figure 6). ALT separates compute for writes (Tailers), compute for reads (Aggregators and Leaves), and storage. Tailers fetch new data from sources such as Kafka or S3. Leaves index that data into multiple index types (columnar, inverted, geo, document). Aggregators then run SQL queries on top of those indexes, scaling horizontally to serve high-concurrency, low-latency workloads.

The key insight is that real-time analytics demands strict isolation between writes and reads. Ingest bursts must not impact query latencies. Disaggregation makes that possible by letting each tier scale independently: more Tailers when ingest load spikes, more Aggregators when query demand surges, and more Leaves as data volume grows.

Rockset also shows why LSM-style storage engines (and append-only logs in general) are natural fits for disaggregation. RocksDB-Cloud never mutates SST files after creation. All SSTs are immutable and stored in cloud object stores like S3. This makes them safely shareable across servers. A compaction job can be sent from one server to another: server A hands the job to a stateless compute node B, which fetches SSTs, merges them, writes new SSTs to S3, and returns control. Storage and compaction compute are fully decoupled.


Memory Disaggregation

The panel also discussed disaggregated memory as an emerging frontier. Today's datacenters waste over half their DRAM capacity due to static provisioning. It's shocking, no? RDMA-based systems like Redy have shown that remote memory can be used elastically to extend caches. The paper looks ahead to CXL as the next step as its coherent memory fabric can make remote memory behave like local. CXL promises fine-grained sharing and coherence. 


Hardware Disaggregation

On the hardware side, the paper surveys how storage, GPUs, and memory are being split from servers and accessed via high-speed fabrics. An interesting case study here is Fungible's DPU-based approach. The DPU offloads data-centric tasks (networking, storage, security) from CPUs, enabling server cores to focus solely on application logic. In a way, the DPU is a hardware embodiment of disaggregation.


Future Directions

Disaggregated databases are already here. Yet there are still many open questions.

  • How do we automatically assemble microservice DBMSs on demand, choosing the right compute, memory, and storage tiers for a workload?
  • How do we co-design software and hardware across fabrics like CXL to avoid data movement while preserving performance isolation?
  • How do we verify the correctness of such dynamic compositions?
  • Can a DBMS learn to reconfigure itself (rebalancing compute and storage) to stay optimal under changing workload patterns?
  • How do we deal fault-tolerance availability issues and develop new distributed systems protocols that exploit opportunities that open up in the disaggregated model?

As Swami said in Sigmod 2023 Panel "The customer value is here, and the technical problems will be solved in time. Thanks to the  complexities of disaggregation problems, every database/systems assistant professor is going to get tenure figuring how to solve them." 

How does it scale? The most basic benchmark on MongoDB

Choosing a database requires ensuring that performance remains fast as your data grows. For example, if a query takes 10 milliseconds on a small dataset, it should still be quick as the data volume increases and should never approach the 100ms threshold that users perceive as waiting. Here’s a simple benchmark: we insert batches of 1,000 operations into random accounts, then query the account with the most recent operation in a specific category—an OLTP scenario using filtering and pagination. As the collection grows, a full collection scan would slow down, so secondary indexes are essential.

We create an accounts collection, where each account belongs to a category and holds multiple operations—a typical one-to-many relationship, with an index for our query on operations per categories:

db.accounts.createIndex({
  category: 1,
  "operations.date": 1,
  "operations.amount": 1,
});

To increase data volume, this function inserts operations into accounts (randomly distributed to ten million accounts over three categories):

function insert(num) {
  const ops = [];
  for (let i = 0; i < num; i++) {
    const account  = Math.floor(Math.random() * 10_000_000) + 1;
    const category = Math.floor(Math.random() * 3);
    const operation = {
      date: new Date(),
      amount: Math.floor(Math.random() * 1000) + 1,
    };
    ops.push({
      updateOne: {
        filter: { _id: account },
        update: {
          $set: { category: category },
          $push: { operations: operation },
        },
        upsert: true,
      }
    });
  }
  db.accounts.bulkWrite(ops);
}

This adds 1,000 operations and should take less than one second:

let time = Date.now();
insert(1000);
console.log(`Elapsed ${Date.now() - time} ms`);

A typical query fetches the account, in a category, that had the latest operation:

function query(category) {
  return db.accounts.find(
    { category: category },
    { "operations.amount": 1 , "operations.date": 1 }
  )
    .sort({ "operations.date": -1 })
    .limit(1);
}

Such query should take a few milliseconds:

let time = Date.now();
print(query(1).toArray());
console.log(`Elapsed ${Date.now() - time} ms`);

I repeatedly insert new operations, by batches of one thousand, in a loop, and measure the time taken for the query while the collection grows, stopping once I reach one billion operations randomly distributed into the accounts:

for (let i = 0; i < 1_000_000; i++) { 
  // more data  
  insert(1000);  
  // same query
  const start = Date.now();  
  const results = query(1).toArray();  
  const elapsed = Date.now() - start;  
  print(results);  
  console.log(`Elapsed ${elapsed} ms`);  
}  
console.log(`Total accounts: ${db.accounts.countDocuments()}`);  

In a scalable database, the response time should not significantly increase while the collection grows. I've run that in MongoDB, and response time stays in single digit milliseconds. I've run that in an Oracle Autonomous Database, with the MongoDB emulation, but I can't publish the results as Oracle Corporations forbids the publication of database benchmarks (DeWitt Clause).

You can copy/paste this test and watch the elapsed time while data is growing, on you own infrastructure.

November 10, 2025

The Future of Fact-Checking is Lies, I Guess

Last weekend I was trying to pull together sources for an essay and kept finding “fact check” pages from factually.co. For instance, a Kagi search for “pepper ball Chicago pastor” returned this Factually article as the second result:

Fact check: Did ice agents shoot a pastor with pepperballs in October in Chicago

The claim that “ICE agents shot a pastor with pepperballs in October” is not supported by the available materials supplied for review; none of the provided sources document a pastor being struck by pepperballs in October, and the only closely related reported incident involves a CBS Chicago reporter’s vehicle being hit by a pepper ball in late September [1][2]. Available reports instead describe ICE operations, clergy protests, and an internal denial of excessive force, but they do not corroborate the specific October pastor shooting allegation [3][4].

Here’s another “fact check”:

Fact check: Who was the pastor shot with a pepper ball by ICE

No credible reporting in the provided materials identifies a pastor who was shot with a pepper‑ball by ICE; multiple recent accounts instead document journalists, protesters, and community members being hit by pepper‑ball munitions at ICE facilities and demonstrations. The available sources (dated September–November 2025) describe incidents in Chicago, Los Angeles and Portland, note active investigations and protests, and show no direct evidence that a pastor was targeted or injured by ICE with a pepper ball [1] [2] [3] [4].

These certainly look authoritative. They’re written in complete English sentences, with professional diction and lots of nods to neutrality and skepticism. There are lengthy, point-by-point explanations with extensively cited sources. The second article goes so far as to suggest “who might be promoting a pastor-victim narrative”.

The problem is that both articles are false. This story was broadly reported, as in this October 8th Fox News article unambiguously titled “Video shows federal agent shoot Chicago pastor in head with pepper ball during Broadview ICE protest”. DHS Assistant Secretary Tricia McLaughlin even went on X to post about it. This event definitely happened, and it would not have been hard to find coverage at the time these articles were published. It was, quite literally, all over the news.

Or maybe they’re sort of true. Each summary disclaims that its findings are based on “the available materials supplied for review”, or “the provided materials”. This is splitting hairs. Source selection is an essential part of the fact-checking process, and Factually selects its own sources in response to user questions. Instead of finding authoritative sources, Factually selected irrelevant ones and spun them into a narrative which is the opposite of true. Many readers will not catch this distinction. Indeed, I second-guessed myself when I saw the Factually articles—and I read the original reporting when it happened.

“These conversations matter for democracy,” says the call-to-action at the top of every Factually article. The donation button urges readers to “support independent reporting.”

But this is not reporting. Reporters go places and talk to people. They take photographs and videos. They search through databases, file FOIA requests, read court transcripts, evaluate sources, and integrate all this with an understanding of social and historical context. People go to journalism school to do this.

What Factually does is different. It takes a question typed by a user and hands it to a Large Language Model, or LLM, to generate some query strings. It performs up to three Internet search queries, then feeds the top nine web pages it found to an LLM, and asks a pair of LLMs to spit out some text shaped like a fact check. This text may resemble the truth, or—as in these cases—utterly misrepresent it.

Calling Factually’s articles “fact checks” is a category error. A fact checker diligently investigates a contentious claim, reasons about it, and ascertains some form of ground truth. Fact checkers are held to a higher evidentiary standard; they are what you rely on when you want to be sure of something. The web pages on factually.co are fact-check-shaped slurry, extruded by a statistical model which does not understand what it is doing. They are fancy Mad Libs.

Some times the Mad Libs are right. Some times they’re blatantly wrong. Some times it is clear that the model simply has no idea what it is doing, as in this article where Factually is asked whether it “creates fake fact-checking articles”, and in response turns to web sites like Scam Adviser, which evaluate site quality based on things like domain age and the presence of an SSL certificate, or Scam Detector, which looks for malware and phishing. Neither of these sources has anything to do with content accuracy. When asked if Factually is often incorrect (people seem to ask this a lot) Factually’s LLM process selects sources like DHS Debunks Fake News Media Narratives from June, Buzzfeed’s 18 Science Facts You Believed in the 1990s That Are Now Totally Wrong, and vocabulary.com’s definition of the word “Wrong”. When asked about database safety, Factually confidently asserts that “Advocates who state that MongoDB is serializable typically refer to the database’s support for snapshot isolation,” omitting that Snapshot Isolation is a completely different, weaker property. Here’s a Factually article on imaginary “med beds” which cites this incoherent article claiming to have achieved quantum entanglement via a photograph. If a real fact checker shows you a paper like this with any degree of credulousness, you can safely ignore them.

The end result of this absurd process is high-ranking, authoritative-sounding web pages which sometimes tell the truth, and sometimes propagate lies. Factual has constructed a stochastic disinformation machine which exacerbates the very problems fact-checkers are supposed to solve.

Please stop doing this.

Comparing Integers and Doubles

During automated testing we stumbled upon a problem that boiled down to transitive comparisons: If a=b, and a=c, when we assumed that b=c. Unfortunately that is not always the case, at least not in all systems. Consider the following SQL query:

select a=b, a=c, b=c
from (values(
   1234567890123456789.0::double precision,
   1234567890123456788::bigint,
   1234567890123456789::bigint)) s(a,b,c)

If you execute that in Postgres (or DuckDB, or SQL Server, or ...) the answer is (true, true, false). That is, the comparison is not transitive! Why does that happen? When these systems compare a bigint and a double, they promote the bigint to double and then compare. But a double has only 52 bits of mantissa, which means it will lose precision when promoting large integers to double, producing false positives in the comparison.

This behavior is highly undesirable, first because it confuses the optimizer, and second because (at least in our system) joins work very differently: Hash joins promote to the most restrictive type and discard all values that cannot be represented, as they will never produce a join partner for sure. For double/bigint joins that leads to observable differences between joins and plain comparisons, which is very bad.

How should we compare correctly? Conceptually the situation is clear, an IEEE 754 floating point with sign s, mantissa m, and exponent e represents the values (-1)^s*m*2^e, we just have to compare the integer with that value. But there is no easy way to do that, if we do a int/double comparison in, e.g., C++, the compiler does the same promotion to double, messing up the comparison.

We can get the logic right by doing two conversions: We first convert the int to double and compare that. If the values are not equal, the order is clear and we can use that. Otherwise, we convert the double back to an integer and check if the conversion rounded up or down, and handle the result. Plus some extra checks to avoid undefined behavior (the conversion of intmax64->double->int64 is not defined) and to handle non-finite values, and we get: 

int cmpDoubleInt64(double a, int64_t b) {
   // handle intmax and nan
   if (!(a<=0x1.fffffffffffffp+62)) return 1;

   // fast path comparison
   double bd = b;
   if (a!=bd) return (a<bd)?-1:1;

   // handle loss of precision
   int64_t ai = a;
   if (ai!=b) return (ai<b)?-1:1;
   return 0;
}

Which is the logic that we now use. Who else does it correctly? Perhaps somewhat surprisingly, Python and SQLLite.  Other database systems (and programming languages) that we tested all lost precision during the comparison, leading to tons of problems. IMHO a proper int/double comparison should be available in every programming language, at least as library function. But in most languages (and DBMSes) it isn't. You can use that code above if you ever have this problem.

Taurus MM: A Cloud-Native Shared-Storage Multi-Master Database

This VLDB'23 paper presents Taurus MM, Huawei's cloud-native, multi-master OLTP database built to scale write throughput in clusters between 2 to 16 masters. It extends the single-master TaurusDB design (which we reviewed yesterday) into a multi-master design while following its shared-storage architecture with separate compute and storage layers. Each master maintains its own write-ahead log (WAL) and executes transactions independently; there are no distributed transactions. All masters share the same Log Stores and Page Stores, and data is coordinated through new algorithms that reduce network traffic and preserve strong consistency.

The system uses pessimistic concurrency control to avoid frequent aborts on contended workloads. Consistency is maintained through two complementary mechanisms: a new clock design that makes causal ordering efficient, and a new hybrid locking protocol that cuts coordination cost.


Vector-Scalar (VS) Clocks

A core contribution is the Vector-Scalar (VS) clock, a new type of logical clock that combines the compactness of Lamport clocks with the causal precision/completenes of vector clocks.

Ordinary Lamport clocks are small but they fail to capture causality fully, in both directions. Vector clocks capture causality fully, but scale poorly. An 8-node vector clock adds 64 bytes to every message or log record, which turns into a prohibitive cost when millions of short lock and log messages per second are exchanged in a cluster. Taurus MM solves this by letting the local component of each node's VS clock behave like a Lamport clock, while keeping the rest of the vector to track other masters' progress. This hybrid makes the local counter advance faster (it reflects causally related global events, not just local ones) yet still yields vector-like ordering when needed.

VS clocks can stamp messages either with a scalar or a vector timestamp depending on context. Scalar timestamps are used when causality is already known, such as for operations serialized by locks or updates to the same page. Vector timestamps are used when causality is uncertain, such as across log flush buffers or when creating global snapshots.

I really like the VC clocks algorithm, and how it keeps most timestamps compact while still preserving ordering semantics. It's conceptually related to Hybrid Logical Clocks (HLC) in that it keeps per-node clock values close and comparable, but VS clocks are purely logical, driven by Lamport-style counters instead of synchronized physical time. The approach enables rapid creation of globally consistent snapshots and reduces timestamp size and bandwidth consumption by up to 60%.

I enjoyed the paper's pedagogical style in Section 5, as it walks the reader through deciding whether each operation needs scalar or vector timestamps. This  makes it clear how we can enhance efficiency by applying the right level of causality tracking to each operation.


Hybrid Page-Row Locking

The second key contribution is a hybrid page-row locking protocol. Taurus MM maintains a Global Lock Manager (GLM) that manages page-level locks (S and X) across all masters. Each master also runs a Local Lock Manager (LLM) that handles row-level locks independently once it holds the covering page lock.

The GLM grants page locks, returning both the latest page version number and any row-lock info. Once a master holds a page lock, it can grant compatible row locks locally without contacting the GLM. When the master releases a page, it sends back the updated row-lock state so other masters can reconstruct the current state lazily.

Finally, row-lock changes don't need to be propagated immediately and are piggybacked on the page lock release flow. This helps reduce lock traffic dramatically. The GLM only intervenes when another master requests a conflicting page lock.

This separation of global page locks and local row locks resembles our 2014 Panopticon work, where we combined global visibility and local autonomy to limit coordination overhead.


Physical and Logical Consistency

Taurus MM distinguishes between physical and logical consistency. Physical consistency ensures structural correctness of pages. The master groups log records into log flush buffers (LFBs) so that each group ends at a physically consistent point (e.g., a B-tree split updates parent and children atomically within LFB bounds). Read replicas apply logs only up to group boundaries, avoiding partial structural states without distributed locks.

Logical consistency ensures isolation-level correctness for user transactions (Repeatable Read isolation). Row locks are held until commit, while readers can use consistent snapshots without blocking writers.


Ordering and Replication

Each master periodically advertises the location of its latest log records to all others in a lightweight, peer-to-peer fashion. This mechanism is new in Taurus MM. In single-master TaurusDB, the metadata service (via Metadata PLogs) tracked which log segments were active, but not the current write offsets within them (the master itself notified read replicas of the latest log positions). In Taurus MM, with multiple masters generating logs concurrently, each master broadcasts its current log positions to the others, avoiding a centralized metadata bottleneck.

To preserve global order, each master groups its recent log records (updates from multiple transactions and pages) into a log flush buffer (LFB) before sending it to the Log and Page Stores. Because each LFB may contain updates to many pages, different LFBs may touch unrelated pages. It becomes unclear which buffer depends on which, so the system uses vector timestamps to capture causal relationships between LFBs produced on different masters. Each master stamps an LFB with its current vector clock and also includes the timestamp of the previous LFB, allowing receivers to detect gaps or missing buffers. When an LFB reaches a Page Store, though, this global ordering is no longer needed. The Page Store processes each page independently, and all updates to a page are already serialized by that page's lock and carry their own scalar timestamps (LSNs). The Page Store simply replays each page's log records in increasing LSN order, ignoring the vector timestamp on the LFB. In short, vector timestamps ensure causal ordering across masters before the LFB reaches storage, and scalar timestamps ensure correct within-page ordering after.

For strict transaction consistency, a background thread exchanges full vector (VS) timestamps among masters to ensure that every transaction sees all updates committed before it began. A master waits until its local clock surpasses this merged/pairwise-maxed timestamp before serving the read in order to guarantee a globally up-to-date view. If VS were driven by physical rather than purely logical clocks, these wait times could shrink further.


Evaluation and Takeaways

Experiments on up to eight masters show good scaling on partitioned workloads and performance advantages over both Aurora Multi-Master (shared-storage, optimistic CC) and CockroachDB (shared-nothing, distributed commit).

The paper compares Taurus MM with CockroachDB using TPC-C–like OLTP workloads. CockroachDB follows a shared-nothing design, with each node managing its own storage and coordinating writes through per-key Raft consensus. Since Taurus MM uses four dedicated nodes for its shared storage layer, while CockroachDB combines compute and storage on the same nodes, the authors matched configurations by comparing 2 and 8 Taurus masters with 6- and 12-node CockroachDB clusters, respectively. For CockroachDB, they used its built-in TPC-C–like benchmark; for Taurus MM, the Percona TPC-C variant with zero think/keying time. Results for 1000 and 5000 warehouses show Taurus MM delivering 60% to 320% higher throughput and lower average and 95th-percentile latencies. The authors also report scaling efficiency, showing both systems scaling similarly on smaller datasets (1000 warehouses), but CockroachDB scaling slightly more efficiently on larger datasets with fewer conflicts. They attribute this to CockroachDB’s distributed-commit overhead, which dominates at smaller scales but diminishes once transactions touch only a subset of nodes, whereas Taurus MM maintains consistent performance by avoiding distributed commits altogether.

Taurus MM shows that multi-master can work in the cloud if coordination is carefully scoped. The VS clock is a general and reusable idea, as it provides a middle ground between Lamport and vector clocks. I think VS clocks are useful for other distributed systems that need lightweight causal ordering across different tasks/components.

But is the additional complexity worth it for the workloads? Few workloads may truly demand concurrent writes across primaries. Amazon Aurora famously abandoned its own multi-master mode. Still from a systems-design perspective, Taurus MM contributes a nice architectural lesson.

November 09, 2025

Taurus Database: How to be Fast, Available, and Frugal in the Cloud

This SIGMOD’20 paper presents TaurusDB, Huawei's disaggregated MySQL-based cloud database. TaurusDB refines the disaggregated architecture pioneered by Aurora and Socrates, and provides a simpler and cleaner separation of compute and storage. 

In my writeup on Aurora, I discussed how "log is the database" approach reduces network load, since the compute primary only sends logs and the storage nodes apply them to reconstruct pages. But Aurora did conflate durability and availability somewhat and used quorum-based replication of six replicas for both logs and pages.

In my review of Socrates, I explained how Socrates (Azure SQL Cloud) separates durability and availability by splitting the system into four layers: compute, log, page, and storage. Durability (logs) ensures data is not lost after a crash. Availability (pages/storage) ensures data can still be served while some replicas or nodes fail. Socrates stores pages separately from logs to improve performance but the excessive layering introduces significant architectural overhead.

Taurus takes this further and uses different replication and consistency schemes for logs and pages, exploiting their distinct access patterns. Logs are append-only and used for durability. Log records are independent, so they can be written to any available Log Store nodes. As long as three healthy Log Stores exist, writes can proceed without quorum coordination. Pages, however, depend on previous versions. A Page Store must reconstruct the latest version by applying logs to old pages. To leverage this asymmetry, Taurus uses synchronous, reconfigurable replication for Log Stores to ensure durability, and asynchronous replication for Page Stores to improve scalability, latency, and availability.


But hey, why do we disaggregate in the first place?

Traditional databases were designed for local disks and dedicated servers. In the cloud, this model wastes resources as shown in Figure 1. Each MySQL replica keeps its own full copy of the data, while the underlying virtual disks already store three replicas for reliability. Three database instances mean nine copies in total, and every transactional update is executed three times. This setup is clearly redundant, costly, and inefficient.

Disaggregation fixes this and also brings true elasticity! Compute and storage are separated because they behave differently. Compute is expensive and variable; storage is cheaper and grows slowly. Compute can be stateless and scaled quickly, while storage must remain durable. Separating them allows faster scaling, shared I/O at storage, better resource use, and the capability of scaling compute to zero and restarting quickly when needed.


Architecture overview

Taurus has two physical layers, compute and storage, and four logical components: Log Stores, Page Stores, the Storage Abstraction Layer (SAL), and the database front end. Keeping only two layers minimizes cross-network hops.

The database front end (a modified MySQL) handles queries, transactions, and log generation. The master handles writes; read replicas serve reads.

Log Store stores (well duh!) write-ahead-logs as fixed-size, append-only objects called PLogs. These are synchronously replicated across three nodes. Taurus favors reconfiguration-based replication: If one replicaset fails or lags, a new PLog is created. Metadata PLogs track active PLogs.

Page Store materializes/manages 10 GB slices of page data. Each page version is identified by an LSN, and the Page Store can reconstruct any version. Pages are written append-only, which is 2–5x faster than random writes and gentler on flash. Each slice maintains a lock-free Log Directory mapping (page, version) to log offset. Consolidation of logs into pages happens in memory. Taurus originally prioritized by longest chain first, but then reverted to oldest unapplied write first to prevent metadata buildup. A local buffer pool accelerates log application. For cache eviction, Taurus finds that LFU (least frequently used) performs better than LRU (least recently used), because it keeps these hot pages in cache longer, reducing I/O and improving consolidation throughput. 

Storage Abstraction Layer (SAL) hides the storage complexity from MySQL by serving as an intermediary. It coordinates between Log Stores and Page Stores, manages slice placement, and tracks the Cluster Visible LSN, the latest globally consistent point. SAL advances CV-LSN only when the logs are durable in Log Stores and at least one Page Store has acknowledged them. SAL also batches writes per slice to reduce small I/Os.


Write path and replication

Did you notice the lack of LogStore to PageStore communication in Figure 2 and Figure 3? The paper does not address this directly, but yest there is no direct LogStore-to-PageStore communication. The SAL in the master mediates this instead. SAL first writes logs to the Log Stores for durability. Once acknowledged, SAL forwards the same logs to the relevant Page Stores. This ensures that Page Stores only see durable logs and lets SAL track exactly what each replica has received. SAL monitors per-slice persistent LSNs for Page Stores, and resends missing logs from the Log Stores if it detects regressions.

I think, this choice adds coupling and complexity. A chain-replication design, where LogStores streamed logs directly to PageStores, would simplify the system. This way, SAL wouldn't need to track every PageStore’s persistent LSN. And Log truncation could be driven by LogStores once all replicas confirmed receipt, instead of being tracked by SAL again. 


Read path

Database front ends read data at page granularity. A dirty page in the buffer pool cannot be evicted until its logs have been written to at least one Page Store replica. This ensures that the latest version is always recoverable.

As mentioned above, SAL maintains the last LSN sent per slice. Reads are routed to the lowest-latency Page Store replica. If one is unavailable or behind, SAL retries with others.

Read replicas don't stream WAL directly from the master. Instead, the master publishes which PLog holds new updates. Replicas fetch logs from the Log Stores, apply them locally, and track their visible LSN. They don't advance past the Page Stores' persisted LSNs, keeping reads consistent. This design keeps replica lag below 20 ms even under high load and prevents the master from becoming a bandwidth bottleneck.


Recovery model

If a Log Store fails temporarily, writes to its PLogs pause. For long failures, the cluster re-replicates its data to healthy nodes.

Page Store recovery is more involved. After short outages, a Page Store gossips with peers to catch up. For longer failures, the system creates a new replica by copying another's data. If recent logs were lost before replication, SAL detects gaps in persistent LSNs and replays the missing records from Log Stores. Gossip runs periodically but can be triggered early when lag is detected.

If the primary fails, SAL ensures all Page Stores have every log record persisted in Log Stores. This is the redo phase (similar to ARIES). Then the database front end performs undo for in-flight transactions.


Nitpicks

I can't refrain from bringing up a couple of issues.

First, RDMA appears in Figure 2 as part of the storage network but then disappears entirely until a brief mention in the final "future work" paragraph.

Second, the evaluation section feels underdeveloped. It lacks the depth expected from a system of this ambition. I skipped detailed discussion of this section in my review, as it adds little insight beyond what is discussed in the protocols. 

November 08, 2025

I Want You to Understand Chicago

I want you to understand what it is to live in Chicago.

Every day my phone buzzes. It is a neighborhood group: four people were kidnapped at the corner drugstore. A friend a mile away sends a Slack message: she was at the scene when masked men assaulted and abducted two people on the street. A plumber working on my pipes is upset, and I find out that two of his employees were kidnapped that morning. A week later it happens again.

An email arrives. Agents with guns have chased a teacher into the school where she works. They did not have a warrant. They dragged her away, ignoring her and her colleagues’ pleas to show proof of her documentation. That evening I stand a few feet from the parents of Rayito de Sol and listen to them describe, with anguish, how good Ms. Diana was to their children. What it is like to have strangers with guns traumatize your kids. For a teacher to hide a three-year-old child for fear they might be killed. How their relatives will no longer leave the house. I hear the pain and fury in their voices, and I wonder who will be next.

Understand what it is to pray in Chicago. On September 19th, Reverend David Black, lead pastor at First Presbyterian Church of Chicago, was praying outside the ICE detention center in Broadview when a DHS agent shot him in the head with pepper balls. Pepper balls are never supposed to be fired at the head, because, as the manufacturer warns, they could seriously injure or even kill. “We could hear them laughing as they were shooting us from the roof,” Black recalled. He is not the only member of the clergy ICE has assaulted. Methodist pastor Hannah Kardon was violently arrested on October 17th, and Baptist pastor Michael Woolf was shot by pepper balls on November 1st.

Understand what it is to sleep in Chicago. On the night of September 30th, federal agents rappelled from a Black Hawk helicopter to execute a raid on an apartment building on the South Shore. Roughly three hundred agents deployed flashbangs, busted down doors, and took people indiscriminately. US citizens, including women and children, were grabbed from their beds, marched outside without even a chance to dress, zip-tied, and loaded into vans. Residents returned to find their windows and doors broken, and their belongings stolen.

Understand what is is to lead Chicago. On October 3rd, Alderperson Jesse Fuentes asked federal agents to produce a judicial warrant and allow an injured man at the hospital access to an attorney. The plainclothes agents grabbed Fuentes, handcuffed her, and took her outside the building. Her lawsuit is ongoing. On October 21st, Representative Hoan Huynh was going door-to-door to inform businesses of their immigration rights when he was attacked by six armed CBP agents, who boxed in his vehicle and pointed a gun at his face. Huynh says the agents tried to bash open his car window.

Understand what it is to live in Chicago. On October 9th, Judge Ellis issued a temporary restraining order requiring that federal agents refrain from deploying tear gas or shooting civilians without an imminent threat, and requiring two audible warnings. ICE and CBP have flaunted these court orders. On October 12th, federal agents shoved an attorney to the ground who tried to help a man being detained in Albany Park. Agents refused to identify themselves or produce a warrant, then deployed tear gas without warning. On October 14th, agents rammed a car on the East Side, then tear-gassed neighbors and police.

On October 23rd, federal agents detained seven people, including two U.S. citizens and an asylum seeker, in Little Village. Two worked for Alderperson Michael Rodriguez: his chief of staff Elianne Bahena, and police district council member Jacqueline Lopez. Again in Little Village, agents tear-gassed and pepper-sprayed protestors, seizing two high school students and a security guard, among others. Alderperson Byron Sigcho-Lopez reported that agents assaulted one of the students, who had blood on his face. On October 24th, agents in Lakeview emerged from unmarked cars, climbed a locked fence to enter a private yard, and kidnapped a construction worker. As neighbors gathered, they deployed four tear-gas canisters. That same day, a few blocks away, men with rifles jumped out of SUVs and assaulted a man standing at a bus stop.

“They were beating him,” said neighbor Hannah Safter. “His face was bleeding”.

They returned minutes later and attacked again. A man from the Laugh Factory, a local comedy club, had come outside with his mother and sister. “His mom put her body in between them, and one of the agents kicked her in the face”.

Understand what it is to raise a family in Chicago. The next day, October 25th, federal agents tear-gassed children in Old Irving Park. Again, no warnings were heard. On October 26th, agents arrested a 70-year-old man and threw a 67-year old woman to the ground in Old Irving Park, then tear-gassed neighbors in Avondale. That same day, federal agents deployed tear gas at a children’s Halloween parade in Old Irving Park.

“Kids dressed in Halloween costumes walking to a parade do not pose an immediate threat to the safety of a law enforcement officer. They just don’t. And you can’t use riot control weapons against them,” Judge Ellis said to Border Patrol chief Gregory Bovino.

Understand how the government speaks about Chicago. On November 3rd, paralegal Dayanne Figueroa, a US citizen, was driving to work when federal agents crashed into her car, drew their guns, and dragged her from the vehicle. Her car was left behind, coffee still in the cup holder, keys still in the car. The Department of Homeland Security blamed her, claiming she “violently resisted arrest, injuring two officers.” You can watch the video for yourself.

“All uses of force have been more than exemplary,” Bovino stated in a recent deposition. He is, as Judge Ellis has stated, lying. Bovino personally threw a tear-gas canister in Little Village. He claimed in a sworn deposition that he was struck in the head by a rock before throwing the canister, and when videos showed no rock, admitted that he lied about the event. When shown video of himself tackling peaceful protestor Scott Blackburn, Bovino refused to acknowledge that he tackled the man. Instead, he claimed, “That’s not a reportable use of force. The use of force was against me.”

“I find the government’s evidence to be simply not credible,” said Judge Ellis in her November 6th ruling. “The use of force shocks the conscience.”

Understand what it is to be Chicago. To carry a whistle and have the ICIRR hotline in your phone. To wake up from nightmares about shouting militiamen pointing guns at your face. To rehearse every day how to calmly refuse entry, how to identify a judicial warrant, how to film and narrate an assault. To wake to helicopters buzzing your home, to feel your heart rate spike at the car horns your neighbors use to alert each other to ICE and CBP enforcement. To know that perhaps three thousand of your fellow Chicagoans have been disappeared by the government, but no one really knows for sure. To know that many of those seized were imprisoned a few miles away, as many as a hundred and fifty people in a cell, denied access to food, water, sanitation, and legal representation. To know that many of these agents—masked, without badge numbers or body cams, and refusing to identify themselves—will never face justice. To wonder what they tell their children.

The masked thugs who attack my neighbors, who point guns at elected officials and shoot pastors with pepper balls, who tear-gas neighborhoods, terrify children, and drag teachers and alderpeople away in handcuffs are not unprecedented. We knew this was coming a year ago, when Trump promised mass deportations. We knew it was coming, and seventy-seven million of us voted for it anyway.

This weight presses upon me every day. I am flooded with stories. There are so many I cannot remember them all; cannot keep straight who was gassed, beaten, abducted, or shot. I write to leave a record, to stare at the track of the tornado which tears through our city. I write to leave a warning. I write to call for help.

I want you to understand, regardless of your politics, the historical danger of a secret police. What happens when a militia is deployed in our neighborhoods and against our own people. Left unchecked, their mandate will grow; the boundaries of acceptable identity and speech will shrink. I want you to think about elections in this future. I want you to understand that every issue you care about—any hope of participatory democracy—is downstream of this.

I want you to understand what it is to love Chicago. To see your neighbors make the heartbreaking choice between showing up for work or staying safe. To march two miles long, calling out: “This is what Chicago sounds like!” To see your representatives put their bodies on the line and their voices in the fight. To form patrols to walk kids safely to school. To join rapid-response networks to document and alert your neighbors to immigration attacks. For mutual aid networks to deliver groceries and buy out street vendors so they can go home safe. To talk to neighbor after neighbor, friend after friend, and hear yes, yes, it’s all hands on deck.

I want you to understand Chicago.