a curated list of database news from authoritative sources

January 08, 2025

I Can’t Believe It’s Not Causal! Scalable Causal Consistency with No Slowdown Cascades

I recently came across the Occult paper (NSDI'17) during my series on "The Use of Time in Distributed Databases." I had high expectations, but my in-depth reading surfaced significant concerns about its contributions and claims. Let me share my analysis, as there are still many valuable lessons to learn from Occult about causality maintenance and distributed systems design.


The Core Value Proposition

Occult (Observable Causal Consistency Using Lossy Timestamps) positions itself as a breakthrough in handling causal consistency at scale. The paper's key claim is that it's "the first scalable, geo-replicated data store that provides causal consistency without slowdown cascades."

The problem they address is illustrated in Figure 1, where a slow/failed shard A (with delayed replication from master to secondary) can create cascading delays across other shards (B and C) due to dependency-waiting during write replication. This is what the paper means by "slowdown cascades". Occult's solution shifts this write-blocking to read-blocking. In other words, Occult eliminates the dependency-blocking for write-replication, and instead for read-serving, it waits-on read operations from shards that are lagging behind to ensure they appear consistent with what a user has already seen to provide causal consistency.


Questionable Premises

The paper presents dependency-blocking across shards for writes as a common problem, yet I struggle to identify any production systems that implement causal consistency (or any stronger consistency) this way. The cited examples are all academic systems like COPS, not real-world databases.

More importantly, the paper's claim of being "first" overlooks many existing solutions. Spanner (2012) had already demonstrated how to handle this challenge effectively using synchronized clocks, achieving even stronger consistency guarantees (strict-serializability) without slowdown cascades. Spanner already does what Occult proposes to do: it shifts write-blocking to read-blocking, and also uses synchronized clocks to help for reads.

The paper acknowledges Spanner only briefly in related work, noting its "heavier-weight mechanisms" and maybe as a result it "aborting more often" - but this comparison feels incomplete given Spanner's superior consistency model and production-proven success.


The Secondary Reads Trade-off

A potential justification for Occult's approach is enabling reads from nearby secondaries without client stickiness. However, the paper doesn't analyze this trade-off at all. Why not just read from the masters? When are secondary reads sufficiently beneficial to justify the added complexity? The paper itself notes in the introduction that inconsistency at secondaries is extremely rare - citing Facebook's study showing fewer than six out of every million reads violating causal consistency even in an eventually-consistent data store.

In examining potential justifications for secondary reads, I see only two viable scenarios. First, secondary reads might help when the primary is CPU-constrained - but this only applies under very limited circumstances, typically with the smallest available VM instances, which is unlikely for production systems. Second, there's the latency consideration: while cross-availability-zone latencies wouldn't justify secondary reads in this case, cross-region latencies might make them worthwhile for single-key operations. However, even this advantage diminishes for transactional reads within general transactions, where reading from masters is more sensible to avoid transaction aborts due to conflicts, which would waste all the work done as part of the transaction.


The Client Session Problem

The paper's handling of client sessions (or rather the lack thereof) reveals another limitation. Their example (from the conference presentation) wholesale couples unrelated operations - like linking a social media post to an academic document share - into the same dependency chain. Modern systems like MongoDB's causal consistency tokens (introduced in 2017, the same year) provide a better approach to session management.

Comparing Occult with Spanner reveals some interesting nuances. While Spanner can read from secondaries, it requires them to catch up to the current time (T_now). Occult takes a different approach by maintaining client causal shardstamps, allowing reads from secondaries without waiting for T_now synchronization. This theoretically enables causal consistency using earlier clock values than Spanner's current-time requirement.

However, this theoretical advantage comes with significant practical concerns, as we mentioned above in the secondary reads tradeoff. Occult shifts substantial complexity to the client side, but the paper inadequately addresses the overhead and coordination requirements this imposes. The feasibility of expecting clients (which are typically frontend web proxies in the datacenter) to handle such sophisticated coordination with servers remains questionable.


The Fatal Flaw

The most concerning aspect about Occult emerges in Section 6 on fault tolerance. The paper reveals that correctness under crash failures requires synchronous replication of the master using Paxos or similar protocols - before the "asynchronous" replication to secondaries. Say what!? This requirement fundamentally undermines the system's claimed benefits from asynchronous replication.

Let me quote from the paper. "Occult exhibits a vulnerability window during which writes executed at the master may not yet have been replicated to slaves and may be lost if the master crashes. These missing writes may cause subsequent client requests to fail: if a client c’s write to object o is lost, c cannot read o without violating causality. This scenario is common to all causal systems for which clients do not share fate with the servers to which they write.

Occult's client-centric approach to causal consistency, however, creates another dangerous scenario: as datacenters are not themselves causally consistent, writes can be replicated out of order. A write y that is dependent on a write x can be replicated to another datacenter despite the loss of x, preventing any subsequent client from reading both x and y.

Master failures can be handled using well-known techniques: individual machine failures within a datacenter can be handled by replicating the master locally using chain-replication or Paxos, before replicating asynchronously to other replicas."

Unfortunately, the implementation ignores fault-tolerance and the evaluation omits crash scenarios entirely, focusing only on node slowdown. This significant limitation isn't adequately addressed in the paper's discussion, and is mentioned in one paragraph of the fault-tolerance section.


Learnings

Despite these criticisms, studying this paper has been valuable. It prompts important discussions about:

  • Trade-offs in causality maintenance
  • The role of synchronized clocks in distributed systems
  • The importance of evaluating academic proposals against real-world requirements

For those interested in learning more, I recommend watching the conference presentation, which provides an excellent explanation of the protocol mechanics.


The protocol

At its core, Occult builds upon vector clocks with some key modifications. Servers attach vector-timestamps to objects and track shard states using "shardstamps", and clients also maintain vector-timestamps to keep a tab on their interaction with the servers. Like vector clocks, shardstamp updates occur through pairwise maximum operations between corresponding entries.



This approach works straightforwardly for single-key updates but becomes more complex for transactions, where a commit timestamp must be chosen and applied consistently across all objects involved. Occult implements transactions using Optimistic Concurrency Control (OCC), but with specific validation requirements. The validation phase must verify two critical properties: first, that the transaction's read timestamp represents a consistent cut across the system, and second, that no conflicting updates occurred before commit. Transaction atomicity is preserved by making writes causally dependent, ensuring clients see either all or none of a transaction's writes - all without introducing write delays that could trigger slowdown cascades.

The protocol employs several optimizations. To manage timestamp overhead, it employs both structural and temporal compression techniques. It also leverages loosely synchronized clocks to avoid spurious dependencies between shards, as I discuss below.  A key contribution is PC-PSI (Per-Client Parallel Snapshot Isolation), a relaxed version of PSI that requires total ordering only within client sessions rather than globally. This modification enables more efficient transaction processing while maintaining essential consistency guarantees.


Discussion on the use of synchronized clocks

A critical aspect of Occult is the requirement for monotonic clocks - clock regression would break causality guarantees for reads. While Lamport logical clocks could provide this monotonicity, synchronized physical clocks serve a more sophisticated purpose:  they prevent artificial waiting periods caused by structural compression of timestamps. The compression maps timestamps to the same entry in the compressed shardstamp using modulo n operations.

The paper illustrates this problem in Section 5 with a clear example: when two shards (i and j) map to the same compressed entry and have vastly different shardstamps (100 and 1000), a client writing to j would fail the consistency check when reading from a slave of i until i has received at least 1000 writes. If shard i never reaches 1000 writes, the client must perpetually failover to reading from i's master shard. Rather than requiring explicit coordination between shards to solve this problem, Occult leverages loosely synchronized physical clocks to bound these differences naturally. This guarantees that shardstamps differ by no more than the relative offset between their clocks, independent of the write rate on different master shards.

This approach resonates with my earlier work on Hybrid Logical Clocks (2013), where we argued for combining synchronized physical clocks with Lamport clock-based causality tracking. The effectiveness of this strategy is demonstrated dramatically in Occult's results - they achieve compression from 16,000 timestamps down to just 5 timestamps through synchronized clocks.

Occult's use of synchronized clocks for causality compression ties into our broader theme about the value of shared temporal reference frames in distributed systems. This connection between physical time and distributed system efficiency deserves deeper exploration.

January 07, 2025

Convex Cookbook: Dynamic Query Builders

You can write a Convex query whose structure -- which index/order/filters to apply, if any -- depends on runtime factors. This article gives a recipe for building queries dynamically.

January 06, 2025

Read data from dropped columns

It’s hard to destroy data. Even when a column is dropped, the data is still physically there on the data page. We can use the undocumented/unsupported command DBCC PAGE to look at it. Example First, create a table based on English values from sys.messages: use tempdb; go /* create table based on english values from […]

The post Read data from dropped columns first appeared on Michael J. Swart.

January 04, 2025

January 01, 2025

December 31, 2024

How I run a coffee club

I started the NYC Systems Coffee Club in December of 2023. It's gone pretty well! I regularly get around 20 people each month. You bring a drink if you feel like it and you hang out with people for an hour or two.

There is no agenda, there is no speaker, there is no structure. The only "structure" is that when the circle of people talking to each other seems gets too big, I break the circle up into two smaller circles so we can get more conversations going.

People tend to talk in a little circle and then move around over time. It's basically no different than a happy hour except it is over a non-alcoholic drink and it's in the morning.

All I have to do as the organizer is periodically tell people about the Google Form to fill out. I got people to sign up to the list by posting about this on Twitter and LinkedIn. And then once a month I send an email bcc-ing everyone on the list and ask them to respond for an invite.

The first 20 people to respond get a calendar invite.

I mention all of this because people ask how they can start a coffee club in their city. They ask how it works. But it's very simple! One of the least-effortful ways to bring together people in your city.

If your city does not have indoor public spaces, you could use a food court, or a cafe, or a park during months where it is warm.

For example, the Cobble Hill Computer Coffee Club is one that meets outdoors at a park.

Good luck! :)

December 30, 2024

Use of Time in Distributed Databases (part 3): Synchronized clocks in databases

This is part 3 of our "Use of Time in Distributed Databases" series. In this post, we explore how synchronized physical clocks enhance database systems, focusing on research and prototype databases. Discussion of time's role in production databases will follow in our next post.

To begin, let's revisit the utility of synchronized clocks in distributed systems. As highlighted in Part 1, synchronized clocks provide a shared time reference across distributed nodes and partitions. For simple, single-key replication tasks, such precision is often unnecessary and leader-based approaches such as MultiPaxos or Raft is much more appropriate. Even WPaxos might be considered if you need a WAN deployment. Of course, if you want to go very fancy by using a leaderless designs, such as those in the EPaxos family/Tempo/Accord, then dependency graphs and time synchronization re-enter the picture.

The true value of synchronized clocks becomes apparent in distributed multi-key operations. By aligning nodes to a shared reference frame, these clocks eliminate the need for some coordination message exchanges across nodes/partitions, and help cut down latency and boost throughput.

There are two key enhancements that catalyzes/facilitates the use of synchronized time in databases.

The first is Multiversion Concurrency Control. MVCC allows each key to maintain multiple versions over time, enabling operations like reading "at a timestamp" or writing "at a timestamp." This simplifies transactional reads by offering consistent snapshots of the database at a specific moment. While MVCC enhances efficiency, it is not strictly required. Bernstein & Goldman’s groundbreaking basic Timestamp Ordering (TSO) algorithm (VLDB'80) operated without MVCC, relying instead on single-version storage and timestamping. MVCC, however, reduces contention and improves performance, making it a valuable enhancement employed by several of the systems (Clock-SI, GentleRain, Scalable OLTP) surveyed in this post.

The second is more tightly synchronized clocks. Tighter bounds on clock precision mean less time spent waiting to account for potential skew. Of course if you have tightly synchronized clocks as Spanner have, you can choose to provide strictly serializable transactions (which we will discuss in our next post). But tightly synchronized clocks were not available publicly before 2020s, so most of the systems we discuss today make do with loosely synchronized clocks, and in order not to impose too much wait-time, they go with snapshot isolation (SI). This is a very smart tradeoff to make because despite the prevalence of serializability in academia, read-committed, repeatable-read, and snapshot isolation are dominantly used in practice/industry.

In this post, we explore research and prototype systems that employ synchronized clocks, ummm...,  in chronological order. Early systems leveraged synchronized clocks primarily for read-only transactions and snapshots, reaping low-hanging fruit. Over time, these systems evolved to tackle read-write transactions and employ more advanced techniques. As we progress through this timeline, you’ll see how synchronized clocks take on increasingly critical roles in database design. 

We cover the following:

  • Granola: Low overhead distributed transaction coordination (ATC'12)
  • Clock-SI: Snapshot Isolation for Partitioned Data Stores Using Loosely Synchronized Clocks (SRDS'13)
  • GentleRain: Cheap and Scalable Causal Consistency with Physical Clocks (SOCC'14)
  • Scalable Causal Consistency with No Slowdown Cascades (NSDI'17)
  • Nezha: Deployable and High-Performance Consensus Using Synchronized Clocks (VLDB'23)
  • Scalable OLTP in the Cloud: What’s the BIG DEAL? (CIDR'24)

As the titles hint, we'll see below that synchronized clocks have been employed to reduce coordination and achieve scalability in these distributed databases.


Granola: Low overhead distributed transaction coordination (ATC'12)

The Granola paper aimed to provide low-overhead approach to distributed transaction coordination tailored for one-shot (non-interactive) transactions. The system uses loosely synchronized clocks to enhance throughput without relying on them for correctness. (After all Barbara Liskov is an author on this paper, and remember what she said in her PODC 1991 paper.)

Granola operates in two distinct modes, Timestamp Mode and Locking Mode, switching between them on-the-fly based on the characteristics of the transactions being processed.

In Timestamp Mode, the system eschews locking to enable timestamp-based serializability, excelling at handling single-repository and independent (local-read) distributed transactions with high throughput and minimal overhead.

However, when coordinated transactions requiring remote reads or cross-node dependencies arrive, Granola transitions the affected repositories to Locking Mode. This ensures serializability through traditional locking mechanisms. Once these coordinated transactions are completed, repositories can revert to Timestamp Mode, restoring efficiency.


Clock-SI: Snapshot Isolation for Partitioned Data Stores Using Loosely Synchronized Clocks (SRDS'13)

The Clock-SI paper implements snapshot isolation (SI) in partitioned multi-version data stores using loosely synchronized clocks. They ensure that read-only transactions always observe consistent snapshots by leveraging local physical clocks for assigning snapshot and commit timestamps. They compensate for clock skew through introducing response delays to wait out the clock uncertainty bounds and to account for the pending commit of an update transaction.


For read operations, transactions observe the version with the highest version number smaller than their snapshot timestamp. This ensures consistent reads while allowing read-only transactions to commit unconditionally. Clock-SI also delays reads to account for pending updates from concurrent transactions.

Employing Hybrid Logical Clocks (HLC) would help avoid the delay in Figure 1 because HLC also encodes/integrates happened-before information in addition to physical clocks.


GentleRain: Cheap and Scalable Causal Consistency with Physical Clocks (SOCC'14)  

The GentleRain is a followup to the ORBE (SOCC'13) multi-version database we reviewed in Part 2. ORBE used a matrix of vector clocks for dependency checking. GentleRain aims to reduce the metadata piggybacked on update propagation and to eliminate complex dependency checking procedures for causal consistency. It does this by employing synchronized physical clocks to encode/compress/replace complex dependency tracking. Unlike its predecessor ORBE, which relied on a matrix vector clocks, GentleRain uses a single physical clock timestamp for updates. The tradeoff is that updates are delayed until all partitions in a data center have seen all previous updates (updates with smaller timestamp), but this ensures causality without the need for explicit dependency checks or extra metadata.

The delay in PUT operations that GentleRain requires can affect the write throughput of the key-value store. CausalSpartan solves this problem in GentleRain by replacing physical clock with Hybrid Logical Clock (HLC).


Scalable Causal Consistency with No Slowdown Cascades (NSDI'17)

Occult (Observable Causal Consistency Using Lossy Timestamps) introduces a novel approach to implementing causal consistency in geo-replicated data stores by shifting enforcement to the client side. Rather than attempting to enforce causal consistency within the data store itself, Occult ensures clients observe a causally consistent view of the system. This strategic shift to client-centric specification of causal consistency must have seeded the later more general treatment of client-centric isolation levels.

Another key innovation of Occult its relaxation of the Parallel Snapshot Isolation (PSI) requirements. While PSI demands a total ordering of transactions committed at the same replica, PC-PSI (Per-Client Parallel Snapshot Isolation) only requires total ordering per client session. This relaxation, implemented through a combination of loosely synchronized clocks and hybrid logical clocks (HLC), enables lightweight dependency tracking without sacrificing consistency guarantees. When combined with the introduction of PC-PSI, the client-centric specification of causal-consistency enables Occult to avoid slowdown cascades, and solves a significant barrier to deploying causal consistency at scale.

Occult also provides comprehensive support for read/write transactions, moving beyond the limited read-only and write-only transactions common in earlier approaches. Occult guarantees that all transactions read from causally consistent snapshots without requiring coordination during asynchronous replication, but instead by the client either retrying the read locally or reading from the master. Occult achieves atomicity by making writes causally dependent on each other, ensuring that causality is used to enforce stronger consistency properties.


Nezha: Deployable and High-Performance Consensus Using Synchronized Clocks (VLDB'23)

Nezha is not a database per-se, but its approach to using time synchronization for consensus and state-machine replication is noteworthy and could be useful in distributed database systems. The protocol leverages synchronized clocks to decrease latency and increase throughput by offloading traditional leader or sequencer-based ordering to synchronized clocks. This enables decentralized coordination without relying on network routers or sequencers, while using time synchronization on a best-effort basis without impacting correctness.

At the core of Nezha is the Deadline-Ordered Multicast (DOM) primitive, which assigns deadline timestamps to requests using synchronized clocks and only delivers them after the deadline is reached, in timestamp order. This creates a buffer that helps maintain consistent ordering across receivers. The system operates with a dedicated stable leader involved in both fast and slow path operations, where each replica follows the leader's log rather than attempting to piece together logs across multiple leaderless nodes. In the fast path, when time synchronization and message delivery work well, Nezha achieves one-RTT consensus.

The protocol's design allows for high scalability as multiple proxies can send their DOM requests using local clocks without inter-proxy communication, with time synchronization ensuring consistent request ordering at the replicas. The leader executes requests speculatively while replicas initially just acknowledge message delivery, executing requests later after confirming the leader's order. If the fast path conditions aren't met (when a super-majority quorum doesn't have the same value as the leader), the system falls back to a more traditional asynchronous slow path where replicas stream the log from the leader. The evaluation suggests that Nezha significantly outperforms previous protocols, including achieving order of magnitude improvements in throughput.


Scalable OLTP in the Cloud: What’s the BIG DEAL? (CIDR'24)

Pat Helland's prototype database architecture aims to show how we can build scalable OLTP systems by leveraging time as a primary organizing principle. The design moves away from traditional multi-version concurrency control (MVCC) databases where reads and writes contend for access to a "current" value at a home location, and instead organizes data primarily by creation time to achieve better scaling. This temporal-first approach eliminates the need for pre-assigned record homes, allowing the database to seamlessly adapt to workload changes.

The system uses a combination of worker servers and owner servers to manage transactions. Workers execute transactions and maintain their own transaction logs, while owners handle concurrency control by verifying that concurrent transactions don't create conflicting updates. The architecture uses time extensively in its operation: workers guess future commit times for transactions, owner servers align commit times for records, and all record versions are organized first by time and then by key in the LSM (log structured merge tree) storage.

Time also plays a crucial role in providing external consistency and snapshot isolation in this architecture. By using current time (T-now) as the snapshot time, the system ensures that new incoming requests see all previously exposed data, even across different database connections. Everything in the database is versioned by record-version commit time, with reads accessing old record versions as of a past snapshot and row-locks ensuring locked records remain unchanged until commit time. This time-based organization allows the database to scale without requiring coordination across disjoint transactions that are reading and updating different records.