a curated list of database news from authoritative sources

December 26, 2024

Use of Time in Distributed Databases (part 2): Use of logical clocks in databases

This is part 2 of our "Use of Time in Distributed Databases" series. We talk about the use of logical clocks in databases in this post. We consider three different approaches:

  • vector clocks
  • dependency graph maintenance
  • epoch service 

In the upcoming posts we will allow in physical clocks for timestamping, so there is no (almost no) physical clocks involved in the systems in part 2.   



1. Vector clocks

Dynamo: Amazon's highly available key-value store (SOSP'07)

Dynamo employs sloppy quorums and hinted hand-off and uses version vector (a special case of vector clocks) to track causal dependencies within the replication group of each key. A version vector contains one entry for each replica (thus the size of clocks grows linearly with the number of replicas). The purpose of this metadata is to detect conflicting updates and to be used in the conflict reconciliation function. Dynamo provides eventual consistency thanks to this reconciliation function and conflict detection by version vectors.

Cassandra, which provided an opensource implementation of Dynamo, decided to forgo vectors clocks in favor of using physical time supplied by the client and Last-Writer-Wins rule for updating replicas. 

So, yeah, somehow use of vector clocks in datastores fizzled out over time. Maybe the size of vector clocks to be included in messages was the headache. Or maybe use of synchronized physical clocks offered more advantages in addition to a single scalar timestamp. Nevertheless, vector clocks may still have applications in version control systems and event logging in distributed systems. And, below we talk about two more systems that uses some form of vector clocks.


ORBE (SOCC'13): Causal consistency

ORBE uses vector clocks, organized as a matrix, to represent dependencies. The vector clock has an entry per partition and data center. Physical clocks are used for generating read snapshot times, and ORBE can complete read-only transactions in one round by relying on these loosely synchronized physical clocks. A drawback with ORBE is the large size of timestamps, which followup work on Gentle Rain aimed to address.


The end of a myth: Distributed transactions can scale (VLDB'17)

NAM-DB aims to addresses scalability challenges in distributed transactions through innovative use of RDMA (Remote Direct Memory Access) adopting a timestamp oracle design. The timestamp oracle uses a partitionable vector clock approach to manage commit timestamps without contention. The timestamp oracle protocol implements a software-based solution where each transaction execution thread maintains its own counter in a timestamp vector, allowing for distributed timestamp management without contention. Transactions obtain read timestamps by reading the vector and commit timestamps by incrementing their specific vector entry through efficient RDMA operations.

Let's dive into how the commit protocol achieves consistency. When committing, transactions create new timestamps by incrementing their counter, verify and lock their write-sets using RDMA operations, and update the timestamp vector upon success. This design offers several advantages: transaction threads operate independently without synchronization overhead, long-running transactions don't block others, and the system maintains monotonic timestamp progression when stored on a single memory server (though this property may not hold with partitioned storage).



2. Dependency graph maintenance

Scalable Causal Consistency for Wide-Area Storage with COPS (SOSP'11)

COPS introduced a dependency tracking approach for achieving causal consistency in geo-replicated datastores. The system assigns scalar version numbers to objects and maintains causality by having clients track the versions of all objects read in their causal past. When updates are propagated between data centers, they carry their dependencies, and receiving data centers only make updates visible once all dependencies are satisfied. A key feature of COPS is its support for causally consistent read-only transactions, which provide a consistent snapshot of the system. These transactions are implemented through a two-round protocol.

COPS chose to perform explicit dependency tracking over using vector clocks. They justified their choice against vector clocks by citing scalability concerns, particularly the O(N) size growth with the number of nodes. They argued that in a datacenter with thousands of nodes, the metadata overhead would become prohibitive. I think they overindexed on the N number of nodes. N doesn't grow to very large numbers in deployments, and especially not for replication. As another reason, they noted that vector clocks only provide happens-before relationships and there would still be a need for additional mechanisms like serialization points or explicit dependency checking to enforce causal consistency across the datacenter. I don't get this argument, either. I think they wanted to take a stance for explicit dependency checking rather than the implicit/wholesale causality we get from logical/vector clocks. 

This explicit dependency tracking approach influenced later systems, including the EPaxos family of consensus protocols. The principle is the same: Each operation maintains dependencies for operations, and replication dependencies are checked at each node, and when they are satisfied the value is updated there. Unfortunately, the dependency graphs can grow significantly in pathological cases, and these systems can experience significant slowdowns when dependency lists grow large. Subsequent systems like Occult and Accord/Cassandra (as we will cover in upcoming posts in this series) have shown that combining dependency tracking approach with loosely synchronized physical clocks can help manage the complexity. 


Kronos: The design and implementation of an event ordering service (Eurosys'14)

Kronos introduces a centralized event ordering service for distributed systems that tracks happens-before relationships through a dedicated API. Rather than having individual nodes maintain and propagate dependency information as in COPS, here the applications explicitly register events and define causal relationships with the Kronos service. This approach allows for cross-system dependency management and fine-grained concurrency detection, with the system binding events to a time order as late as possible. While this provides more flexibility in capturing application-specific causality compared to Logical/Vector Clocks (which automatically assume causal dependence between consecutive events on the same node), it comes with the overhead of communicating with the service and searching dependency graphs.


 

3. Epoch server

Chardonnay (OSDI'23): use of a centralized epoch service

Chardonnay is an in-memory distributed database that employs a logically-centralized (3 MultiPaxos nodes under the trenchcoat) epoch service, whose sole job  is to maintain a monotonic epoch counter. The magic of the epoch counter enters the picture for read-only transactions, but let's first cover the read-write transactions.

For read-write transactions, Chardonnay uses a two-phase approach: first running transactions in "dry run" mode to discover and pin read/write sets in memory, then executing them definitively using 2PC+2PL in-memory for speed. This approach leverages modern datacenter networking being significantly faster than disk I/O, allowing Chardonnay to achieve strictly serializable transactions efficiently by keeping relevant data in memory and avoiding deadlocks through ordered lock acquisition. In that sense, this architecture builds on ideas from deterministic databases like Calvin.

For read-only transactions, Chardonnay implements snapshot isolation within epochs (10ms intervals), enabling contention-free queries. A transaction can get a consistent snapshot as of the beginning of the current epoch ec by ensuring it observes the effects of all committed transactions that have a lower epoch. That is realized by waiting for all the transactions with an epoch e < ec to release their write locks. Hence, the snapshot read algorithm would simply work by reading the epoch ec, then reading the appropriate key versions. It is a neat trick, no?

This algorithm does not guarantee strict serializability, because a transaction T would not observe the effects of transactions in epoch ec that committed before T started. If desired, ensuring linearizability is easy at the cost of some latency; after T starts, wait for the epoch to advance once and then use the new epoch for reads. Another neat trick. Tradeoff latency with efficiency/throughput.

The system has been extended to multi-datacenter deployments through Chablis (CIDR '24), which introduces global epochs for cross-datacenter consistency while maintaining local epoch efficiency.

Picking up volleyball in NYC with Goodrec and New York Urban

I was so intimidated to go at first, but it is in fact easy and fun to start playing beginner volleyball in New York. The people are so friendly and welcoming that it has been easy to keep playing consistently every week since I started for the first time this August. It's been a great workout and a great way to make friends!

The two platforms I've used to find volleyball games are Goodrec and New York Urban. While these platforms may also offer classes and leagues, I mostly use them to play "pickup" games. Pickup games are where you show up and join (or get assigned to) a team to play for an hour or two. Easy to go on your own or with friends.

I'm not an expert! My only hope with this post is that maybe it makes trying out volleyball in New York feel a little less intimidating for you!

Goodrec

With Goodrec you have to use their mobile app. Beginner tier is called "social" on Goodrec. So browse available games until you find one at the level you want to play. You enroll in (buy a place in) sessions individually.

Sessions are between 90-120 minutes long.

They ask you not to arrive more than 10 minutes early at the gym. When you arrive you tell the gym managers (usually in a desk up front somewhere) you're there for Goodrec and the tier (in case the gym has multiple level games going on at the same time). Then you wait until the Goodrec "host" arrives and they will organize everyone into teams.

Goodrec hosts are players who volunteer to organize the games. They'll explain the rules of the game (makes Goodrec very good for beginners) and otherwise help you out.

Always say thank you to your host!

New York Urban

With New York Urban, pickup sessions are called "open play".

There is no mobile app, you just use the website to purchase a spot in a session. The sessions are longer and cheaper than Goodrec. But there is no host; players self-organize.

The options are more limited too. You play at one of four high schools on either a Friday night or on Sunday. And session slots tend to sell out much more quickly than with Goodrec.

Big City Volleyball

You can also check out Big City Volleyball but I haven't used it yet.

Volo

I haven't ever done Volo but I think I've heard it described as "beer league". That even some of the beginner tier sessions with Goodrec and New York Urban are more competitive.

But also, Volo is built around leagues so you have to get the timing right. Goodrec's and New York Urban's pickup games make it easy to get started playing any time of year.

Making friends

It was super awkward to go at first! I went by myself. I didn't know what I was doing. I couldn't remember, and didn't know, many rules. I didn't have court shoes or knee pads.

But the Goodrec host system is particularly great for bringing beginners in and making them feel welcome. You have a great time even if you're terrible.

The first game I went to, I tried to hang out afterward to meet people. But people either came with their SO or with their friends or by themselves so they all just left immediately or hung out in their group.

So you can't just go once and expect to make friends immediately. But if you keep going at the same place and time regularly week over week, you'll see familiar faces. Maybe half the people I play with each week are regulars. If you're friendly you'll start making friends with these people and eventually start going out to bars with them after the games.

Improving

Even if you find yourself embarrassingly bad at first, just keep going! I'm 29, 6'1, 190lbs and from observation the past 5 months, age, height, and weight have a very indirect relation to playing ability.

Most of the people who play are self-taught, especially at the lower tiers I've played at. But some people played for the school team in high school or college. These people are fun to play with and you can learn a lot from them.

Most people who are self-taught seem to watch YouTube videos like Coach Donny, helpful for learning how to serve, set, block, etc. Or they take "clinics" (classes) with Goodrec or other platforms. (I have no idea about these, I've never done them before.)

At first I played 2 hours a week and I was completely exhausted after the session. Over time it got easier so I started playing 2-3 sessions a week (6-9-ish hours). With practice and consistency (after about 3-4 months), I started playing Intermediate tier with Goodrec and New York Urban. And I don't think I'll play Beginner/Social at all anymore.

I still primarily play for fun and for the workout and to meet people. But it's also fun to get better!

I played with one person much better than myself in an Intermediate session one time and he mentioned he will probably stop playing Intermediate and only play High Intermediate. He mentioned you get better when you keep pushing yourself to play with better and better players. Good advice!

December 25, 2024

Seconds Since the Epoch

This is not at all news, but it comes up often enough that I think there should be a concise explanation of the problem. People, myself included, like to say that POSIX time, also known as Unix time, is the number of seconds since the Unix epoch, which was 1970-01-01 at 00:00:00.

This is not true. Or rather, it isn’t true in the sense most people think. For example, it is presently 2024-12-25 at 18:51:26 UTC. The POSIX time is 1735152686. It has been 1735152713 seconds since the POSIX epoch. The POSIX time number is twenty-seven seconds lower.

This is because POSIX time is derived in IEEE 1003.1 from Coordinated Universal Time. The standard assumes that every day is exactly 86,400 seconds long. Specifically:

The time() function returns the value of time in seconds since the Epoch.

Which is defined as:

seconds since the Epoch. A value to be interpreted as the number of seconds between a specified time and the Epoch. A Coordinated Universal Time name (specified in terms of seconds (tm_sec), minutes (tm_min), hours (tm_hour), days since January 1 of the year (tm_yday), and calendar year minus 1900 (tm_year)) is related to a time represented as seconds since the Epoch according to the expression below.

If year < 1970 or the value is negative, the relationship is undefined. If year ≥ 1970 and the value is non-negative, the value is related to a Coordinated Universal Time name according to the expression:

tm_sec + tm_min * 60 + tm_hour * 3600 + tm_yday * 86400 + (tm_year-70) * 31536000 + ((tm_year - 69) / 4) * 86400

The length of the day is not 86,400 seconds, and in fact changes over time. To keep UTC days from drifting too far from solar days, astronomers periodically declare a leap second in UTC. Consequently, every few years POSIX time jumps backwards, wreaking utter havoc. Someday it might jump forward.

Archaeology

Appendix B of IEEE 1003 has a fascinating discussion of leap seconds:

The concept of leap seconds is added for precision; at the time this standard was published, 14 leap seconds had been added since January 1, 1970. These 14 seconds are ignored to provide an easy and compatible method of computing time differences.

I, too, love to ignore things to make my life easy. The standard authors knew “seconds since the epoch” were not, in fact, seconds since the epoch. And they admit as much:

Most systems’ notion of “time” is that of a continuously-increasing value, so this value should increase even during leap seconds. However, not only do most systems not keep track of leap seconds, but most systems are probably not synchronized to any standard time reference. Therefore, it is inappropriate to require that a time represented as seconds since the Epoch precisely represent the number of seconds between the referenced time and the Epoch.

It is sufficient to require that applications be allowed to treat this time as if it represented the number of seconds between the referenced time and the Epoch. It is the responsibility of the vendor of the system, and the administrator of the system, to ensure that this value represents the number of seconds between the referenced time and the Epoch as closely as necessary for the application being run on that system….

I imagine there was some debate over this point. The appendix punts, saying that vendors and administrators must make time align “as closely as necessary”, and that “this value should increase even during leap seconds”. The latter is achievable, but the former is arguably impossible: the standard requires POSIX clocks be twenty-seven seconds off.

Consistent interpretation of seconds since the Epoch can be critical to certain types of distributed applications that rely on such timestamps to synchronize events. The accrual of leap seconds in a time standard is not predictable. The number of leap seconds since the Epoch will likely increase. The standard is more concerned about the synchronization of time between applications of astronomically short duration and the Working Group expects these concerns to become more critical in the future.

In a sense, the opposite happened. Time synchronization is always off, so systems generally function (however incorrectly) when times drift a bit. But leap seconds are rare, and the linearity evoked by the phrase “seconds since the epoch” is so deeply baked in to our intuition, that software can accrue serious, unnoticed bugs. Until a few years later, one of those tiny little leap seconds takes down a big chunk of the internet.

What To Do Instead

If you just need to compute the duration between two events on one computer, use CLOCK_MONOTONIC, or better yet, CLOCK_BOOTTIME. If you don’t need to exchange timestamps with other systems that assume POSIX time, use TAI, GPS, or maybe LORAN. If you do need rough alignment with other POSIX-timestamp systems, smear leap seconds over a longer window of time. Libraries like qntm’s t-a-i can convert back and forth between POSIX and TAI.

There’s an ongoing effort to end leap seconds, hopefully by 2035. It’ll require additional work to build conversion tables into everything that relies on the “86,400 seconds per day” assumption, but it should also make it much simpler to ask questions like “how many seconds between these two times”. At least for times after 2035!

December 24, 2024

Helping Christmas Elves Count Presents (or: Vectorized Overflow Checking)

Vectorized Overflow Checking

In our earlier post on overflow handling we explained how to use the CPU flags to detected integer overflows and wrote: “this doesn’t have zero overhead as the compiler can’t vectorize the function using checked arithmetic.” Not satisfied, we took matters into our own hands: If the compiler can’t help us, we have to help ourselves!

This post explains in very low-level detail how you can very quickly sum up (signed) integers on modern x86 CPUs using specialized vector instructions such as vpternlogd. We also show some numbers comparing the hand-written vectorized sum with the compiler-assisted checked arithmetic that demonstrate that you can gain a lot of performance even when checking for overflows.

December 23, 2024

Tutorial: How I added GitHub and npm stat counters to TanStack.com

The tutorial "How I added GitHub and npm stat counters to TanStack.com" by Convex Champion Shawn Erquhart details how Convex automates the integration of live GitHub and npm statistics on the TanStack.com website. It showcases data fetching, database optimization, and real-time updates through APIs, web scraping, and scheduled jobs.

Use of Time in Distributed Databases (part 1)

Distributed systems are characterized by nodes executing concurrently with no shared state and no common clock. Coordination between nodes are needed to satisfy some correctness properties, but since coordination requires message communication there is a performance tradeoff preventing nodes from frequently communicating/coordinating.


Timestamping and ordering

Why do the nodes, in particular database nodes, need coordinating anyway? The goal of coordination among nodes is to perform event ordering and align independent timelines of nodes with respect to each other. This is why timestamping events and ordering them is very important. Nodes run concurrently without knowing what the other nodes are doing at the moment. They learn about each other's states/events only by sending and receiving messages and this information by definition come from the past state of the nodes. Each node needs to compose a coherent view of the system from these messages and all the while the system is moving along advancing and changing state. This is like trying to compose a panaromic snapshot of a parade moving down the street only by using small 4x6 photos shot by the participants at arbitrary times.

In 1978, Leslie Lamport captured the ordering relationship between events in a distributed system by introducing a neat abstraction called logical clocks. Logical clocks timestamp events based on causality (happened-before) which comes from either precedence within the same thread of execution at the node or via message communication received from another node.

Unfortunately logical clocks are disconnected from real time (wall clocks). Logical clocks would not be able to establish an ordering to event2 at node2 that is occurring a full 24 hours later than event1 at node1, if there had not been a chain of communication linking event1 to event2 with a causal precedence relationship in that period. Logical clocks ignore all back-channel, and require strict enforcement of all communication to be logical-clock timestamped. Moreover, they don't allow you to query physical time an event took place.

Their cousin vector clocks can get you consistent snapshots across nodes (well, with O(n) vector clock overhead), but vector clocks are also vulnerable to the same drawbacks above. To address these drawbacks, we need physical clocks and real-time affinity. To drive this point home, and learn about even more benefits synchronized clocks can give us, here is another round of motivation for the importance of time in distributed systems and databases.


What is synchronized clocks good for? 

Asymmetry of information is the curse on distributed system. The coordinated attack problem provides a great demonstration of this. The two generals in that problem are stuck in a Zeno's paradox of "You may not know that I know that you know that I know" kind of information chaining. A classic paper to checkout on this topic is "Knowledge and common knowledge in a distributed environment".

Synchronized clocks create a consistent global time reference, seeding the build up of common knowledge. Having a globally consistent time reference (i.e., synchronized clocks/time) is a big boon for common knowledge. Without a common reference time, even by using same rate local clock at each node you lose considerable information about the other party. 

Consider this example where node2 is trying to learn for how long more node1's lease would be valid.  Without synchronized time, unilateral communication from node1 to node2 does not suffice to convey this information, because the message from node1 to node2 could have been arbitrarily delayed in the channel. This makes the timer on node2 obsolete for coordinating between the two nodes since by the time the message arrives the lease might have long expired. With unilateral communication node2 would only know when for sure node1's lease is invalidated, but cannot know when node1 is still holding a lease. It can only reason about the absence of lease by node1, but not about its presence.

Without synchronization, node2 requires two-way communication to estimate lease validity. In this scheme, node2 initializes the communication, and starts its timer when it sends the message. When it receives a response back from node1 which contains the remaining time on the lease for node1, it needs to subtract the delta time passed for the round-trip-time (RTT) to get a lowerbound on the validity of node1's lease. It cannot subtract delta/2 from the lease timer, because communication delay may be asymmetrical. 

But, if we have synchronized clocks, this information can be conveyed just with a unilateral communication from node1 to node2. Node1's message would say "I have the lease till T_end", and when node2 receives the message, this T_end information is instantly usable by Node2 without sufffering the RTT delay rounding above, because of the synchronized clocks. Synchronized clocks simplifies coordination and reduces uncertainty in distributed systems.


Loosely synchronized clocks

Ok, let's discuss physical clocks, and clock synchronization at last! Each node has a local clock, which is just a crystal/quartz oscillator. A circuit counts the number of times a crystal oscillates and declares that a millisecond has passed because it counted say 1000 oscillations. This is not precise of course. Atomic clocks use rubidium, which has an oscillation/clock error of one microsecond a day. But quartz is not like that. It is dirt cheap and small, but also inaccurate so it drifts too much. No two quartz crystals are identical. They have big variations. Another thing about quartz crystal is they are very temperature sensitive: when the temperature gets colder they oscillate more when it gets hotter they oscillate less.  Therefore frequent time synchronization is needed when using crystal oscillator based clocks. Otherwise each node's clock would drift apart from each other.

In 1985, Network Time Protocol (NTP) entered the game. NTP is a networking protocol for clock synchronization between computer systems over packet-switched, variable-latency data networks. NTP is by far the most popular time sync distribution protocol on the Internet. However, NTP clock sync errors can be amount to tens of milliseconds. The biggest source of problem for NTP non-deterministic errors come in for synchronization? It comes from the network. The switches and routers are the source of nondeterministic errors for NTP. Asymmetry in the links is a big error source. Consider 100 mbps link feeding into 1 Gbps link. One way there is no delay, but coming back the other way there is queuing delay. 

Unfortunately, using loosely synchronized clocks via NTP, with several tens of millisecond uncertainty, violates causality and monotonicity under errors. This is problematic for ordering events and our consistent snapshots across nodes. Hybrid Logical Clocks (HLC) provides a remedy, by integrating Lamport's logical time with loosely synchronized physical clocks. This makes them resilient to synchronization errors and enables progress in degraded conditions.

Of course using tightly synchronized clocks is even better, and gets you better performance with more certainty about other node's states/events/orderings. Luckily we had great progress in this area in the last decade.


The advent of tightly synchronized clocks

In 2018, I wrote this short survey on the advent of tightly synchronized clocks.

One big development has been the availability of cheaper (but almost as precise clocks) than atomic clocks. Atomic clocks use Rubidium and has 1 µs/day. This is so small that relativistic effect considerations kick in here. OCXO ovenized oscillators provide the next best way to have precise clocks in a box in a much cheaper manner. They achieve a drift of ~25 µs/day. Pair these bad boys with a GPS antenna and they get less than 100 nanosecond of true time.   

The rest is ensuring high quality distribution of time information and avoiding the sources of nondeterminism creeping in. Precision Time Protocol (PTP) with hardware timestamping helps a lot here and gets you to have ~50 µs clock uncertainty as the AWS Timesync provides for public access. 

I would be amiss if I don't talk about the clockbound API here. Google TrueTime (TT) introduced these bounded uncertainty intervals. A call to TT.now() API returns earliest and latest bounds that the true clock could in the worst case fall into. This is awesome because if you ensure non-overlapping intervals on events, you achieve definite event ordering. This allows you to wait out the latest bound on time to strictly order/serialize your commit event. This also provides another level of safety for correctness of timestamp based ordering. For ordering to go awry, both clock synchronization, and clock-bound around it need to go bad concurrently. If your clock synchronization goes bad, the bounds around it would grow, and you would know something is going wrong. 


Timestamp-based Algorithms for Concurrency Control in Distributed Database Systems (VLDB'80)

Ok, this introduction already got long. Let's finish this Part 1 with a banger! This paper is from VLDB 1980, and predates even NTP. It is written with a typewriter for God's sake. But Bernstein & Goldman's vision is ahead of its time for many decades. They assume synchronized clocks and propose Timestamp Ordering (TSO) algorithms for concurrency control in distributed database systems. The banger is that after more than three decades, Amazon DynamoDB (ATC'23) uses TSO based algorithm for its one-shot transactions.

Here is how the TSO-based optimistic concurrency control algorithm works. Transactions are serialized by the start-time T_si. Transactions don't hold any locks, and resort to optimistic validation. Instead, every object x is tagged with the timestamp of the last transaction[s] that successfully read/write x. So, object x has two metadata associated with it: read_ts(x) and write_ts(x). To guarantee the serializability of transactions, it is crucial that these timestamps should be monotonically increasing (hence the need for synchronized clocks!). The update protocol checks and ensures this for each access. If a transaction tries to access (read/write) an object from its future it is aborted. For example, if ti finds that it is reading x, which has write_ts(x)>ts_ti, then this would be a future read, and x aborts itself. It may later be retried with a new/higher ts, and hopefully it will succeed then. That is it. That is the whole protocol.

After 1980, the next splash on the use of synchronized clocks arrive at PODC 1991, with Barbara Liskov's "Practical Uses of Synchronized Clocks in Distributed Systems" paper. Interestingly this paper does not cite the Bernstein-Goldman's banger groundbreaking paper. It proposed several distributed algorithms that use synchronized clocks:

  • Leases for cache consistency and leader-local reads. Technically, these are realized only via same-rate local timers and don't need synchronized clocks. 
  • SCMP protocol for at-most-once message delivery
  • Kerberos authentication tickets

A noteworthy principle the paper advocated was to use clocks for performance, and not correctness. Notice that in the Bernstein-Goldman paper a decade earlier the correctness of serialization of OCC transactions did not rely on the time synchronization, but the performance would be improved with better time synchronization.

Then comes two decades of silence. The next big update on the practical use of synchronized clocks came 21 years later in 2012 with Google Spanner.


Outline of Use of Time in Distributed Databases

So, this is what I am thinking of covering in the next posts on this series. I already had summaries of these work in my blog over the last several years. It will be nice to organize this information and draw some trends/conclusions from these work. One thing is clear: We see an accelerating trend of adoption of synchronized clocks in distributed databases.

Part 2: Use of logical clocks in database design: Discusses use of vector clocks (e.g., Dynamo (SOSP'07)ORBE (SOCC'13), NAM-DB (VLDB'17)), dependency graph maintenance (COPS (SOSP'11), Kronos(EuroSys'14)), and epoch server (e.g., Chardonnay (OSDI'23), Chablis (CIDR'24)).

Part 3: Use of synchronized physical clocks in database design: Granola (ATC'12), Clock-SI (SRDS'13), GentleRain (SOCC'14), Occult (NSDI'17), Nezha (VLDB'23)

Part 4: Use of clocks in production databases: Spanner (OSDI'12), MongoDB (SIGMOD'19), CockroachDB (SIGMOD'20), DynamoDB (ATC'23), Cassandra/Accord (2023), Aurora Limitless (2023), Aurora DSQL (2024) 

Part 5: Lessons learned: Takeaways, correctness of synchronized clock-based algorithms, tradeoffs, and trends

December 22, 2024

How bloom filters made SQLite 10x faster

This is the fascinating story of how researchers used Bloom filters cleverly to make SQLite 10x faster for analytical queries. These are my five-minute notes on the paper SQLite: Past, Present, and Future

December 20, 2024

December 19, 2024