a curated list of database news from authoritative sources

December 23, 2024

Use of Time in Distributed Databases (part 1)

Distributed systems are characterized by nodes executing concurrently with no shared state and no common clock. Coordination between nodes are needed to satisfy some correctness properties, but since coordination requires message communication there is a performance tradeoff preventing nodes from frequently communicating/coordinating.


Timestamping and ordering

Why do the nodes, in particular database nodes, need coordinating anyway? The goal of coordination among nodes is to perform event ordering and align independent timelines of nodes with respect to each other. This is why timestamping events and ordering them is very important. Nodes run concurrently without knowing what the other nodes are doing at the moment. They learn about each other's states/events only by sending and receiving messages and this information by definition come from the past state of the nodes. Each node needs to compose a coherent view of the system from these messages and all the while the system is moving along advancing and changing state. This is like trying to compose a panaromic snapshot of a parade moving down the street only by using small 4x6 photos shot by the participants at arbitrary times.

In 1978, Leslie Lamport captured the ordering relationship between events in a distributed system by introducing a neat abstraction called logical clocks. Logical clocks timestamp events based on causality (happened-before) which comes from either precedence within the same thread of execution at the node or via message communication received from another node.

Unfortunately logical clocks are disconnected from real time (wall clocks). Logical clocks would not be able to establish an ordering to event2 at node2 that is occurring a full 24 hours later than event1 at node1, if there had not been a chain of communication linking event1 to event2 with a causal precedence relationship in that period. Logical clocks ignore all back-channel, and require strict enforcement of all communication to be logical-clock timestamped. Moreover, they don't allow you to query physical time an event took place.

Their cousin vector clocks can get you consistent snapshots across nodes (well, with O(n) vector clock overhead), but vector clocks are also vulnerable to the same drawbacks above. To address these drawbacks, we need physical clocks and real-time affinity. To drive this point home, and learn about even more benefits synchronized clocks can give us, here is another round of motivation for the importance of time in distributed systems and databases.


What is synchronized clocks good for? 

Asymmetry of information is the curse on distributed system. The coordinated attack problem provides a great demonstration of this. The two generals in that problem are stuck in a Zeno's paradox of "You may not know that I know that you know that I know" kind of information chaining. A classic paper to checkout on this topic is "Knowledge and common knowledge in a distributed environment".

Synchronized clocks create a consistent global time reference, seeding the build up of common knowledge. Having a globally consistent time reference (i.e., synchronized clocks/time) is a big boon for common knowledge. Without a common reference time, even by using same rate local clock at each node you lose considerable information about the other party. 

Consider this example where node2 is trying to learn for how long more node1's lease would be valid.  Without synchronized time, unilateral communication from node1 to node2 does not suffice to convey this information, because the message from node1 to node2 could have been arbitrarily delayed in the channel. This makes the timer on node2 obsolete for coordinating between the two nodes since by the time the message arrives the lease might have long expired. With unilateral communication node2 would only know when for sure node1's lease is invalidated, but cannot know when node1 is still holding a lease. It can only reason about the absence of lease by node1, but not about its presence.

Without synchronization, node2 requires two-way communication to estimate lease validity. In this scheme, node2 initializes the communication, and starts its timer when it sends the message. When it receives a response back from node1 which contains the remaining time on the lease for node1, it needs to subtract the delta time passed for the round-trip-time (RTT) to get a lowerbound on the validity of node1's lease. It cannot subtract delta/2 from the lease timer, because communication delay may be asymmetrical. 

But, if we have synchronized clocks, this information can be conveyed just with a unilateral communication from node1 to node2. Node1's message would say "I have the lease till T_end", and when node2 receives the message, this T_end information is instantly usable by Node2 without sufffering the RTT delay rounding above, because of the synchronized clocks. Synchronized clocks simplifies coordination and reduces uncertainty in distributed systems.


Loosely synchronized clocks

Ok, let's discuss physical clocks, and clock synchronization at last! Each node has a local clock, which is just a crystal/quartz oscillator. A circuit counts the number of times a crystal oscillates and declares that a millisecond has passed because it counted say 1000 oscillations. This is not precise of course. Atomic clocks use rubidium, which has an oscillation/clock error of one microsecond a day. But quartz is not like that. It is dirt cheap and small, but also inaccurate so it drifts too much. No two quartz crystals are identical. They have big variations. Another thing about quartz crystal is they are very temperature sensitive: when the temperature gets colder they oscillate more when it gets hotter they oscillate less.  Therefore frequent time synchronization is needed when using crystal oscillator based clocks. Otherwise each node's clock would drift apart from each other.

In 1985, Network Time Protocol (NTP) entered the game. NTP is a networking protocol for clock synchronization between computer systems over packet-switched, variable-latency data networks. NTP is by far the most popular time sync distribution protocol on the Internet. However, NTP clock sync errors can be amount to tens of milliseconds. The biggest source of problem for NTP non-deterministic errors come in for synchronization? It comes from the network. The switches and routers are the source of nondeterministic errors for NTP. Asymmetry in the links is a big error source. Consider 100 mbps link feeding into 1 Gbps link. One way there is no delay, but coming back the other way there is queuing delay. 

Unfortunately, using loosely synchronized clocks via NTP, with several tens of millisecond uncertainty, violates causality and monotonicity under errors. This is problematic for ordering events and our consistent snapshots across nodes. Hybrid Logical Clocks (HLC) provides a remedy, by integrating Lamport's logical time with loosely synchronized physical clocks. This makes them resilient to synchronization errors and enables progress in degraded conditions.

Of course using tightly synchronized clocks is even better, and gets you better performance with more certainty about other node's states/events/orderings. Luckily we had great progress in this area in the last decade.


The advent of tightly synchronized clocks

In 2018, I wrote this short survey on the advent of tightly synchronized clocks.

One big development has been the availability of cheaper (but almost as precise clocks) than atomic clocks. Atomic clocks use Rubidium and has 1 µs/day. This is so small that relativistic effect considerations kick in here. OCXO ovenized oscillators provide the next best way to have precise clocks in a box in a much cheaper manner. They achieve a drift of ~25 µs/day. Pair these bad boys with a GPS antenna and they get less than 100 nanosecond of true time.   

The rest is ensuring high quality distribution of time information and avoiding the sources of nondeterminism creeping in. Precision Time Protocol (PTP) with hardware timestamping helps a lot here and gets you to have ~50 µs clock uncertainty as the AWS Timesync provides for public access. 

I would be amiss if I don't talk about the clockbound API here. Google TrueTime (TT) introduced these bounded uncertainty intervals. A call to TT.now() API returns earliest and latest bounds that the true clock could in the worst case fall into. This is awesome because if you ensure non-overlapping intervals on events, you achieve definite event ordering. This allows you to wait out the latest bound on time to strictly order/serialize your commit event. This also provides another level of safety for correctness of timestamp based ordering. For ordering to go awry, both clock synchronization, and clock-bound around it need to go bad concurrently. If your clock synchronization goes bad, the bounds around it would grow, and you would know something is going wrong. 


Timestamp-based Algorithms for Concurrency Control in Distributed Database Systems (VLDB'80)

Ok, this introduction already got long. Let's finish this Part 1 with a banger! This paper is from VLDB 1980, and predates even NTP. It is written with a typewriter for God's sake. But Bernstein & Goldman's vision is ahead of its time for many decades. They assume synchronized clocks and propose Timestamp Ordering (TSO) algorithms for concurrency control in distributed database systems. The banger is that after more than three decades, Amazon DynamoDB (ATC'23) uses TSO based algorithm for its one-shot transactions.

Here is how the TSO-based optimistic concurrency control algorithm works. Transactions are serialized by the start-time T_si. Transactions don't hold any locks, and resort to optimistic validation. Instead, every object x is tagged with the timestamp of the last transaction[s] that successfully read/write x. So, object x has two metadata associated with it: read_ts(x) and write_ts(x). To guarantee the serializability of transactions, it is crucial that these timestamps should be monotonically increasing (hence the need for synchronized clocks!). The update protocol checks and ensures this for each access. If a transaction tries to access (read/write) an object from its future it is aborted. For example, if ti finds that it is reading x, which has write_ts(x)>ts_ti, then this would be a future read, and x aborts itself. It may later be retried with a new/higher ts, and hopefully it will succeed then. That is it. That is the whole protocol.

After 1980, the next splash on the use of synchronized clocks arrive at PODC 1991, with Barbara Liskov's "Practical Uses of Synchronized Clocks in Distributed Systems" paper. Interestingly this paper does not cite the Bernstein-Goldman's banger groundbreaking paper. It proposed several distributed algorithms that use synchronized clocks:

  • Leases for cache consistency and leader-local reads. Technically, these are realized only via same-rate local timers and don't need synchronized clocks. 
  • SCMP protocol for at-most-once message delivery
  • Kerberos authentication tickets

A noteworthy principle the paper advocated was to use clocks for performance, and not correctness. Notice that in the Bernstein-Goldman paper a decade earlier the correctness of serialization of OCC transactions did not rely on the time synchronization, but the performance would be improved with better time synchronization.

Then comes two decades of silence. The next big update on the practical use of synchronized clocks came 21 years later in 2012 with Google Spanner.


Outline of Use of Time in Distributed Databases

So, this is what I am thinking of covering in the next posts on this series. I already had summaries of these work in my blog over the last several years. It will be nice to organize this information and draw some trends/conclusions from these work. One thing is clear: We see an accelerating trend of adoption of synchronized clocks in distributed databases.

Part 2: Use of logical clocks in database design: Discusses use of vector clocks (e.g., Dynamo (SOSP'07)ORBE (SOCC'13), NAM-DB (VLDB'17)), dependency graph maintenance (COPS (SOSP'11), Kronos(EuroSys'14)), and epoch server (e.g., Chardonnay (OSDI'23), Chablis (CIDR'24)).

Part 3: Use of synchronized physical clocks in database design: Granola (ATC'12), Clock-SI (SRDS'13), GentleRain (SOCC'14), Occult (NSDI'17), Nezha (VLDB'23)

Part 4: Use of clocks in production databases: Spanner (OSDI'12), MongoDB (SIGMOD'19), CockroachDB (SIGMOD'20), DynamoDB (ATC'23), Cassandra/Accord (2023), Aurora Limitless (2023), Aurora DSQL (2024) 

Part 5: Lessons learned: Takeaways, correctness of synchronized clock-based algorithms, tradeoffs, and trends

December 22, 2024

How bloom filters made SQLite 10x faster

This is the fascinating story of how researchers used Bloom filters cleverly to make SQLite 10x faster for analytical queries. These are my five-minute notes on the paper SQLite: Past, Present, and Future

December 20, 2024

December 19, 2024

December 16, 2024

Utilizing highly synchronized clocks in distributed databases

This master's thesis at Lund University Sweden explores how CockroachDB's transactional performance can be improved by using tightly synchronized clocks. The paper addresses two questions: how to integrate high-precision clock synchronization into CockroachDB and the resulting impact on performance. Given the publicly available clock synchronization technologies like Amazon Time Sync Service and ClockBound, the researchers (Jacob and Fabian) argue that the traditional skepticism around the reliability of clocks in distributed systems is outdated. 


CockroachDB vs Spanner approaches

CockroachDB uses loosely-synchronized NTP clocks, and achieves linearizability by using Hybrid Logical Clocks (HLC) and relying on a static maximum offset (max_offset=500milliseconds) to account for clock skew. However, this approach has limitations, particularly in handling transactional conflicts within a predefined uncertainty interval. When a transaction reads a value with a timestamp falling within this interval, it triggers an uncertainty restart to ensure consistency. These restarts, categorized as necessary (caused by actual clock skew) or unnecessary (caused by latency), increase transaction latency. By using tight clock synchronization and reducing uncertainty intervals, it is possible to achieve performance gains.

To investigate this further, the paper compares CockroachDB's approach with Google's Spanner, which uses the TrueTime API. TrueTime provides bounded timestamps that guarantee the actual time falls within an interval. This enables Spanner to ensure linearizability using commit-wait, which delays transaction commits until timestamp bounds guarantee consistency. That is, a transaction must wait until tmax of the transaction’s timestamp has passed before committing and releasing its locks. As tmax provides an upper bound for timestamps generated across all nodes, it is guaranteed that the changes made by the transaction will be visible across the entire cluster, achieving linearizability.

To adopt tightly synchronized clocks for CockroachDB, the paper considers two main approaches:

  1. Adopting a commit-wait model: This approach, inspired by Google Spanner, involves waiting for a certain period after a transaction acquires all necessary locks to ensure that its changes are visible across the entire cluster. However, the implementation complexity was deemed significant for the current project, and this is not pursued.
  2. Dynamically adjusting uncertainty intervals: This approach focuses on dynamically reducing the size of uncertainty intervals by leveraging bounded timestamps provided by high-precision clock synchronization services, like AWS TimeSync. By dynamically adjusting these intervals based on the actual clock skew, the number of unnecessary transaction restarts can be significantly reduced, leading to improved performance.


TrueClock and ClockBound   

The authors implemented the second approach by modifying CockroachDB to utilize a TrueTime like API and ClockBound. They ran into practical challenges during implementation. Initial tests revealed that retrieving bounded timestamps from ClockBound introduced significant latency (50 microseconds) compared to standard system clock readings (60 nanoseconds). To address this, they ported the open-source ClockBound daemon as a Go library, called TrueClock, which allowed them to include it directly in CockroachDB as it is also written in Go. This in turn removed the need for a datagram request for each clock reading. This reduced clock reading latency from 50 microseconds to 250 nanoseconds, a negligible overhead for database operations.


Evaluation

To test their hypothesis and evaluate the transactional performance gains, the researchers modified CockroachDB to replace the static max_offset with dynamic values calculated based on real-time clockbound synchronization precision. They deployed a CockroachDB cluster consisting of three replicas across three availability zones in the eu-north-1 (Stockholm) region. As predicted, this change significantly reduced the uncertainty interval size, improving the database's ability to process transactions efficiently. The uncertainty intervals shrank by a factor of over 500 compared to the standard configuration, with maximum bounds of 1.4 milliseconds versus the default 500 milliseconds.

Experiments showed significant performance improvements, with read latency reduced by up to 47% and write latency by up to 43%. The authors attribute these gains partly to reduced latch contention. In CockroachDB, latches are used to serialize operations, even for the read operations. Shorter uncertainty intervals allow latches to release sooner, reducing contention and the time writes must wait for reads to complete. Interestingly, although only reads are directly affected by uncertainty restarts, reducing their latency also indirectly benefited write performance due to lower contention.


Conclusion

This work emphasizes the importance of embracing high-precision clocks as they become increasingly accessible in production environments. By dynamically adapting to real-time synchronization precision in place of static/pessimistic assumptions about clock skew, the experiments showed improved performance even with the latches for reads still being in place.

This highlights the potential of integrating modern clock synchronization methods into distributed databases. The results suggest that tighter synchronization not only improves transactional throughput but also offers a feasible path to achieving stronger consistency models like linearizability without significant overhead. The guiding principle should be to use clock synchronization for performance, and not for correctness. 



December 14, 2024

In search of a faster SQLite

Researchers at the University of Helsinki and Cambridge attempted to build a faster SQLite using modern programming paradigms like io_uring and disaggregated storage. They demonstrate up to a 100x reduction in tail latency. These are my notes.

December 13, 2024

What are important data systems problems, ignored by research?

In November, I had the pleasure of attending the Dutch-Belgian DataBase Day, where I moderated a panel on practical challenges often overlooked in database research. Our distinguished panelists included Allison Lee (founding engineer at Snowflake), Andy Pavlo (professor at CMU), and Hannes Mühleisen (co-creator of DuckDB and researcher at CWI), with attendees contributing to the discussion and sharing their perspectives. In this post, I'll attempt to summarize the discussion in the hope that it inspires young (and young-at-heart) researchers to tackle these challenges. Additionally, I'll link to some paper that can serve as motivation and starting points for research in these areas.

One significant yet understudied problem raised by multiple panellists is the handling of variable-length strings. Any analysis of real-world analytical queries reveals that strings are ubiquitous. For instance, Amazon Redshift recently reported that around 50% of all columns are strings. Since strings are typically larger than numeric data, this implies that strings are a substantial majority of real-world data. Dealing with strings presents two major challenges. First, query processing is often slow due to the variable size of strings and the (time and space) overhead of dynamic allocation. Second, surprisingly little research has been dedicated to efficient database-specific string compression. Given the importance of strings on real-world query performance and storage consumption, it is surprising how little research there is on the topic (there are some exceptions).

Allison highlighted a related issue: standard benchmarks, like TPC-H, are overly simplistic, which may partly explain why string processing is understudied. TPC-H queries involve little complex string processing and don't use strings as join or aggregation keys. Moreover, TPC-H strings have static upper bounds, allowing them to be treated as fixed-size objects. This sidesteps the real challenges of variable-size strings and their complex operations. More broadly, standard benchmarks fall short of reflecting real-world workloads, as they lack advanced relational operators (e.g., window functions, CTEs) and complex expressions. To drive meaningful progress, we likely need new, more realistic benchmarks. While the participants agreed on most points, one particularly interesting topic of discussion was distributed query processing. Allison pointed out that many query processing papers overlook distributed processing, making them hard to adopt in industrial systems. Hannes, however, argued that most user workloads can be handled on a single machine, which should be the primary focus of publicly funded research. My personal view is that both single-node and distributed processing are important, and there is ample room to address both challenges.

While database researchers often focus on database engine architectures, Andy argued that surrounding topics, such as network connection handling (e.g., database proxies), receive little attention despite their practical importance. Surprisingly, there is also limited research on scheduling database workloads and optimizing the network stack, even though communication bottlenecks frequently constrain efficient OLTP systems. Multi-statement stored procedures, though a potential solution, are not widely adopted and fail to address this issue in practice. I believe there are significant research opportunities in exploring how to better structure the interface between applications and database systems.

One striking fact about major database conferences, such as SIGMOD and VLDB, is how few papers address practical database system problems. From personal experience, I believe this presents a significant opportunity for researchers seeking both academic and real-world impact. Solutions to the problems discussed above (and many others) are likely to gain industry attention and be adopted by real database systems. Moreover, with the availability of open-source systems like DuckDB, DataFusion, LeanStore, and PostgreSQL, conducting systems research has become easier than ever.