a curated list of database news from authoritative sources

December 23, 2024

Use of Time in Distributed Databases (part 1)

Distributed systems are characterized by nodes executing concurrently with no shared state and no common clock. Coordination between nodes are needed to satisfy some correctness properties, but since coordination requires message communication there is a performance tradeoff preventing nodes from frequently communicating/coordinating.


Timestamping and ordering

Why do the nodes, in particular database nodes, need coordinating anyway? The goal of coordination among nodes is to perform event ordering and align independent timelines of nodes with respect to each other. This is why timestamping events and ordering them is very important. Nodes run concurrently without knowing what the other nodes are doing at the moment. They learn about each other's states/events only by sending and receiving messages and this information by definition come from the past state of the nodes. Each node needs to compose a coherent view of the system from these messages and all the while the system is moving along advancing and changing state. This is like trying to compose a panaromic snapshot of a parade moving down the street only by using small 4x6 photos shot by the participants at arbitrary times.

In 1978, Leslie Lamport captured the ordering relationship between events in a distributed system by introducing a neat abstraction called logical clocks. Logical clocks timestamp events based on causality (happened-before) which comes from either precedence within the same thread of execution at the node or via message communication received from another node.

Unfortunately logical clocks are disconnected from real time (wall clocks). Logical clocks would not be able to establish an ordering to event2 at node2 that is occurring a full 24 hours later than event1 at node1, if there had not been a chain of communication linking event1 to event2 with a causal precedence relationship in that period. Logical clocks ignore all back-channel, and require strict enforcement of all communication to be logical-clock timestamped. Moreover, they don't allow you to query physical time an event took place.

Their cousin vector clocks can get you consistent snapshots across nodes (well, with O(n) vector clock overhead), but vector clocks are also vulnerable to the same drawbacks above. To address these drawbacks, we need physical clocks and real-time affinity. To drive this point home, and learn about even more benefits synchronized clocks can give us, here is another round of motivation for the importance of time in distributed systems and databases.


What is synchronized clocks good for? 

Asymmetry of information is the curse on distributed system. The coordinated attack problem provides a great demonstration of this. The two generals in that problem are stuck in a Zeno's paradox of "You may not know that I know that you know that I know" kind of information chaining. A classic paper to checkout on this topic is "Knowledge and common knowledge in a distributed environment".

Synchronized clocks create a consistent global time reference, seeding the build up of common knowledge. Having a globally consistent time reference (i.e., synchronized clocks/time) is a big boon for common knowledge. Without a common reference time, even by using same rate local clock at each node you lose considerable information about the other party. 

Consider this example where node2 is trying to learn for how long more node1's lease would be valid.  Without synchronized time, unilateral communication from node1 to node2 does not suffice to convey this information, because the message from node1 to node2 could have been arbitrarily delayed in the channel. This makes the timer on node2 obsolete for coordinating between the two nodes since by the time the message arrives the lease might have long expired. With unilateral communication node2 would only know when for sure node1's lease is invalidated, but cannot know when node1 is still holding a lease. It can only reason about the absence of lease by node1, but not about its presence.

Without synchronization, node2 requires two-way communication to estimate lease validity. In this scheme, node2 initializes the communication, and starts its timer when it sends the message. When it receives a response back from node1 which contains the remaining time on the lease for node1, it needs to subtract the delta time passed for the round-trip-time (RTT) to get a lowerbound on the validity of node1's lease. It cannot subtract delta/2 from the lease timer, because communication delay may be asymmetrical. 

But, if we have synchronized clocks, this information can be conveyed just with a unilateral communication from node1 to node2. Node1's message would say "I have the lease till T_end", and when node2 receives the message, this T_end information is instantly usable by Node2 without sufffering the RTT delay rounding above, because of the synchronized clocks. Synchronized clocks simplifies coordination and reduces uncertainty in distributed systems.


Loosely synchronized clocks

Ok, let's discuss physical clocks, and clock synchronization at last! Each node has a local clock, which is just a crystal/quartz oscillator. A circuit counts the number of times a crystal oscillates and declares that a millisecond has passed because it counted say 1000 oscillations. This is not precise of course. Atomic clocks use rubidium, which has an oscillation/clock error of one microsecond a day. But quartz is not like that. It is dirt cheap and small, but also inaccurate so it drifts too much. No two quartz crystals are identical. They have big variations. Another thing about quartz crystal is they are very temperature sensitive: when the temperature gets colder they oscillate more when it gets hotter they oscillate less.  Therefore frequent time synchronization is needed when using crystal oscillator based clocks. Otherwise each node's clock would drift apart from each other.

In 1985, Network Time Protocol (NTP) entered the game. NTP is a networking protocol for clock synchronization between computer systems over packet-switched, variable-latency data networks. NTP is by far the most popular time sync distribution protocol on the Internet. However, NTP clock sync errors can be amount to tens of milliseconds. The biggest source of problem for NTP non-deterministic errors come in for synchronization? It comes from the network. The switches and routers are the source of nondeterministic errors for NTP. Asymmetry in the links is a big error source. Consider 100 mbps link feeding into 1 Gbps link. One way there is no delay, but coming back the other way there is queuing delay. 

Unfortunately, using loosely synchronized clocks via NTP, with several tens of millisecond uncertainty, violates causality and monotonicity under errors. This is problematic for ordering events and our consistent snapshots across nodes. Hybrid Logical Clocks (HLC) provides a remedy, by integrating Lamport's logical time with loosely synchronized physical clocks. This makes them resilient to synchronization errors and enables progress in degraded conditions.

Of course using tightly synchronized clocks is even better, and gets you better performance with more certainty about other node's states/events/orderings. Luckily we had great progress in this area in the last decade.


The advent of tightly synchronized clocks

In 2018, I wrote this short survey on the advent of tightly synchronized clocks.

One big development has been the availability of cheaper (but almost as precise clocks) than atomic clocks. Atomic clocks use Rubidium and has 1 µs/day. This is so small that relativistic effect considerations kick in here. OCXO ovenized oscillators provide the next best way to have precise clocks in a box in a much cheaper manner. They achieve a drift of ~25 µs/day. Pair these bad boys with a GPS antenna and they get less than 100 nanosecond of true time.   

The rest is ensuring high quality distribution of time information and avoiding the sources of nondeterminism creeping in. Precision Time Protocol (PTP) with hardware timestamping helps a lot here and gets you to have ~50 µs clock uncertainty as the AWS Timesync provides for public access. 

I would be amiss if I don't talk about the clockbound API here. Google TrueTime (TT) introduced these bounded uncertainty intervals. A call to TT.now() API returns earliest and latest bounds that the true clock could in the worst case fall into. This is awesome because if you ensure non-overlapping intervals on events, you achieve definite event ordering. This allows you to wait out the latest bound on time to strictly order/serialize your commit event. This also provides another level of safety for correctness of timestamp based ordering. For ordering to go awry, both clock synchronization, and clock-bound around it need to go bad concurrently. If your clock synchronization goes bad, the bounds around it would grow, and you would know something is going wrong. 


Timestamp-based Algorithms for Concurrency Control in Distributed Database Systems (VLDB'80)

Ok, this introduction already got long. Let's finish this Part 1 with a banger! This paper is from VLDB 1980, and predates even NTP. It is written with a typewriter for God's sake. But Bernstein & Goldman's vision is ahead of its time for many decades. They assume synchronized clocks and propose Timestamp Ordering (TSO) algorithms for concurrency control in distributed database systems. The banger is that after more than three decades, Amazon DynamoDB (ATC'23) uses TSO based algorithm for its one-shot transactions.

Here is how the TSO-based optimistic concurrency control algorithm works. Transactions are serialized by the start-time T_si. Transactions don't hold any locks, and resort to optimistic validation. Instead, every object x is tagged with the timestamp of the last transaction[s] that successfully read/write x. So, object x has two metadata associated with it: read_ts(x) and write_ts(x). To guarantee the serializability of transactions, it is crucial that these timestamps should be monotonically increasing (hence the need for synchronized clocks!). The update protocol checks and ensures this for each access. If a transaction tries to access (read/write) an object from its future it is aborted. For example, if ti finds that it is reading x, which has write_ts(x)>ts_ti, then this would be a future read, and x aborts itself. It may later be retried with a new/higher ts, and hopefully it will succeed then. That is it. That is the whole protocol.

After 1980, the next splash on the use of synchronized clocks arrive at PODC 1991, with Barbara Liskov's "Practical Uses of Synchronized Clocks in Distributed Systems" paper. Interestingly this paper does not cite the Bernstein-Goldman's banger groundbreaking paper. It proposed several distributed algorithms that use synchronized clocks:

  • Leases for cache consistency and leader-local reads. Technically, these are realized only via same-rate local timers and don't need synchronized clocks. 
  • SCMP protocol for at-most-once message delivery
  • Kerberos authentication tickets

A noteworthy principle the paper advocated was to use clocks for performance, and not correctness. Notice that in the Bernstein-Goldman paper a decade earlier the correctness of serialization of OCC transactions did not rely on the time synchronization, but the performance would be improved with better time synchronization.

Then comes two decades of silence. The next big update on the practical use of synchronized clocks came 21 years later in 2012 with Google Spanner.


Outline of Use of Time in Distributed Databases

So, this is what I am thinking of covering in the next posts on this series. I already had summaries of these work in my blog over the last several years. It will be nice to organize this information and draw some trends/conclusions from these work. One thing is clear: We see an accelerating trend of adoption of synchronized clocks in distributed databases.

Part 2: Use of logical clocks in database design: Discusses use of vector clocks (e.g., Dynamo (SOSP'07)ORBE (SOCC'13), NAM-DB (VLDB'17)), dependency graph maintenance (COPS (SOSP'11), Kronos(EuroSys'14)), and epoch server (e.g., Chardonnay (OSDI'23), Chablis (CIDR'24)).

Part 3: Use of synchronized physical clocks in database design: Granola (ATC'12), Clock-SI (SRDS'13), GentleRain (SOCC'14), Occult (NSDI'17), Nezha (VLDB'23)

Part 4: Use of clocks in production databases: Spanner (OSDI'12), MongoDB (SIGMOD'19), CockroachDB (SIGMOD'20), DynamoDB (ATC'23), Cassandra/Accord (2023), Aurora Limitless (2023), Aurora DSQL (2024) 

Part 5: Lessons learned: Takeaways, correctness of synchronized clock-based algorithms, tradeoffs, and trends

December 22, 2024

How bloom filters made SQLite 10x faster

This is the fascinating story of how researchers used Bloom filters cleverly to make SQLite 10x faster for analytical queries. These are my five-minute notes on the paper SQLite: Past, Present, and Future

December 20, 2024

December 19, 2024

December 18, 2024

Stop Nesting Database Systems

The idea of embedding fast data processing engines into other systems to speed up analytics is not new. In fact, DuckDBs started out as SQL Analytics on Pandas. Meta had a similar vision for its Velox project, which takes a slightly different approach by building an embeddable execution engine distributed as a C++ library. If you’re considering building your own analytics engine, embedding an existing engine is almost always a better choice. Don’t roll your own crypto analytics engine.

Recently, however, there has been a trend to embed database systems in other database systems for various reasons, which I find surprising and worrying. It started with pg_duckdb, and now there is omni-sqlite, pg_clickhouse, and, just this week, a Planetscale to MotherDuck pipeline over pg_duckdb. With this pipeline, you end up nesting four systems: PlanetScale packs PostgreSQL, which packs the MotherDuck-built extension pg_duckdb, which packs DuckDB and MotherDuck.

Solutions like pg_duckdb and pg_clickhouse promise to solve your PostgreSQL analytics issues with HTAP-like capabilities. They allow you to run transactions in PostgreSQL and claim to seamlessly hand over analytical workloads to Clickhouse and DuckDB with a sleek interface directly from PostgreSQL. But only at first glance. To me, this approach is similar to opening a random closet and throwing all your clutter in there to hide it before guests arrive. Does it look clean? Sure! But does it solve your issues? Probably not.

To find out if these scientists were too preoccupied with whether they could, let’s take a look at how these extensions work and what they can and can’t do for you.

How the Nesting of Database Systems Works

We’ll use pg_duckdb as an example because it is the oldest and most popular of the bunch. However, the points discussed here are not specific to pg_duckdb, but rather to the concept of nesting database systems in general.

Among the available solutions, pg_duckdb has the greatest potential to speed up analytics within a transactional host like PostgreSQL. pg_duckdb has several features that enable elegant, nearly seamless integration.

  • DuckDB was designed from the beginning to be embedded within other processes.
  • DuckDB supports the PostgreSQL SQL dialect.
  • DuckDB has an excellent query planner, joins, and analytical algorithms.
  • Your queries run within the same transaction on transactionally consistent data within the bounds of PostgreSQL.

With the extension enabled, pg_duckdb controlls query execution for all queries you send to PostgreSQL. It knows what it can handle itself and delegates the rest, mainly scanning the data on disk, to PostgreSQL. All join and aggregation processing then takes place in DuckDB before the result is handed back to PostgreSQL. The diagram below provides a rough overview and illustrates some optimizations that we will discuss below.

What They Can Do Better than the Outer Engine

Analytical query processing has certainly come a long way since PostgreSQL was designed. Consequently, using an analytical system provides a wealth of functionality not available in PostgreSQL. Unlike PostgreSQL, DuckDB’s optimizer was designed for complex analytical workloads, so it can find much more efficient plans. This allows DuckDB to fully parallelize queries and utilize system resources much more effectively. While PostgreSQL assumes memory is scarce, analytical engines can productively saturate powerful servers. This has the potential to reduce execution times significantly. You also gain access to features of the embedded engine that are not available in PostgreSQL. For example, you can query remote Parquet files and combine them with local subresults.

Where They Are On-Par

Of course, nesting another engine does not magically solve all performance issues. There are parts where the embedded engine can only be as fast as the outer system allows. In order to run queries on data managed by the host system, the embedded engine must retrieve the data from the host. Retrieving this data requires a regular scan, which PostgreSQL must perform. pg_duckdb employs additional optimizations to reduce the impact of this process. It pushes filters to PostgreSQL scans where possible and runs scans in parallel. However, it still depends on PostgreSQL to fetch rows from disk using its buffer manager. It has no access to columnar layout or advanced techniques such as scan pruning. Therefore, while the upper parts of the query plan can leverage DuckDB’s full performance, all scans are expected to run at PostgreSQL speeds.

Where They Hold You Back

Creating a Frankenstein system comes at a cost. For one thing, you increase the complexity of your stack. Rather than running PostgreSQL in isolation, you now need two engines that aren’t designed to work together to do so. There are other limitations as well.

While you can perform many analytical tasks, some things fundamentally do not work. The most obvious issue is that you cannot mix and match functionality. For instance, you can’t run DuckDB plans that require PostgreSQL functions in the query plan. However, this is not only a surface-level issue of available functions. What if the systems have different sort orders or comparisons, e.g., through collations? Suddenly, the results could differ depending on which system executes the query, and you would not know.

The last downside is the missing upsides in the current state of these solutions. While there are obvious theoretical opportunities for performance improvements, neither pg_duckdb nor pg_clickhouse can reliably outperform PostgreSQL on TPC-H in their current forms.1 So, you need to carefully decide which queries to use them for, as well as when sticking to PostgreSQL is the better choice.

Nesting Systems vs ETL

Although nesting systems have their drawbacks, there are definitely use cases where they are preferable to typical ETL setups.

First, if you need to perform analytics, they must occur somewhere. If you don’t want to adopt a full HTAP engine like CedarDB right away, nesting systems can be a good stopgap solution. For many PostgreSQL users, the alternative to such integrations is setting up and maintaining a full ETL pipeline for infrequent, analytical “heavy hitter” queries, such as monthly reporting. In that case, having the option to spin up an embedded engine at the click of a button and have it disappear without a trace is ideal. However, if you need to continuously run analytics, the benefits of a dedicated analytical or HTAP solution will quickly outweigh the cost.

This does not hold true for approaches like pg_clickhouse or the MotherDuck component for pg_duckdb, which act more like foreign data wrappers. Rather than being fully embedded within the host, they provide access to query results on external systems through a single, convenient interface. However, they don’t help with moving data to the analytical system or keeping it up to date. Therefore, you still need to set up a data pipeline from PostgreSQL to Clickhouse or MotherDuck. While these solutions are more convenient, they are not fundamentally different from traditional ETL stacks.

These extensions’ main promise is HTAP, a workload that ETL pipelines will never be able to cover. Running analytics within a transaction and persisting results with transactional guarantees is only possible if a single system controls both analytical reads and transactional writes. These workloads are much more common than one might think. 2 To run HTAP workloads, your analytical queries require a consistent view of your transactional data and the ability to persist results with the same guarantees. However, is an analytical engine nested in a transaction system worthy of being called “HTAP”?

Is This HTAP?

The MotherDuck post that inspired this semi-rant gives a “maybe yes” but I want to stick to a clear “no.” The setup can be useful in the above scenarios, but the only part that manages to even scratch the surface of HTAP is the last one: a local, embedded analytical engine within a transactional host.

And even a full embedding still has to delegate scans to the host system, meaning it can not overcome the most crucial bottleneck for analytics: scans. The majority of time in analytics is spent on table scans, even in systems that heavily optimize their scans for analytical workloads. 3 Both MotherDuck and Clickhouse have such scans as well, but they are only useful on data they manage themselves, which does not come with the same transactional guarantees as your PostgreSQL query.

The only way to have your cake and eat it too is to use a system built for HTAP that can optimize the entire process, from scan to result, while still guaranteeing transactional integrity. Getting this right takes more than simply embedding an analytical engine.

Here is a short list of what we had to build at CedarDB to achieve full performance for a mix of analytics and transactions:

  • A storage layout that works for both point accesses and analytical scans.
  • Optimistically synchronized data structures to reduce lock contention between parallel write and read operations.
  • Optimistic snapshot isolation for transactions to prevent long reads from blocking writers.
  • A scheduler that prioritizes small writes without starving long reads.

These features are in addition to optimizations for the analytical and transactional parts in isolation, many of which we highlight in other blog posts. Building an HTAP system is definitely a much larger project than nesting two database systems, but I believe the results justify the effort.

If you want to experience the performance and convenience of a true HTAP solution for yourself, try out our Community Edition with an HTAP workload.


  1. Both pg_clickhouse and pg_duckdb report own comparisons against PostgreSQL. See here for pg_duckdb and here for pg_clickhouse ↩︎

  2. Even analytical warehouses, such as Redshift, experience a significant volume of mixed read and write workloads, many of which involve substantial updates. For more details, see the AWS Redshift study for more details. ↩︎

  3. According to Snowflake, their queries spend an average of 50% of their time in scans in the published Snowset dataset. See Renen et al. ↩︎

December 16, 2024

Utilizing highly synchronized clocks in distributed databases

This master's thesis at Lund University Sweden explores how CockroachDB's transactional performance can be improved by using tightly synchronized clocks. The paper addresses two questions: how to integrate high-precision clock synchronization into CockroachDB and the resulting impact on performance. Given the publicly available clock synchronization technologies like Amazon Time Sync Service and ClockBound, the researchers (Jacob and Fabian) argue that the traditional skepticism around the reliability of clocks in distributed systems is outdated. 


CockroachDB vs Spanner approaches

CockroachDB uses loosely-synchronized NTP clocks, and achieves linearizability by using Hybrid Logical Clocks (HLC) and relying on a static maximum offset (max_offset=500milliseconds) to account for clock skew. However, this approach has limitations, particularly in handling transactional conflicts within a predefined uncertainty interval. When a transaction reads a value with a timestamp falling within this interval, it triggers an uncertainty restart to ensure consistency. These restarts, categorized as necessary (caused by actual clock skew) or unnecessary (caused by latency), increase transaction latency. By using tight clock synchronization and reducing uncertainty intervals, it is possible to achieve performance gains.

To investigate this further, the paper compares CockroachDB's approach with Google's Spanner, which uses the TrueTime API. TrueTime provides bounded timestamps that guarantee the actual time falls within an interval. This enables Spanner to ensure linearizability using commit-wait, which delays transaction commits until timestamp bounds guarantee consistency. That is, a transaction must wait until tmax of the transaction’s timestamp has passed before committing and releasing its locks. As tmax provides an upper bound for timestamps generated across all nodes, it is guaranteed that the changes made by the transaction will be visible across the entire cluster, achieving linearizability.

To adopt tightly synchronized clocks for CockroachDB, the paper considers two main approaches:

  1. Adopting a commit-wait model: This approach, inspired by Google Spanner, involves waiting for a certain period after a transaction acquires all necessary locks to ensure that its changes are visible across the entire cluster. However, the implementation complexity was deemed significant for the current project, and this is not pursued.
  2. Dynamically adjusting uncertainty intervals: This approach focuses on dynamically reducing the size of uncertainty intervals by leveraging bounded timestamps provided by high-precision clock synchronization services, like AWS TimeSync. By dynamically adjusting these intervals based on the actual clock skew, the number of unnecessary transaction restarts can be significantly reduced, leading to improved performance.


TrueClock and ClockBound   

The authors implemented the second approach by modifying CockroachDB to utilize a TrueTime like API and ClockBound. They ran into practical challenges during implementation. Initial tests revealed that retrieving bounded timestamps from ClockBound introduced significant latency (50 microseconds) compared to standard system clock readings (60 nanoseconds). To address this, they ported the open-source ClockBound daemon as a Go library, called TrueClock, which allowed them to include it directly in CockroachDB as it is also written in Go. This in turn removed the need for a datagram request for each clock reading. This reduced clock reading latency from 50 microseconds to 250 nanoseconds, a negligible overhead for database operations.


Evaluation

To test their hypothesis and evaluate the transactional performance gains, the researchers modified CockroachDB to replace the static max_offset with dynamic values calculated based on real-time clockbound synchronization precision. They deployed a CockroachDB cluster consisting of three replicas across three availability zones in the eu-north-1 (Stockholm) region. As predicted, this change significantly reduced the uncertainty interval size, improving the database's ability to process transactions efficiently. The uncertainty intervals shrank by a factor of over 500 compared to the standard configuration, with maximum bounds of 1.4 milliseconds versus the default 500 milliseconds.

Experiments showed significant performance improvements, with read latency reduced by up to 47% and write latency by up to 43%. The authors attribute these gains partly to reduced latch contention. In CockroachDB, latches are used to serialize operations, even for the read operations. Shorter uncertainty intervals allow latches to release sooner, reducing contention and the time writes must wait for reads to complete. Interestingly, although only reads are directly affected by uncertainty restarts, reducing their latency also indirectly benefited write performance due to lower contention.


Conclusion

This work emphasizes the importance of embracing high-precision clocks as they become increasingly accessible in production environments. By dynamically adapting to real-time synchronization precision in place of static/pessimistic assumptions about clock skew, the experiments showed improved performance even with the latches for reads still being in place.

This highlights the potential of integrating modern clock synchronization methods into distributed databases. The results suggest that tighter synchronization not only improves transactional throughput but also offers a feasible path to achieving stronger consistency models like linearizability without significant overhead. The guiding principle should be to use clock synchronization for performance, and not for correctness. 



December 14, 2024

In search of a faster SQLite

Researchers at the University of Helsinki and Cambridge attempted to build a faster SQLite using modern programming paradigms like io_uring and disaggregated storage. They demonstrate up to a 100x reduction in tail latency. These are my notes.