a curated list of database news from authoritative sources

December 20, 2024

December 19, 2024

December 16, 2024

Utilizing highly synchronized clocks in distributed databases

This master's thesis at Lund University Sweden explores how CockroachDB's transactional performance can be improved by using tightly synchronized clocks. The paper addresses two questions: how to integrate high-precision clock synchronization into CockroachDB and the resulting impact on performance. Given the publicly available clock synchronization technologies like Amazon Time Sync Service and ClockBound, the researchers (Jacob and Fabian) argue that the traditional skepticism around the reliability of clocks in distributed systems is outdated. 


CockroachDB vs Spanner approaches

CockroachDB uses loosely-synchronized NTP clocks, and achieves linearizability by using Hybrid Logical Clocks (HLC) and relying on a static maximum offset (max_offset=500milliseconds) to account for clock skew. However, this approach has limitations, particularly in handling transactional conflicts within a predefined uncertainty interval. When a transaction reads a value with a timestamp falling within this interval, it triggers an uncertainty restart to ensure consistency. These restarts, categorized as necessary (caused by actual clock skew) or unnecessary (caused by latency), increase transaction latency. By using tight clock synchronization and reducing uncertainty intervals, it is possible to achieve performance gains.

To investigate this further, the paper compares CockroachDB's approach with Google's Spanner, which uses the TrueTime API. TrueTime provides bounded timestamps that guarantee the actual time falls within an interval. This enables Spanner to ensure linearizability using commit-wait, which delays transaction commits until timestamp bounds guarantee consistency. That is, a transaction must wait until tmax of the transaction’s timestamp has passed before committing and releasing its locks. As tmax provides an upper bound for timestamps generated across all nodes, it is guaranteed that the changes made by the transaction will be visible across the entire cluster, achieving linearizability.

To adopt tightly synchronized clocks for CockroachDB, the paper considers two main approaches:

  1. Adopting a commit-wait model: This approach, inspired by Google Spanner, involves waiting for a certain period after a transaction acquires all necessary locks to ensure that its changes are visible across the entire cluster. However, the implementation complexity was deemed significant for the current project, and this is not pursued.
  2. Dynamically adjusting uncertainty intervals: This approach focuses on dynamically reducing the size of uncertainty intervals by leveraging bounded timestamps provided by high-precision clock synchronization services, like AWS TimeSync. By dynamically adjusting these intervals based on the actual clock skew, the number of unnecessary transaction restarts can be significantly reduced, leading to improved performance.


TrueClock and ClockBound   

The authors implemented the second approach by modifying CockroachDB to utilize a TrueTime like API and ClockBound. They ran into practical challenges during implementation. Initial tests revealed that retrieving bounded timestamps from ClockBound introduced significant latency (50 microseconds) compared to standard system clock readings (60 nanoseconds). To address this, they ported the open-source ClockBound daemon as a Go library, called TrueClock, which allowed them to include it directly in CockroachDB as it is also written in Go. This in turn removed the need for a datagram request for each clock reading. This reduced clock reading latency from 50 microseconds to 250 nanoseconds, a negligible overhead for database operations.


Evaluation

To test their hypothesis and evaluate the transactional performance gains, the researchers modified CockroachDB to replace the static max_offset with dynamic values calculated based on real-time clockbound synchronization precision. They deployed a CockroachDB cluster consisting of three replicas across three availability zones in the eu-north-1 (Stockholm) region. As predicted, this change significantly reduced the uncertainty interval size, improving the database's ability to process transactions efficiently. The uncertainty intervals shrank by a factor of over 500 compared to the standard configuration, with maximum bounds of 1.4 milliseconds versus the default 500 milliseconds.

Experiments showed significant performance improvements, with read latency reduced by up to 47% and write latency by up to 43%. The authors attribute these gains partly to reduced latch contention. In CockroachDB, latches are used to serialize operations, even for the read operations. Shorter uncertainty intervals allow latches to release sooner, reducing contention and the time writes must wait for reads to complete. Interestingly, although only reads are directly affected by uncertainty restarts, reducing their latency also indirectly benefited write performance due to lower contention.


Conclusion

This work emphasizes the importance of embracing high-precision clocks as they become increasingly accessible in production environments. By dynamically adapting to real-time synchronization precision in place of static/pessimistic assumptions about clock skew, the experiments showed improved performance even with the latches for reads still being in place.

This highlights the potential of integrating modern clock synchronization methods into distributed databases. The results suggest that tighter synchronization not only improves transactional throughput but also offers a feasible path to achieving stronger consistency models like linearizability without significant overhead. The guiding principle should be to use clock synchronization for performance, and not for correctness. 



December 14, 2024

In search of a faster SQLite

Researchers at the University of Helsinki and Cambridge attempted to build a faster SQLite using modern programming paradigms like io_uring and disaggregated storage. They demonstrate up to a 100x reduction in tail latency. These are my notes.

December 13, 2024

What are important data systems problems, ignored by research?

In November, I had the pleasure of attending the Dutch-Belgian DataBase Day, where I moderated a panel on practical challenges often overlooked in database research. Our distinguished panelists included Allison Lee (founding engineer at Snowflake), Andy Pavlo (professor at CMU), and Hannes Mühleisen (co-creator of DuckDB and researcher at CWI), with attendees contributing to the discussion and sharing their perspectives. In this post, I'll attempt to summarize the discussion in the hope that it inspires young (and young-at-heart) researchers to tackle these challenges. Additionally, I'll link to some paper that can serve as motivation and starting points for research in these areas.

One significant yet understudied problem raised by multiple panellists is the handling of variable-length strings. Any analysis of real-world analytical queries reveals that strings are ubiquitous. For instance, Amazon Redshift recently reported that around 50% of all columns are strings. Since strings are typically larger than numeric data, this implies that strings are a substantial majority of real-world data. Dealing with strings presents two major challenges. First, query processing is often slow due to the variable size of strings and the (time and space) overhead of dynamic allocation. Second, surprisingly little research has been dedicated to efficient database-specific string compression. Given the importance of strings on real-world query performance and storage consumption, it is surprising how little research there is on the topic (there are some exceptions).

Allison highlighted a related issue: standard benchmarks, like TPC-H, are overly simplistic, which may partly explain why string processing is understudied. TPC-H queries involve little complex string processing and don't use strings as join or aggregation keys. Moreover, TPC-H strings have static upper bounds, allowing them to be treated as fixed-size objects. This sidesteps the real challenges of variable-size strings and their complex operations. More broadly, standard benchmarks fall short of reflecting real-world workloads, as they lack advanced relational operators (e.g., window functions, CTEs) and complex expressions. To drive meaningful progress, we likely need new, more realistic benchmarks. While the participants agreed on most points, one particularly interesting topic of discussion was distributed query processing. Allison pointed out that many query processing papers overlook distributed processing, making them hard to adopt in industrial systems. Hannes, however, argued that most user workloads can be handled on a single machine, which should be the primary focus of publicly funded research. My personal view is that both single-node and distributed processing are important, and there is ample room to address both challenges.

While database researchers often focus on database engine architectures, Andy argued that surrounding topics, such as network connection handling (e.g., database proxies), receive little attention despite their practical importance. Surprisingly, there is also limited research on scheduling database workloads and optimizing the network stack, even though communication bottlenecks frequently constrain efficient OLTP systems. Multi-statement stored procedures, though a potential solution, are not widely adopted and fail to address this issue in practice. I believe there are significant research opportunities in exploring how to better structure the interface between applications and database systems.

One striking fact about major database conferences, such as SIGMOD and VLDB, is how few papers address practical database system problems. From personal experience, I believe this presents a significant opportunity for researchers seeking both academic and real-world impact. Solutions to the problems discussed above (and many others) are likely to gain industry attention and be adopted by real database systems. Moreover, with the availability of open-source systems like DuckDB, DataFusion, LeanStore, and PostgreSQL, conducting systems research has become easier than ever.

December 11, 2024

Best of Metadata in 2024

I can't believe we wasted another good year. It is time to reflect back on the best posts at Metadata blog in 2024. (I think you guys should tip me just because I didn't call this post "Metadata wrapped".)

Distributed systems posts

Transactional storage for geo-replicated systems(SOSP11): I like this paper because it asked the right questions, and introduced parallel snapshot isolation. No individual part is novel (vector clocks, csets) but their composition together and application to WAN web applications have been novel. Walter showed how to limit WAN coordination, while still developing useful applications.

An Hourglass Architecture for Distributed Systems (SRDS24 Keynote): This work successfully bridges theoretical research and practical implementation in large-scale distributed systems in Facebook/Meta control plane. The shared log abstraction proposes to separate the system into two layers: the database layer (proposers/learners) and the log layer (acceptors). The shared log provides a simple API for appending entries, checking the tail, reading entries, and trimming the log. This separation allows the SMR layer to focus on higher-level concerns without dealing with the complexities of consensus protocols.

Linearizability: A Correctness Condition for Concurrent Objects (TOPLAS90) : This is a foundational paper on linearizability. I gave this paper a critical read to point out things it did well versus some shortcomings. 

Unanimous 2PC: Fault-tolerant Distributed Transactions Can be Fast and Simple (PAPOC24): This paper brings together the work/ideas around 2-phase commit and consensus protocols. It is thought provoking to consider the tradeoffs between the two. 

FlexiRaft: Flexible Quorums with Raft (CIDR23): The paper talks about how they applied Raft to MySQL replication, and used the flexible quorums in the process. This is not a technically deep paper, but it was interesting to see a practical application of flexible quorums idea to Raft rather than Paxos. The most technically interesting part is the adoption of flexible quorums to Raft rather than Paxos, which needs to impose an extra requirement on quorums in order to guarantee Leader Completeness: "the new leader must already have all log entries replicated by a majority of nodes in the previous term."

Amazon MemoryDB: A Fast and Durable Memory-First Cloud Database (SIGMOD24): Amazon MemoryDB is a fully managed in-memory database service that leverages Redis's performance strengths while overcoming its durability limitations. It uses Redis as the in-memory data processing engine but offloads persistence to an AWS-internal transaction log service (internally known as the journal). This decoupled architecture provides in-memory performance with microsecond reads and single-digit millisecond writes, while ensuring across availability zone (AZ) durability, 99.99% availability, and strong consistency in the face of failures.

Fault-Tolerant Replication with Pull-Based Consensus in MongoDB (NSDI21): Raft provides fault-tolerant state-machine-replication (SMR) over asynchronous networks. Raft (like most SMR protocols) uses push-based replication. But MongoDB uses pull-based replication scheme, so when integrating/invigorating MongoDB's SMR with Raft, this caused challenges. The paper focuses on examining and solving these challenges, and explaining the resulting MongoSMR protocol.

Tunable Consistency in MongoDB (VLDB19): This paper discusses the tunable consistency models in MongoDB and how MongoDB's speculative execution model and data rollback protocol enable a  spectrum of consistency levels efficiently.


Databases posts

I participated in reading groups that covered two database books in 2024, and blogged about the chapters. I had 7 posts related to Transaction Processing book by Gray and Reuters and 14 posts related to the Designing Data Intensive Applications book. Both books were great for getting good understanding of databases. 

Scalable OLTP in the Cloud: What’s the BIG DEAL? (CIDR24): In this paper Pat Helland argues that the answer lies in the joint responsibility of database and the application. The BIG DEAL splits the scaling responsibilities between the database and the application. *Scalable DBs* don’t coordinate across disjoint TXs updating different keys. *Scalable apps* don’t concurrently update the same key. So, snapshot isolation is a BIG DEAL!

Chardonnay: Fast and General Datacenter Transactions for On-Disk Databases (OSDI23): Chardonnay provides strictly serializable general read-write transactions via 2PC+2PL in-memory quickly for a single-datacenter deployment. It also provides lock-free (contention-free) serializable read-only transactions from snapshots (Chardonnay is multiversion in that sense) that are taken every epoch (10ms).

Looking back at Postgres(2019): This article covers Postgres's origin story, and provides a nice retrospective and context about development and features.

DBSP: Automatic Incremental View Maintenance for Rich Query Languages (VLDB23): DBSP is a simplified version of differential dataflow. It assumes linear synchronous time, and in return provides powerful compositional properties. If you apply two queries in sequence and you want to incrementalize that composition, it's enough to incrementalize each query independently. This allows independent optimization of query components, enabling efficient and modular query processing. 

Understanding the Performance Implications of Storage-Disaggregated Databases (SIGMOD24): This paper has conducted a comprehensive study to investigate the performance implications of storage-disaggregated databases. The work addresses several critical performance questions that were obscured due to the closed-source nature of these systems.


AI posts

Auto-WLM: machine learning enhanced workload management in Amazon Redshift (SIGMOD23): Auto-WLM is a machine learning based automatic workload manager currently used in production in Amazon Redshift. This paper turned out to be a practical/applied data systems paper rather than a deep learning and/or advanced machine learning paper. At its core, this paper is about improving query performance and resource utilization in data warehouses, possibly the first for a database system in production at scale. 

The demise of coding is greatly exaggerated: This is my response to NVDIA CEO Jensen Huang's remarks: "Over the course of the last 10 years, 15 years, almost everybody who sits on a stage like this would tell you that it is vital that your children learn computer science, and everybody should learn how to program. And in fact, it’s almost exactly the opposite."

Looming Liability Machines (LLMs): The use of LLMs for automatic root cause analysis (RCA) for cloud incidents spooked me vicerally. I am not suggesting a Butlerian Jihad against LLMs. But I am worried, we are enticed too much by LLMs. Ok, let's use them, but maybe we shouldn't open the fort doors to let them in. I have worries about automation surprise, systemic failures, and getting lazy trading thinking with superficial answers.


TLA+ posts

Exploring the NaiadClock TLA+ model in TLA-Web: I have been impressed by the usability of TLA-Web from Will Schultz. Recently I have been using it for my TLA+ modeling of MongoDB catalog protocols internally, and found it very useful to explore and understand behavior. This got me thinking that TLA-Web would be really useful when exploring and understanding an unfamiliar spec I picked up on the web.

TLA+ modeling of a single replicaset transaction modeling: For some time I had been playing with transaction modeling and most recently with replicaset modeling by way of a single log. While playing with these, I realized I can build something cool on top of these building blocks. I just finished building snapshot isolated transaction modeling that sit on top of a replicaset log. This is also a high level approximation of MongoDB-style snapshot isolated transactions on a single replicaset.

TLA+ modeling of MongoDB logless reconfiguration: This is a walkthrough of the TLA+ specs for the MongoDB logless reconfiguration protocol we've reviewed here.


Miscellaneous posts

The titles in this section is pretty self explanatory, and this is mostly nontechnical writing. So dig in to anything that attracts your interest.

Advice to the young

Why I blog

Know yourself

700

TLA+ conference 2024

SRDS Day 1

HPTS'24 Day 1, part 1

The checklist manifesto (Dr. Atul Gawande, 2009)



Previous years in review




Best of metadata in 2021


Best of metadata in 2020


Best of metadata in 2019


Best of metadata in 2018


Research, writing, and career advice


December 10, 2024

Index for Designing Data Intensive Applications (DDIA) book

The DDIA book is a great textbook, because it is not written as a textbook, but more of a guidebook. 

Textbooks are generally bland and boring. Textbooks that are written by professors even more so, because thoser are often written to impress other professors and to flaunt academic flair. Few textbooks take teaching as the primary goal.

DDIA book has clear writing, and it is pragmatic. It is as if your smart colleague is filling you in about the fundamentals as well as intricacies (dirty secrets) of their field. It is genuine and authentic.

Kudos Kleppmann!


Here are my summaries of the DDIA book chapters. A new version of the book is coming soon, and I look forward to seeing the updated content.


Designing Data Intensive Applications (DDIA) Book, Chp 1. Intro and Chp2. Data Models and Query Languages

DDIA: Chp 3. Storage and Retrieval (Part 1)

DDIA: Chp 3. Storage and Retrieval (Part 2)

DDIA: Chp 4. Encoding and Evolution (Part 1)

DDIA: Chp 4. Encoding and Evolution (Part 2)

DDIA: Chp 5. Replication (Part 1)

DDIA: Chp 5. Replication (Part 2)

DDIA: Chp 6. Partitioning

DDIA: Chp 7. Transactions (Part 1)

DDIA: Chp 7. Transactions (Part 2): Serializability

DDIA: Chp 8. The Trouble with Distributed Systems

DDIA: Chp 9. Consistency and Consensus

DDIA: Chp 10. Batch Processing

DDIA: Chp 11. Stream Processing

DDIA: Chp 12. The Future of Data Systems: TK


C++ exception performance three years later

About three years ago we noticed serious performance problems in C++ exception unwinding. Due to contention on the unwinding path these became more and more severe the more cores a system had, and unwinding could slow down by orders of magnitude. Due to the constraints of backwards compatibility this contention was not easy to eliminate, and P2544 discussed ways to fix this problem via language changes in C++.

But fortunately people found less invasive solutions. First, Florian Weimer changed the glibc to provide a lock-free mechanism to find the (static) unwind tables for a given shared object. Which eliminates the most serious contention for "simple" C++ programs. For example in a micro-benchmark that calls a function with some computations (100 calls to sqrt per function invocation), and which throws with a certain probability, we previously had very poor scalability with increasing core count. With his patch we now see with gcc 14.2 on a dual-socket EPYC 7713 the following performance development (runtime in ms):


1 2 4 8 16 32 64 128 threads
0% failure
29 29 29 29 29 29 29 42
0.1% failure 29 29 29 29 29 29 29 32
1% failure 29
30
30
30 30 30 32 34
10% failure 36 36 37 37 37 37
47
65

Which is more or less perfect. 128 threads are a bit slower, but that is to be expected as one EPYC only has 64 cores. With higher failure rates unwinding itself becomes slower but that is still acceptable here. Thus most C++ programs are just fine.

For our use case that is not enough, though. We dynamically generate machine code at runtime, and we want to be able to pass exceptions through generated code. The _dl_find_object mechanism of glibc is not used for JITed code, instead libgcc maintains its own lookup structure. Historically this was a simple list with a global lock, which of course had terrible performance. But through a series of patches we managed to change libgcc into using a lock-free b-tree for maintaining the dynamic unwinding frames. Using a similar experiment to the one above, but now with JIT-generated code (using LLVM 19), we get the following:


1 2 4 8 16 32 64 128 threads
0% failure
32 38 48 64 48 36 59 62
0.1% failure 32 32 32 32
32
48
62 68
1% failure 41
40
40
40 53
69
80
83
10% failure 123 113 103
116
128 127
131
214

The numbers have more noise than for statically generated code, but overall observation is the same: Unwinding now scales with the number of cores, and we can safely use C++ exceptions even on machines with large core counts.

So is everything perfect now? No. First, only gcc has a fully scalable frame lookup mechanism. clang has its own implementation, and as far as I know it still does not scale properly due to a global lock in DwarfFDECache. Note that at least on many Linux distributions clang uses libgcc by default, thus the problem is not immediately obvious there, but a pure llvm/clang build will have scalability problems. And  second unwinding through JIT-ed code is a quite a bit slower, which is unfortunate. But admittedly the problem is less severe than shown here, the benchmark with JITed code simply has a stack frame more to unwind due to the way static code and JITed code interact. And it makes sense to prioritize static unwinding over dynamic unwinding frames, as most people never JIT-generate code.

Overall we are now quite happy with the unwinding mechanism. The bottlenecks are gone, and performance is fine even with high core counts. It is still not appropriate for high failure rates, something like P709 would be better for that, but we can live with the status quo.



DDIA: Chapter 11 - Stream Processing


Daily batch processes introduce significant latency, since input changes reflected in the output only after a day. For fast paced business, this is too slow. To reduce delays, stream processing occurs more frequently (e.g., every second) or continuously, where events are handled as they happen. 

In stream processing, a record is typically called an event—a small, immutable object containing details of an occurrence, often with a timestamp. Polling for new events becomes costly when striving for low-latency continuous processing. Frequent polling increases overhead as most requests return no new data. Instead, systems should notify consumers when new events are available. Messaging systems handle this by pushing events from producers to consumers.

Direct messaging systems require application code to handle message loss and assume producers and consumers are always online, limiting fault tolerance. Message brokers (or message queues) improve reliability by acting as intermediaries. Producers write messages to the broker, which clients (consumers) read. Examples include RabbitMQ, ActiveMQ, Azure Service Bus, and Google Cloud Pub/Sub.

Partitioned logs store messages as append-only logs, assigning offsets to messages within each partition. Log-based brokers like Apache Kafka, Amazon Kinesis Streams, and Twitter’s DistributedLog prioritize high throughput and message ordering. Google Cloud Pub/Sub offers a similar architecture but exposes a JMS-style API. Log-based brokers suit high-throughput scenarios with fast, ordered message processing. In contrast, JMS/AMQP brokers work better when messages are expensive to process, order isn't critical, and parallel processing is needed.


Databases and Streams

Dual writes, where applications update multiple systems (e.g., a database, search index, and cache) concurrently or sequentially, are error-prone. Race conditions can occur, leading to inconsistent states (see Fig. 11-4). A better approach is to designate one system as the leader (e.g., the database) and make others, like the search index, its followers.

Traditionally, databases treat replication logs as internal details, not public APIs. However, Change Data Capture (CDC) extracts data changes from a database, often as a stream, enabling replication to other systems. For example, a database change log can update a search index in the same order as the changes occur, ensuring consistency (see Fig. 11-5). This makes derived data systems, like search indexes, consumers of the change stream.

CDC designates the source database as the leader, with derived systems as followers. Log-based message brokers like Kafka are ideal for delivering CDC events, as they maintain order. Large-scale CDC implementations include LinkedIn's Databus, Facebook's Wormhole, and Debezium for MySQL. Tools like Kafka Connect offer connectors for various databases.

To manage disk space, CDC uses snapshots in conjunction with logs. A snapshot corresponds to a specific log position, ensuring changes after the snapshot can be applied correctly. Some CDC tools automate this; others require manual handling.

Event sourcing builds on immutability, recording all changes as an append-only log of events. This approach mirrors accounting, where transactions are never altered but corrected with new entries if mistakes occur. Immutable logs improve auditability, ease recovery from bugs, and preserve historical data for analytics. For instance, a customer adding and removing an item from a shopping cart generates two events. Though the cart's final state is empty, the log records the customer’s interest, which is valuable for analytics.

Event sourcing simplifies concurrency control. Instead of multi-object transactions, a single event encapsulates a user action, requiring only an atomic append to the log. Partitioning the log and application state similarly (e.g., by customer) enables single-threaded processing per partition, eliminating concurrency issues. For cross-partition events, additional coordination is needed.


Processing streams

Processing streams supports use cases like monitoring (e.g., fraud detection, trading systems, manufacturing) and involves joins, which combine data across streams and databases. Three join types include stream-stream joins for matching related events within a time window, stream-table joins for enriching events using database changelogs, and table-table joins for producing materialized view updates. Fault tolerance in stream processing invole using techniques like microbatching, checkpointing, transactions, and idempotent writes to ensure reliability in long-running processes.