December 19, 2024
December 18, 2024
Stop Nesting Database Systems
The idea of embedding fast data processing engines into other systems to speed up analytics is not new.
In fact, DuckDBs started out as SQL Analytics on Pandas.
Meta had a similar vision for its Velox project, which takes a slightly different approach by building an embeddable execution engine distributed as a C++ library.
If you’re considering building your own analytics engine, embedding an existing engine is almost always a better choice. Don’t roll your own crypto analytics engine.
Recently, however, there has been a trend to embed database systems in other database systems for various reasons, which I find surprising and worrying. It started with pg_duckdb, and now there is omni-sqlite, pg_clickhouse, and, just this week, a Planetscale to MotherDuck pipeline over pg_duckdb. With this pipeline, you end up nesting four systems: PlanetScale packs PostgreSQL, which packs the MotherDuck-built extension pg_duckdb, which packs DuckDB and MotherDuck.
Solutions like pg_duckdb and pg_clickhouse promise to solve your PostgreSQL analytics issues with HTAP-like capabilities. They allow you to run transactions in PostgreSQL and claim to seamlessly hand over analytical workloads to Clickhouse and DuckDB with a sleek interface directly from PostgreSQL. But only at first glance. To me, this approach is similar to opening a random closet and throwing all your clutter in there to hide it before guests arrive. Does it look clean? Sure! But does it solve your issues? Probably not.
To find out if these scientists were too preoccupied with whether they could, let’s take a look at how these extensions work and what they can and can’t do for you.
How the Nesting of Database Systems Works
We’ll use pg_duckdb as an example because it is the oldest and most popular of the bunch. However, the points discussed here are not specific to pg_duckdb, but rather to the concept of nesting database systems in general.
Among the available solutions, pg_duckdb has the greatest potential to speed up analytics within a transactional host like PostgreSQL. pg_duckdb has several features that enable elegant, nearly seamless integration.
- DuckDB was designed from the beginning to be embedded within other processes.
- DuckDB supports the PostgreSQL SQL dialect.
- DuckDB has an excellent query planner, joins, and analytical algorithms.
- Your queries run within the same transaction on transactionally consistent data within the bounds of PostgreSQL.
With the extension enabled, pg_duckdb controlls query execution for all queries you send to PostgreSQL. It knows what it can handle itself and delegates the rest, mainly scanning the data on disk, to PostgreSQL. All join and aggregation processing then takes place in DuckDB before the result is handed back to PostgreSQL. The diagram below provides a rough overview and illustrates some optimizations that we will discuss below.
What They Can Do Better than the Outer Engine
Analytical query processing has certainly come a long way since PostgreSQL was designed. Consequently, using an analytical system provides a wealth of functionality not available in PostgreSQL. Unlike PostgreSQL, DuckDB’s optimizer was designed for complex analytical workloads, so it can find much more efficient plans. This allows DuckDB to fully parallelize queries and utilize system resources much more effectively. While PostgreSQL assumes memory is scarce, analytical engines can productively saturate powerful servers. This has the potential to reduce execution times significantly. You also gain access to features of the embedded engine that are not available in PostgreSQL. For example, you can query remote Parquet files and combine them with local subresults.
Where They Are On-Par
Of course, nesting another engine does not magically solve all performance issues. There are parts where the embedded engine can only be as fast as the outer system allows. In order to run queries on data managed by the host system, the embedded engine must retrieve the data from the host. Retrieving this data requires a regular scan, which PostgreSQL must perform. pg_duckdb employs additional optimizations to reduce the impact of this process. It pushes filters to PostgreSQL scans where possible and runs scans in parallel. However, it still depends on PostgreSQL to fetch rows from disk using its buffer manager. It has no access to columnar layout or advanced techniques such as scan pruning. Therefore, while the upper parts of the query plan can leverage DuckDB’s full performance, all scans are expected to run at PostgreSQL speeds.
Where They Hold You Back
Creating a Frankenstein system comes at a cost. For one thing, you increase the complexity of your stack. Rather than running PostgreSQL in isolation, you now need two engines that aren’t designed to work together to do so. There are other limitations as well.
While you can perform many analytical tasks, some things fundamentally do not work. The most obvious issue is that you cannot mix and match functionality. For instance, you can’t run DuckDB plans that require PostgreSQL functions in the query plan. However, this is not only a surface-level issue of available functions. What if the systems have different sort orders or comparisons, e.g., through collations? Suddenly, the results could differ depending on which system executes the query, and you would not know.
The last downside is the missing upsides in the current state of these solutions. While there are obvious theoretical opportunities for performance improvements, neither pg_duckdb nor pg_clickhouse can reliably outperform PostgreSQL on TPC-H in their current forms.1 So, you need to carefully decide which queries to use them for, as well as when sticking to PostgreSQL is the better choice.
Nesting Systems vs ETL
Although nesting systems have their drawbacks, there are definitely use cases where they are preferable to typical ETL setups.
First, if you need to perform analytics, they must occur somewhere. If you don’t want to adopt a full HTAP engine like CedarDB right away, nesting systems can be a good stopgap solution. For many PostgreSQL users, the alternative to such integrations is setting up and maintaining a full ETL pipeline for infrequent, analytical “heavy hitter” queries, such as monthly reporting. In that case, having the option to spin up an embedded engine at the click of a button and have it disappear without a trace is ideal. However, if you need to continuously run analytics, the benefits of a dedicated analytical or HTAP solution will quickly outweigh the cost.
This does not hold true for approaches like pg_clickhouse or the MotherDuck component for pg_duckdb, which act more like foreign data wrappers. Rather than being fully embedded within the host, they provide access to query results on external systems through a single, convenient interface. However, they don’t help with moving data to the analytical system or keeping it up to date. Therefore, you still need to set up a data pipeline from PostgreSQL to Clickhouse or MotherDuck. While these solutions are more convenient, they are not fundamentally different from traditional ETL stacks.
These extensions’ main promise is HTAP, a workload that ETL pipelines will never be able to cover. Running analytics within a transaction and persisting results with transactional guarantees is only possible if a single system controls both analytical reads and transactional writes. These workloads are much more common than one might think. 2 To run HTAP workloads, your analytical queries require a consistent view of your transactional data and the ability to persist results with the same guarantees. However, is an analytical engine nested in a transaction system worthy of being called “HTAP”?
Is This HTAP?
The MotherDuck post that inspired this semi-rant gives a “maybe yes” but I want to stick to a clear “no.” The setup can be useful in the above scenarios, but the only part that manages to even scratch the surface of HTAP is the last one: a local, embedded analytical engine within a transactional host.
And even a full embedding still has to delegate scans to the host system, meaning it can not overcome the most crucial bottleneck for analytics: scans. The majority of time in analytics is spent on table scans, even in systems that heavily optimize their scans for analytical workloads. 3 Both MotherDuck and Clickhouse have such scans as well, but they are only useful on data they manage themselves, which does not come with the same transactional guarantees as your PostgreSQL query.
The only way to have your cake and eat it too is to use a system built for HTAP that can optimize the entire process, from scan to result, while still guaranteeing transactional integrity. Getting this right takes more than simply embedding an analytical engine.
Here is a short list of what we had to build at CedarDB to achieve full performance for a mix of analytics and transactions:
- A storage layout that works for both point accesses and analytical scans.
- Optimistically synchronized data structures to reduce lock contention between parallel write and read operations.
- Optimistic snapshot isolation for transactions to prevent long reads from blocking writers.
- A scheduler that prioritizes small writes without starving long reads.
These features are in addition to optimizations for the analytical and transactional parts in isolation, many of which we highlight in other blog posts. Building an HTAP system is definitely a much larger project than nesting two database systems, but I believe the results justify the effort.
If you want to experience the performance and convenience of a true HTAP solution for yourself, try out our Community Edition with an HTAP workload.
-
Both pg_clickhouse and pg_duckdb report own comparisons against PostgreSQL. See here for pg_duckdb and here for pg_clickhouse ↩︎
-
Even analytical warehouses, such as Redshift, experience a significant volume of mixed read and write workloads, many of which involve substantial updates. For more details, see the AWS Redshift study for more details. ↩︎
-
According to Snowflake, their queries spend an average of 50% of their time in scans in the published Snowset dataset. See Renen et al. ↩︎
December 16, 2024
Utilizing highly synchronized clocks in distributed databases
This master's thesis at Lund University Sweden explores how CockroachDB's transactional performance can be improved by using tightly synchronized clocks. The paper addresses two questions: how to integrate high-precision clock synchronization into CockroachDB and the resulting impact on performance. Given the publicly available clock synchronization technologies like Amazon Time Sync Service and ClockBound, the researchers (Jacob and Fabian) argue that the traditional skepticism around the reliability of clocks in distributed systems is outdated.
CockroachDB vs Spanner approaches
CockroachDB uses loosely-synchronized NTP clocks, and achieves linearizability by using Hybrid Logical Clocks (HLC) and relying on a static maximum offset (max_offset=500milliseconds) to account for clock skew. However, this approach has limitations, particularly in handling transactional conflicts within a predefined uncertainty interval. When a transaction reads a value with a timestamp falling within this interval, it triggers an uncertainty restart to ensure consistency. These restarts, categorized as necessary (caused by actual clock skew) or unnecessary (caused by latency), increase transaction latency. By using tight clock synchronization and reducing uncertainty intervals, it is possible to achieve performance gains.
To investigate this further, the paper compares CockroachDB's approach with Google's Spanner, which uses the TrueTime API. TrueTime provides bounded timestamps that guarantee the actual time falls within an interval. This enables Spanner to ensure linearizability using commit-wait, which delays transaction commits until timestamp bounds guarantee consistency. That is, a transaction must wait until tmax of the transaction’s timestamp has passed before committing and releasing its locks. As tmax provides an upper bound for timestamps generated across all nodes, it is guaranteed that the changes made by the transaction will be visible across the entire cluster, achieving linearizability.
To adopt tightly synchronized clocks for CockroachDB, the paper considers two main approaches:
- Adopting a commit-wait model: This approach, inspired by Google Spanner, involves waiting for a certain period after a transaction acquires all necessary locks to ensure that its changes are visible across the entire cluster. However, the implementation complexity was deemed significant for the current project, and this is not pursued.
- Dynamically adjusting uncertainty intervals: This approach focuses on dynamically reducing the size of uncertainty intervals by leveraging bounded timestamps provided by high-precision clock synchronization services, like AWS TimeSync. By dynamically adjusting these intervals based on the actual clock skew, the number of unnecessary transaction restarts can be significantly reduced, leading to improved performance.
TrueClock and ClockBound
The authors implemented the second approach by modifying CockroachDB to utilize a TrueTime like API and ClockBound. They ran into practical challenges during implementation. Initial tests revealed that retrieving bounded timestamps from ClockBound introduced significant latency (50 microseconds) compared to standard system clock readings (60 nanoseconds). To address this, they ported the open-source ClockBound daemon as a Go library, called TrueClock, which allowed them to include it directly in CockroachDB as it is also written in Go. This in turn removed the need for a datagram request for each clock reading. This reduced clock reading latency from 50 microseconds to 250 nanoseconds, a negligible overhead for database operations.
Evaluation
To test their hypothesis and evaluate the transactional performance gains, the researchers modified CockroachDB to replace the static max_offset with dynamic values calculated based on real-time clockbound synchronization precision. They deployed a CockroachDB cluster consisting of three replicas across three availability zones in the eu-north-1 (Stockholm) region. As predicted, this change significantly reduced the uncertainty interval size, improving the database's ability to process transactions efficiently. The uncertainty intervals shrank by a factor of over 500 compared to the standard configuration, with maximum bounds of 1.4 milliseconds versus the default 500 milliseconds.
Experiments showed significant performance improvements, with read latency reduced by up to 47% and write latency by up to 43%. The authors attribute these gains partly to reduced latch contention. In CockroachDB, latches are used to serialize operations, even for the read operations. Shorter uncertainty intervals allow latches to release sooner, reducing contention and the time writes must wait for reads to complete. Interestingly, although only reads are directly affected by uncertainty restarts, reducing their latency also indirectly benefited write performance due to lower contention.
Conclusion
This work emphasizes the importance of embracing high-precision clocks as they become increasingly accessible in production environments. By dynamically adapting to real-time synchronization precision in place of static/pessimistic assumptions about clock skew, the experiments showed improved performance even with the latches for reads still being in place.
This highlights the potential of integrating modern clock synchronization methods into distributed databases. The results suggest that tighter synchronization not only improves transactional throughput but also offers a feasible path to achieving stronger consistency models like linearizability without significant overhead. The guiding principle should be to use clock synchronization for performance, and not for correctness.
December 14, 2024
In search of a faster SQLite
December 13, 2024
What are important data systems problems, ignored by research?
In November, I had the pleasure of attending the Dutch-Belgian DataBase Day, where I moderated a panel on practical challenges often overlooked in database research. Our distinguished panelists included Allison Lee (founding engineer at Snowflake), Andy Pavlo (professor at CMU), and Hannes Mühleisen (co-creator of DuckDB and researcher at CWI), with attendees contributing to the discussion and sharing their perspectives. In this post, I'll attempt to summarize the discussion in the hope that it inspires young (and young-at-heart) researchers to tackle these challenges. Additionally, I'll link to some paper that can serve as motivation and starting points for research in these areas.
One significant yet understudied problem raised by multiple panellists is the handling of variable-length strings. Any analysis of real-world analytical queries reveals that strings are ubiquitous. For instance, Amazon Redshift recently reported that around 50% of all columns are strings. Since strings are typically larger than numeric data, this implies that strings are a substantial majority of real-world data. Dealing with strings presents two major challenges. First, query processing is often slow due to the variable size of strings and the (time and space) overhead of dynamic allocation. Second, surprisingly little research has been dedicated to efficient database-specific string compression. Given the importance of strings on real-world query performance and storage consumption, it is surprising how little research there is on the topic (there are some exceptions).
Allison highlighted a related issue: standard benchmarks, like TPC-H, are overly simplistic, which may partly explain why string processing is understudied. TPC-H queries involve little complex string processing and don't use strings as join or aggregation keys. Moreover, TPC-H strings have static upper bounds, allowing them to be treated as fixed-size objects. This sidesteps the real challenges of variable-size strings and their complex operations. More broadly, standard benchmarks fall short of reflecting real-world workloads, as they lack advanced relational operators (e.g., window functions, CTEs) and complex expressions. To drive meaningful progress, we likely need new, more realistic benchmarks. While the participants agreed on most points, one particularly interesting topic of discussion was distributed query processing. Allison pointed out that many query processing papers overlook distributed processing, making them hard to adopt in industrial systems. Hannes, however, argued that most user workloads can be handled on a single machine, which should be the primary focus of publicly funded research. My personal view is that both single-node and distributed processing are important, and there is ample room to address both challenges.
While database researchers often focus on database engine architectures, Andy argued that surrounding topics, such as network connection handling (e.g., database proxies), receive little attention despite their practical importance. Surprisingly, there is also limited research on scheduling database workloads and optimizing the network stack, even though communication bottlenecks frequently constrain efficient OLTP systems. Multi-statement stored procedures, though a potential solution, are not widely adopted and fail to address this issue in practice. I believe there are significant research opportunities in exploring how to better structure the interface between applications and database systems.
One striking fact about major database conferences, such as SIGMOD and VLDB, is how few papers address practical database system problems. From personal experience, I believe this presents a significant opportunity for researchers seeking both academic and real-world impact. Solutions to the problems discussed above (and many others) are likely to gain industry attention and be adopted by real database systems. Moreover, with the availability of open-source systems like DuckDB, DataFusion, LeanStore, and PostgreSQL, conducting systems research has become easier than ever.
December 11, 2024
Can your database do this? Ep 3: Zero-downtime, type-safe migrations
Best of Metadata in 2024
I can't believe we wasted another good year. It is time to reflect back on the best posts at Metadata blog in 2024. (I think you guys should tip me just because I didn't call this post "Metadata wrapped".)
Distributed systems posts
Transactional storage for geo-replicated systems(SOSP11): I like this paper because it asked the right questions, and introduced parallel snapshot isolation. No individual part is novel (vector clocks, csets) but their composition together and application to WAN web applications have been novel. Walter showed how to limit WAN coordination, while still developing useful applications.
An Hourglass Architecture for Distributed Systems (SRDS24 Keynote): This work successfully bridges theoretical research and practical implementation in large-scale distributed systems in Facebook/Meta control plane. The shared log abstraction proposes to separate the system into two layers: the database layer (proposers/learners) and the log layer (acceptors). The shared log provides a simple API for appending entries, checking the tail, reading entries, and trimming the log. This separation allows the SMR layer to focus on higher-level concerns without dealing with the complexities of consensus protocols.
Linearizability: A Correctness Condition for Concurrent Objects (TOPLAS90) : This is a foundational paper on linearizability. I gave this paper a critical read to point out things it did well versus some shortcomings.
Unanimous 2PC: Fault-tolerant Distributed Transactions Can be Fast and Simple (PAPOC24): This paper brings together the work/ideas around 2-phase commit and consensus protocols. It is thought provoking to consider the tradeoffs between the two.
FlexiRaft: Flexible Quorums with Raft (CIDR23): The paper talks about how they applied Raft to MySQL replication, and used the flexible quorums in the process. This is not a technically deep paper, but it was interesting to see a practical application of flexible quorums idea to Raft rather than Paxos. The most technically interesting part is the adoption of flexible quorums to Raft rather than Paxos, which needs to impose an extra requirement on quorums in order to guarantee Leader Completeness: "the new leader must already have all log entries replicated by a majority of nodes in the previous term."
Amazon MemoryDB: A Fast and Durable Memory-First Cloud Database (SIGMOD24): Amazon MemoryDB is a fully managed in-memory database service that leverages Redis's performance strengths while overcoming its durability limitations. It uses Redis as the in-memory data processing engine but offloads persistence to an AWS-internal transaction log service (internally known as the journal). This decoupled architecture provides in-memory performance with microsecond reads and single-digit millisecond writes, while ensuring across availability zone (AZ) durability, 99.99% availability, and strong consistency in the face of failures.
Fault-Tolerant Replication with Pull-Based Consensus in MongoDB (NSDI21): Raft provides fault-tolerant state-machine-replication (SMR) over asynchronous networks. Raft (like most SMR protocols) uses push-based replication. But MongoDB uses pull-based replication scheme, so when integrating/invigorating MongoDB's SMR with Raft, this caused challenges. The paper focuses on examining and solving these challenges, and explaining the resulting MongoSMR protocol.
Tunable Consistency in MongoDB (VLDB19): This paper discusses the tunable consistency models in MongoDB and how MongoDB's speculative execution model and data rollback protocol enable a spectrum of consistency levels efficiently.
Databases posts
I participated in reading groups that covered two database books in 2024, and blogged about the chapters. I had 7 posts related to Transaction Processing book by Gray and Reuters and 14 posts related to the Designing Data Intensive Applications book. Both books were great for getting good understanding of databases.
- Index for the Transaction Processing Book
- Index for Designing Data Intensive Applications (DDIA) book
Scalable OLTP in the Cloud: What’s the BIG DEAL? (CIDR24): In this paper Pat Helland argues that the answer lies in the joint responsibility of database and the application. The BIG DEAL splits the scaling responsibilities between the database and the application. *Scalable DBs* don’t coordinate across disjoint TXs updating different keys. *Scalable apps* don’t concurrently update the same key. So, snapshot isolation is a BIG DEAL!
Chardonnay: Fast and General Datacenter Transactions for On-Disk Databases (OSDI23): Chardonnay provides strictly serializable general read-write transactions via 2PC+2PL in-memory quickly for a single-datacenter deployment. It also provides lock-free (contention-free) serializable read-only transactions from snapshots (Chardonnay is multiversion in that sense) that are taken every epoch (10ms).
Looking back at Postgres(2019): This article covers Postgres's origin story, and provides a nice retrospective and context about development and features.
DBSP: Automatic Incremental View Maintenance for Rich Query Languages (VLDB23): DBSP is a simplified version of differential dataflow. It assumes linear synchronous time, and in return provides powerful compositional properties. If you apply two queries in sequence and you want to incrementalize that composition, it's enough to incrementalize each query independently. This allows independent optimization of query components, enabling efficient and modular query processing.
Understanding the Performance Implications of Storage-Disaggregated Databases (SIGMOD24): This paper has conducted a comprehensive study to investigate the performance implications of storage-disaggregated databases. The work addresses several critical performance questions that were obscured due to the closed-source nature of these systems.
AI posts
Auto-WLM: machine learning enhanced workload management in Amazon Redshift (SIGMOD23): Auto-WLM is a machine learning based automatic workload manager currently used in production in Amazon Redshift. This paper turned out to be a practical/applied data systems paper rather than a deep learning and/or advanced machine learning paper. At its core, this paper is about improving query performance and resource utilization in data warehouses, possibly the first for a database system in production at scale.
The demise of coding is greatly exaggerated: This is my response to NVDIA CEO Jensen Huang's remarks: "Over the course of the last 10 years, 15 years, almost everybody who sits on a stage like this would tell you that it is vital that your children learn computer science, and everybody should learn how to program. And in fact, it’s almost exactly the opposite."
Looming Liability Machines (LLMs): The use of LLMs for automatic root cause analysis (RCA) for cloud incidents spooked me vicerally. I am not suggesting a Butlerian Jihad against LLMs. But I am worried, we are enticed too much by LLMs. Ok, let's use them, but maybe we shouldn't open the fort doors to let them in. I have worries about automation surprise, systemic failures, and getting lazy trading thinking with superficial answers.
TLA+ posts
Exploring the NaiadClock TLA+ model in TLA-Web: I have been impressed by the usability of TLA-Web from Will Schultz. Recently I have been using it for my TLA+ modeling of MongoDB catalog protocols internally, and found it very useful to explore and understand behavior. This got me thinking that TLA-Web would be really useful when exploring and understanding an unfamiliar spec I picked up on the web.
TLA+ modeling of a single replicaset transaction modeling: For some time I had been playing with transaction modeling and most recently with replicaset modeling by way of a single log. While playing with these, I realized I can build something cool on top of these building blocks. I just finished building snapshot isolated transaction modeling that sit on top of a replicaset log. This is also a high level approximation of MongoDB-style snapshot isolated transactions on a single replicaset.
TLA+ modeling of MongoDB logless reconfiguration: This is a walkthrough of the TLA+ specs for the MongoDB logless reconfiguration protocol we've reviewed here.
Miscellaneous posts
The titles in this section is pretty self explanatory, and this is mostly nontechnical writing. So dig in to anything that attracts your interest.
The checklist manifesto (Dr. Atul Gawande, 2009)
Previous years in review
Best of metadata in 2021
Best of metadata in 2020
Best of metadata in 2019
Best of metadata in 2018
Research, writing, and career advice
December 10, 2024
Index for Designing Data Intensive Applications (DDIA) book
The DDIA book is a great textbook, because it is not written as a textbook, but more of a guidebook.
Textbooks are generally bland and boring. Textbooks that are written by professors even more so, because thoser are often written to impress other professors and to flaunt academic flair. Few textbooks take teaching as the primary goal.
DDIA book has clear writing, and it is pragmatic. It is as if your smart colleague is filling you in about the fundamentals as well as intricacies (dirty secrets) of their field. It is genuine and authentic.
Kudos Kleppmann!
Here are my summaries of the DDIA book chapters. A new version of the book is coming soon, and I look forward to seeing the updated content.
DDIA: Chp 3. Storage and Retrieval (Part 1)
DDIA: Chp 3. Storage and Retrieval (Part 2)
DDIA: Chp 4. Encoding and Evolution (Part 1)
DDIA: Chp 4. Encoding and Evolution (Part 2)
DDIA: Chp 5. Replication (Part 1)
DDIA: Chp 5. Replication (Part 2)
DDIA: Chp 7. Transactions (Part 1)
DDIA: Chp 7. Transactions (Part 2): Serializability
DDIA: Chp 8. The Trouble with Distributed Systems
DDIA: Chp 9. Consistency and Consensus
DDIA: Chp 10. Batch Processing
DDIA: Chp 11. Stream Processing
DDIA: Chp 12. The Future of Data Systems: TK
C++ exception performance three years later
About three years ago we noticed serious performance problems in C++ exception unwinding. Due to contention on the unwinding path these became more and more severe the more cores a system had, and unwinding could slow down by orders of magnitude. Due to the constraints of backwards compatibility this contention was not easy to eliminate, and P2544 discussed ways to fix this problem via language changes in C++.
But fortunately people found less invasive solutions. First, Florian Weimer changed the glibc to provide a lock-free mechanism to find the (static) unwind tables for a given shared object. Which eliminates the most serious contention for "simple" C++ programs. For example in a micro-benchmark that calls a function with some computations (100 calls to sqrt per function invocation), and which throws with a certain probability, we previously had very poor scalability with increasing core count. With his patch we now see with gcc 14.2 on a dual-socket EPYC 7713 the following performance development (runtime in ms):
| 1 | 2 | 4 | 8 | 16 | 32 | 64 | 128 | threads | |
|---|---|---|---|---|---|---|---|---|---|
| 0% failure |
29 | 29 | 29 | 29 | 29 | 29 | 29 | 42 | |
| 0.1% failure | 29 | 29 | 29 | 29 | 29 | 29 | 29 | 32 | |
| 1% failure | 29 |
30 |
30 |
30 | 30 | 30 | 32 | 34 | |
| 10% failure | 36 | 36 | 37 | 37 | 37 | 37 |
47 |
65 |
Which is more or less perfect. 128 threads are a bit slower, but that is to be expected as one EPYC only has 64 cores. With higher failure rates unwinding itself becomes slower but that is still acceptable here. Thus most C++ programs are just fine.
For our use case that is not enough, though. We dynamically generate machine code at runtime, and we want to be able to pass exceptions through generated code. The _dl_find_object mechanism of glibc is not used for JITed code, instead libgcc maintains its own lookup structure. Historically this was a simple list with a global lock, which of course had terrible performance. But through a series of patches we managed to change libgcc into using a lock-free b-tree for maintaining the dynamic unwinding frames. Using a similar experiment to the one above, but now with JIT-generated code (using LLVM 19), we get the following:
| 1 | 2 | 4 | 8 | 16 | 32 | 64 | 128 | threads | |
|---|---|---|---|---|---|---|---|---|---|
| 0% failure |
32 | 38 | 48 | 64 | 48 | 36 | 59 | 62 | |
| 0.1% failure | 32 | 32 | 32 | 32 |
32 |
48 |
62 | 68 |
|
| 1% failure | 41 |
40 |
40 |
40 | 53 |
69 |
80 |
83 |
|
| 10% failure | 123 | 113 | 103 |
116 |
128 | 127 |
131 |
214 |
The numbers have more noise than for statically generated code, but overall observation is the same: Unwinding now scales with the number of cores, and we can safely use C++ exceptions even on machines with large core counts.
So is everything perfect now? No. First, only gcc has a fully scalable frame lookup mechanism. clang has its own implementation, and as far as I know it still does not scale properly due to a global lock in DwarfFDECache. Note that at least on many Linux distributions clang uses libgcc by default, thus the problem is not immediately obvious there, but a pure llvm/clang build will have scalability problems. And second unwinding through JIT-ed code is a quite a bit slower, which is unfortunate. But admittedly the problem is less severe than shown here, the benchmark with JITed code simply has a stack frame more to unwind due to the way static code and JITed code interact. And it makes sense to prioritize static unwinding over dynamic unwinding frames, as most people never JIT-generate code.
Overall we are now quite happy with the unwinding mechanism. The bottlenecks are gone, and performance is fine even with high core counts. It is still not appropriate for high failure rates, something like P709 would be better for that, but we can live with the status quo.
DDIA: Chapter 11 - Stream Processing
Daily batch processes introduce significant latency, since input changes reflected in the output only after a day. For fast paced business, this is too slow. To reduce delays, stream processing occurs more frequently (e.g., every second) or continuously, where events are handled as they happen.
In stream processing, a record is typically called an event—a small, immutable object containing details of an occurrence, often with a timestamp. Polling for new events becomes costly when striving for low-latency continuous processing. Frequent polling increases overhead as most requests return no new data. Instead, systems should notify consumers when new events are available. Messaging systems handle this by pushing events from producers to consumers.
Direct messaging systems require application code to handle message loss and assume producers and consumers are always online, limiting fault tolerance. Message brokers (or message queues) improve reliability by acting as intermediaries. Producers write messages to the broker, which clients (consumers) read. Examples include RabbitMQ, ActiveMQ, Azure Service Bus, and Google Cloud Pub/Sub.
Partitioned logs store messages as append-only logs, assigning offsets to messages within each partition. Log-based brokers like Apache Kafka, Amazon Kinesis Streams, and Twitter’s DistributedLog prioritize high throughput and message ordering. Google Cloud Pub/Sub offers a similar architecture but exposes a JMS-style API. Log-based brokers suit high-throughput scenarios with fast, ordered message processing. In contrast, JMS/AMQP brokers work better when messages are expensive to process, order isn't critical, and parallel processing is needed.
Databases and Streams
Dual writes, where applications update multiple systems (e.g., a database, search index, and cache) concurrently or sequentially, are error-prone. Race conditions can occur, leading to inconsistent states (see Fig. 11-4). A better approach is to designate one system as the leader (e.g., the database) and make others, like the search index, its followers.
Traditionally, databases treat replication logs as internal details, not public APIs. However, Change Data Capture (CDC) extracts data changes from a database, often as a stream, enabling replication to other systems. For example, a database change log can update a search index in the same order as the changes occur, ensuring consistency (see Fig. 11-5). This makes derived data systems, like search indexes, consumers of the change stream.
CDC designates the source database as the leader, with derived systems as followers. Log-based message brokers like Kafka are ideal for delivering CDC events, as they maintain order. Large-scale CDC implementations include LinkedIn's Databus, Facebook's Wormhole, and Debezium for MySQL. Tools like Kafka Connect offer connectors for various databases.
To manage disk space, CDC uses snapshots in conjunction with logs. A snapshot corresponds to a specific log position, ensuring changes after the snapshot can be applied correctly. Some CDC tools automate this; others require manual handling.
Event sourcing builds on immutability, recording all changes as an append-only log of events. This approach mirrors accounting, where transactions are never altered but corrected with new entries if mistakes occur. Immutable logs improve auditability, ease recovery from bugs, and preserve historical data for analytics. For instance, a customer adding and removing an item from a shopping cart generates two events. Though the cart's final state is empty, the log records the customer’s interest, which is valuable for analytics.
Event sourcing simplifies concurrency control. Instead of multi-object transactions, a single event encapsulates a user action, requiring only an atomic append to the log. Partitioning the log and application state similarly (e.g., by customer) enables single-threaded processing per partition, eliminating concurrency issues. For cross-partition events, additional coordination is needed.
Processing streams
Processing streams supports use cases like monitoring (e.g., fraud detection, trading systems, manufacturing) and involves joins, which combine data across streams and databases. Three join types include stream-stream joins for matching related events within a time window, stream-table joins for enriching events using database changelogs, and table-table joins for producing materialized view updates. Fault tolerance in stream processing invole using techniques like microbatching, checkpointing, transactions, and idempotent writes to ensure reliability in long-running processes.