a curated list of database news from authoritative sources

November 23, 2024

Everything is a Transaction: Unifying Logical Concurrency Control and Physical Data Structure Maintenance in Database Management Systems

This paper from CIDR'21 introduces the Deferred Action Framework (DAF). This framework aims to unify transaction control and data structure maintenance under multi-version concurrency control (MVCC) database systems, particularly for complex maintenance tasks like garbage collection and index cleanup.

In MVCC systems, transactions and data maintenance tasks (like garbage collection) often interfere, requiring complex coordination. This can lead to issues in system performance and code maintainability, as physical structures (e.g., B+ trees) aren't inherently designed to handle multi-versioning.

The paper proposes DAF to handle deferred actions—maintenance tasks that are queued to execute only when they no longer interfere with active transactions. DAF relies on timestamp-based ordering and executes tasks only when their actions won’t affect concurrent transactions. DAF guarantees to process actions deferred at some timestamp t only after all transactions started before t have exited. This provides *epoch protection* to transactions and maintenance tasks, without requiring a separate mechanism for refreshing and advancing epochs. Using a global epoch counter might have been simpler, but it would disallow the complex, multi-deferral-based ordering of actions needed for DAF’s broader applications like schema changes and index maintenance.

DAF is implemented in NoisePage (notice Pavlo's patented database naming scheme in action). The implementation chains deferrals to handle dependencies (e.g., an index cleanup must happen before a table deletion). DAF uses a single-action queue with timestamp tags, and processes tasks sequentially ensuring no conflicts arise. DAF supports multi-threaded processing, with both cooperative and dedicated action processing threads for high concurrency.

As Figure 1 shows, DAF uses a single action queue with a dedicated execution thread. DAF's transaction management process follows a specific sequence. First, a transaction worker gets an "observable" timestamp by incrementing the global counter. It then tags its deferred actions with this timestamp and adds them to the queue. When the transaction begins, the worker increments the global counter again. The action thread processes queue items by comparing queue item timestamps against the oldest active transaction timestamp. It executes items when their timestamp is smaller or when no active transactions exist, blocks when this condition isn't met, and waits when the queue is empty.

For complex operations like table deletion under snapshot isolation, DAF uses chained deferrals. For example, when transaction T1 drops a table, it defers action T1-A1, which when executed defers the actual deletion T1-A2. This ensures T1-A2 executes after all concurrent operations (like T2's inserts) complete. This approach solves ordering issues in concurrent scenarios while maintaining memory safety.

Experiments show DAF’s performance is on par with or exceeds specialized implementations, particularly in high-throughput scenarios like TPC-C benchmarks.

Beyond garbage collection, DAF can support latch-free block transformations and non-blocking schema changes. While simple schema changes (like renaming tables) can proceed without blocking transactions, complex changes that require table rewrites typically block all queries. Supporting true non-blocking schema changes requires managing multiple concurrent schema versions, which complicates transaction visibility and query planning due to the need for multiple plan cache entries.

November 22, 2024

Replica Preserve Commit Order and Measuring Lag

With multi-threaded replication (MTR), a replica can commit transactions in the same order as the source, or not. This is determined by sysvar replica_preserve_commit_order (RPCO). As of MySQL v8.0.27 (released October 2021) it’s ON by default, but it was OFF by default for several years prior. In either case, it’s relatively new compared to 20+ years of single-threaded replication for which commit order was not an issue or option. But with MTR, it’s important to understand the affects of RPCO, especially with respect to the focus of this three-part series: replication lag.

November 21, 2024

DDIA: Chp 10. Batch Processing

Batch processing allows large-scale data transformations through bulk-synchronous processing. The simplicity of this approach allowed building reliable, scalable, maintainable applications with it. If you recall, "reliable-scalable-maintainable" was what we set out to learn when we began the DDIA book.

This story of MapReduce starts when Google engineers realize there were a lot of repetitive tasks involved when computing over large data. These tasks often involved individually processing elements and then gathering and fusing their output. Interestingly, this bores a striking resemblance to electromechanical IBM card-sorting machines from the 1940-50s. MapReduce also got some inspiration from the map reduce operations in Lisp: (map square '(1 2 3 4)) gives us  (1 4 9 16), and (reduce + '(1 4 9 16))  gives us 30. The key innovation of Google's MapReduce framework was its ability to simplify parallel processing by abstracting away complex network communication and failure handling. By separating network communication from application logic, MapReduce shielded developers from intricate distributed computing challenges (partial failures, distributed state management, etc). Tasks could be automatically retried without disrupting the core application workflow. 

The MapReduce framework's elegance lay in its simplicity: providing a straightforward mechanism for parallelizing computations through map and reduce operations. Though MapReduce had many limitations/inefficiencies and Google retired the approach by 2014, its impact on distributed computing still remains important. It is hard to deny that MapReduce paved the way to commodity distributed computing.

Before beginning to explain MapReduce, the book takes a detour through batch processing in Unix/Linux: modular commands where output are piped as input to the next command in sequence. That section is skippable. By the way, I think the original MapReduce paper from OSDI 2004 is worth a read itself. It is written clearly and is easy to follow. 


MapReduce and Distributed Filesystems

Hadoop MapReduce leverages HDFS (Hadoop Distributed File System), which is an open-source reimplementation of Google's File System (GFS). Built on the shared-nothing principle, HDFS radically transformed traditional data storage from custom expensive hardware towards cheap and off-the-shelf nodes. HDFS also enabled indiscriminate data dumping. Unlike traditional massively parallel processing (MPP) databases requiring meticulous upfront data modeling, HDFS allowed raw data storage with interpretation deferred to consumption. This "schema-on-read" approach, dubbed the "sushi principle" (raw data is better), enabled flexible data transformation and multiple interpretations.

HDFS consists of a daemon process running on each machine to serve files stored on that machine, with a central NameNode tracking file block placements across machines. This simple design enables creation of a massive fault-tolerant filesystem by leveraging distributed storage on many machines and implementing block replication.

Ok, back to MapReduce. The MapReduce job model is centered around two callback functions: the mapper and reducer. Mappers process individual input records independently, extracting key-value pairs, while reducers aggregate and process values associated with specific keys. These two operations, when chained as multi-stage processing jobs, are sufficient to implement complex data transformations.


MapReduce Job Execution

MapReduce job execution strategically assigns map tasks to machines storing input file replicas, minimizing network data transfer and optimizing computational resource utilization. First, the framework copies necessary application code to assigned machines. Map tasks then process input files record by record, generating key-value pair outputs. Reduce tasks are partitioned using key hash algorithms to ensure consistent key processing.

Data sorting occurs in stages due to massive dataset sizes. Mappers partition output by reducer, writing sorted files locally using techniques similar to SSTables. When mappers complete, reducers fetch these sorted partition files through a process called "shuffle". Reducers merge mapper files applying user provided logic, maintaining sort order, and process records with identical keys sequentially.

MapReduce jobs are often chained together in workflows, where the output of one job becomes the input of the next. Since Hadoop MapReduce lacks native workflow support, jobs are linked by configuring output and input directory names. Large organizations did have complex workflows with hundreds of MapReduce jobs. To manage these data flows, several higher-level tools had been developed, including: Pig, Hive, Cascading, Crunch, FlumeJava.


Beyond MapReduce

Back in 2010s I was flabbergasted as to why people would use MapReduce, because it was so inefficient: So much data movement, so much need for synchronization, and such a limited language (just map and reduce) to express computation. Yes, MapReduce provided parallelism and scalability, but at what COST? Frank McSherry argued that much of that scalability came from "poor baselines and low expectations". 

But people cared about usability more, and didn't care about if the MapReduce job took an extra 3 hours and 5 hours there. They were ready to wait anyways, since this is offline batch processing. MapReduce shielded them for distributed systems challenges, or thinking too hard to develop/program an efficient algorithm, and that was more important for them. MapReduce was too simple and coarse-grained to be efficient, but it paved the way to more efficient dataflow engines. Respect what came before.

New dataflow engines like Spark, Tez, and Flink overcame some of MapReduce's limitations. Unlike MapReduce's rigid map-reduce model, these engines enable more flexible operator connections and smarter data processing. They optimize computational resources by selectively sorting, intelligently scheduling tasks, reducing unnecessary data movement, and improving data locality. By explicitly modeling the data flow, these systems are able to make nuanced scheduling decisions to speed-up processing. The engines also introduce advanced join strategies and minimize computational overhead by supporting in-memory intermediate states, reducing I/O demands, and enabling faster task startup.


November 20, 2024

DBSP: Automatic Incremental View Maintenance for Rich Query Languages

Incremental computation represents a transformative (!) approach to data processing. Instead of recomputing everything when your input changes slightly, incremental computation aims to reuse the original output and efficiently update the results. Efficiently means performing work proportional only to input and output changes.

This paper introduces DBSP, a programming language inspired by signal processing (hence the name DB-SP). DBSP is  simple, yet it offers extensive computational capabilities. With just four operators, it covers complex database queries, including entire relational algebra, set and multiset computations, nested relations, aggregations, recursive queries, and streaming computations.


Basic DBSP operators 

The language is designed to capture computation on streams. Streams are represented as infinite vectors indexed by consecutive time. Each stream can represent values of any type and incorporates basic mathematical operations like addition, subtraction, and a zero element. DBSP introduces incremental computing operators such as lifting ($\uparrow$) that converts scalar functions to stream functions, delay ($z^{-1}$) that shifts stream values, differentiation (${D}$) that compute stream changes, and integration (${I}$) that reconstruct original streams from change streams. Integration and differentiation are inverses of each other, and this shows in their stream circuit representation. 

   



DBSP is a simplified version of differential dataflow. I talked about differential dataflow in 2017 here. Differential dataflow represented time as a lattice, DBSP represents it as a single array. In other words, in DBSP time is consecutive and each state requires a unique predecessor. In differential dataflow, time is defined as a partial order to capture causality and concurrency. That is a better model for distributed systems, but that introduces a lot of complexity. DBSP makes a big simplification when it assumes linear synchronous time, and probably opens up issues related to managing out-of-order inputs, incorporating data from multiple streams, and tracking progress. 

But the simplification buys us powerful properties. The chain rule below emerges as a particularly powerful composition technique for incremental computation. If you apply two queries in sequence and you want to incrementalize that composition, it's enough to incrementalize each query independently. This is big. This allows independent optimization of query components, enabling efficient and modular query processing. 



Application to database incremental view maintenance problem

The paper argues that traditional databases can be reframed as streaming databases by representing linearizable transactions as streams and database states as evolving snapshots. With this framing, views become dynamically/incrementally computed entities over this stream. And DBSP gives us simple and principled Incremental View Maintenance (IVM) over a database.

An IVM algorithm needs to compute the changes to the view given the changes to the database. Given a query Q, we can define its incremental version $ Q^\Delta$ using stream integration and differentiation: $Q^\Delta = {D} \circ {\uparrow \!Q} \circ {I}$. The incremental version of a query is a stateful streaming operator which computes directly on changes and produces changes. This is depicted graphically as:

Instead of maintaining the output view entirely, DBSP proposes generating deltas as the output of the computation. The idea that both inputs and outputs to an IVM system are streams of changes is key to the symmetry of the solution and the fundamental reason that the chain rule exists.

But we have one more kink to iron out. DBSP operations assume that the competitions are done over commutative groups but databases are not commutative groups: they compute over sets. Fortunately, there is a known trick to solve this problem: Represent each table and as a Z-set. I talked about Z-sets (counting sets) briefly when describing Walter recently. In a Z-set each row has an additional weight and the weight shows how many times the row belongs to the table. In Z-sets the weights can be both positive and negative. This is great, because we can leverage this to represent delta changes to tables for example a positive weight indicator says a row is added/updated and a negative weight indicates a row is removed.

Moreover, we can represent database tables with the mapping where each weight is one if positive, and deleted if negative. This is also where the DISTINCT operation comes handy. It "removes" duplicates from multisets, and it also eliminates elements with negative weights. The paper also talks about a way to implement/lift DISTINCT operator as efficiently incrementalizable. 

We can use Z-sets to implement all of SQL, essentially entire relational algebra. Moreover in this implementation, each primitive operator has an efficient incremental version which does only work proportional to the size of the change.



Incremental View Maintenance (IVM) algorithm

Now, let's put that chain rule compositionality to good use. Here is the algorithm.

Consider a relational query Q defining a view V. To create a circuit that maintains V incrementally, we apply the following mechanical steps:

  1. Translate Q into a circuit using the rules in Table 1.
  2. Apply distinct elimination rules until convergence.
  3. Lift the whole circuit converting it to a circuit operating on streams.
  4. Incrementalize the whole circuit "surrounding" it with I and D (as we described above).
  5. Apply the chain rule recursively on the query structure to obtain an incremental implementation.

This algorithm is deterministic and its running time is proportional to the number of operators in the query.


Wrapping up

So what do we have so far? As long as we can express a new operator as a DBSP circuit, we can (1) define its incremental version, (2) apply the incrementalization algorithm to obtain an efficient incremental implementation, and (3) be confident that it composes with any other operators.

I skip the incremental recursive queries discussion from the paper. Here is a brief summary from the paper:  "We compile a recursive query into a circuit that employs incremental computation internally (using semi-naïve evaluation), to compute the fixed point. Here we combine these results to construct a circuit that evaluates a recursive query incrementally. The circuit receives a stream of updates to input relations, and for every update recomputes the fixed point. To do this incrementally, it preserves the stream of changes to recursive relations produced by the iterative fixed point computation, and adjusts this stream to account for the modified inputs. Thus, every element of the input stream yields a stream of adjustments to the fixed point computation, using nested streams."


The paper is a tour-de-force.  The researchers validate their approach through Lean theorem proving, developing an open-source Rust implementation with an MIT-licensed runtime. Their SQL-to-incremental-circuit compiler provides practical proof of the concept's viability.


The DBSP (now Feldera) Rust library provides APIs for basic algebraic data types: such as groups, finite maps, Z-set, indexed Z-set. A separate circuit construction API allows users to create DBSP circuits by placing operator nodes (corresponding to boxes in our diagrams) and connecting them with streams, which correspond to the arrows in our diagrams. They also built a SQL to DBSP compiler, which translates standard SQL queries into DBSP circuits. The compiler implements the Algorithm above to generate a streaming version of any SQL query. (Since I linked to the DBSP/Feldera repo, for completeness here is a link to the Differential Dataflow repo.)

This 12 minute VLDB conference presentation video does a great job of explaining the basic concepts of DBSP, so I recommend watching this. The paper received best paper in VLDB'23.


This is cool work! As I mentioned in the differential dataflow comparison, DBSP makes simplification to consecutive/synchronous time, but as a result it gets a very clean yet powerful modular system, which seems to get good practical mileage in real world. In 2001, I was a summer intern at IBM Research T.J. Watson, Westchester NYC. I was interning with Rob Strom to add an incremental view maintenance (IVM) research on IBM's publish-subscribe system, Gryphon. We had some progress, but what I got out of that project was that this is a very hard problem. Glad to see science is doing its thing, incrementally, but surely. 


November 19, 2024

Offset Considered Harmful or: The Surprising Complexity of Pagination in SQL

Offset Considered Harmful or: The Surprising Complexity of Pagination in SQL

For most applications, the job of a database system is to ingest all your data and give you access to the data you care about. Unfortunately, sometimes this is a large amount of data, and a typical application should only carefully take small sips from the fire hose of data. Generally, controlling the amount of output data is hard, but SQL offers the special limit syntax, which can guarantee that the result is bounded, and you never turn on the fire hose’s full blast. Unfortunately, the story does not end here, since there are many caveats.

November 15, 2024

Active and influential NYC infrastructure people

These are some of the most influential (mostly due to experience or expertise) and active folks (I actually see them attend events) in the NYC infrastructure scene (that I have a personal connection to).

If you're running a dinner or are just looking to meet interesting people in NYC in software infrastructure, consider this list and feel free to mention "Phil said you are awesome".

I've normalized titles a little bit but I say every title in the most generous way. These folks are brilliant.

This list is intentionally randomized. Also not a complete list. I've surely forgotten (let alone not yet met) great folk.

November 13, 2024