a curated list of database news from authoritative sources

November 20, 2024

DBSP: Automatic Incremental View Maintenance for Rich Query Languages

Incremental computation represents a transformative (!) approach to data processing. Instead of recomputing everything when your input changes slightly, incremental computation aims to reuse the original output and efficiently update the results. Efficiently means performing work proportional only to input and output changes.

This paper introduces DBSP, a programming language inspired by signal processing (hence the name DB-SP). DBSP is  simple, yet it offers extensive computational capabilities. With just four operators, it covers complex database queries, including entire relational algebra, set and multiset computations, nested relations, aggregations, recursive queries, and streaming computations.


Basic DBSP operators 

The language is designed to capture computation on streams. Streams are represented as infinite vectors indexed by consecutive time. Each stream can represent values of any type and incorporates basic mathematical operations like addition, subtraction, and a zero element. DBSP introduces incremental computing operators such as lifting ($\uparrow$) that converts scalar functions to stream functions, delay ($z^{-1}$) that shifts stream values, differentiation (${D}$) that compute stream changes, and integration (${I}$) that reconstruct original streams from change streams. Integration and differentiation are inverses of each other, and this shows in their stream circuit representation. 

   



DBSP is a simplified version of differential dataflow. I talked about differential dataflow in 2017 here. Differential dataflow represented time as a lattice, DBSP represents it as a single array. In other words, in DBSP time is consecutive and each state requires a unique predecessor. In differential dataflow, time is defined as a partial order to capture causality and concurrency. That is a better model for distributed systems, but that introduces a lot of complexity. DBSP makes a big simplification when it assumes linear synchronous time, and probably opens up issues related to managing out-of-order inputs, incorporating data from multiple streams, and tracking progress. 

But the simplification buys us powerful properties. The chain rule below emerges as a particularly powerful composition technique for incremental computation. If you apply two queries in sequence and you want to incrementalize that composition, it's enough to incrementalize each query independently. This is big. This allows independent optimization of query components, enabling efficient and modular query processing. 



Application to database incremental view maintenance problem

The paper argues that traditional databases can be reframed as streaming databases by representing linearizable transactions as streams and database states as evolving snapshots. With this framing, views become dynamically/incrementally computed entities over this stream. And DBSP gives us simple and principled Incremental View Maintenance (IVM) over a database.

An IVM algorithm needs to compute the changes to the view given the changes to the database. Given a query Q, we can define its incremental version $ Q^\Delta$ using stream integration and differentiation: $Q^\Delta = {D} \circ {\uparrow \!Q} \circ {I}$. The incremental version of a query is a stateful streaming operator which computes directly on changes and produces changes. This is depicted graphically as:

Instead of maintaining the output view entirely, DBSP proposes generating deltas as the output of the computation. The idea that both inputs and outputs to an IVM system are streams of changes is key to the symmetry of the solution and the fundamental reason that the chain rule exists.

But we have one more kink to iron out. DBSP operations assume that the competitions are done over commutative groups but databases are not commutative groups: they compute over sets. Fortunately, there is a known trick to solve this problem: Represent each table and as a Z-set. I talked about Z-sets (counting sets) briefly when describing Walter recently. In a Z-set each row has an additional weight and the weight shows how many times the row belongs to the table. In Z-sets the weights can be both positive and negative. This is great, because we can leverage this to represent delta changes to tables for example a positive weight indicator says a row is added/updated and a negative weight indicates a row is removed.

Moreover, we can represent database tables with the mapping where each weight is one if positive, and deleted if negative. This is also where the DISTINCT operation comes handy. It "removes" duplicates from multisets, and it also eliminates elements with negative weights. The paper also talks about a way to implement/lift DISTINCT operator as efficiently incrementalizable. 

We can use Z-sets to implement all of SQL, essentially entire relational algebra. Moreover in this implementation, each primitive operator has an efficient incremental version which does only work proportional to the size of the change.



Incremental View Maintenance (IVM) algorithm

Now, let's put that chain rule compositionality to good use. Here is the algorithm.

Consider a relational query Q defining a view V. To create a circuit that maintains V incrementally, we apply the following mechanical steps:

  1. Translate Q into a circuit using the rules in Table 1.
  2. Apply distinct elimination rules until convergence.
  3. Lift the whole circuit converting it to a circuit operating on streams.
  4. Incrementalize the whole circuit "surrounding" it with I and D (as we described above).
  5. Apply the chain rule recursively on the query structure to obtain an incremental implementation.

This algorithm is deterministic and its running time is proportional to the number of operators in the query.


Wrapping up

So what do we have so far? As long as we can express a new operator as a DBSP circuit, we can (1) define its incremental version, (2) apply the incrementalization algorithm to obtain an efficient incremental implementation, and (3) be confident that it composes with any other operators.

I skip the incremental recursive queries discussion from the paper. Here is a brief summary from the paper:  "We compile a recursive query into a circuit that employs incremental computation internally (using semi-naïve evaluation), to compute the fixed point. Here we combine these results to construct a circuit that evaluates a recursive query incrementally. The circuit receives a stream of updates to input relations, and for every update recomputes the fixed point. To do this incrementally, it preserves the stream of changes to recursive relations produced by the iterative fixed point computation, and adjusts this stream to account for the modified inputs. Thus, every element of the input stream yields a stream of adjustments to the fixed point computation, using nested streams."


The paper is a tour-de-force.  The researchers validate their approach through Lean theorem proving, developing an open-source Rust implementation with an MIT-licensed runtime. Their SQL-to-incremental-circuit compiler provides practical proof of the concept's viability.


The DBSP (now Feldera) Rust library provides APIs for basic algebraic data types: such as groups, finite maps, Z-set, indexed Z-set. A separate circuit construction API allows users to create DBSP circuits by placing operator nodes (corresponding to boxes in our diagrams) and connecting them with streams, which correspond to the arrows in our diagrams. They also built a SQL to DBSP compiler, which translates standard SQL queries into DBSP circuits. The compiler implements the Algorithm above to generate a streaming version of any SQL query. (Since I linked to the DBSP/Feldera repo, for completeness here is a link to the Differential Dataflow repo.)

This 12 minute VLDB conference presentation video does a great job of explaining the basic concepts of DBSP, so I recommend watching this. The paper received best paper in VLDB'23.


This is cool work! As I mentioned in the differential dataflow comparison, DBSP makes simplification to consecutive/synchronous time, but as a result it gets a very clean yet powerful modular system, which seems to get good practical mileage in real world. In 2001, I was a summer intern at IBM Research T.J. Watson, Westchester NYC. I was interning with Rob Strom to add an incremental view maintenance (IVM) research on IBM's publish-subscribe system, Gryphon. We had some progress, but what I got out of that project was that this is a very hard problem. Glad to see science is doing its thing, incrementally, but surely. 


November 19, 2024

Offset Considered Harmful or: The Surprising Complexity of Pagination in SQL

Offset Considered Harmful or: The Surprising Complexity of Pagination in SQL

For most applications, the job of a database system is to ingest all your data and give you access to the data you care about. Unfortunately, sometimes this is a large amount of data, and a typical application should only carefully take small sips from the fire hose of data. Generally, controlling the amount of output data is hard, but SQL offers the special limit syntax, which can guarantee that the result is bounded, and you never turn on the fire hose’s full blast. Unfortunately, the story does not end here, since there are many caveats.

November 15, 2024

Active and influential NYC infrastructure people

These are some of the most influential (mostly due to experience or expertise) and active folks (I actually see them attend events) in the NYC infrastructure scene (that I have a personal connection to).

If you're running a dinner or are just looking to meet interesting people in NYC in software infrastructure, consider this list and feel free to mention "Phil said you are awesome".

I've normalized titles a little bit but I say every title in the most generous way. These folks are brilliant.

This list is intentionally randomized. Also not a complete list. I've surely forgotten (let alone not yet met) great folk.

November 13, 2024

November 12, 2024

Grouping and Aggregations on Vitess

I love my job. One of the best feelings is when I find an interesting paper and use it to solve a real problem. It feels like I found a cheat code. Instead of having to do a lot of hard thinking, I can just stand on the shoulders of really big people and take a shortcut. Here, I want to share a recent project that I could solve using a public paper.

November 09, 2024

Fixing some of the InnoDB scan perf regressions in a MySQL fork

I recently learned of Advanced MySQL, a MySQL fork, and ran my sysbench benchmarks for it. It fixed some, but not all, of the regressions for write heavy workloads that landed in InnoDB after MySQL 8.0.28.

In response to my results, the project lead filed a bug for performance regressions and then quickly came up with a diff. The bug in this case is for regressions that are most obvious during full table scans and the problems arrived in MySQL 8.0.29 and 8.0.30 -- see bug 111538 and this post. The bug is closed for upstream but the perf regressions remain so I am excited to see the community working to solve this problem.

tl;dr

  • Advanced MySQL with the fix removes much of the regression in scan performance
Builds

I tried 4 builds

  • my8028 - upstream MySQL 8.0.28
  • my8040 - upstream MySQL 8.0.40
  • my8040adv_pre - Advanced MySQL 8.0.40 without the fix (without d347cdb)
  • my8040adv_post - Advanced MySQL 8.0.40 with the fix (at d347cdb)
Hardware

The servers are

  • dell32
    • Dell Precision 7865 Tower Workstation with 1 socket, 128G RAM, AMD Ryzen Threadripper PRO 5975WX with 32-Cores, 2 m.2 SSD (each 2TB, RAID SW 0, ext4). 
  • ax162-s
    • AMD EPYC 9454P 48-Core Processor with SMT disabled, 128G RAM, Ubuntu 22.04 and ext4 on 2 NVMe devices with SW RAID 1. This is in the Hetzner cloud.
  • bee
    • Beelink SER 4700u with Ryzen 7 4700u, 16G RAM, Ubuntu 22.04 and ext4 on NVMe

Benchmark

I used sysbench and my usage is explained here. A full run has 42 microbenchmarks and most test only 1 type of SQL statement. The database is cached by InnoDB.

The benchmark is run with ...
  • dell32 - 8 tables, 10M rows per table and 24 threads
  • ax162-s - 8 tables, 10M rows per table and 40 threads
  • bee - 1 table, 30M rows and 1 thread
Each microbenchmark runs for 300 seconds if read-only and 600 seconds otherwise. Prepared statements were enabled.

Results: overview

All of the results use relative QPS (rQPS) where:
  • rQPS is: (QPS for my version / QPS for base version)
  • base version is the QPS from MySQL 8.0.28
  • my version is one of the other versions
Here I only share the results for the scan microbenchmark.

Results: dell32

Summary
  • Summary
    • QPS with the fix in Advanced MySQL is ~9% better than without the fix
    • QPS with the fix in Advanced MySQL is ~2% better than my8040.
    • I am not sure why my8040adv_pre did much worse than my8040
From the relative QPS results the QPS with my8040adv_pre was ~15% less than my8028. But my8040adv_post is only ~7% slower than my8028 so it removes half of the regression.

Relative to: my8028
col-1 : my8040
col-2 : my8040adv_pre
col-3 : my8040adv_post

col-1   col-2   col-3
0.91    0.85    0.93    scan

From vmstat and iostat metrics CPU overhead for my8040adv_pre was ~22% larger than my8028. But with the fix the CPU overhead for my8040adv_post is only ~8% larger than my8028. This is great.

--- absolute
cpu/o cs/o r/o rKB/o wKB/o o/s dbms
0.093496 3.256 0 0 0.006 246 my8028
0.106105 4.065 0 0 0.006 225 my8040
0.113878 4.344 0 0 0.006 208 my8040adv_pre
0.101104 3.978 0 0 0.006 228 my8040adv_post
--- relative to first result
1.13 1.25 1 1 1.00 0.91 my8040
1.22 1.33 1 1 1.00 0.85 my8040adv_pre
1.08 1.22 1 1 1.00 0.93 my8040adv_post

Results: ax162-s

Summary
  • QPS is ~18% larger with the fix in Advanced MySQL
  • CPU overhead is ~15% smaller with the fix
From the relative QPS results the QPS with my8040adv_pre was the same as my8040 and both were ~17% slower than my8028. But my8040adv_post is only ~2% slower than my8028 which is excellent.

Relative to: my8028
col-1 : my8040
col-2 : my8040adv_pre
col-3 : my8040adv_post

col-1   col-2   col-3
0.83    0.83    0.98    scan

From vmstat and iostat metrics CPU overhead for my8040 and my8040adv_pre were ~20% larger than my8028. But with the fix the CPU overhead for my8040adv_post is only ~3% larger than my8028. This is great.

--- absolute
cpu/o cs/o r/o rKB/o wKB/o o/s dbms
0.018767 0.552 0 0 0.052 872 my8028
0.022533 0.800 0 0 0.013 725 my8040
0.022499 0.808 0 0.001 0.034 727 my8040adv_pre
0.019305 0.731 0 0 0.03 851 my8040adv_post
--- relative to first result
1.20 1.45 1 1 0.25 0.83 my8040
1.20 1.46 1 inf 0.65 0.83 my8040adv_pre
1.03 1.32 1 1 0.58 0.98 my8040adv_post

Results: bee

Summary:
  • QPS is ~17% larger with the fix in Advanced MySQL
  • CPU overhead is ~15% smaller with the fix
I did not test my8040adv_pre on this server.

From the relative QPS results the QPS with my8040 is ~22% less than my8028. But QPS from my8040adv_post is only ~9% less than my8028. This is great.

Relative to: my8028
col-1 : my8040
col-2 : my8040adv_post

col-1   col-2
0.78    0.91    scan

From vmstat and iostat metrics CPU overhead for my8040 was ~28% larger than my8028. But with the fix the CPU overhead for my8040adv_post is only ~3% larger than my8028. This is great.

--- absolute
cpu/o           cs/o    r/o     rKB/o   wKB/o   o/s     dbms
0.222553        2.534   0       0.001   0.035   55      my8028
0.285792        7.622   0       0       0.041   43      my8040
0.246404        6.475   0       0       0.036   50      my8040adv_post
--- relative to first result
1.28            3.01    1       0.00    1.17    0.78    my8040
1.11            2.56    1       0.00    1.03    0.91    my8040adv_post

RocksDB benchmarks: large server, leveled compaction

I recently shared benchmark results for RocksDB a few weeks ago for both leveled and universal compaction on a small server. This post has results from a large server with leveled compaction. 

tl;dr

  • there are a few regressions from bug 12038
  • QPS for overwrite is ~1.5X to ~2X better in 9.x than 6.0 (ignoring bug 12038)
  • otherwise QPS in 9.x is similar to 6.x

Hardware

The server is an ax162-s from Hetzner with an AMD EPYC 9454P processor, 48 cores, AMD SMT disabled and 128G RAM. The OS is Ubuntu 22.04. Storage is 2 NVMe devices with SW RAID 1 and ext4.

Builds

I compiled db_bench from source on all servers. I used versions:
  • 6.x - 6.0.2, 6.10.4, 6.20.4, 6.29.5
  • 7.x - 7.0.4, 7.3.2, 7.6.0, 7.10.2
  • 8.x - 8.0.0, 8.3.3, 8.6.7, 8.9.2, 8.11.4
  • 9.x - 9.0.1, 9.1.2, 9.2.2, 9.3.2, 9.4.1, 9.5.2, 9.6.1 and 9.7.3
Benchmark

All tests used the default value for compaction_readahead_size and the block cache (LRU).

I used my fork of the RocksDB benchmark scripts that are wrappers to run db_bench. These run db_bench tests in a special sequence -- load in key order, read-only, do some overwrites, read-write and then write-only. The benchmark was run using 40 threads. How I do benchmarks for RocksDB is explained here and here. The command line to run the tests is: bash x3.sh 40 no 1800 c48r128 100000000 2000000000 byrx iobuf iodir

The tests on the charts are named as:
  • fillseq -- load in key order with the WAL disabled
  • revrangeww -- reverse range while writing, do short reverse range scans as fast as possible while another thread does writes (Put) at a fixed rate
  • fwdrangeww -- like revrangeww except do short forward range scans
  • readww - like revrangeww except do point queries
  • overwrite - do overwrites (Put) as fast as possible
Workloads

There are three workloads, all of which use 40 threads:

  • byrx - the database is cached by RocksDB (100M KV pairs)
  • iobuf - the database is larger than memory and RocksDB uses buffered IO (2B KV pairs)
  • iodir - the database is larger than memory and RocksDB uses O_DIRECT (2B KV pairs)

A spreadsheet with all results is here and performance summaries with more details are here for byrx, iobuf and iodir.

Relative QPS

The numbers in the spreadsheet and on the y-axis in the charts that follow are the relative QPS which is (QPS for $me) / (QPS for $base). When the value is greater than 1.0 then $me is faster than $base. When it is less than 1.0 then $base is faster (perf regression!).

The base version is RocksDB 6.0.2.

Results: byrx

The byrx tests use a cached database. The performance summary is here

This chart shows the relative QPS for a given version relative to RocksDB 6.0.2. The y-axis doesn't start at 0 in the second chart to improve readability for some lines.

Summary:
  • fillseq is worse from 6.0 to 8.0 but stable since then
  • overwrite has large improvements late in 6.0 and small improvements since then
  • fwdrangeww has small improvements in early 7.0 and is stable since then
  • revrangeww and readww are stable from 6.0 through 9.
Results: iobuf

The iobuf tests use an IO-bound database with buffered IO. The performance summary is here

This chart shows the relative QPS for a given version relative to RocksDB 6.0.2. The y-axis doesn't start at 0 in the second chart to improve readability for some lines.

Summary:
  • bug 12038 explains the drop in throughput for overwrite since 8.6.7
  • otherwise QPS in 9.x is similar to 6.0
Results: iodir

The iodir tests use an IO-bound database with O_DIRECT. The performance summary is here

This chart shows the relative QPS for a given version relative to RocksDB 6.0.2. The y-axis doesn't start at 0 in the second chart to improve readability for some lines.

Summary:
  • the QPS drop for overwrite in 8.6.7 occurs because the db_bench client wasn't updated to use the new default value for compaction readahead size
  • QPS for overwrite is ~2X better in 9.x relative to 6.0
  • otherwise QPS in 9.x is similar to 6.0
todo








Efficient MySQL Performance In 10 Sentences

Don’t have time to read Efficient MySQL Performance? Here’s the book (10 chapters) in one-liners.

  1. Performance is query response time.
  2. Proper left-most indexing is required for performance.
  3. The less data, the better.
  4. Access patterns (part of the workload) help or hinder performance.
  5. Sharding is how to scale writes when single-node performance is truly reached.
  6. Server metrics reflect how the app workload causes MySQL to work.
  7. Replication lag is data loss.
  8. Locks are held until a transaction commits, so commit quickly.
  9. There are many other challenges that you might need to address—sorry.
  10. MySQL in the cloud is slower and more expensive, so performance is more important than ever.

November 07, 2024