a curated list of database news from authoritative sources

November 09, 2024

November 07, 2024

November 06, 2024

Application Architecture: Combining DynamoDB and Tinybird

Most applications tend to be built around a “transactional” core. Buy a thingamajig. Cancel a whoosiwatsie. Edit a whatchamacallit. You might be booking flights, posting bird pics on Insta, or patronizing the local Syrian restaurant for lunch (tabouleh, anyone?). While CRUD transactions are the foundation of applications, many are now are starting to offer (or see user demand for) analytical experiences: travellers want to see price change history to find the best time to fly, content creators w

RocksDB on a big server: LRU vs hyperclock

This has benchmark results for RocksDB using a big (48-core) server. I ran tests to document the impact of the the block cache type (LRU vs hyperclock) and a few other configuration choices for a CPU-bound workload. A previous post with great results for the hyperclock block cache is here.

tl;dr

  • read QPS is up to ~3X better with auto_hyper_clock_cache vs LRU
  • read QPS is up to ~1.3X better with the per-level fanout set to 32 vs 8
  • read QPS drops by ~15% as the background write rate increases from 2 to 32 M/s
Software

I used RocksDB 9.6, compiled with gcc 11.4.0.

Hardware

The server is an ax162-s from Hetzner with an AMD EPYC 9454P processor, 48 cores, AMD SMT disabled and 128G RAM. The OS is Ubuntu 22.04. Storage is 2 NVMe devices with SW RAID 1 and ext4.

Benchmark

Overviews on how I use db_bench are here and here.

All of my tests here use a CPU-bound workload with a database that is cached by RocksDB and are repeated for 1, 10, 20 and 40 threads. 

I focus on the readwhilewriting benchmark where performance is reported for the reads (point queries) while there is a fixed rate for writes done in the background. I prefer to measure read performance when there are concurrent writes because read-only benchmarks with an LSM suffer from non-determinism as the state (shape) of the LSM tree has a large impact on CPU overhead and throughput.

To save time I did not run the fwdrangewhilewriting benchmark. Were I to repeat this work I would include it because the results from it would be interesting for a few of the configuration options I compared.

I did tests to understand the following:

  • LRU vs auto_hyper_clock_cache for the block cache implementation
    • LRU is the original implementation. The code was simple, which is nice. The implementation for LRU is sharded with a mutex per shard and that mutex can become a hot spot. The hyperclock implementation is much better at avoiding hot spots.
  • per level fanout (8 vs 32)
    • By per level fanout I mean the value of --max_bytes_for_level_multiplier which determines the target size difference between adjacent levels. By default I use 8, while 10 is also a common choice. Here I compare 8 vs 32. When the fanout is larger the LSM tree has fewer levels -- meaning there are fewer places to check for data which should reduce CPU overhead and increase QPS.
  • background write rate
    • I repeated tests with the background write rate (--benchmark_write_rate_limit) set to 2, 8 and 32 MB/s. With a higher write rate there is more chance for interference between reads and writes. The interference might be from mutex contention, compaction threads using more CPU, more L0 files to check or more data in levels L1 and larger.
  • target size for L0
    • By target size I mean the number of files in the L0 that trigger compaction. The db_bench option for this is --level0_file_num_compaction_trigger. When the value is larger there will be more L0 files on average that a query might have to check and that means there is more CPU overhead. Unfortunately, I configured RocksDB incorrectly so I don't have results to share. The issue is that when the L0 is configured to be larger, the L1 should be configured to be at least as large as the L0 (L1 target size should be >= sizeof(SST) * num(L0 files). If not, then L0->L1 compaction will happen sooner than expected.
All of the results are in this spreadsheet.

Results: LRU vs auto_hyper_clock_cache

These graphs have QPS from the readwhilewriting benchmark for the LRU and AHCC block cache implementations where LRU is the original version with a sharded hash table and a mutex per shard while AHCC is the hyper clock cache (--cache_type=auto_hyper_clock_cache).

Summary:
  • QPS is much better with AHCC than LRU (~3.3X faster at 40 threads)
  • QPS with AHCC scales linearly with the thread count
  • QPS with LRU does not scale linearly and suffers from mutex contention
  • There are some odd effects in the results for 1 thread
With a 2M/s background write rate AHCC is ~1.1X faster at 1 thread and ~3.3X faster at 40 threads relative to LRU.
With an 8M/s background write rate AHCC is ~1.1X faster at 1 thread and ~3.3X faster at 40 threads relative to LRU.
With a 32M/s background write rate AHCC is ~1.1X faster at 1 thread and ~2.9X faster at 40 threads relative to LRU.

Results: per level fanout

These graphs have QPS from the readwhilewriting benchmark to compare results with per-level fanout set to 8 and 32.

Summary
  • QPS is often 1.1X to 1.3X larger with fanout=32 vs fanout=8

With an 8M/s background write rate and LRU, fanout=8 is faster at 1 thread but then fanout=32 is from 1.1X to 1.3X faster at 10 to 40 threads.
With an 8M/s background write rate and AHCC, fanout=8 is faster at 1 thread but then fanout=32 is ~1.1X faster at 10 to 40 threads.

With a 32M/s background write rate and LRU, fanout=8 is ~2X faster at 1 thread but then fanout=32 is from 1.1X to 1.2X faster at 10 to 40 threads.
With a 32M/s background write rate and AHCC, fanout=8 is ~2X faster at 1 thread but then fanout=32 is ~1.1X faster at 10 to 40 threads.
Results: background write rate

Summary:
  • With LRU
    • QPS drops by up to ~15% as the background write rate grows from 2M/s to 32M/s
    • QPS does not scale linearly and suffers from mutex contention
  • With AHCC
    • QPS drops by up to 13% as the background write rate grows from 2M/s to 32M/s
    • QPS scales linearly with the thread count
  • There are some odd effects in the results for 1 thread
Results with LRU show that per-thread QPS doesn't scale linearly
Results with AHCC show that per-thread QPS scales linearly ignoring the odd results for 1 thread



November 05, 2024

Effective unemployment and social media

Being unemployed can be incredibly depressing. So much rejection. Everything seems to be out of your control. Everything except for one thing: what you produce.

You might know that repeatedly posting on social media that you are looking for work is ineffective. That it looks (or at least feels) worse each time you say so. But there is at least one major caveat to this.

Every single time you create something and share it publicly is a chance to also reiterate that you are looking for work. And people actually appreciate and value this!

Whether you write a blog post or build some project, you are seen as working on yourself and contributing to the community. Positive things! And it is no problem at all to learn with each new post you write and each new project you publish that you are also looking for work.

Moreover, dynamics of the internet and social media basically require that you be regularly producing something new. Either regularly producing a new version of some existing project or regularly producing new projects (or blog posts) entirely.

What you did a week ago is old news on social media. What will you do next week?

This could itself feel depressing except for that it's probably actually a fairly healthy thing for yourself anyway! It is a motivation to keep your skills sharp as time goes on.

So while you're unemployed and able to muster the motivation, write about things that are interesting to you! Build projects that intrigue you. Leave a little note on every post and project that you are looking for work. And share every post and project on social media.

You'll expose yourself to opportunities and referrals. And even if no post or project "takes off" you will still be working on yourself and contributing back knowledge to the community.

Optimizing query planning in Vitess: a step-by-step approach

Introduction # In this blog post, we will discuss an example of a change to the Vitess query planner and how it enhances the optimization process. The new model focuses on making every step in the optimization pipeline a runnable plan. This approach offers several benefits, including simpler understanding and reasoning, ease of testing, and the ability to use arbitrary expressions in ordering, grouping, and aggregations. Vitess distributed query planner # VTGate is the proxy component of Vitess.

November 04, 2024

RocksDB benchmarks: small server, universal compaction

I shared benchmark results for RocksDB a few weeks ago using leveled compaction and a small server. Here I have results for universal compaction and the same small server.

tl;dr
  • in general the there are some improvements and some small regressions with one exception (see  bug 12038)
  • for a cached database
    • From RocksDB 6.0.2 to 9.x QPS drops by ~10% for fillseq and ~15% for other tests
    • Performance has been stable since 7.x
  • for an IO-bound database with buffered IO
    •  bug 12038 hurts QPS for overwrite (will be fixed soon in 9.7)
    • QPS is otherwise stable 
  • for an IO-bound database with O_DIRECT
    • QPS for fillseq and overwrite is ~10% less in 9.7 vs 6.0.2 and has been stable since 7.0
    • QPS for read-heavy tests is ~5% better in RocksDB 9.7.2 vs 6.0.2
Hardware

The small server is named SER7 and is a Beelink SER7 7840HS (see here) with 8 cores, AMD SMT disabled, a Ryzen 7 7840HS CPU, Ubuntu 22.04. Storage is ext4 with data=writeback and 1 NVMe device. 

The storage device has 128 for max_hw_sectors_kb and max_sectors_kb. This is relevant for bug 12038 which will be fixed real soon in a 9.7 patch release.

Builds

I compiled db_bench from source on all servers. I used versions:
  • 6.x - 6.0.2, 6.10.4, 6.20.4, 6.29.5
  • 7.x - 7.0.4, 7.3.2, 7.6.0, 7.10.2
  • 8.x - 8.0.0, 8.3.3, 8.6.7, 8.9.2, 8.11.4
  • 9.x - 9.0.1, 9.1.2, 9.2.2, 9.3.2, 9.4.1, 9.5.2, 9.6.1, 9.6.2, 9.7.2, 9.7.4 and 9.8.1
Benchmark

All tests used the default value for compaction_readahead_size. For all versions tested I used the default values for the block cache (LRU) and format_version.

I used my fork of the RocksDB benchmark scripts that are wrappers to run db_bench. These run db_bench tests in a special sequence -- load in key order, read-only, do some overwrites, read-write and then write-only. The benchmark was run using 1 thread for the small server and 8 threads for the medium server. How I do benchmarks for RocksDB is explained here and here. The command line to run the tests is:

    # Small server, SER7: use 1 thread, 20M KV pairs for cached, 400M for IO-bound
    bash x3.sh 1 no 1800 c8r32 20000000 400000000 byrx iobuf iodir

The tests on the charts are named as:
  • fillseq -- load in key order with the WAL disabled
  • revrangeww -- reverse range while writing, do short reverse range scans as fast as possible while another thread does writes (Put) at a fixed rate
  • fwdrangeww -- like revrangeww except do short forward range scans
  • readww - like revrangeww except do point queries
  • overwrite - do overwrites (Put) as fast as possible
Workloads

There are three workloads, all of which use one client (thread):

  • byrx - the database is cached by RocksDB
  • iobuf - the database is larger than memory and RocksDB uses buffered IO
  • iodir - the database is larger than memory and RocksDB uses O_DIRECT

A spreadsheet with all results is here and performance summaries with more details are here for byrx, for iobuf and for iodir.

Relative QPS

The numbers in the spreadsheet and on the y-axis in the charts that follow are the relative QPS which is (QPS for $me) / (QPS for $base). When the value is greater than 1.0 then $me is faster than $base. When it is less than 1.0 then $base is faster (perf regression!).

The base version is RocksDB 6.0.2.

Results: byrx

The byrx tests use a cached database. The performance summary is here

The charts show the relative QPS for a given version of RocksDB 6.0.2. There are two charts with the same data and the y-axis on the second doesn't start at 0 to improve readability.

Summary:
  • From RocksDB 6.0.2 to 9.x QPS drops by ~10% for fillseq and ~15% for other tests
  • Performance has been stable since 7.x
Results: iobuf

The iobuf tests use a database larger than memory with buffered IO. The performance summary is here.

The charts show the relative QPS for a given version of RocksDB 6.0.2. There are two charts with the same data and the y-axis on the second doesn't start at 0 to improve readability.

Summary:
  • bug 12038 explains the regression for overwrite (fixed soon in 9.7)
  • QPS for fillseq has been stable
  • QPS for revrangeww, fwdrangeww and readww is stable. I am not sure about the variance in 9.6 and 9.7 releases. The cause might be that universal (tiered) is more prone to variance. I will revisit that when I run tests again in a few months.
Results: iodir

The iodir tests use a database larger than memory with O_DIRECT. The performance summary is here.

The charts show the relative QPS for a given version of RocksDB 6.0.2. There are two charts with the same data and the y-axis on the second doesn't start at 0 to improve readability.

Summary:
  • QPS for fillseq and overwrite is ~10% less in 9.7 vs 6.0.2 and has been stable since 7.0. My vague memory is that the issue is new CPU overhead from better error checking.
  • QPS for read-heavy tests is ~5% better in RocksDB 9.7.2 vs 6.0.2


November 02, 2024

October 31, 2024

Checking linearizability in Go

You want to check for strict consistency (linearizability) for your project but you don't want to have to deal with the JVM. Porcupine, used by a number of real-world systems like etcd and TiDB, has you covered!

Importantly, neither Jepsen projects nor Porcupine can prove linearizability. They can only help you build confidence that you aren't obviously violating linearizability.

The Porcupine README is pretty good but doesn't give complete working code, so I'm going to walk through checking linearizability of a distributed register. And then we'll tweak things a bit by checking linearizability for a distributed key-value store.

But rather than implementing a distributed register and implementing a distributed key-value store, to keep this post concise, we're just going to imagine that they exist and we'll come up with some example histories we might see.

Code for this post can be found on GitHub.

Boilerplate

Create a new directory and go mod init lintest. Let's add the imports we need and a helper function for generating a visualization of a history, in main.go:

package main

import "os"
import "log"
import "github.com/anishathalye/porcupine"

func visualizeTempFile(model porcupine.Model, info porcupine.LinearizationInfo) {
    file, err := os.CreateTemp("", "*.html")
    if err != nil {
        panic("failed to create temp file")
    }
    err = porcupine.Visualize(model, info, file)
    if err != nil {
        panic("visualization failed")
    }
    log.Printf("wrote visualization to %s", file.Name())
}

A distributed register

A distributed register is like a distributed key-value store but there's only a single key.

We need to tell Porcupine what the inputs and outputs for this system are. And we'll later describe for it how an idealized version of this system should behave as it receives each input; what output the idealized version should produce.

Each time we send a command to the distributed register it will include an operation (to get or to set the register). And if it is a set command it will include a value.

type registerInput struct {
    operation string // "get" and "set"
    value int
}

The register is an integer register.

Now we will define a model for Porcupine which, again, is the idealized version of this system.

func main() {
    registerModel := porcupine.Model{
        Init: func() any {
            return 0
        },
        Step: func(stateAny, inputAny, outputAny any) (bool, any) {
            input := inputAny.(registerInput)
            output := outputAny.(int)
            state := stateAny.(int)
            if input.operation == "set" {
                return true, input.value
            } else if input.operation == "get" {
                readCorrectValue := output == state
                return readCorrectValue, state
            }

            panic("Unexpected operation")
        },
    }

The step function accepts anything because it has to be able to model any sort of system with its different inputs and outputs and current state. So we have to handle casting from the any type to what we know are the inputs and outputs and state. And finally we actually do the state change and return the new state as well as if the given output matches what we know it should be.

An invalid history

Now we've only defined the idealized version of this system. Let's pretend we have some real-world implementation of this. We might have two clients and they might issue concurrent get and set requests.

Every time we stimulate the system we will generate a new history that we can validate with Porcupine against our model to see if the history is linearizable.

Let's imagine these two clients concurrently set the register to some value. Both sets succeed. Then both clients read the register. And they get different values. Here's what that history would look like modeled for Porcupine.

    ops := []porcupine.Operation{
        // Client 3 sets the register to 100. The request starts at t0 and ends at t2.
        {3, registerInput{"set", 100}, 0, 100 /* end state at t2 is 100 */, 2},
        // Client 5 sets the register to 200. The request starts at t3 and ends at t4.
        {5, registerInput{"set", 200}, 3, 200/* end state at t3 is 200 */, 4},
        // Client 3 reads the register. The request starts at t5 and ends at t6.
        {3, registerInput{"get", 0 /* doesn't matter */ }, 5, 200, 6},
        // Client 5 reads the register. The request starts at t7 and ends at t8. Reads a stale value!
        {5, registerInput{"get", 0 /* doesn't matter */}, 7, 100, 8},
    }
    res, info := porcupine.CheckOperationsVerbose(registerModel, ops, 0)
    visualizeTempFile(registerModel, info)

    if res != porcupine.Ok {
        panic("expected operations to be linearizable")
    }
}

If we build and run this code:

$ go mod tidy
go: finding module for package github.com/anishathalye/porcupine
go: found github.com/anishathalye/porcupine in github.com/anishathalye/porcupine v0.1.6
$ go build
$ ./lintest
2024/10/31 19:54:08 wrote visualization to /var/folders/cb/v27m749d0sj89h9ydfq0f0940000gn/T/463308000.html
panic: expected operations to be linearizable

goroutine 1 [running]:
main.main()
        /Users/phil/tmp/lintest/main.go:59 +0x394

Porcupine caught the stale value. Open that HTML file to see the visualization.

A valid history

Let's say we fix the bug so now there's no stale read. The new history would look like this:

    ops := []porcupine.Operation{
        // Client 3 sets the register to 100. The request starts at t0 and ends at t2.
        {3, registerInput{"set", 100}, 0, 100 /* end state at t2 is 100 */, 2},
        // Client 5 sets the register to 200. The request starts at t3 and ends at t4.
        {5, registerInput{"set", 200}, 3, 200/* end state at t3 is 200 */, 4},
        // Client 3 reads the register. The request starts at t5 and ends at t6.
        {3, registerInput{"get", 0 /* doesn't matter */ }, 5, 200, 6},
        // Client 5 reads the register. The request starts at t7 and ends at t8.
        {5, registerInput{"get", 0 /* doesn't matter */}, 7, 200, 8},
    }

Rebuild, rerun lintest (it should exit successfully now), and open the visualization.

Great! Now let's make things a little more complicated by modeling a distributed key-value store rather than a distributed register.

Distributed key-value

The inputs of this system will be slightly more complex. They will take a key along with the operation and value.

type kvInput struct {
    operation string // "get" and "set"
    key string
    value int
}

And when we model the distributed key-value store with the state and output at each step being a map[string]int.

    kvModel := porcupine.Model{
        Init: func() any {
            return map[string]int{}
        },
        Step: func(stateAny, inputAny, outputAny any) (bool, any) {
            input := inputAny.(kvInput)
            output := outputAny.(map[string]int)
            state := stateAny.(map[string]int)
            if input.operation == "set" {
                newState := map[string]int{}
                for k, v := range state {
                    newState[k] = v
                }
                newState[input.key] = input.value
                return true, newState
            } else if input.operation == "get" {
                readCorrectValue := output[input.key] == state[input.key]
                return readCorrectValue, state
            }

            panic("Unexpected operation")
        },
    }

And now the history gets slightly more complex because we are now working with some specific key. But we'll otherwise use the same history as before.

    ops := []porcupine.Operation{
        // Client 3 set key `a` to 100. The request starts at t0 and ends at t2.
        {3, kvInput{"set", "a", 100}, 0, map[string]int{"a": 100}, 2},
        // Client 5 set key `a` to 200. The request starts at t3 and ends at t4.
        {5, kvInput{"set", "a", 200}, 3, map[string]int{"a": 200}, 4},
        // Client 3 read key `a`. The request starts at t5 and ends at t6.
        {3, kvInput{"get", "a", 0 /* doesn't matter */ }, 5, map[string]int{"a": 200}, 6},
        // Client 5 read key `a`. The request starts at t7 and ends at t8.
        {5, kvInput{"get", "a", 0 /* doesn't matter */}, 7, map[string]int{"a": 200}, 8},
    }

Build and run. Open the visualization.

And there we go!

What's next

These are just a few simple examples that are not hooked up to a real system. But it still seemed useful to show how you model one or two simple different systems and check a history with Porcupine.

Another aspect of Porcupine I did not cover is partitioning the state space. The docs say:

Implementing the partition functions can greatly improve performance. If you're implementing the partition function, the model Init and Step functions can be per-partition. For example, if your specification is for a key-value store and you partition by key, then the per-partition state representation can just be a single value rather than a map.

Perhaps that, and hooking this up to some "real" system, would be a good next step.