a curated list of database news from authoritative sources

November 06, 2024

November 05, 2024

Effective unemployment and social media

Being unemployed can be incredibly depressing. So much rejection. Everything seems to be out of your control. Everything except for one thing: what you produce.

You might know that repeatedly posting on social media that you are looking for work is ineffective. That it looks (or at least feels) worse each time you say so. But there is at least one major caveat to this.

Every single time you create something and share it publicly is a chance to also reiterate that you are looking for work. And people actually appreciate and value this!

Whether you write a blog post or build some project, you are seen as working on yourself and contributing to the community. Positive things! And it is no problem at all to learn with each new post you write and each new project you publish that you are also looking for work.

Moreover, dynamics of the internet and social media basically require that you be regularly producing something new. Either regularly producing a new version of some existing project or regularly producing new projects (or blog posts) entirely.

What you did a week ago is old news on social media. What will you do next week?

This could itself feel depressing except for that it's probably actually a fairly healthy thing for yourself anyway! It is a motivation to keep your skills sharp as time goes on.

So while you're unemployed and able to muster the motivation, write about things that are interesting to you! Build projects that intrigue you. Leave a little note on every post and project that you are looking for work. And share every post and project on social media.

You'll expose yourself to opportunities and referrals. And even if no post or project "takes off" you will still be working on yourself and contributing back knowledge to the community.

Optimizing query planning in Vitess: a step-by-step approach

Introduction # In this blog post, we will discuss an example of a change to the Vitess query planner and how it enhances the optimization process. The new model focuses on making every step in the optimization pipeline a runnable plan. This approach offers several benefits, including simpler understanding and reasoning, ease of testing, and the ability to use arbitrary expressions in ordering, grouping, and aggregations. Vitess distributed query planner # VTGate is the proxy component of Vitess.

November 04, 2024

RocksDB benchmarks: small server, universal compaction

I shared benchmark results for RocksDB a few weeks ago using leveled compaction and a small server. Here I have results for universal compaction and the same small server.

tl;dr
  • in general the there are some improvements and some small regressions with one exception (see  bug 12038)
  • for a cached database
    • From RocksDB 6.0.2 to 9.x QPS drops by ~10% for fillseq and ~15% for other tests
    • Performance has been stable since 7.x
  • for an IO-bound database with buffered IO
    •  bug 12038 hurts QPS for overwrite (will be fixed soon in 9.7)
    • QPS is otherwise stable 
  • for an IO-bound database with O_DIRECT
    • QPS for fillseq and overwrite is ~10% less in 9.7 vs 6.0.2 and has been stable since 7.0
    • QPS for read-heavy tests is ~5% better in RocksDB 9.7.2 vs 6.0.2
Hardware

The small server is named SER7 and is a Beelink SER7 7840HS (see here) with 8 cores, AMD SMT disabled, a Ryzen 7 7840HS CPU, Ubuntu 22.04. Storage is ext4 with data=writeback and 1 NVMe device. 

The storage device has 128 for max_hw_sectors_kb and max_sectors_kb. This is relevant for bug 12038 which will be fixed real soon in a 9.7 patch release.

Builds

I compiled db_bench from source on all servers. I used versions:
  • 6.x - 6.0.2, 6.10.4, 6.20.4, 6.29.5
  • 7.x - 7.0.4, 7.3.2, 7.6.0, 7.10.2
  • 8.x - 8.0.0, 8.3.3, 8.6.7, 8.9.2, 8.11.4
  • 9.x - 9.0.1, 9.1.2, 9.2.2, 9.3.2, 9.4.1, 9.5.2, 9.6.1, 9.6.2, 9.7.2, 9.7.4 and 9.8.1
Benchmark

All tests used the default value for compaction_readahead_size. For all versions tested I used the default values for the block cache (LRU) and format_version.

I used my fork of the RocksDB benchmark scripts that are wrappers to run db_bench. These run db_bench tests in a special sequence -- load in key order, read-only, do some overwrites, read-write and then write-only. The benchmark was run using 1 thread for the small server and 8 threads for the medium server. How I do benchmarks for RocksDB is explained here and here. The command line to run the tests is:

    # Small server, SER7: use 1 thread, 20M KV pairs for cached, 400M for IO-bound
    bash x3.sh 1 no 1800 c8r32 20000000 400000000 byrx iobuf iodir

The tests on the charts are named as:
  • fillseq -- load in key order with the WAL disabled
  • revrangeww -- reverse range while writing, do short reverse range scans as fast as possible while another thread does writes (Put) at a fixed rate
  • fwdrangeww -- like revrangeww except do short forward range scans
  • readww - like revrangeww except do point queries
  • overwrite - do overwrites (Put) as fast as possible
Workloads

There are three workloads, all of which use one client (thread):

  • byrx - the database is cached by RocksDB
  • iobuf - the database is larger than memory and RocksDB uses buffered IO
  • iodir - the database is larger than memory and RocksDB uses O_DIRECT

A spreadsheet with all results is here and performance summaries with more details are here for byrx, for iobuf and for iodir.

Relative QPS

The numbers in the spreadsheet and on the y-axis in the charts that follow are the relative QPS which is (QPS for $me) / (QPS for $base). When the value is greater than 1.0 then $me is faster than $base. When it is less than 1.0 then $base is faster (perf regression!).

The base version is RocksDB 6.0.2.

Results: byrx

The byrx tests use a cached database. The performance summary is here

The charts show the relative QPS for a given version of RocksDB 6.0.2. There are two charts with the same data and the y-axis on the second doesn't start at 0 to improve readability.

Summary:
  • From RocksDB 6.0.2 to 9.x QPS drops by ~10% for fillseq and ~15% for other tests
  • Performance has been stable since 7.x
Results: iobuf

The iobuf tests use a database larger than memory with buffered IO. The performance summary is here.

The charts show the relative QPS for a given version of RocksDB 6.0.2. There are two charts with the same data and the y-axis on the second doesn't start at 0 to improve readability.

Summary:
  • bug 12038 explains the regression for overwrite (fixed soon in 9.7)
  • QPS for fillseq has been stable
  • QPS for revrangeww, fwdrangeww and readww is stable. I am not sure about the variance in 9.6 and 9.7 releases. The cause might be that universal (tiered) is more prone to variance. I will revisit that when I run tests again in a few months.
Results: iodir

The iodir tests use a database larger than memory with O_DIRECT. The performance summary is here.

The charts show the relative QPS for a given version of RocksDB 6.0.2. There are two charts with the same data and the y-axis on the second doesn't start at 0 to improve readability.

Summary:
  • QPS for fillseq and overwrite is ~10% less in 9.7 vs 6.0.2 and has been stable since 7.0. My vague memory is that the issue is new CPU overhead from better error checking.
  • QPS for read-heavy tests is ~5% better in RocksDB 9.7.2 vs 6.0.2


November 02, 2024

October 31, 2024

Checking linearizability in Go

You want to check for strict consistency (linearizability) for your project but you don't want to have to deal with the JVM. Porcupine, used by a number of real-world systems like etcd and TiDB, has you covered!

Importantly, neither Jepsen projects nor Porcupine can prove linearizability. They can only help you build confidence that you aren't obviously violating linearizability.

The Porcupine README is pretty good but doesn't give complete working code, so I'm going to walk through checking linearizability of a distributed register. And then we'll tweak things a bit by checking linearizability for a distributed key-value store.

But rather than implementing a distributed register and implementing a distributed key-value store, to keep this post concise, we're just going to imagine that they exist and we'll come up with some example histories we might see.

Code for this post can be found on GitHub.

Boilerplate

Create a new directory and go mod init lintest. Let's add the imports we need and a helper function for generating a visualization of a history, in main.go:

package main

import "os"
import "log"
import "github.com/anishathalye/porcupine"

func visualizeTempFile(model porcupine.Model, info porcupine.LinearizationInfo) {
    file, err := os.CreateTemp("", "*.html")
    if err != nil {
        panic("failed to create temp file")
    }
    err = porcupine.Visualize(model, info, file)
    if err != nil {
        panic("visualization failed")
    }
    log.Printf("wrote visualization to %s", file.Name())
}

A distributed register

A distributed register is like a distributed key-value store but there's only a single key.

We need to tell Porcupine what the inputs and outputs for this system are. And we'll later describe for it how an idealized version of this system should behave as it receives each input; what output the idealized version should produce.

Each time we send a command to the distributed register it will include an operation (to get or to set the register). And if it is a set command it will include a value.

type registerInput struct {
    operation string // "get" and "set"
    value int
}

The register is an integer register.

Now we will define a model for Porcupine which, again, is the idealized version of this system.

func main() {
    registerModel := porcupine.Model{
        Init: func() any {
            return 0
        },
        Step: func(stateAny, inputAny, outputAny any) (bool, any) {
            input := inputAny.(registerInput)
            output := outputAny.(int)
            state := stateAny.(int)
            if input.operation == "set" {
                return true, input.value
            } else if input.operation == "get" {
                readCorrectValue := output == state
                return readCorrectValue, state
            }

            panic("Unexpected operation")
        },
    }

The step function accepts anything because it has to be able to model any sort of system with its different inputs and outputs and current state. So we have to handle casting from the any type to what we know are the inputs and outputs and state. And finally we actually do the state change and return the new state as well as if the given output matches what we know it should be.

An invalid history

Now we've only defined the idealized version of this system. Let's pretend we have some real-world implementation of this. We might have two clients and they might issue concurrent get and set requests.

Every time we stimulate the system we will generate a new history that we can validate with Porcupine against our model to see if the history is linearizable.

Let's imagine these two clients concurrently set the register to some value. Both sets succeed. Then both clients read the register. And they get different values. Here's what that history would look like modeled for Porcupine.

    ops := []porcupine.Operation{
        // Client 3 sets the register to 100. The request starts at t0 and ends at t2.
        {3, registerInput{"set", 100}, 0, 100 /* end state at t2 is 100 */, 2},
        // Client 5 sets the register to 200. The request starts at t3 and ends at t4.
        {5, registerInput{"set", 200}, 3, 200/* end state at t3 is 200 */, 4},
        // Client 3 reads the register. The request starts at t5 and ends at t6.
        {3, registerInput{"get", 0 /* doesn't matter */ }, 5, 200, 6},
        // Client 5 reads the register. The request starts at t7 and ends at t8. Reads a stale value!
        {5, registerInput{"get", 0 /* doesn't matter */}, 7, 100, 8},
    }
    res, info := porcupine.CheckOperationsVerbose(registerModel, ops, 0)
    visualizeTempFile(registerModel, info)

    if res != porcupine.Ok {
        panic("expected operations to be linearizable")
    }
}

If we build and run this code:

$ go mod tidy
go: finding module for package github.com/anishathalye/porcupine
go: found github.com/anishathalye/porcupine in github.com/anishathalye/porcupine v0.1.6
$ go build
$ ./lintest
2024/10/31 19:54:08 wrote visualization to /var/folders/cb/v27m749d0sj89h9ydfq0f0940000gn/T/463308000.html
panic: expected operations to be linearizable

goroutine 1 [running]:
main.main()
        /Users/phil/tmp/lintest/main.go:59 +0x394

Porcupine caught the stale value. Open that HTML file to see the visualization.

A valid history

Let's say we fix the bug so now there's no stale read. The new history would look like this:

    ops := []porcupine.Operation{
        // Client 3 sets the register to 100. The request starts at t0 and ends at t2.
        {3, registerInput{"set", 100}, 0, 100 /* end state at t2 is 100 */, 2},
        // Client 5 sets the register to 200. The request starts at t3 and ends at t4.
        {5, registerInput{"set", 200}, 3, 200/* end state at t3 is 200 */, 4},
        // Client 3 reads the register. The request starts at t5 and ends at t6.
        {3, registerInput{"get", 0 /* doesn't matter */ }, 5, 200, 6},
        // Client 5 reads the register. The request starts at t7 and ends at t8.
        {5, registerInput{"get", 0 /* doesn't matter */}, 7, 200, 8},
    }

Rebuild, rerun lintest (it should exit successfully now), and open the visualization.

Great! Now let's make things a little more complicated by modeling a distributed key-value store rather than a distributed register.

Distributed key-value

The inputs of this system will be slightly more complex. They will take a key along with the operation and value.

type kvInput struct {
    operation string // "get" and "set"
    key string
    value int
}

And when we model the distributed key-value store with the state and output at each step being a map[string]int.

    kvModel := porcupine.Model{
        Init: func() any {
            return map[string]int{}
        },
        Step: func(stateAny, inputAny, outputAny any) (bool, any) {
            input := inputAny.(kvInput)
            output := outputAny.(map[string]int)
            state := stateAny.(map[string]int)
            if input.operation == "set" {
                newState := map[string]int{}
                for k, v := range state {
                    newState[k] = v
                }
                newState[input.key] = input.value
                return true, newState
            } else if input.operation == "get" {
                readCorrectValue := output[input.key] == state[input.key]
                return readCorrectValue, state
            }

            panic("Unexpected operation")
        },
    }

And now the history gets slightly more complex because we are now working with some specific key. But we'll otherwise use the same history as before.

    ops := []porcupine.Operation{
        // Client 3 set key `a` to 100. The request starts at t0 and ends at t2.
        {3, kvInput{"set", "a", 100}, 0, map[string]int{"a": 100}, 2},
        // Client 5 set key `a` to 200. The request starts at t3 and ends at t4.
        {5, kvInput{"set", "a", 200}, 3, map[string]int{"a": 200}, 4},
        // Client 3 read key `a`. The request starts at t5 and ends at t6.
        {3, kvInput{"get", "a", 0 /* doesn't matter */ }, 5, map[string]int{"a": 200}, 6},
        // Client 5 read key `a`. The request starts at t7 and ends at t8.
        {5, kvInput{"get", "a", 0 /* doesn't matter */}, 7, map[string]int{"a": 200}, 8},
    }

Build and run. Open the visualization.

And there we go!

What's next

These are just a few simple examples that are not hooked up to a real system. But it still seemed useful to show how you model one or two simple different systems and check a history with Porcupine.

Another aspect of Porcupine I did not cover is partitioning the state space. The docs say:

Implementing the partition functions can greatly improve performance. If you're implementing the partition function, the model Init and Step functions can be per-partition. For example, if your specification is for a key-value store and you partition by key, then the per-partition state representation can just be a single value rather than a map.

Perhaps that, and hooking this up to some "real" system, would be a good next step.

October 30, 2024

Query DynamoDB tables with SQL

Want to aggregate, filter, or join DynamoDB tables with SQL? Here's how to do it, and why you should (and shouldn't) query DynamoDB tables with SQL.

How to Correctly Sum Up Numbers

How to Correctly Sum Up Numbers

One of the first examples for loops in probably every programming language course is taking a list of numbers and calculating the sum of them. And in every programming language, it’s trivial to write up a solution:

int sumUpNumbers(std::span<const int> numbers) {
 int sum = 0;
 for (int i : numbers)
 sum += i;
 return sum;
}

If you compile this with a modern optimizing compiler, you will even get very efficient vectorized code, that sums millions of integers per second. Nice!

October 29, 2024

Announcing Vitess 21

Announcing Vitess 21 # We're delighted to announce the release of Vitess 21 along with version 2.14.0 of the Vitess Kubernetes Operator. Version 21 focuses on enhancing query compatibility, improving cluster management, and expanding VReplication capabilities, with experimental support for atomic distributed transactions and recursive CTEs. Key features include reference table materialization, multi-metric throttler support, and enhanced Online DDL functionality. Backup and restore processes benefit from a new mysqlshell engine, while vexplain now offers detailed execution traces and schema analysis.