a curated list of database news from authoritative sources

November 19, 2024

Offset Considered Harmful or: The Surprising Complexity of Pagination in SQL

Offset Considered Harmful or: The Surprising Complexity of Pagination in SQL

For most applications, the job of a database system is to ingest all your data and give you access to the data you care about. Unfortunately, sometimes this is a large amount of data, and a typical application should only carefully take small sips from the fire hose of data. Generally, controlling the amount of output data is hard, but SQL offers the special limit syntax, which can guarantee that the result is bounded, and you never turn on the fire hose’s full blast. Unfortunately, the story does not end here, since there are many caveats.

November 15, 2024

Active and influential NYC infrastructure people

These are some of the most influential (mostly due to experience or expertise) and active folks (I actually see them attend events) in the NYC infrastructure scene (that I have a personal connection to).

If you're running a dinner or are just looking to meet interesting people in NYC in software infrastructure, consider this list and feel free to mention "Phil said you are awesome".

I've normalized titles a little bit but I say every title in the most generous way. These folks are brilliant.

This list is intentionally randomized. Also not a complete list. I've surely forgotten (let alone not yet met) great folk.

November 13, 2024

November 12, 2024

Grouping and Aggregations on Vitess

I love my job. One of the best feelings is when I find an interesting paper and use it to solve a real problem. It feels like I found a cheat code. Instead of having to do a lot of hard thinking, I can just stand on the shoulders of really big people and take a shortcut. Here, I want to share a recent project that I could solve using a public paper.

November 09, 2024

Fixing some of the InnoDB scan perf regressions in a MySQL fork

I recently learned of Advanced MySQL, a MySQL fork, and ran my sysbench benchmarks for it. It fixed some, but not all, of the regressions for write heavy workloads that landed in InnoDB after MySQL 8.0.28.

In response to my results, the project lead filed a bug for performance regressions and then quickly came up with a diff. The bug in this case is for regressions that are most obvious during full table scans and the problems arrived in MySQL 8.0.29 and 8.0.30 -- see bug 111538 and this post. The bug is closed for upstream but the perf regressions remain so I am excited to see the community working to solve this problem.

tl;dr

  • Advanced MySQL with the fix removes much of the regression in scan performance
Builds

I tried 4 builds

  • my8028 - upstream MySQL 8.0.28
  • my8040 - upstream MySQL 8.0.40
  • my8040adv_pre - Advanced MySQL 8.0.40 without the fix (without d347cdb)
  • my8040adv_post - Advanced MySQL 8.0.40 with the fix (at d347cdb)
Hardware

The servers are

  • dell32
    • Dell Precision 7865 Tower Workstation with 1 socket, 128G RAM, AMD Ryzen Threadripper PRO 5975WX with 32-Cores, 2 m.2 SSD (each 2TB, RAID SW 0, ext4). 
  • ax162-s
    • AMD EPYC 9454P 48-Core Processor with SMT disabled, 128G RAM, Ubuntu 22.04 and ext4 on 2 NVMe devices with SW RAID 1. This is in the Hetzner cloud.
  • bee
    • Beelink SER 4700u with Ryzen 7 4700u, 16G RAM, Ubuntu 22.04 and ext4 on NVMe

Benchmark

I used sysbench and my usage is explained here. A full run has 42 microbenchmarks and most test only 1 type of SQL statement. The database is cached by InnoDB.

The benchmark is run with ...
  • dell32 - 8 tables, 10M rows per table and 24 threads
  • ax162-s - 8 tables, 10M rows per table and 40 threads
  • bee - 1 table, 30M rows and 1 thread
Each microbenchmark runs for 300 seconds if read-only and 600 seconds otherwise. Prepared statements were enabled.

Results: overview

All of the results use relative QPS (rQPS) where:
  • rQPS is: (QPS for my version / QPS for base version)
  • base version is the QPS from MySQL 8.0.28
  • my version is one of the other versions
Here I only share the results for the scan microbenchmark.

Results: dell32

Summary
  • Summary
    • QPS with the fix in Advanced MySQL is ~9% better than without the fix
    • QPS with the fix in Advanced MySQL is ~2% better than my8040.
    • I am not sure why my8040adv_pre did much worse than my8040
From the relative QPS results the QPS with my8040adv_pre was ~15% less than my8028. But my8040adv_post is only ~7% slower than my8028 so it removes half of the regression.

Relative to: my8028
col-1 : my8040
col-2 : my8040adv_pre
col-3 : my8040adv_post

col-1   col-2   col-3
0.91    0.85    0.93    scan

From vmstat and iostat metrics CPU overhead for my8040adv_pre was ~22% larger than my8028. But with the fix the CPU overhead for my8040adv_post is only ~8% larger than my8028. This is great.

--- absolute
cpu/o cs/o r/o rKB/o wKB/o o/s dbms
0.093496 3.256 0 0 0.006 246 my8028
0.106105 4.065 0 0 0.006 225 my8040
0.113878 4.344 0 0 0.006 208 my8040adv_pre
0.101104 3.978 0 0 0.006 228 my8040adv_post
--- relative to first result
1.13 1.25 1 1 1.00 0.91 my8040
1.22 1.33 1 1 1.00 0.85 my8040adv_pre
1.08 1.22 1 1 1.00 0.93 my8040adv_post

Results: ax162-s

Summary
  • QPS is ~18% larger with the fix in Advanced MySQL
  • CPU overhead is ~15% smaller with the fix
From the relative QPS results the QPS with my8040adv_pre was the same as my8040 and both were ~17% slower than my8028. But my8040adv_post is only ~2% slower than my8028 which is excellent.

Relative to: my8028
col-1 : my8040
col-2 : my8040adv_pre
col-3 : my8040adv_post

col-1   col-2   col-3
0.83    0.83    0.98    scan

From vmstat and iostat metrics CPU overhead for my8040 and my8040adv_pre were ~20% larger than my8028. But with the fix the CPU overhead for my8040adv_post is only ~3% larger than my8028. This is great.

--- absolute
cpu/o cs/o r/o rKB/o wKB/o o/s dbms
0.018767 0.552 0 0 0.052 872 my8028
0.022533 0.800 0 0 0.013 725 my8040
0.022499 0.808 0 0.001 0.034 727 my8040adv_pre
0.019305 0.731 0 0 0.03 851 my8040adv_post
--- relative to first result
1.20 1.45 1 1 0.25 0.83 my8040
1.20 1.46 1 inf 0.65 0.83 my8040adv_pre
1.03 1.32 1 1 0.58 0.98 my8040adv_post

Results: bee

Summary:
  • QPS is ~17% larger with the fix in Advanced MySQL
  • CPU overhead is ~15% smaller with the fix
I did not test my8040adv_pre on this server.

From the relative QPS results the QPS with my8040 is ~22% less than my8028. But QPS from my8040adv_post is only ~9% less than my8028. This is great.

Relative to: my8028
col-1 : my8040
col-2 : my8040adv_post

col-1   col-2
0.78    0.91    scan

From vmstat and iostat metrics CPU overhead for my8040 was ~28% larger than my8028. But with the fix the CPU overhead for my8040adv_post is only ~3% larger than my8028. This is great.

--- absolute
cpu/o           cs/o    r/o     rKB/o   wKB/o   o/s     dbms
0.222553        2.534   0       0.001   0.035   55      my8028
0.285792        7.622   0       0       0.041   43      my8040
0.246404        6.475   0       0       0.036   50      my8040adv_post
--- relative to first result
1.28            3.01    1       0.00    1.17    0.78    my8040
1.11            2.56    1       0.00    1.03    0.91    my8040adv_post

RocksDB benchmarks: large server, leveled compaction

I recently shared benchmark results for RocksDB a few weeks ago for both leveled and universal compaction on a small server. This post has results from a large server with leveled compaction. 

tl;dr

  • there are a few regressions from bug 12038
  • QPS for overwrite is ~1.5X to ~2X better in 9.x than 6.0 (ignoring bug 12038)
  • otherwise QPS in 9.x is similar to 6.x

Hardware

The server is an ax162-s from Hetzner with an AMD EPYC 9454P processor, 48 cores, AMD SMT disabled and 128G RAM. The OS is Ubuntu 22.04. Storage is 2 NVMe devices with SW RAID 1 and ext4.

Builds

I compiled db_bench from source on all servers. I used versions:
  • 6.x - 6.0.2, 6.10.4, 6.20.4, 6.29.5
  • 7.x - 7.0.4, 7.3.2, 7.6.0, 7.10.2
  • 8.x - 8.0.0, 8.3.3, 8.6.7, 8.9.2, 8.11.4
  • 9.x - 9.0.1, 9.1.2, 9.2.2, 9.3.2, 9.4.1, 9.5.2, 9.6.1 and 9.7.3
Benchmark

All tests used the default value for compaction_readahead_size and the block cache (LRU).

I used my fork of the RocksDB benchmark scripts that are wrappers to run db_bench. These run db_bench tests in a special sequence -- load in key order, read-only, do some overwrites, read-write and then write-only. The benchmark was run using 40 threads. How I do benchmarks for RocksDB is explained here and here. The command line to run the tests is: bash x3.sh 40 no 1800 c48r128 100000000 2000000000 byrx iobuf iodir

The tests on the charts are named as:
  • fillseq -- load in key order with the WAL disabled
  • revrangeww -- reverse range while writing, do short reverse range scans as fast as possible while another thread does writes (Put) at a fixed rate
  • fwdrangeww -- like revrangeww except do short forward range scans
  • readww - like revrangeww except do point queries
  • overwrite - do overwrites (Put) as fast as possible
Workloads

There are three workloads, all of which use 40 threads:

  • byrx - the database is cached by RocksDB (100M KV pairs)
  • iobuf - the database is larger than memory and RocksDB uses buffered IO (2B KV pairs)
  • iodir - the database is larger than memory and RocksDB uses O_DIRECT (2B KV pairs)

A spreadsheet with all results is here and performance summaries with more details are here for byrx, iobuf and iodir.

Relative QPS

The numbers in the spreadsheet and on the y-axis in the charts that follow are the relative QPS which is (QPS for $me) / (QPS for $base). When the value is greater than 1.0 then $me is faster than $base. When it is less than 1.0 then $base is faster (perf regression!).

The base version is RocksDB 6.0.2.

Results: byrx

The byrx tests use a cached database. The performance summary is here

This chart shows the relative QPS for a given version relative to RocksDB 6.0.2. The y-axis doesn't start at 0 in the second chart to improve readability for some lines.

Summary:
  • fillseq is worse from 6.0 to 8.0 but stable since then
  • overwrite has large improvements late in 6.0 and small improvements since then
  • fwdrangeww has small improvements in early 7.0 and is stable since then
  • revrangeww and readww are stable from 6.0 through 9.
Results: iobuf

The iobuf tests use an IO-bound database with buffered IO. The performance summary is here

This chart shows the relative QPS for a given version relative to RocksDB 6.0.2. The y-axis doesn't start at 0 in the second chart to improve readability for some lines.

Summary:
  • bug 12038 explains the drop in throughput for overwrite since 8.6.7
  • otherwise QPS in 9.x is similar to 6.0
Results: iodir

The iodir tests use an IO-bound database with O_DIRECT. The performance summary is here

This chart shows the relative QPS for a given version relative to RocksDB 6.0.2. The y-axis doesn't start at 0 in the second chart to improve readability for some lines.

Summary:
  • the QPS drop for overwrite in 8.6.7 occurs because the db_bench client wasn't updated to use the new default value for compaction readahead size
  • QPS for overwrite is ~2X better in 9.x relative to 6.0
  • otherwise QPS in 9.x is similar to 6.0
todo








Efficient MySQL Performance In 10 Sentences

Don’t have time to read Efficient MySQL Performance? Here’s the book (10 chapters) in one-liners.

  1. Performance is query response time.
  2. Proper left-most indexing is required for performance.
  3. The less data, the better.
  4. Access patterns (part of the workload) help or hinder performance.
  5. Sharding is how to scale writes when single-node performance is truly reached.
  6. Server metrics reflect how the app workload causes MySQL to work.
  7. Replication lag is data loss.
  8. Locks are held until a transaction commits, so commit quickly.
  9. There are many other challenges that you might need to address—sorry.
  10. MySQL in the cloud is slower and more expensive, so performance is more important than ever.

November 07, 2024

November 06, 2024

Application Architecture: Combining DynamoDB and Tinybird

Most applications tend to be built around a “transactional” core. Buy a thingamajig. Cancel a whoosiwatsie. Edit a whatchamacallit. You might be booking flights, posting bird pics on Insta, or patronizing the local Syrian restaurant for lunch (tabouleh, anyone?). While CRUD transactions are the foundation of applications, many are now are starting to offer (or see user demand for) analytical experiences: travellers want to see price change history to find the best time to fly, content creators w

RocksDB on a big server: LRU vs hyperclock

This has benchmark results for RocksDB using a big (48-core) server. I ran tests to document the impact of the the block cache type (LRU vs hyperclock) and a few other configuration choices for a CPU-bound workload. A previous post with great results for the hyperclock block cache is here.

tl;dr

  • read QPS is up to ~3X better with auto_hyper_clock_cache vs LRU
  • read QPS is up to ~1.3X better with the per-level fanout set to 32 vs 8
  • read QPS drops by ~15% as the background write rate increases from 2 to 32 M/s
Software

I used RocksDB 9.6, compiled with gcc 11.4.0.

Hardware

The server is an ax162-s from Hetzner with an AMD EPYC 9454P processor, 48 cores, AMD SMT disabled and 128G RAM. The OS is Ubuntu 22.04. Storage is 2 NVMe devices with SW RAID 1 and ext4.

Benchmark

Overviews on how I use db_bench are here and here.

All of my tests here use a CPU-bound workload with a database that is cached by RocksDB and are repeated for 1, 10, 20 and 40 threads. 

I focus on the readwhilewriting benchmark where performance is reported for the reads (point queries) while there is a fixed rate for writes done in the background. I prefer to measure read performance when there are concurrent writes because read-only benchmarks with an LSM suffer from non-determinism as the state (shape) of the LSM tree has a large impact on CPU overhead and throughput.

To save time I did not run the fwdrangewhilewriting benchmark. Were I to repeat this work I would include it because the results from it would be interesting for a few of the configuration options I compared.

I did tests to understand the following:

  • LRU vs auto_hyper_clock_cache for the block cache implementation
    • LRU is the original implementation. The code was simple, which is nice. The implementation for LRU is sharded with a mutex per shard and that mutex can become a hot spot. The hyperclock implementation is much better at avoiding hot spots.
  • per level fanout (8 vs 32)
    • By per level fanout I mean the value of --max_bytes_for_level_multiplier which determines the target size difference between adjacent levels. By default I use 8, while 10 is also a common choice. Here I compare 8 vs 32. When the fanout is larger the LSM tree has fewer levels -- meaning there are fewer places to check for data which should reduce CPU overhead and increase QPS.
  • background write rate
    • I repeated tests with the background write rate (--benchmark_write_rate_limit) set to 2, 8 and 32 MB/s. With a higher write rate there is more chance for interference between reads and writes. The interference might be from mutex contention, compaction threads using more CPU, more L0 files to check or more data in levels L1 and larger.
  • target size for L0
    • By target size I mean the number of files in the L0 that trigger compaction. The db_bench option for this is --level0_file_num_compaction_trigger. When the value is larger there will be more L0 files on average that a query might have to check and that means there is more CPU overhead. Unfortunately, I configured RocksDB incorrectly so I don't have results to share. The issue is that when the L0 is configured to be larger, the L1 should be configured to be at least as large as the L0 (L1 target size should be >= sizeof(SST) * num(L0 files). If not, then L0->L1 compaction will happen sooner than expected.
All of the results are in this spreadsheet.

Results: LRU vs auto_hyper_clock_cache

These graphs have QPS from the readwhilewriting benchmark for the LRU and AHCC block cache implementations where LRU is the original version with a sharded hash table and a mutex per shard while AHCC is the hyper clock cache (--cache_type=auto_hyper_clock_cache).

Summary:
  • QPS is much better with AHCC than LRU (~3.3X faster at 40 threads)
  • QPS with AHCC scales linearly with the thread count
  • QPS with LRU does not scale linearly and suffers from mutex contention
  • There are some odd effects in the results for 1 thread
With a 2M/s background write rate AHCC is ~1.1X faster at 1 thread and ~3.3X faster at 40 threads relative to LRU.
With an 8M/s background write rate AHCC is ~1.1X faster at 1 thread and ~3.3X faster at 40 threads relative to LRU.
With a 32M/s background write rate AHCC is ~1.1X faster at 1 thread and ~2.9X faster at 40 threads relative to LRU.

Results: per level fanout

These graphs have QPS from the readwhilewriting benchmark to compare results with per-level fanout set to 8 and 32.

Summary
  • QPS is often 1.1X to 1.3X larger with fanout=32 vs fanout=8

With an 8M/s background write rate and LRU, fanout=8 is faster at 1 thread but then fanout=32 is from 1.1X to 1.3X faster at 10 to 40 threads.
With an 8M/s background write rate and AHCC, fanout=8 is faster at 1 thread but then fanout=32 is ~1.1X faster at 10 to 40 threads.

With a 32M/s background write rate and LRU, fanout=8 is ~2X faster at 1 thread but then fanout=32 is from 1.1X to 1.2X faster at 10 to 40 threads.
With a 32M/s background write rate and AHCC, fanout=8 is ~2X faster at 1 thread but then fanout=32 is ~1.1X faster at 10 to 40 threads.
Results: background write rate

Summary:
  • With LRU
    • QPS drops by up to ~15% as the background write rate grows from 2M/s to 32M/s
    • QPS does not scale linearly and suffers from mutex contention
  • With AHCC
    • QPS drops by up to 13% as the background write rate grows from 2M/s to 32M/s
    • QPS scales linearly with the thread count
  • There are some odd effects in the results for 1 thread
Results with LRU show that per-thread QPS doesn't scale linearly
Results with AHCC show that per-thread QPS scales linearly ignoring the odd results for 1 thread