a curated list of database news from authoritative sources

November 30, 2024

RocksDB on a big server: LRU vs hyperclock, v2

This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a large server to show the speedup from the hyperclock block cache implementation for different concurrency levels with RocksDB 9.6. Here I share results from the same server and different (old and new) RocksDB releases.

Results are amazing on a large (48 cores) server with 40 client threads

  • ~2X more QPS for range queries with hyperclock
  • ~3X more QPS for point queries with hyperclock

Software

I used RocksDB versions 6.0.2, 6.29.5, 7.0.4, 7.6.0, 7.7.8, 8.5.4, 8.6.7, 9.0.1, 9.1.2, 9.3.2, 9.5.2, 9.7.4 and 9.9.0. Everything was compiled with gcc 11.4.0.

The --cache_type argument selected the block cache implementation:

  • lru_cache was used for versions 7.6 and earlier. Because some of the oldest releases don't support --cache_type I also used --undef_params=...,cache_type
  • hyper_clock_cache was used for versions 7.7 through 8.5
  • auth_hyper_clock_cache was used for versions 8.5+

Hardware

The server is an ax162-s from Hetzner with an AMD EPYC 9454P processor, 48 cores, AMD SMT disabled and 128G RAM. The OS is Ubuntu 22.04. Storage is 2 NVMe devices with SW RAID 1 and ext4.

Benchmark

Overviews on how I use db_bench are here and here.

All of my tests here use a CPU-bound workload with a database that is cached by RocksDB and the benchmark is run for 40 threads.

I focus on the read-heavy benchmark steps:

  • revrangeww (reverse range while writing) - this does short reverse range scans
  • fwdrangeww (forward range while writing) - this does short forward range scans
  • readww (read while writing) - this does point queries

For each of these there is a fixed rate for writes done in the background and performance is reported for the reads. I prefer to measure read performance when there are concurrent writes because read-only benchmarks with an LSM suffer from non-determinism as the state (shape) of the LSM tree has a large impact on CPU overhead and throughput.

Results

All results are in this spreadsheet and the performance summary is here.

The graph below shows relative QPS which is: (QPS for my version / QPS for RocksDB 6.0.2) and the results are amazing:

  • ~2X more QPS for range queries with hyperclock
  • ~3X more QPS for point queries with hyperclock

The average values for vmstat metrics provide more detail on why hyperclock is so good for performance. The context switch rate drops dramatically when it is enabled because there is much less mutex contention. The user CPU utilization increases by ~1.6X because more useful work can get done when there is less mutex contention.

legend
* cs - context switches per second per vmstat
* us - user CPU utilization per vmstat
* sy - system CPU utilization per vmstat
* id - idle CPU utilization per vmstat
* wa - wait CPU utilization per vmstat
* version - RocksDB version

cs      us      sy      us+sy   id      wa      version
1495325 50.3    14.0    64.3    18.5    0.1     7.6.0
2360    82.7    14.0    96.7    16.6    0.1     9.9.0







November 28, 2024

1 million page views

I was delighted to notice this morning that this site has recently passed 1M page views. And since Murat wrote about his 1M page view accomplishment at the time, I felt compelled to now too.

I started regularly blogging in 2018. For some reason I decided to write a blog post every month. And while I have definitely skipped a month or two here or there, on average I've written 2 posts per month.

Tooling

Since at least 2018 this site has been built with a static site generator. I might have used a 3rd-party generator at one point, but for as long as I can remember most of this site has been built with a little Python script I wrote.

I used to get so pissed when static site generators would pointlessly change their APIs and I'd have to make pointless changes. I have not had to make any significant changes to my build code in many years.

I hosted the site itself on GitHub Pages for many years. But I wanted more flexibility with subdomains (ultimately not something I liked) and the ability to view server-side logs (ultimately not something I ever do).

I think this site is hosted on an OVH machine now. But at this point it is inertia keeping me there. If you have no strong feelings otherwise, GitHub Pages is perfect.

I used to use Google Analytics but then they shut down the old version. The new version was incredibly confusing to use. I could not find some very basic information. So I moved to Fathom which has been great.

I used to track all subscribers in a Google Form and bcc them but this became untenable eventually after 1000 subscribers due to GMail rate limits. I currently use MailerLite for subscriptions and sending email about new posts. But this is an absolutely terrible service. They proxy all links behind a domain that adblockers hate and they also visually shorten the URL so you can't copy the text of the URL.

I just want a service that has a hosted form for collecting subscribers and a <textarea> that lets me dump raw HTML and send that as an email to my subscribers. No branding, no watermarks, no link proxying. This apparently doesn't exist. I am too lazy to figure out Amazon SES so I stick with MailerLite for now.

Evolution

In the beginning I talked about little interpreters in JavaScript, about programming languages, about Scheme. I was into functional programming. Over time I moved into little emulators and bytecode VMs. And for the last four years I became obsessed with databases and distributed systems.

I have almost always written about little projects to teach myself a concept. Writing a bytecode VM in Rust, emulating a subset of x86 in Go, implementing Raft in Go, implementing MVCC isolation levels in Go, and so on.

So many times when I tried to learn a concept I would find blog posts with only partial code. The post would link to a GitHub repo that, by the time I got to the post, had evolved significantly beyond what was described in the post. The repo code had by then become too complex for me to follow. So I was motivated to write minimal implementations and walk through the code in its entirety.

Even today there is not a single post on implementing TCP/IP from scratch that walks through entirely working code. (Please, someone write this.)

I have also had a blast writing survey posts such as how various databases execute expressions, analyzing non-V8 JavaScript implementations, how various programming language implementations parse code, and how various database systems build on top of key-value databases.

The last two posts have even each been cited in a research paper (here and here).

Editing

In terms of quality, my single greatest trick is to read the post out loud. Multiple times. Notice parts that are awkward or unclear and rewrite them.

My second greatest trick is to ask friends for review. Some posts like an intuition for distributed consensus and a write-ahead log is not a universal part of durability would simply not have been correct or credible without my fantastic reviewers. And I'm proud to have played that part a few times in turn.

We also have a fantastic #writing-and-drafts channel on the Software Internals Discord where folks (myself occasionally included) come for post review.

Context

I've lost count of the total number of times that these posts have been on the front page of Hacker News or that a tweet announcing a post has reached triple digits likes. I think I've had 9 posts on the front of HN this year. I do know that my single best year for HN was 12 months between 2022-2023 where 20 of my posts or projects were on the front page.

Every time a post does well there's a part of me that worries that I've peaked. But the way to deal with this has been to ignore that little voice and to just keep learning new things. I haven't stopped finding things confusing yet, and confusion is a phenomenal muse.

And also to, like, go out and meet friends for dinner, run meetups, run book clubs, chat with you fascinating internet strangers, play volleyball, and so on.

It's always been about cultivating healthy obsessions.

Benediction

In parting, I'll remind you:

November 27, 2024

Hey Claude, help me analyze Bluesky data.

This is the full, unedited transcript of our conversation with Claude, whose context-awareness is provided by a v0 Tinybird MCP Server.

November 25, 2024

RocksDB benchmarks: large server, universal compaction

This post has results from a large server with universal compaction from the same server for which I recently shared leveled compaction results. The results are boring (no large regressions) but a bit more exciting than the ones for leveled compaction because there is more variance. A somewhat educated guess is that variance more likely with universal.

tl;dr

  • there are some small regressions for cached workloads (see byrx below)
  • there are some small to medium improvements for IO-bound workloads (see iodir and iobuf)
  • modern RocksDB would look better were I to use the Hyper Clock block cache, but here I don't to test similar code across all versions

Hardware

The server is an ax162-s from Hetzner with an AMD EPYC 9454P processor, 48 cores, AMD SMT disabled and 128G RAM. The OS is Ubuntu 22.04. Storage is 2 NVMe devices with SW RAID 1 and ext4.

Builds

I compiled db_bench from source on all servers. I used versions:
  • 6.x - 6.0.2, 6.10.4, 6.20.4, 6.29.5
  • 7.x - 7.0.4, 7.3.2, 7.6.0, 7.10.2
  • 8.x - 8.0.0, 8.3.3, 8.6.7, 8.9.2, 8.11.4
  • 9.x - 9.0.1, 9.1.2, 9.2.2, 9.3.2, 9.4.1, 9.5.2, 9.6.1 and 9.7.3
Benchmark

All tests used the default value for compaction_readahead_size and the block cache (LRU).

I used my fork of the RocksDB benchmark scripts that are wrappers to run db_bench. These run db_bench tests in a special sequence -- load in key order, read-only, do some overwrites, read-write and then write-only. The benchmark was run using 40 threads. How I do benchmarks for RocksDB is explained here and here. The command line to run the tests is: bash x3.sh 40 no 1800 c48r128 100000000 2000000000 byrx iobuf iodir

The tests on the charts are named as:
  • fillseq -- load in key order with the WAL disabled
  • revrangeww -- reverse range while writing, do short reverse range scans as fast as possible while another thread does writes (Put) at a fixed rate
  • fwdrangeww -- like revrangeww except do short forward range scans
  • readww - like revrangeww except do point queries
  • overwrite - do overwrites (Put) as fast as possible
Workloads

There are three workloads, all of which use 40 threads:

  • byrx - the database is cached by RocksDB (100M KV pairs)
  • iobuf - the database is larger than memory and RocksDB uses buffered IO (2B KV pairs)
  • iodir - the database is larger than memory and RocksDB uses O_DIRECT (2B KV pairs)

A spreadsheet with all results is here and performance summaries with more details are here for byrxiobuf and iodir.

Relative QPS

The numbers in the spreadsheet and on the y-axis in the charts that follow are the relative QPS which is (QPS for $me) / (QPS for $base). When the value is greater than 1.0 then $me is faster than $base. When it is less than 1.0 then $base is faster (perf regression!).

The base version is RocksDB 6.0.2.

Results: byrx

The byrx tests use a cached database. The performance summary is here

The chart shows the relative QPS for a given version relative to RocksDB 6.0.2. There are two charts and the second narrows the range for the y-axis to make it easier to see regressions.

Summary:
  • fillseq has new CPU overhead in 7.0 from code added for correctness checks and QPS has been stable since then
  • QPS for other tests has been stable, with some variance, since late 6.x
Results: iobuf

The iodir tests use an IO-bound database with buffered. The performance summary is here

The chart shows the relative QPS for a given version relative to RocksDB 6.0.2. There are two charts and the second narrows the range for the y-axis to make it easier to see regressions.

Summary:
  • fillseq has been stable since 7.6
  • readww has always been stable
  • overwrite improved in 7.6 and has been stable since then
  • fwdrangeww and revrangeww improved in late 6.0 and have been stable since then
Results: iodir

The iodir tests use an IO-bound database with O_DIRECT. The performance summary is here

The chart shows the relative QPS for a given version relative to RocksDB 6.0.2. There are two charts and the second narrows the range for the y-axis to make it easier to see regressions.

Summary:
  • fillseq has been stable since 7.6
  • readww has always been stable
  • overwrite improved in 7.6 and has been stable since then
  • fwdrangeww and revrangeww have been stable but there is some variance








November 24, 2024

Zero Disk Architecture

State is pain. The next generation of infrastructure tools will be built on diskless paradigm. In this short post I will explain what is Diskless / Zero Disk Architecture

November 23, 2024

Blood draw

[trigger warning: blood]

I had my first blood draw in 13 years yesterday. The lengthy gap is not random. My last blood draw had gone horribly wrong.


The last time

That previous visit had been for a fasting blood draw. Until then, I'd never had issues with blood draw before. Nurses always complimented me on my veins. One of them said that I had "veins like a garden hose, they are impossible to miss." So, I felt zero fear/anxiety for the procedure. But I think I made the mistake of looking at the syringe. My blood flows fast, and blood draw goes twice as quickly for me than for my wife. I saw my bright red blood enthusiastically gushing the syringe. That was the last thing I remember. 

Then came nothing. It was the void, and after an indeterminate time, came the reboot. 

I literally experienced my brain booting up like an old Intel 488 computer booting up DOS. First the BIOS kicked in, it checked for available memory, scanned disk to locate the operating system, and started loading it. Then it felt like I was coming back up from void pitch black back up to the surface. When I regained full consciousness, I found myself pounding the blood draw chair with my fist. I then started hearing people. I stopped. By then several nurses had rushed into the room, and my wife looked terrified.

She later described what happened: my eyes had rolled back in their sockets, and I started thrashing violently! It lasted less then a minute, but it was so alarming that my wife thought I was dying. She began shouting and collapsed, and the nurses had to pick her up floor the floor. 

After researching the incident, I believe what I experienced was vasovagal syncope, specifically convulsive syncope. I found this old German research video, and I think this is what it must have looked like. This is what happens when the blood pressure drops suddenly and significantly reduces blood flow to your brain, 


Yesterday's return

Fast forward 13 years, I finally had returned to my primary physician for a well visit, which prescribed me the fasting blood work. I was procrastinating on getting it done, but as my wife also needed blood work, we scheduled back-to-back appointments through the Quest app for 9:50 AM on Friday.

I felt surprisingly calm. No anxiety the night before or that morning. Maybe because of my computer science training, I'm good at compartmentalizing. Why worry before the event? This is a 9:50am problem.

In the waiting room, I googled how to prevent fainting during blood draw. I had already learned about not watching my blood gushing into the tube from the previous episode. The other recommendations included using a reclined chair, breathing deeply, and crossing and tensing leg muscles to maintain blood flow to the brain.

I informed the phlebotomist that I fainted before, and we arranged for a reclined chair. I offered my left arm to the phlebotomist, and my wife held my right hand. Honestly, she seemed more nervous than I was, as I remained detached from the situation. I don't think I have a needle phobia; it's specifically seeing blood while fasting that triggers me. Ironically, I find it more distressing to watch my wife get blood drawn than to undergo it myself.

Ok, the plastic wraps on my left arm, and I make a fist. The needle goes in with a slight pinch. No problemo! But uh, oh. The phlebotomist gives up after 10 seconds, saying she missed the vein. She decided to try my right arm instead, and also said she will use a butterfly this time, because my veins did not manifest well. Wait what? I thought I had garden hose for veins. Checking later, I noticed my veins were barely visible, but they became quite prominent by evening time. Apparently, I was having a bad vein morning. 

Ok, they reposition around me. Now my wife clutched to my left hand with both her hands and she is feeling panicked. I again remain calm, feeling more curious than worried. The needle goes in, this time it hurts more, so I know it hit the vein. Good, I check myself, and I feel OK. Three tubes later, we are done. My wife suggested I stay seated, but I felt fine and stood up immediately.

That's it. Fortunately it is an anticlimactic ending to a story, but that probably wasn't worth your last few minutes. I don't think there is much point to this post, I just thought this was interesting. Oh, also I still feel sad thinking about the waiting room at the lab. It was mostly elderly patients, some having a hard time walking. This made me uncomfortably aware that I'll likely be visiting these labs more frequently in my later years, when I'm weak and less steady on my feet. This is definitely not my happy place.

On the one hand, I am very optimistic for the next 30 years, looking forward to how distributed systems, machine learning, space exploration, bio-exploration, sub-atomic exploration advances. On the other hand, I do not look forward to getting any older. I have middle-aged!

Everything is a Transaction: Unifying Logical Concurrency Control and Physical Data Structure Maintenance in Database Management Systems

This paper from CIDR'21 introduces the Deferred Action Framework (DAF). This framework aims to unify transaction control and data structure maintenance under multi-version concurrency control (MVCC) database systems, particularly for complex maintenance tasks like garbage collection and index cleanup.

In MVCC systems, transactions and data maintenance tasks (like garbage collection) often interfere, requiring complex coordination. This can lead to issues in system performance and code maintainability, as physical structures (e.g., B+ trees) aren't inherently designed to handle multi-versioning.

The paper proposes DAF to handle deferred actions—maintenance tasks that are queued to execute only when they no longer interfere with active transactions. DAF relies on timestamp-based ordering and executes tasks only when their actions won’t affect concurrent transactions. DAF guarantees to process actions deferred at some timestamp t only after all transactions started before t have exited. This provides *epoch protection* to transactions and maintenance tasks, without requiring a separate mechanism for refreshing and advancing epochs. Using a global epoch counter might have been simpler, but it would disallow the complex, multi-deferral-based ordering of actions needed for DAF’s broader applications like schema changes and index maintenance.

DAF is implemented in NoisePage (notice Pavlo's patented database naming scheme in action). The implementation chains deferrals to handle dependencies (e.g., an index cleanup must happen before a table deletion). DAF uses a single-action queue with timestamp tags, and processes tasks sequentially ensuring no conflicts arise. DAF supports multi-threaded processing, with both cooperative and dedicated action processing threads for high concurrency.

As Figure 1 shows, DAF uses a single action queue with a dedicated execution thread. DAF's transaction management process follows a specific sequence. First, a transaction worker gets an "observable" timestamp by incrementing the global counter. It then tags its deferred actions with this timestamp and adds them to the queue. When the transaction begins, the worker increments the global counter again. The action thread processes queue items by comparing queue item timestamps against the oldest active transaction timestamp. It executes items when their timestamp is smaller or when no active transactions exist, blocks when this condition isn't met, and waits when the queue is empty.

For complex operations like table deletion under snapshot isolation, DAF uses chained deferrals. For example, when transaction T1 drops a table, it defers action T1-A1, which when executed defers the actual deletion T1-A2. This ensures T1-A2 executes after all concurrent operations (like T2's inserts) complete. This approach solves ordering issues in concurrent scenarios while maintaining memory safety.

Experiments show DAF’s performance is on par with or exceeds specialized implementations, particularly in high-throughput scenarios like TPC-C benchmarks.

Beyond garbage collection, DAF can support latch-free block transformations and non-blocking schema changes. While simple schema changes (like renaming tables) can proceed without blocking transactions, complex changes that require table rewrites typically block all queries. Supporting true non-blocking schema changes requires managing multiple concurrent schema versions, which complicates transaction visibility and query planning due to the need for multiple plan cache entries.

November 22, 2024

Replica Preserve Commit Order and Measuring Lag

With multi-threaded replication (MTR), a replica can commit transactions in the same order as the source, or not. This is determined by sysvar replica_preserve_commit_order (RPCO). As of MySQL v8.0.27 (released October 2021) it’s ON by default, but it was OFF by default for several years prior. In either case, it’s relatively new compared to 20+ years of single-threaded replication for which commit order was not an issue or option. But with MTR, it’s important to understand the affects of RPCO, especially with respect to the focus of this three-part series: replication lag.