a curated list of database news from authoritative sources

November 28, 2024

1 million page views

I was delighted to notice this morning that this site has recently passed 1M page views. And since Murat wrote about his 1M page view accomplishment at the time, I felt compelled to now too.

I started regularly blogging in 2018. For some reason I decided to write a blog post every month. And while I have definitely skipped a month or two here or there, on average I've written 2 posts per month.

Tooling

Since at least 2018 this site has been built with a static site generator. I might have used a 3rd-party generator at one point, but for as long as I can remember most of this site has been built with a little Python script I wrote.

I used to get so pissed when static site generators would pointlessly change their APIs and I'd have to make pointless changes. I have not had to make any significant changes to my build code in many years.

I hosted the site itself on GitHub Pages for many years. But I wanted more flexibility with subdomains (ultimately not something I liked) and the ability to view server-side logs (ultimately not something I ever do).

I think this site is hosted on an OVH machine now. But at this point it is inertia keeping me there. If you have no strong feelings otherwise, GitHub Pages is perfect.

I used to use Google Analytics but then they shut down the old version. The new version was incredibly confusing to use. I could not find some very basic information. So I moved to Fathom which has been great.

I used to track all subscribers in a Google Form and bcc them but this became untenable eventually after 1000 subscribers due to GMail rate limits. I currently use MailerLite for subscriptions and sending email about new posts. But this is an absolutely terrible service. They proxy all links behind a domain that adblockers hate and they also visually shorten the URL so you can't copy the text of the URL.

I just want a service that has a hosted form for collecting subscribers and a <textarea> that lets me dump raw HTML and send that as an email to my subscribers. No branding, no watermarks, no link proxying. This apparently doesn't exist. I am too lazy to figure out Amazon SES so I stick with MailerLite for now.

Evolution

In the beginning I talked about little interpreters in JavaScript, about programming languages, about Scheme. I was into functional programming. Over time I moved into little emulators and bytecode VMs. And for the last four years I became obsessed with databases and distributed systems.

I have almost always written about little projects to teach myself a concept. Writing a bytecode VM in Rust, emulating a subset of x86 in Go, implementing Raft in Go, implementing MVCC isolation levels in Go, and so on.

So many times when I tried to learn a concept I would find blog posts with only partial code. The post would link to a GitHub repo that, by the time I got to the post, had evolved significantly beyond what was described in the post. The repo code had by then become too complex for me to follow. So I was motivated to write minimal implementations and walk through the code in its entirety.

Even today there is not a single post on implementing TCP/IP from scratch that walks through entirely working code. (Please, someone write this.)

I have also had a blast writing survey posts such as how various databases execute expressions, analyzing non-V8 JavaScript implementations, how various programming language implementations parse code, and how various database systems build on top of key-value databases.

The last two posts have even each been cited in a research paper (here and here).

Editing

In terms of quality, my single greatest trick is to read the post out loud. Multiple times. Notice parts that are awkward or unclear and rewrite them.

My second greatest trick is to ask friends for review. Some posts like an intuition for distributed consensus and a write-ahead log is not a universal part of durability would simply not have been correct or credible without my fantastic reviewers. And I'm proud to have played that part a few times in turn.

We also have a fantastic #writing-and-drafts channel on the Software Internals Discord where folks (myself occasionally included) come for post review.

Context

I've lost count of the total number of times that these posts have been on the front page of Hacker News or that a tweet announcing a post has reached triple digits likes. I think I've had 9 posts on the front of HN this year. I do know that my single best year for HN was 12 months between 2022-2023 where 20 of my posts or projects were on the front page.

Every time a post does well there's a part of me that worries that I've peaked. But the way to deal with this has been to ignore that little voice and to just keep learning new things. I haven't stopped finding things confusing yet, and confusion is a phenomenal muse.

And also to, like, go out and meet friends for dinner, run meetups, run book clubs, chat with you fascinating internet strangers, play volleyball, and so on.

It's always been about cultivating healthy obsessions.

Benediction

In parting, I'll remind you:

November 27, 2024

Hey Claude, help me analyze Bluesky data.

This is the full, unedited transcript of our conversation with Claude, whose context-awareness is provided by a v0 Tinybird MCP Server.

November 25, 2024

RocksDB benchmarks: large server, universal compaction

This post has results from a large server with universal compaction from the same server for which I recently shared leveled compaction results. The results are boring (no large regressions) but a bit more exciting than the ones for leveled compaction because there is more variance. A somewhat educated guess is that variance more likely with universal.

tl;dr

  • there are some small regressions for cached workloads (see byrx below)
  • there are some small to medium improvements for IO-bound workloads (see iodir and iobuf)
  • modern RocksDB would look better were I to use the Hyper Clock block cache, but here I don't to test similar code across all versions

Hardware

The server is an ax162-s from Hetzner with an AMD EPYC 9454P processor, 48 cores, AMD SMT disabled and 128G RAM. The OS is Ubuntu 22.04. Storage is 2 NVMe devices with SW RAID 1 and ext4.

Builds

I compiled db_bench from source on all servers. I used versions:
  • 6.x - 6.0.2, 6.10.4, 6.20.4, 6.29.5
  • 7.x - 7.0.4, 7.3.2, 7.6.0, 7.10.2
  • 8.x - 8.0.0, 8.3.3, 8.6.7, 8.9.2, 8.11.4
  • 9.x - 9.0.1, 9.1.2, 9.2.2, 9.3.2, 9.4.1, 9.5.2, 9.6.1 and 9.7.3
Benchmark

All tests used the default value for compaction_readahead_size and the block cache (LRU).

I used my fork of the RocksDB benchmark scripts that are wrappers to run db_bench. These run db_bench tests in a special sequence -- load in key order, read-only, do some overwrites, read-write and then write-only. The benchmark was run using 40 threads. How I do benchmarks for RocksDB is explained here and here. The command line to run the tests is: bash x3.sh 40 no 1800 c48r128 100000000 2000000000 byrx iobuf iodir

The tests on the charts are named as:
  • fillseq -- load in key order with the WAL disabled
  • revrangeww -- reverse range while writing, do short reverse range scans as fast as possible while another thread does writes (Put) at a fixed rate
  • fwdrangeww -- like revrangeww except do short forward range scans
  • readww - like revrangeww except do point queries
  • overwrite - do overwrites (Put) as fast as possible
Workloads

There are three workloads, all of which use 40 threads:

  • byrx - the database is cached by RocksDB (100M KV pairs)
  • iobuf - the database is larger than memory and RocksDB uses buffered IO (2B KV pairs)
  • iodir - the database is larger than memory and RocksDB uses O_DIRECT (2B KV pairs)

A spreadsheet with all results is here and performance summaries with more details are here for byrxiobuf and iodir.

Relative QPS

The numbers in the spreadsheet and on the y-axis in the charts that follow are the relative QPS which is (QPS for $me) / (QPS for $base). When the value is greater than 1.0 then $me is faster than $base. When it is less than 1.0 then $base is faster (perf regression!).

The base version is RocksDB 6.0.2.

Results: byrx

The byrx tests use a cached database. The performance summary is here

The chart shows the relative QPS for a given version relative to RocksDB 6.0.2. There are two charts and the second narrows the range for the y-axis to make it easier to see regressions.

Summary:
  • fillseq has new CPU overhead in 7.0 from code added for correctness checks and QPS has been stable since then
  • QPS for other tests has been stable, with some variance, since late 6.x
Results: iobuf

The iodir tests use an IO-bound database with buffered. The performance summary is here

The chart shows the relative QPS for a given version relative to RocksDB 6.0.2. There are two charts and the second narrows the range for the y-axis to make it easier to see regressions.

Summary:
  • fillseq has been stable since 7.6
  • readww has always been stable
  • overwrite improved in 7.6 and has been stable since then
  • fwdrangeww and revrangeww improved in late 6.0 and have been stable since then
Results: iodir

The iodir tests use an IO-bound database with O_DIRECT. The performance summary is here

The chart shows the relative QPS for a given version relative to RocksDB 6.0.2. There are two charts and the second narrows the range for the y-axis to make it easier to see regressions.

Summary:
  • fillseq has been stable since 7.6
  • readww has always been stable
  • overwrite improved in 7.6 and has been stable since then
  • fwdrangeww and revrangeww have been stable but there is some variance








November 24, 2024

Zero Disk Architecture

State is pain. The next generation of infrastructure tools will be built on diskless paradigm. In this short post I will explain what is Diskless / Zero Disk Architecture

November 23, 2024

Blood draw

[trigger warning: blood]

I had my first blood draw in 13 years yesterday. The lengthy gap is not random. My last blood draw had gone horribly wrong.


The last time

That previous visit had been for a fasting blood draw. Until then, I'd never had issues with blood draw before. Nurses always complimented me on my veins. One of them said that I had "veins like a garden hose, they are impossible to miss." So, I felt zero fear/anxiety for the procedure. But I think I made the mistake of looking at the syringe. My blood flows fast, and blood draw goes twice as quickly for me than for my wife. I saw my bright red blood enthusiastically gushing the syringe. That was the last thing I remember. 

Then came nothing. It was the void, and after an indeterminate time, came the reboot. 

I literally experienced my brain booting up like an old Intel 488 computer booting up DOS. First the BIOS kicked in, it checked for available memory, scanned disk to locate the operating system, and started loading it. Then it felt like I was coming back up from void pitch black back up to the surface. When I regained full consciousness, I found myself pounding the blood draw chair with my fist. I then started hearing people. I stopped. By then several nurses had rushed into the room, and my wife looked terrified.

She later described what happened: my eyes had rolled back in their sockets, and I started thrashing violently! It lasted less then a minute, but it was so alarming that my wife thought I was dying. She began shouting and collapsed, and the nurses had to pick her up floor the floor. 

After researching the incident, I believe what I experienced was vasovagal syncope, specifically convulsive syncope. I found this old German research video, and I think this is what it must have looked like. This is what happens when the blood pressure drops suddenly and significantly reduces blood flow to your brain, 


Yesterday's return

Fast forward 13 years, I finally had returned to my primary physician for a well visit, which prescribed me the fasting blood work. I was procrastinating on getting it done, but as my wife also needed blood work, we scheduled back-to-back appointments through the Quest app for 9:50 AM on Friday.

I felt surprisingly calm. No anxiety the night before or that morning. Maybe because of my computer science training, I'm good at compartmentalizing. Why worry before the event? This is a 9:50am problem.

In the waiting room, I googled how to prevent fainting during blood draw. I had already learned about not watching my blood gushing into the tube from the previous episode. The other recommendations included using a reclined chair, breathing deeply, and crossing and tensing leg muscles to maintain blood flow to the brain.

I informed the phlebotomist that I fainted before, and we arranged for a reclined chair. I offered my left arm to the phlebotomist, and my wife held my right hand. Honestly, she seemed more nervous than I was, as I remained detached from the situation. I don't think I have a needle phobia; it's specifically seeing blood while fasting that triggers me. Ironically, I find it more distressing to watch my wife get blood drawn than to undergo it myself.

Ok, the plastic wraps on my left arm, and I make a fist. The needle goes in with a slight pinch. No problemo! But uh, oh. The phlebotomist gives up after 10 seconds, saying she missed the vein. She decided to try my right arm instead, and also said she will use a butterfly this time, because my veins did not manifest well. Wait what? I thought I had garden hose for veins. Checking later, I noticed my veins were barely visible, but they became quite prominent by evening time. Apparently, I was having a bad vein morning. 

Ok, they reposition around me. Now my wife clutched to my left hand with both her hands and she is feeling panicked. I again remain calm, feeling more curious than worried. The needle goes in, this time it hurts more, so I know it hit the vein. Good, I check myself, and I feel OK. Three tubes later, we are done. My wife suggested I stay seated, but I felt fine and stood up immediately.

That's it. Fortunately it is an anticlimactic ending to a story, but that probably wasn't worth your last few minutes. I don't think there is much point to this post, I just thought this was interesting. Oh, also I still feel sad thinking about the waiting room at the lab. It was mostly elderly patients, some having a hard time walking. This made me uncomfortably aware that I'll likely be visiting these labs more frequently in my later years, when I'm weak and less steady on my feet. This is definitely not my happy place.

On the one hand, I am very optimistic for the next 30 years, looking forward to how distributed systems, machine learning, space exploration, bio-exploration, sub-atomic exploration advances. On the other hand, I do not look forward to getting any older. I have middle-aged!

Everything is a Transaction: Unifying Logical Concurrency Control and Physical Data Structure Maintenance in Database Management Systems

This paper from CIDR'21 introduces the Deferred Action Framework (DAF). This framework aims to unify transaction control and data structure maintenance under multi-version concurrency control (MVCC) database systems, particularly for complex maintenance tasks like garbage collection and index cleanup.

In MVCC systems, transactions and data maintenance tasks (like garbage collection) often interfere, requiring complex coordination. This can lead to issues in system performance and code maintainability, as physical structures (e.g., B+ trees) aren't inherently designed to handle multi-versioning.

The paper proposes DAF to handle deferred actions—maintenance tasks that are queued to execute only when they no longer interfere with active transactions. DAF relies on timestamp-based ordering and executes tasks only when their actions won’t affect concurrent transactions. DAF guarantees to process actions deferred at some timestamp t only after all transactions started before t have exited. This provides *epoch protection* to transactions and maintenance tasks, without requiring a separate mechanism for refreshing and advancing epochs. Using a global epoch counter might have been simpler, but it would disallow the complex, multi-deferral-based ordering of actions needed for DAF’s broader applications like schema changes and index maintenance.

DAF is implemented in NoisePage (notice Pavlo's patented database naming scheme in action). The implementation chains deferrals to handle dependencies (e.g., an index cleanup must happen before a table deletion). DAF uses a single-action queue with timestamp tags, and processes tasks sequentially ensuring no conflicts arise. DAF supports multi-threaded processing, with both cooperative and dedicated action processing threads for high concurrency.

As Figure 1 shows, DAF uses a single action queue with a dedicated execution thread. DAF's transaction management process follows a specific sequence. First, a transaction worker gets an "observable" timestamp by incrementing the global counter. It then tags its deferred actions with this timestamp and adds them to the queue. When the transaction begins, the worker increments the global counter again. The action thread processes queue items by comparing queue item timestamps against the oldest active transaction timestamp. It executes items when their timestamp is smaller or when no active transactions exist, blocks when this condition isn't met, and waits when the queue is empty.

For complex operations like table deletion under snapshot isolation, DAF uses chained deferrals. For example, when transaction T1 drops a table, it defers action T1-A1, which when executed defers the actual deletion T1-A2. This ensures T1-A2 executes after all concurrent operations (like T2's inserts) complete. This approach solves ordering issues in concurrent scenarios while maintaining memory safety.

Experiments show DAF’s performance is on par with or exceeds specialized implementations, particularly in high-throughput scenarios like TPC-C benchmarks.

Beyond garbage collection, DAF can support latch-free block transformations and non-blocking schema changes. While simple schema changes (like renaming tables) can proceed without blocking transactions, complex changes that require table rewrites typically block all queries. Supporting true non-blocking schema changes requires managing multiple concurrent schema versions, which complicates transaction visibility and query planning due to the need for multiple plan cache entries.

November 22, 2024

Replica Preserve Commit Order and Measuring Lag

With multi-threaded replication (MTR), a replica can commit transactions in the same order as the source, or not. This is determined by sysvar replica_preserve_commit_order (RPCO). As of MySQL v8.0.27 (released October 2021) it’s ON by default, but it was OFF by default for several years prior. In either case, it’s relatively new compared to 20+ years of single-threaded replication for which commit order was not an issue or option. But with MTR, it’s important to understand the affects of RPCO, especially with respect to the focus of this three-part series: replication lag.

November 21, 2024

DDIA: Chp 10. Batch Processing

Batch processing allows large-scale data transformations through bulk-synchronous processing. The simplicity of this approach allowed building reliable, scalable, maintainable applications with it. If you recall, "reliable-scalable-maintainable" was what we set out to learn when we began the DDIA book.

This story of MapReduce starts when Google engineers realize there were a lot of repetitive tasks involved when computing over large data. These tasks often involved individually processing elements and then gathering and fusing their output. Interestingly, this bores a striking resemblance to electromechanical IBM card-sorting machines from the 1940-50s. MapReduce also got some inspiration from the map reduce operations in Lisp: (map square '(1 2 3 4)) gives us  (1 4 9 16), and (reduce + '(1 4 9 16))  gives us 30. The key innovation of Google's MapReduce framework was its ability to simplify parallel processing by abstracting away complex network communication and failure handling. By separating network communication from application logic, MapReduce shielded developers from intricate distributed computing challenges (partial failures, distributed state management, etc). Tasks could be automatically retried without disrupting the core application workflow. 

The MapReduce framework's elegance lay in its simplicity: providing a straightforward mechanism for parallelizing computations through map and reduce operations. Though MapReduce had many limitations/inefficiencies and Google retired the approach by 2014, its impact on distributed computing still remains important. It is hard to deny that MapReduce paved the way to commodity distributed computing.

Before beginning to explain MapReduce, the book takes a detour through batch processing in Unix/Linux: modular commands where output are piped as input to the next command in sequence. That section is skippable. By the way, I think the original MapReduce paper from OSDI 2004 is worth a read itself. It is written clearly and is easy to follow. 


MapReduce and Distributed Filesystems

Hadoop MapReduce leverages HDFS (Hadoop Distributed File System), which is an open-source reimplementation of Google's File System (GFS). Built on the shared-nothing principle, HDFS radically transformed traditional data storage from custom expensive hardware towards cheap and off-the-shelf nodes. HDFS also enabled indiscriminate data dumping. Unlike traditional massively parallel processing (MPP) databases requiring meticulous upfront data modeling, HDFS allowed raw data storage with interpretation deferred to consumption. This "schema-on-read" approach, dubbed the "sushi principle" (raw data is better), enabled flexible data transformation and multiple interpretations.

HDFS consists of a daemon process running on each machine to serve files stored on that machine, with a central NameNode tracking file block placements across machines. This simple design enables creation of a massive fault-tolerant filesystem by leveraging distributed storage on many machines and implementing block replication.

Ok, back to MapReduce. The MapReduce job model is centered around two callback functions: the mapper and reducer. Mappers process individual input records independently, extracting key-value pairs, while reducers aggregate and process values associated with specific keys. These two operations, when chained as multi-stage processing jobs, are sufficient to implement complex data transformations.


MapReduce Job Execution

MapReduce job execution strategically assigns map tasks to machines storing input file replicas, minimizing network data transfer and optimizing computational resource utilization. First, the framework copies necessary application code to assigned machines. Map tasks then process input files record by record, generating key-value pair outputs. Reduce tasks are partitioned using key hash algorithms to ensure consistent key processing.

Data sorting occurs in stages due to massive dataset sizes. Mappers partition output by reducer, writing sorted files locally using techniques similar to SSTables. When mappers complete, reducers fetch these sorted partition files through a process called "shuffle". Reducers merge mapper files applying user provided logic, maintaining sort order, and process records with identical keys sequentially.

MapReduce jobs are often chained together in workflows, where the output of one job becomes the input of the next. Since Hadoop MapReduce lacks native workflow support, jobs are linked by configuring output and input directory names. Large organizations did have complex workflows with hundreds of MapReduce jobs. To manage these data flows, several higher-level tools had been developed, including: Pig, Hive, Cascading, Crunch, FlumeJava.


Beyond MapReduce

Back in 2010s I was flabbergasted as to why people would use MapReduce, because it was so inefficient: So much data movement, so much need for synchronization, and such a limited language (just map and reduce) to express computation. Yes, MapReduce provided parallelism and scalability, but at what COST? Frank McSherry argued that much of that scalability came from "poor baselines and low expectations". 

But people cared about usability more, and didn't care about if the MapReduce job took an extra 3 hours and 5 hours there. They were ready to wait anyways, since this is offline batch processing. MapReduce shielded them for distributed systems challenges, or thinking too hard to develop/program an efficient algorithm, and that was more important for them. MapReduce was too simple and coarse-grained to be efficient, but it paved the way to more efficient dataflow engines. Respect what came before.

New dataflow engines like Spark, Tez, and Flink overcame some of MapReduce's limitations. Unlike MapReduce's rigid map-reduce model, these engines enable more flexible operator connections and smarter data processing. They optimize computational resources by selectively sorting, intelligently scheduling tasks, reducing unnecessary data movement, and improving data locality. By explicitly modeling the data flow, these systems are able to make nuanced scheduling decisions to speed-up processing. The engines also introduce advanced join strategies and minimize computational overhead by supporting in-memory intermediate states, reducing I/O demands, and enabling faster task startup.