a curated list of database news from authoritative sources

December 03, 2024

The History of the Decline and Fall of In-Memory Database Systems

The History of the Decline and Fall of In-Memory Database Systems

In the early 2010s, the drop in memory prices combined with an overall increase in the reliability of computer hardware fueled a minor revolution in the world of database systems. Traditionally, slow but durable magnetic disk storage was the source of truth for a database system. Only when data needed to be analyzed or updated would it be briefly cached in memory by the buffer manager. And as memory got bigger and faster, the access latency of magnetic disks quickly became a bottleneck for many systems.

December 02, 2024

December 01, 2024

Threads Won't Take You South of Market

In June 2023, when Threads announced their plans to federate with other Fediverse instances, there was a good deal of debate around whether smaller instances should allow federation or block it pre-emptively. As one of the admins of woof.group, I wrote about some of the potential risks and rewards of federating with Threads. We decided to wait and see.

In my queer and leather circles, Facebook and Instagram have been generally understood as hostile environments for over a decade. In 2014, their “Real Name” policy made life particularly difficult for trans people, drag queens, sex workers, and people who, for various reasons, needed to keep their real name disconnected from their queer life. My friends have been repeatedly suspended from both platforms for showing too much skin, or using the peach emoji. Meta’s moderation has been aggressive, opaque, and wildly inconsistent: sometimes full nudity is fine; other times a kiss or swimsuit is beyond the line. In some circles, maintaining a series of backup accounts in advance of one’s ban became de rigueur.

I’d hoped that federation between Threads and the broader Fediverse might allow a more nuanced spectrum of moderation norms. Threads might opt for a more conservative environment locally, but through federation, allow their users to interact with friends on instances with more liberal norms. Conversely, most of my real-life friends are still on Meta services—I’d love to see their posts and chat with them again. Threads could communicate with Gay Fedi (using the term in the broadest sense), and de-rank or hide content they don’t like on a per-post or per-account basis.

This world seems technically feasible. Meta reports 275 million Monthly Active Users (MAUs), and over three billion accross other Meta services. Fediverse has something like one million MAUs across various instances. This is not a large jump in processing or storage; nor would it seem to require a large increase in moderation staff. Threads has already committed to doing the requisite engineering, user experience, and legal work to allow federation across a broad range of instances. Meta is swimming in cash.

All this seems a moot point. A year and a half later, Threads is barely half federated. It publishes Threads posts to the world, but only if you dig in to the settings and check the “Fediverse Sharing” box. Threads users can see replies to their posts, but can’t talk back. Threads users can’t mention others, see mentions from other people, or follow anyone outside Threads. This may work for syndication, but is essentially unusable for conversation.

Despite the fact that Threads users can’t follow or see mentions from people on other instances, Threads has already opted to block a slew of instances where gay & leather people congregate. Threads blocks hypno.social, rubber.social, 4bear.com, nsfw.lgbt, kinkyelephant.com, kinktroet.social, barkclub.xyz, mastobate.social, and kinky.business. They also block the (now-defunct) instances bear.community, gaybdsm.group, and gearheads.social. They block more general queer-friendly instances like bark.lgbt, super-gay.co, gay.camera, and gaygeek.social. They block sex-positive instances like nsfwphotography.social, nsfw.social, and net4sw.com. All these instances are blocked for having “violated our Community Standards or Terms of Use”. Others like fisting.social, mastodon.hypnoguys.com, abdl.link, qaf.men, and social.rubber.family, are blocked for having “no publicly accessible feed”. I don’t know what this means: hypnoguys.social, for instance, has the usual Mastodon publically accessible local feed.

It’s not like these instances are hotbeds of spam, hate speech, or harassment: woof.group federates heavily with most of the servers I mentioned above, and we rarely have problems with their moderation. Most have reasonable and enforced media policies requiring sensitive-media flags for genitals, heavy play, and so on. Those policies are generally speaking looser than Threads (woof.group, for instance, allows butts!) but there are plenty of accounts and posts on these instances which would be anodyne under Threads’ rules.

I am shocked that woof.group is not on Threads’ blocklist yet. We have similar users who post similar things. Our content policies are broadly similar—several of the instances Threads blocks actually adopted woof.group’s specific policy language. I doubt it’s our size: Threads blocks several instances with less than ten MAUs, and woof.group has over seven hundred.

I’ve been out of the valley for nearly a decade, and I don’t have insight into Meta’s policies or decision-making. I’m sure Threads has their reasons. Whatever they are, Threads, like all of Meta’s services, feels distinctly uncomfortable with sex, and sexual expression is a vibrant aspect of gay culture.

This is part of why I started woof.group: we deserve spaces moderated with our subculture in mind. But I also hoped that by designing a moderation policy which compromised with normative sensibilities, we might retain connections to a broader set of publics. This particular leather bar need not be an invite-only clubhouse; it can be a part of a walkable neighborhood. For nearly five years we’ve kept that balance, retaining open federation with most all the Fediverse. I get the sense that Threads intends to wall its users off from our world altogether—to make “bad gays” invisible. If Threads were a taxi service, it wouldn’t take you South of Market.

November 30, 2024

RocksDB on a big server: LRU vs hyperclock, v2

This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a large server to show the speedup from the hyperclock block cache implementation for different concurrency levels with RocksDB 9.6. Here I share results from the same server and different (old and new) RocksDB releases.

Results are amazing on a large (48 cores) server with 40 client threads

  • ~2X more QPS for range queries with hyperclock
  • ~3X more QPS for point queries with hyperclock

Software

I used RocksDB versions 6.0.2, 6.29.5, 7.0.4, 7.6.0, 7.7.8, 8.5.4, 8.6.7, 9.0.1, 9.1.2, 9.3.2, 9.5.2, 9.7.4 and 9.9.0. Everything was compiled with gcc 11.4.0.

The --cache_type argument selected the block cache implementation:

  • lru_cache was used for versions 7.6 and earlier. Because some of the oldest releases don't support --cache_type I also used --undef_params=...,cache_type
  • hyper_clock_cache was used for versions 7.7 through 8.5
  • auth_hyper_clock_cache was used for versions 8.5+

Hardware

The server is an ax162-s from Hetzner with an AMD EPYC 9454P processor, 48 cores, AMD SMT disabled and 128G RAM. The OS is Ubuntu 22.04. Storage is 2 NVMe devices with SW RAID 1 and ext4.

Benchmark

Overviews on how I use db_bench are here and here.

All of my tests here use a CPU-bound workload with a database that is cached by RocksDB and the benchmark is run for 40 threads.

I focus on the read-heavy benchmark steps:

  • revrangeww (reverse range while writing) - this does short reverse range scans
  • fwdrangeww (forward range while writing) - this does short forward range scans
  • readww (read while writing) - this does point queries

For each of these there is a fixed rate for writes done in the background and performance is reported for the reads. I prefer to measure read performance when there are concurrent writes because read-only benchmarks with an LSM suffer from non-determinism as the state (shape) of the LSM tree has a large impact on CPU overhead and throughput.

Results

All results are in this spreadsheet and the performance summary is here.

The graph below shows relative QPS which is: (QPS for my version / QPS for RocksDB 6.0.2) and the results are amazing:

  • ~2X more QPS for range queries with hyperclock
  • ~3X more QPS for point queries with hyperclock

The average values for vmstat metrics provide more detail on why hyperclock is so good for performance. The context switch rate drops dramatically when it is enabled because there is much less mutex contention. The user CPU utilization increases by ~1.6X because more useful work can get done when there is less mutex contention.

legend
* cs - context switches per second per vmstat
* us - user CPU utilization per vmstat
* sy - system CPU utilization per vmstat
* id - idle CPU utilization per vmstat
* wa - wait CPU utilization per vmstat
* version - RocksDB version

cs      us      sy      us+sy   id      wa      version
1495325 50.3    14.0    64.3    18.5    0.1     7.6.0
2360    82.7    14.0    96.7    16.6    0.1     9.9.0







November 28, 2024

1 million page views

I was delighted to notice this morning that this site has recently passed 1M page views. And since Murat wrote about his 1M page view accomplishment at the time, I felt compelled to now too.

I started regularly blogging in 2018. For some reason I decided to write a blog post every month. And while I have definitely skipped a month or two here or there, on average I've written 2 posts per month.

Tooling

Since at least 2018 this site has been built with a static site generator. I might have used a 3rd-party generator at one point, but for as long as I can remember most of this site has been built with a little Python script I wrote.

I used to get so pissed when static site generators would pointlessly change their APIs and I'd have to make pointless changes. I have not had to make any significant changes to my build code in many years.

I hosted the site itself on GitHub Pages for many years. But I wanted more flexibility with subdomains (ultimately not something I liked) and the ability to view server-side logs (ultimately not something I ever do).

I think this site is hosted on an OVH machine now. But at this point it is inertia keeping me there. If you have no strong feelings otherwise, GitHub Pages is perfect.

I used to use Google Analytics but then they shut down the old version. The new version was incredibly confusing to use. I could not find some very basic information. So I moved to Fathom which has been great.

I used to track all subscribers in a Google Form and bcc them but this became untenable eventually after 1000 subscribers due to GMail rate limits. I currently use MailerLite for subscriptions and sending email about new posts. But this is an absolutely terrible service. They proxy all links behind a domain that adblockers hate and they also visually shorten the URL so you can't copy the text of the URL.

I just want a service that has a hosted form for collecting subscribers and a <textarea> that lets me dump raw HTML and send that as an email to my subscribers. No branding, no watermarks, no link proxying. This apparently doesn't exist. I am too lazy to figure out Amazon SES so I stick with MailerLite for now.

Evolution

In the beginning I talked about little interpreters in JavaScript, about programming languages, about Scheme. I was into functional programming. Over time I moved into little emulators and bytecode VMs. And for the last four years I became obsessed with databases and distributed systems.

I have almost always written about little projects to teach myself a concept. Writing a bytecode VM in Rust, emulating a subset of x86 in Go, implementing Raft in Go, implementing MVCC isolation levels in Go, and so on.

So many times when I tried to learn a concept I would find blog posts with only partial code. The post would link to a GitHub repo that, by the time I got to the post, had evolved significantly beyond what was described in the post. The repo code had by then become too complex for me to follow. So I was motivated to write minimal implementations and walk through the code in its entirety.

Even today there is not a single post on implementing TCP/IP from scratch that walks through entirely working code. (Please, someone write this.)

I have also had a blast writing survey posts such as how various databases execute expressions, analyzing non-V8 JavaScript implementations, how various programming language implementations parse code, and how various database systems build on top of key-value databases.

The last two posts have even each been cited in a research paper (here and here).

Editing

In terms of quality, my single greatest trick is to read the post out loud. Multiple times. Notice parts that are awkward or unclear and rewrite them.

My second greatest trick is to ask friends for review. Some posts like an intuition for distributed consensus and a write-ahead log is not a universal part of durability would simply not have been correct or credible without my fantastic reviewers. And I'm proud to have played that part a few times in turn.

We also have a fantastic #writing-and-drafts channel on the Software Internals Discord where folks (myself occasionally included) come for post review.

Context

I've lost count of the total number of times that these posts have been on the front page of Hacker News or that a tweet announcing a post has reached triple digits likes. I think I've had 9 posts on the front of HN this year. I do know that my single best year for HN was 12 months between 2022-2023 where 20 of my posts or projects were on the front page.

Every time a post does well there's a part of me that worries that I've peaked. But the way to deal with this has been to ignore that little voice and to just keep learning new things. I haven't stopped finding things confusing yet, and confusion is a phenomenal muse.

And also to, like, go out and meet friends for dinner, run meetups, run book clubs, chat with you fascinating internet strangers, play volleyball, and so on.

It's always been about cultivating healthy obsessions.

Benediction

In parting, I'll remind you:

November 27, 2024

Hey Claude, help me analyze Bluesky data.

This is the full, unedited transcript of our conversation with Claude, whose context-awareness is provided by a v0 Tinybird MCP Server.

November 25, 2024

RocksDB benchmarks: large server, universal compaction

This post has results from a large server with universal compaction from the same server for which I recently shared leveled compaction results. The results are boring (no large regressions) but a bit more exciting than the ones for leveled compaction because there is more variance. A somewhat educated guess is that variance more likely with universal.

tl;dr

  • there are some small regressions for cached workloads (see byrx below)
  • there are some small to medium improvements for IO-bound workloads (see iodir and iobuf)
  • modern RocksDB would look better were I to use the Hyper Clock block cache, but here I don't to test similar code across all versions

Hardware

The server is an ax162-s from Hetzner with an AMD EPYC 9454P processor, 48 cores, AMD SMT disabled and 128G RAM. The OS is Ubuntu 22.04. Storage is 2 NVMe devices with SW RAID 1 and ext4.

Builds

I compiled db_bench from source on all servers. I used versions:
  • 6.x - 6.0.2, 6.10.4, 6.20.4, 6.29.5
  • 7.x - 7.0.4, 7.3.2, 7.6.0, 7.10.2
  • 8.x - 8.0.0, 8.3.3, 8.6.7, 8.9.2, 8.11.4
  • 9.x - 9.0.1, 9.1.2, 9.2.2, 9.3.2, 9.4.1, 9.5.2, 9.6.1 and 9.7.3
Benchmark

All tests used the default value for compaction_readahead_size and the block cache (LRU).

I used my fork of the RocksDB benchmark scripts that are wrappers to run db_bench. These run db_bench tests in a special sequence -- load in key order, read-only, do some overwrites, read-write and then write-only. The benchmark was run using 40 threads. How I do benchmarks for RocksDB is explained here and here. The command line to run the tests is: bash x3.sh 40 no 1800 c48r128 100000000 2000000000 byrx iobuf iodir

The tests on the charts are named as:
  • fillseq -- load in key order with the WAL disabled
  • revrangeww -- reverse range while writing, do short reverse range scans as fast as possible while another thread does writes (Put) at a fixed rate
  • fwdrangeww -- like revrangeww except do short forward range scans
  • readww - like revrangeww except do point queries
  • overwrite - do overwrites (Put) as fast as possible
Workloads

There are three workloads, all of which use 40 threads:

  • byrx - the database is cached by RocksDB (100M KV pairs)
  • iobuf - the database is larger than memory and RocksDB uses buffered IO (2B KV pairs)
  • iodir - the database is larger than memory and RocksDB uses O_DIRECT (2B KV pairs)

A spreadsheet with all results is here and performance summaries with more details are here for byrxiobuf and iodir.

Relative QPS

The numbers in the spreadsheet and on the y-axis in the charts that follow are the relative QPS which is (QPS for $me) / (QPS for $base). When the value is greater than 1.0 then $me is faster than $base. When it is less than 1.0 then $base is faster (perf regression!).

The base version is RocksDB 6.0.2.

Results: byrx

The byrx tests use a cached database. The performance summary is here

The chart shows the relative QPS for a given version relative to RocksDB 6.0.2. There are two charts and the second narrows the range for the y-axis to make it easier to see regressions.

Summary:
  • fillseq has new CPU overhead in 7.0 from code added for correctness checks and QPS has been stable since then
  • QPS for other tests has been stable, with some variance, since late 6.x
Results: iobuf

The iodir tests use an IO-bound database with buffered. The performance summary is here

The chart shows the relative QPS for a given version relative to RocksDB 6.0.2. There are two charts and the second narrows the range for the y-axis to make it easier to see regressions.

Summary:
  • fillseq has been stable since 7.6
  • readww has always been stable
  • overwrite improved in 7.6 and has been stable since then
  • fwdrangeww and revrangeww improved in late 6.0 and have been stable since then
Results: iodir

The iodir tests use an IO-bound database with O_DIRECT. The performance summary is here

The chart shows the relative QPS for a given version relative to RocksDB 6.0.2. There are two charts and the second narrows the range for the y-axis to make it easier to see regressions.

Summary:
  • fillseq has been stable since 7.6
  • readww has always been stable
  • overwrite improved in 7.6 and has been stable since then
  • fwdrangeww and revrangeww have been stable but there is some variance