a curated list of database news from authoritative sources

July 10, 2024

Dealing with large tables

Large databases often have a small number of very large tables that makes scaling difficult. How can you scale with these while keeping your database performant? This article covers three techniques.

July 08, 2024

July 07, 2024

A reawakening of systems programming meetups

This year has seen a resurgence in really high quality systems programming meetups. Munich Database Meetup, Berlin Systems Group, SF Distributed Systems Meetup, NYC Systems, Bengaluru Systems, to name a few.

This post summarizes a bit of disappointing recent tech meetup history, the new trend of excellent systems programming meetups, and ends with some encouragement and guidance for running your own systems programming events.

I will be a little critical in this post but I want to preface by saying: organizing meetups is really tough! It takes a lot of work and I have a huge amount of respect for meetup organizers even when their meetup style did not resonate with me.

Although much of this post talks about NYC Systems, the reason I think this post is worth writing is because so many other meetups in a similar vein popped up. I hope to encourage these other meetups and to encourage folks in other major metros (London, for example) to start similar meetups.

Meetups

I used to attend a bunch of meetups before the pandemic. But I quickly got disillusioned. Almost every meetup was varying degrees of startups pitching their product. The last straw for me was sitting through a talk at a JavaScript meetup that was by a devrel employee of a startup who literally gave a tutorial for their product.

There were also some pretty intelligent meetups like the New York Haskell Users Group and the New York Emacs Meetup. But not being an expert in either domain, and the attendees almost solely appearing to be experts, I didn't particularly enjoy going.

There were a couple of meetups that felt inclusive for various skill-levels of attendees yet still went into interesting depth. Specifically, New York Linux User Group and Papers We Love NYC.

These meetups were exceptional because they were language- and framework-agnostic, they would start broad to give you background, but then go deep into a topic. Maybe you only understood 50% of what was covered. But you get exposed to something new from an expert in that domain.

Unfortunately, the pandemic happened and these two excellent meetups basically have not come back.

A couple of students in Munich

The pandemic ended and I tried a couple of meetups I thought might be better quality. Rust and Go. But they weren't much better than I remembered. People would give a high level talk and brush over all the interesting concepts.

I had been thinking of doing an in-person talk series since 2022.

But I was busy with TigerBeetle until December of 2023 when I was messaged on LinkedIn by Georg Kreuzmayr, a graduate student at Technical University of Munich (TUM).

Georg and his friends, fellow graduate students at TUM, started a database club: TUMuchData. We got to talking about opportunities for collaboration and I started feeling a bit embarrassed that a graduate student had more guts than I had to get back onto the meetup organizer wagon.

A week later, with assurance from Justin Jaffray that at least he would show up with me if no one else did, I started the NYC Systems Coffee Club to bring together folks in NYC interested in any topic of systems programming (e.g. compilers, databases, web browser internals, distributed systems, formal methods, etc.). To bring them together in a completely informal setting for coffee at 9am in the morning in a public space in midtown Manhattan.

I set up that linked web page and started collecting subscribers to the club via Google Form. Once a month I'd send an email out to the list asking for RSVPs to this month's coffee club. The first 20 to respond would get a calendar invite.

And about the same time I started asking around on Twitter/LinkedIn if someone would be interested in co-organizing a new systems programming meetup in NYC. Angelo Saraceno immediately took me up on the idea and we met up.

NYC Systems

We agreed on the premise: this would be a language- and framework-agnostic meetup that was focused on engineering challenges, not product pitches. It would be 100% for the sake of corporate marketing, but corporate marketing of the engineering team, not the product.

NYC Systems was born!

We'd find speakers who could start broad and dive deep into some interesting aspect of databases, programming languages, distributed systems, and so on. Product pitches were necessary to establish a context, but the focus of the talk would be about some interesting recent technical challenge and how they dealt with it.

We'd schedule talks only every other month to ease our own burden in organizing and finding great speakers.

Once Angelo and I had decided to go forward, the next two challenges were finding speakers and finding a venue. Thanks to Twitter and LinkedIn, finding speakers turned out to be the easy part.

It was harder to find a venue. It was surprisingly challenging to find a company in NYC with a shared vision that the important thing about being associated with a meetup like this is to be associated with the quality of speakers and audience we can bring in by not allowing transparent product pitches.

Almost every company in Manhattan with space we spoke with had a requirement that they have their own speaker each night. That seemed like a bad idea.

I think it was especially challenging to find a company willing to relax about branding requirements like this because we were a new meetup.

It was pretty frustrating not to find a sympathetic company with space in Manhattan. And the only reason we didn't give up was because Angelo was so adament that this kind of meetup actually happen. It's always best to start something new with someone else for this exact reason. You can keep each other going.

In the end we went with the company that did not insist on their own speaker or their own branding. A Brooklyn-based company whose CEO immediately got in touch with me that they wanted to host us, Trail of Bits.

How it works

To keep things easy, I set up a web page on my personal site with information about the meetup. (Eventually we moved this to nycsystems.xyz.) I set up a Google Form to collect emails for a mailing list. And we started posting about the group on Twitter and LinkedIn.

We published the event calendar in advance (an HTML table on the website) and announced each event's speakers a week in advance of the event. I'd send another Google Form to the mailing list taking RSVPs for the night. The first 60 people to respond got a Google Calendar invite.

It's a bit of work, sure, but I'd do anything to avoid Meetup.com.

It is interesting to see every new systems programming meetup also not pick Meetup.com. The only one that went with it, Munich Database Meetup, is a revival of an existing group, the Munich NoSQL Meetup and presumably they didn't want to give up their subscribers. Though most others use lu.ma.

The mailing list is now about 400+ people. And in each event RSVP we have a wait list of 20-30 people. Of course although 60 people say Yes initially, by the time of the event we have typically gotten about 50 people in attendance.

At each event, Trail of Bits provided screens, chairs, food, and drink. Angelo had recording equipment so he took over audio/video capturing (and later editing and publishing).

After each event we'd publish talk videos to our @NYCSystems Youtube.

Network effects

In March 2024, the TUMuchData folks joined Alex Petrov's Munich NoSQL Meetup to form the Munich Database Meetup. In May, Kaivalya Apte and Manish Gill started the Berlin Systems Group, inspired by Alex and the Munich Database Meetup.

In May 2024, two PhD students in the San Francisco Bay Area, Shadaj Laddad and Conor Power, started the SF Distributed Systems meetup.

And in July 2024, Shraddha Agrawal, Anirudh Rowjee and friends kicked off the first Bengaluru Systems Meetup.

Suggestions

First off, don't pay for anything yourself. Find a company who will host. At the same time, don't feel the need to give in too much to the demands of the company. I'd be happy to help you think through how to talk about the event with companies. It is mutually beneficial for them to get to give a 5-minute hiring/product pitch and not need to do extensive branding nor to give a 30-minute product tutorial.

Second, keep a bit of pressure on speakers to not do an overview talk and not to do a product pitch. Suggest that they tell the story of some interesting recent bug or interesting recent feature. What happened? Why was it hard? What did you learn?

Focusing on these types of talks will help you get a really interesting audience.

I have been continuously surprised and impressed at the folks who show up for NYC Systems. It's a mix of technical founders in the systems space, pretty experienced developers in the systems space, graduate students, and developers of all sorts.

I am certain we can only get these kinds of folks to show up because we avoid product pitch-type talks.

Third, finding speakers is still hard! The best approach so far has been to individually message folks in industry and academia who hang out on Twitter. Sending out a public call is easy but doesn't often pan out. So keep an eye on interesting companies in the area.

Another avenue I've been thinking about is messaging VC connections to ask them if they know any engineers/technical founders/CTOs in the area who could give an interesting technical talk.

Fourth, speak with other organizers! I finally met Alex Petrov in person last month and we had a great time talking about the challenges and joys of organizing really high quality meetups.

I'm always happy to chat, DMs are open.

July 04, 2024

July 03, 2024

Multi-tenant analytics for SaaS applications

Customer-facing analytics becomes more challenging in multi-tenant environments. Learn strategies for building multi-tenant analytics in a secure and scalable way.

July 01, 2024

A write-ahead log is not a universal part of durability

A database does not need a write-ahead log (WAL) to achieve durability. A database can write its long-term data structure durably to disk before returning to a client. Granted, this is a bad idea! And granted, a WAL is critical for durability by design in most databases. But I think it's helpful to understand WALs by understanding what you could do without them.

So let's look at what terrible design we can make for a durable database that has no write-ahead log. To motivate the idea of, and build an intuition for, a write-ahead log.

Thank you to Alex Miller for reviewing a version of this post.

But first, what is durability?

Durability

Durability happens in the context of a request a client makes to a data system (either an embedded system like SQLite or RocksDB or a standalone system like Postgres). Durability is a spectrum of guarantees the server provides when a client requests to write some data: that either the request succeeds and the data is safely written to disk, or the request fails and the client must retry or decide to do something else.

It can be difficult to set an absolute definition for durability since different databases have different concepts of what can go wrong with disks (also called a "storage fault model"), or they have no concept at all.

Let's start from the beginning.

An in-memory database

An in-memory database has no durability at all. Here is pseudo-code for an in-memory database service.

db = btree()

def handle_write(req):
  db.update(req.key, req.value)
  return 200, {}

def handle_read(req):
  value = db.read(req.key)
  return 200, {"value": value}

Throughout this post, for the sake of code brevity, imagine that the environment is concurrent and that data races around shared mutable values like db are protected somehow.

Writing to disk

If we want to achieve the most basic level of durability, we can write this database to a file.

f = open("kv.db")
db = btree.init_from_disk(f)

def handle_write(req):
  db.update(req.key, req.value)
  db.write_to_disk(f)
  return 200, {}

def handle_read(req):
  value = db.read(req.key)
  return 200, {"value": value}

btree.write_to_disk will call pwrite(2) under the hood. And we'll assume it does copy-on-write for only changed pages. So imagine we have a large database represented by a btree that takes up 10GiB on disk. With the btree algorithm, if we write a single entry to the btree, often only a single (often 4Kib) page will get written rather than all pages (holding all values) in the tree. At the same time, in the worst case, the entire tree (all 10GiB of data) may need to get rewritten.

But this code isn't crash-safe. If the virtual or physical machine this code is running on reboots, the data we wrote to the file may not actually be on disk.

fsync

File data is buffered by the operating system by default. By general consensus, writing data without flushing the operating system buffer is not considered durable. Every so often a new database will show up on Hacker News claiming to beat all other databases on insert speed until a commenter points out the new database doesn't actually flush data to disk.

In other words, the commonly accepted requirement for durability is that not only do you write data to a file on disk but you fsync(2) the file you wrote. This forces the operating system to flush to disk any data it has buffered.

f = open("kv.db")
db = btree.init_from_disk(f)

def handle_write(req):
  db.update(req.key, req.value)
  db.write_to_disk(f)
  f.fsync() # Force a flush
  return 200, {}

def handle_read(req):
  value = db.read(req.key)
  return 200, {"value": value}

Furthermore you must not ignore fsync failure. How you deal with fsync failure is up to you, but exiting immediately with a message that the user should restore from a backup is sometimes considered acceptable.

Databases don't like to fsync because it's slow. Many major databases offer modes where they do not fsync data files before returning a success to a client. Postgres offers this unsafe mode, though does not default to it and warns against it. MongoDB offers this unsafe mode but does not default to it.

An earlier version of this post said that MongoDB would unsafely flush on an interval. Daniel Gomez Ferro from MongoDB messaged me that while the docs are confusing, the default write concern "majority" does actually imply "j: true" which means data is synchronized (i.e. fsync-ed) before returning a success to a client.

Almost every database trades safety for performance in some regard. For example, few databases but SQLite and Cockroach default to Serializable Isolation. While it is commonly agreed that basically no level below Serializable Isolation (that all other databases default to) can be reasoned about. Other databases offer Serializable Isolation, they just don't default to it. Because it can be slow.

Group commit

But let's get back to fsync. One way to amortize the cost of fsync is to delay requests so that you write data from each of them and then fsync the data from all requests. This is sometimes called group commit.

For example, we could update the database in-memory but have a background thread serialize to disk and call fsync only every 5ms.

f = open("kv.db")
db = btree.init_from_disk(f)

group_commit_sems = []

@background_worker()
def group_commit():
  for:
    if clock() % 5ms == 0:
      db.write_to_disk(f)
      f.fsync() # Durably flush for the group
      for sem in group_commit_sems:
        sem.signal()

def handle_write(req):
  db.update(req.key, req.value)
  sem = semaphore()
  group_commit_sems.push(sem)
  sem.wait()
  return 200, {}

def handle_read(req):
  value = db.read(req.key)
  return 200, {"value": value}

It is critical that handle_write waits to return a success until the write is durable via fsync.

So to reiterate, the key idea for durability of a client request is that you have some version of the client message stored on disk durably with fsync before returning a success to a client.

From now on in this post, when you see "durable" or "durability", it means that the data has been written and fsync-ed to disk.

Optimizing durable writes

A key insight is that it's silly to serialize the entire permanent structure of the database to disk every time a user writes.

We could just write the user's message itself to an append-only log. And then only periodically write the entire btree to disk. So long as we have fsync-ed the append-only log file, we can safely return to the user even if the btree itself has not yet been written to disk.

The additional logic this requires is that on startup we must read the btree from disk and then replay the log on top of the btree.

f = open("kv.db", "rw")
db = btree.init_from_disk(f)

log_f = open("kv.log", "rw")
l = log.init_from_disk()
for log in l.read_logs_from(db.last_log_index):
  db.update(log.key, log.value)

group_commit_sems = []

@background_worker()
def group_commit():
  for:
    log_accumulator = log_page()
    if clock() % 5ms == 0:
      for (log, _) in group_commit_sems:
        log_accumulator.add(log)

      log_f.write(log_accumulator.page()) # Write out all log entries at once
      log_f.fsync() # Durably flush wal data
      for (_, sem) in group_commit_sems:
        sem.signal()

    if clock() % 1m == 0:
      db.write_to_disk(f)
      f.fsync() # Durably flush db data

def handle_write(req):
  db.update(req.key, req.value)
  sem = semaphore()
  log = req
  group_commit_sems.push((log, sem))
  sem.wait() # This time waiting for only the log to be written and flushed, not the btree.
  return 200, {}

def handle_read(req):
  value = db.read(req.key)
  return 200, {"value": value}

This is a write-ahead log!

Consider a few scenarios. One request writes the smallest key ever seen. And one request within the same millisecond writes the largest key ever seen. Writing these to disk on the btree means modifying at least two pages spread out in space on disk.

But if we only have to durably write these two messages to a log, they can likely both be included in the same log page. ("Likely" so long as key and values are small enough that multiple can fit into the same page.)

That is, it's cheaper to write only these small messages representing the client request to disk. And we save the structured btree persistence for a less frequent durable write.

Filesystem and disk bugs

Sometimes filesystems will write data to the wrong place. Sometimes disks corrupt data. A solution to both of these is to checksum the data on write, store the checksum on disk, and confirm the checksum on read. This combined with a background process called scrubbing to validate unread data can help you learn quickly when your data has been corrupted and you must recover from backup.

MongoDB's default storage engine WiredTiger does checksum data by default.

But some databases famous for integrity do not. Postgres does no data checksumming by default:

By default, data pages are not protected by checksums, but this can optionally be enabled for a cluster. When enabled, each data page includes a checksum that is updated when the page is written and verified each time the page is read. Only data pages are protected by checksums; internal data structures and temporary files are not.

SQLite likewise does no checksumming by default. Checksumming is an optional extension:

The checksum VFS extension is a VFS shim that adds an 8-byte checksum to the end of every page in an SQLite database. The checksum is added as each page is written and verified as each page is read. The checksum is intended to help detect database corruption caused by random bit-flips in the mass storage device.

But even this isn't perfect. Disks and nodes can fail completely. At that point you can only improve durability by introducing redundancy across disks (and/or nodes), for example, via distributed consensus.

Other reasons you need a WAL?

Some databases (like SQLite) require a write-ahead log to implement aspects of ACID transactions. But this need not be a requirement for ACID transactions if you do MVCC (SQLite does not). See my previous post on implementing MVCC for details.

Logical replication (also called change data capture (CDC)) is another interesting feature that requires a write-ahead log. The idea is that the log already preserves the exact order and changes that affect the database's "state machine". So we could copy these changes out of the system by tracking the write-ahead log, preserving change order, and apply these changes to a foreign system.

But again, just CDC is not about durability. It's an ancillary feature that write-ahead logs make simple.

Conclusion

A few key points. One, durability primarily matters if it is established before returning a success to the client. Second, a write-ahead log is a cheap way to get durability.

And finally, durability is a spectrum. You need to read the docs for your database to understand what it does and does not.

June 27, 2024

Working with JSON and Graphs in CedarDB

Working with JSON and Graphs in CedarDB

While relational database systems are still by far the most popular database systems in use today, other data models are growing in both importance and popularity. The perceived simplicity and greater flexibility of semi-structured data, such as JSON, has made it the first choice for many application developers, especially for web interfaces and log data. As a result, much of the data being generated today is semi-structured. And the shift to new data models does not stop at data generation. Analysts are increasingly turning to graphs to better capture real-world relationships in their data.

Announcing Vitess 20

We're delighted to announce the release of Vitess 20 along with version 2.13.0 of the Vitess Kubernetes Operator. Version 20 focuses on usability and maturity of existing features, and continues to build on the solid foundation of scalability and performance established in previous versions. Our commitment remains steadfast in providing a powerful, scalable, and reliable solution for your database scaling needs. What's New in Vitess 20 # Query Compatibility: enhanced DML support including improved query compatibility, Vindex hints, and extended support for various sharded update and delete operations.