July 10, 2024
July 08, 2024
Sharding strategies: directory-based, range-based, and hash-based
July 07, 2024
A reawakening of systems programming meetups
This year has seen a resurgence in really high quality systems programming meetups. Munich Database Meetup, Berlin Systems Group, SF Distributed Systems Meetup, NYC Systems, Bengaluru Systems, to name a few.
This post summarizes a bit of disappointing recent tech meetup history, the new trend of excellent systems programming meetups, and ends with some encouragement and guidance for running your own systems programming events.
I will be a little critical in this post but I want to preface by saying: organizing meetups is really tough! It takes a lot of work and I have a huge amount of respect for meetup organizers even when their meetup style did not resonate with me.
Although much of this post talks about NYC Systems, the reason I think this post is worth writing is because so many other meetups in a similar vein popped up. I hope to encourage these other meetups and to encourage folks in other major metros (London, for example) to start similar meetups.
Meetups
I used to attend a bunch of meetups before the pandemic. But I quickly got disillusioned. Almost every meetup was varying degrees of startups pitching their product. The last straw for me was sitting through a talk at a JavaScript meetup that was by a devrel employee of a startup who literally gave a tutorial for their product.
There were also some pretty intelligent meetups like the New York Haskell Users Group and the New York Emacs Meetup. But not being an expert in either domain, and the attendees almost solely appearing to be experts, I didn't particularly enjoy going.
There were a couple of meetups that felt inclusive for various skill-levels of attendees yet still went into interesting depth. Specifically, New York Linux User Group and Papers We Love NYC.
These meetups were exceptional because they were language- and framework-agnostic, they would start broad to give you background, but then go deep into a topic. Maybe you only understood 50% of what was covered. But you get exposed to something new from an expert in that domain.
Unfortunately, the pandemic happened and these two excellent meetups basically have not come back.
A couple of students in Munich
The pandemic ended and I tried a couple of meetups I thought might be better quality. Rust and Go. But they weren't much better than I remembered. People would give a high level talk and brush over all the interesting concepts.
I had been thinking of doing an in-person talk series since 2022.
If I put together a systems/databases/distributed systems meetup in NYC (a physical meetup, not Zoom), who'd be interested (in attending, or presenting, or helping me organize, or donating space)?
— Phil Eaton (@eatonphil) September 27, 2022
No promises!
But I was busy with TigerBeetle until December of 2023 when I was messaged on LinkedIn by Georg Kreuzmayr, a graduate student at Technical University of Munich (TUM).
Georg and his friends, fellow graduate students at TUM, started a database club: TUMuchData. We got to talking about opportunities for collaboration and I started feeling a bit embarrassed that a graduate student had more guts than I had to get back onto the meetup organizer wagon.
A week later, with assurance from Justin Jaffray that at least he would show up with me if no one else did, I started the NYC Systems Coffee Club to bring together folks in NYC interested in any topic of systems programming (e.g. compilers, databases, web browser internals, distributed systems, formal methods, etc.). To bring them together in a completely informal setting for coffee at 9am in the morning in a public space in midtown Manhattan.
Trying something new! If you're a dev in NYC working
— Phil Eaton (@eatonphil) December 11, 2023
on (or interested in) systems programming, grab a coffee and come hang out at 1 Bryant Park (indoor space) this Thursday 9AM - 9:30AM.
See post for details and fill out the Google Form or DM me!https://t.co/A4bzcPGy6x pic.twitter.com/n1ECMd59ev
I set up that linked web page and started collecting subscribers to the club via Google Form. Once a month I'd send an email out to the list asking for RSVPs to this month's coffee club. The first 20 to respond would get a calendar invite.
And about the same time I started asking around on Twitter/LinkedIn if someone would be interested in co-organizing a new systems programming meetup in NYC. Angelo Saraceno immediately took me up on the idea and we met up.
NYC Systems
We agreed on the premise: this would be a language- and framework-agnostic meetup that was focused on engineering challenges, not product pitches. It would be 100% for the sake of corporate marketing, but corporate marketing of the engineering team, not the product.
NYC Systems was born!
We'd find speakers who could start broad and dive deep into some interesting aspect of databases, programming languages, distributed systems, and so on. Product pitches were necessary to establish a context, but the focus of the talk would be about some interesting recent technical challenge and how they dealt with it.
We'd schedule talks only every other month to ease our own burden in organizing and finding great speakers.
Once Angelo and I had decided to go forward, the next two challenges were finding speakers and finding a venue. Thanks to Twitter and LinkedIn, finding speakers turned out to be the easy part.
It was harder to find a venue. It was surprisingly challenging to find a company in NYC with a shared vision that the important thing about being associated with a meetup like this is to be associated with the quality of speakers and audience we can bring in by not allowing transparent product pitches.
Almost every company in Manhattan with space we spoke with had a requirement that they have their own speaker each night. That seemed like a bad idea.
I think it was especially challenging to find a company willing to relax about branding requirements like this because we were a new meetup.
It was pretty frustrating not to find a sympathetic company with space in Manhattan. And the only reason we didn't give up was because Angelo was so adament that this kind of meetup actually happen. It's always best to start something new with someone else for this exact reason. You can keep each other going.
In the end we went with the company that did not insist on their own speaker or their own branding. A Brooklyn-based company whose CEO immediately got in touch with me that they wanted to host us, Trail of Bits.
How it works
To keep things easy, I set up a web page on my personal site with information about the meetup. (Eventually we moved this to nycsystems.xyz.) I set up a Google Form to collect emails for a mailing list. And we started posting about the group on Twitter and LinkedIn.
Very pleased to share the first NYC Systems Talks are taking place next Thursday Feb 22nd 6PM. Hosted by @trailofbits, with @paulgb and @StefanKarpinski speaking.
— Phil Eaton (@eatonphil) February 15, 2024
Space is not infinite, fill out the Google Form if you can attend and would like an invite!https://t.co/jNssr5v1kJ
We published the event calendar in advance (an HTML table on the website) and announced each event's speakers a week in advance of the event. I'd send another Google Form to the mailing list taking RSVPs for the night. The first 60 people to respond got a Google Calendar invite.
It's a bit of work, sure, but I'd do anything to avoid Meetup.com.
It is interesting to see every new systems programming meetup also not pick Meetup.com. The only one that went with it, Munich Database Meetup, is a revival of an existing group, the Munich NoSQL Meetup and presumably they didn't want to give up their subscribers. Though most others use lu.ma.
The mailing list is now about 400+ people. And in each event RSVP we have a wait list of 20-30 people. Of course although 60 people say Yes initially, by the time of the event we have typically gotten about 50 people in attendance.
At each event, Trail of Bits provided screens, chairs, food, and drink. Angelo had recording equipment so he took over audio/video capturing (and later editing and publishing).
After each event we'd publish talk videos to our @NYCSystems Youtube.
Network effects
In March 2024, the TUMuchData folks joined Alex Petrov's Munich NoSQL Meetup to form the Munich Database Meetup. In May, Kaivalya Apte and Manish Gill started the Berlin Systems Group, inspired by Alex and the Munich Database Meetup.
I want to start a Berlin Database/Storage systems group, where we have regular meetups, discussions and talks.
— Kaivalya Apte - The Geek Narrator (@thegeeknarrator) May 15, 2024
WDYT? @mgill25 @mehd_io @ClickHouseDB @SnowflakeDB @awscloud @GoogleDE @TUBerlin
Can I get some support? Who else would be interested? #Databases
Thanks…
In May 2024, two PhD students in the San Francisco Bay Area, Shadaj Laddad and Conor Power, started the SF Distributed Systems meetup.
We’re super excited to be organizing a new SF Distributed Systems meetup NEXT WEEK! Our first meetup features @julianhyde and @conor_power23 presenting work on extending SQL and applying algebraic properties, sign up at https://t.co/d2lLDaQ5iJ
— Shadaj Laddad (@ShadajL) May 15, 2024
And in July 2024, Shraddha Agrawal, Anirudh Rowjee and friends kicked off the first Bengaluru Systems Meetup.
Are you ready, Systems Enthusiasts of Bengaluru?
— Bengaluru Systems Meetup (@BengaluruSys) July 4, 2024
Speaking at our first-ever meetup on 6th July, we have:@simsimsandy with "Learn about the systems that power GenAI applications" and @vivekgalatage with "The Browser Backstage: Performance vs Security"
(talks linked below!)
Suggestions
First off, don't pay for anything yourself. Find a company who will host. At the same time, don't feel the need to give in too much to the demands of the company. I'd be happy to help you think through how to talk about the event with companies. It is mutually beneficial for them to get to give a 5-minute hiring/product pitch and not need to do extensive branding nor to give a 30-minute product tutorial.
Second, keep a bit of pressure on speakers to not do an overview talk and not to do a product pitch. Suggest that they tell the story of some interesting recent bug or interesting recent feature. What happened? Why was it hard? What did you learn?
Focusing on these types of talks will help you get a really interesting audience.
I have been continuously surprised and impressed at the folks who show up for NYC Systems. It's a mix of technical founders in the systems space, pretty experienced developers in the systems space, graduate students, and developers of all sorts.
I am certain we can only get these kinds of folks to show up because we avoid product pitch-type talks.
Third, finding speakers is still hard! The best approach so far has been to individually message folks in industry and academia who hang out on Twitter. Sending out a public call is easy but doesn't often pan out. So keep an eye on interesting companies in the area.
Another avenue I've been thinking about is messaging VC connections to ask them if they know any engineers/technical founders/CTOs in the area who could give an interesting technical talk.
Fourth, speak with other organizers! I finally met Alex Petrov in person last month and we had a great time talking about the challenges and joys of organizing really high quality meetups.
I'm always happy to chat, DMs are open.
New post telling a bit of the history behind https://t.co/NEh1tm8v3Q; why it only exists due to folks like @georg_kreuzmayr and @ngeloxyz; the explosion of systems meetups around the world; and encouragement and suggestions for future organizers!https://t.co/dwe4TtmXKK pic.twitter.com/ZMLkVYdZDJ
— Phil Eaton (@eatonphil) July 7, 2024
July 04, 2024
Postgres Realtime location sharing with MapLibre
July 03, 2024
Multi-tenant analytics for SaaS applications
July 01, 2024
A write-ahead log is not a universal part of durability
A database does not need a write-ahead log (WAL) to achieve durability. A database can write its long-term data structure durably to disk before returning to a client. Granted, this is a bad idea! And granted, a WAL is critical for durability by design in most databases. But I think it's helpful to understand WALs by understanding what you could do without them.
So let's look at what terrible design we can make for a durable database that has no write-ahead log. To motivate the idea of, and build an intuition for, a write-ahead log.
Thank you to Alex Miller for reviewing a version of this post.
But first, what is durability?
Durability
Durability happens in the context of a request a client makes to a data system (either an embedded system like SQLite or RocksDB or a standalone system like Postgres). Durability is a spectrum of guarantees the server provides when a client requests to write some data: that either the request succeeds and the data is safely written to disk, or the request fails and the client must retry or decide to do something else.
It can be difficult to set an absolute definition for durability since different databases have different concepts of what can go wrong with disks (also called a "storage fault model"), or they have no concept at all.
Let's start from the beginning.
An in-memory database
An in-memory database has no durability at all. Here is pseudo-code for an in-memory database service.
db = btree()
def handle_write(req):
db.update(req.key, req.value)
return 200, {}
def handle_read(req):
value = db.read(req.key)
return 200, {"value": value}
Throughout this post, for the sake of code brevity, imagine that the
environment is concurrent and that data races around shared mutable
values like db
are protected somehow.
Writing to disk
If we want to achieve the most basic level of durability, we can write this database to a file.
f = open("kv.db")
db = btree.init_from_disk(f)
def handle_write(req):
db.update(req.key, req.value)
db.write_to_disk(f)
return 200, {}
def handle_read(req):
value = db.read(req.key)
return 200, {"value": value}
btree.write_to_disk
will call
pwrite(2) under the hood. And
we'll assume it does copy-on-write for only changed pages. So imagine
we have a large database represented by a btree that takes up 10GiB on
disk. With the btree algorithm, if we write a single entry to the
btree, often only a single (often 4Kib) page will get written rather
than all pages (holding all values) in the tree. At the same time, in
the worst case, the entire tree (all 10GiB of data) may need to get
rewritten.
But this code isn't crash-safe. If the virtual or physical machine this code is running on reboots, the data we wrote to the file may not actually be on disk.
fsync
File data is buffered by the operating system by default. By general consensus, writing data without flushing the operating system buffer is not considered durable. Every so often a new database will show up on Hacker News claiming to beat all other databases on insert speed until a commenter points out the new database doesn't actually flush data to disk.
In other words, the commonly accepted requirement for durability is that not only do you write data to a file on disk but you fsync(2) the file you wrote. This forces the operating system to flush to disk any data it has buffered.
f = open("kv.db")
db = btree.init_from_disk(f)
def handle_write(req):
db.update(req.key, req.value)
db.write_to_disk(f)
f.fsync() # Force a flush
return 200, {}
def handle_read(req):
value = db.read(req.key)
return 200, {"value": value}
Furthermore you must not ignore fsync failure. How you deal with fsync failure is up to you, but exiting immediately with a message that the user should restore from a backup is sometimes considered acceptable.
Databases don't like to fsync because it's slow. Many major databases offer modes where they do not fsync data files before returning a success to a client. Postgres offers this unsafe mode, though does not default to it and warns against it. MongoDB offers this unsafe mode but does not default to it.
An earlier version of this post said that MongoDB would unsafely flush on an interval. Daniel Gomez Ferro from MongoDB messaged me that while the docs are confusing, the default write concern "majority" does actually imply "j: true" which means data is synchronized (i.e. fsync-ed) before returning a success to a client.
Almost every database trades safety for performance in some regard. For example, few databases but SQLite and Cockroach default to Serializable Isolation. While it is commonly agreed that basically no level below Serializable Isolation (that all other databases default to) can be reasoned about. Other databases offer Serializable Isolation, they just don't default to it. Because it can be slow.
Group commit
But let's get back to fsync. One way to amortize the cost of fsync is to delay requests so that you write data from each of them and then fsync the data from all requests. This is sometimes called group commit.
For example, we could update the database in-memory but have a background thread serialize to disk and call fsync only every 5ms.
f = open("kv.db")
db = btree.init_from_disk(f)
group_commit_sems = []
@background_worker()
def group_commit():
for:
if clock() % 5ms == 0:
db.write_to_disk(f)
f.fsync() # Durably flush for the group
for sem in group_commit_sems:
sem.signal()
def handle_write(req):
db.update(req.key, req.value)
sem = semaphore()
group_commit_sems.push(sem)
sem.wait()
return 200, {}
def handle_read(req):
value = db.read(req.key)
return 200, {"value": value}
It is critical that handle_write
waits to return a success until the
write is durable via fsync.
So to reiterate, the key idea for durability of a client request is that you have some version of the client message stored on disk durably with fsync before returning a success to a client.
From now on in this post, when you see "durable" or "durability", it means that the data has been written and fsync-ed to disk.
Optimizing durable writes
A key insight is that it's silly to serialize the entire permanent structure of the database to disk every time a user writes.
We could just write the user's message itself to an append-only log. And then only periodically write the entire btree to disk. So long as we have fsync-ed the append-only log file, we can safely return to the user even if the btree itself has not yet been written to disk.
The additional logic this requires is that on startup we must read the btree from disk and then replay the log on top of the btree.
f = open("kv.db", "rw")
db = btree.init_from_disk(f)
log_f = open("kv.log", "rw")
l = log.init_from_disk()
for log in l.read_logs_from(db.last_log_index):
db.update(log.key, log.value)
group_commit_sems = []
@background_worker()
def group_commit():
for:
log_accumulator = log_page()
if clock() % 5ms == 0:
for (log, _) in group_commit_sems:
log_accumulator.add(log)
log_f.write(log_accumulator.page()) # Write out all log entries at once
log_f.fsync() # Durably flush wal data
for (_, sem) in group_commit_sems:
sem.signal()
if clock() % 1m == 0:
db.write_to_disk(f)
f.fsync() # Durably flush db data
def handle_write(req):
db.update(req.key, req.value)
sem = semaphore()
log = req
group_commit_sems.push((log, sem))
sem.wait() # This time waiting for only the log to be written and flushed, not the btree.
return 200, {}
def handle_read(req):
value = db.read(req.key)
return 200, {"value": value}
This is a write-ahead log!
Consider a few scenarios. One request writes the smallest key ever seen. And one request within the same millisecond writes the largest key ever seen. Writing these to disk on the btree means modifying at least two pages spread out in space on disk.
But if we only have to durably write these two messages to a log, they can likely both be included in the same log page. ("Likely" so long as key and values are small enough that multiple can fit into the same page.)
That is, it's cheaper to write only these small messages representing the client request to disk. And we save the structured btree persistence for a less frequent durable write.
Filesystem and disk bugs
Sometimes filesystems will write data to the wrong place. Sometimes disks corrupt data. A solution to both of these is to checksum the data on write, store the checksum on disk, and confirm the checksum on read. This combined with a background process called scrubbing to validate unread data can help you learn quickly when your data has been corrupted and you must recover from backup.
MongoDB's default storage engine WiredTiger does checksum data by default.
But some databases famous for integrity do not. Postgres does no data checksumming by default:
By default, data pages are not protected by checksums, but this can optionally be enabled for a cluster. When enabled, each data page includes a checksum that is updated when the page is written and verified each time the page is read. Only data pages are protected by checksums; internal data structures and temporary files are not.
SQLite likewise does no checksumming by default. Checksumming is an optional extension:
The checksum VFS extension is a VFS shim that adds an 8-byte checksum to the end of every page in an SQLite database. The checksum is added as each page is written and verified as each page is read. The checksum is intended to help detect database corruption caused by random bit-flips in the mass storage device.
But even this isn't perfect. Disks and nodes can fail completely. At that point you can only improve durability by introducing redundancy across disks (and/or nodes), for example, via distributed consensus.
Other reasons you need a WAL?
Some databases (like SQLite) require a write-ahead log to implement aspects of ACID transactions. But this need not be a requirement for ACID transactions if you do MVCC (SQLite does not). See my previous post on implementing MVCC for details.
Logical replication (also called change data capture (CDC)) is another interesting feature that requires a write-ahead log. The idea is that the log already preserves the exact order and changes that affect the database's "state machine". So we could copy these changes out of the system by tracking the write-ahead log, preserving change order, and apply these changes to a foreign system.
But again, just CDC is not about durability. It's an ancillary feature that write-ahead logs make simple.
Conclusion
A few key points. One, durability primarily matters if it is established before returning a success to the client. Second, a write-ahead log is a cheap way to get durability.
And finally, durability is a spectrum. You need to read the docs for your database to understand what it does and does not.
Here's a new post about durability and write-ahead logs. Write-ahead logs are used almost everywhere. But to build an intuition for why, it is helpful to imagine what you would do without a WAL. And to explore the meaning of durability.https://t.co/nzS2pMz22z pic.twitter.com/m1n9x8CNcp
— Phil Eaton (@eatonphil) July 1, 2024
June 27, 2024
Announcing Vitess 20
Working with JSON and Graphs in CedarDB
Working with JSON and Graphs in CedarDB
While relational database systems are still by far the most popular database systems in use today, other data models are growing in both importance and popularity. The perceived simplicity and greater flexibility of semi-structured data, such as JSON, has made it the first choice for many application developers, especially for web interfaces and log data. As a result, much of the data being generated today is semi-structured. And the shift to new data models does not stop at data generation. Analysts are increasingly turning to graphs to better capture real-world relationships in their data.