a curated list of database news from authoritative sources

March 27, 2024

March 26, 2024

March 21, 2024

March 19, 2024

March 15, 2024

Zig, Rust, and other languages

Having worked a bit in Zig, Rust, Go and now C, I think there are a few common topics worth having a fresh conversation on: automatic memory management, the standard library, and explicit allocation.

Zig is not a mature language. But it has made enough useful choices for a number of companies to invest in it and run it in production. The useful choices make Zig worth talking about.

Go and Rust are mature languages. But they have both made questionable choices that seem worth talking about.

All of these languages are developed by highly intelligent folks I personally look up to. And your choice to use any one of these is certainly fine, whichever it is.

The positive and negative choices particular languages made, though, are worth talking about as we consider what a systems programming language 10 years from now would look like. Or how these languages themselves might evolve in the next 10 years.

My perspective is mostly building distributed databases. So the points that I bring up may have no relevance to the kind of work you do, and that's alright. Moreover, I'm already aware most of these opinions are not shared by the language maintainers, and that's ok too. I am not writing to convince anyone.

Automatic memory management

One of my bigger issues with Zig is that it doesn't support RAII. You can defer cleanup to the end of a block; and this is half of the problem. But only RAII will allow for smart pointers and automatic (not manual) reference counting. RAII is an excellent option to default to, but in Zig you aren't allowed to. In contrast, even C "supports" automatic cleanup (via compiler extensions).

But most of the time, arenas are fine. Postgres is written in C and memory is almost entirely managed through nested arenas (called "memory contexts") that get cleaned up when some subset of a task finishes, recursively. Zig has builtin support for arenas, which is great.

Standard library

It seems regrettable that some languages have been shipping smaller standard libraries. Smaller standard libraries seem to encourage users of the language to install more transitively-unvetted third-party libraries, which increases build time and build flakiness, and which increases bitrot over time as unnecessary breaking changes occur.

People have been making jokes about node_modules for a decade now, but this problem is just as bad in Rust codebases I've seen. And to a degree it happens in Java and Go as well, though their larger standard libraries allow you to get further without dependencies.

Zig has a good standard library, which may be Go and Java tier in a few years. But one goal of their package manager seemed to be to allow the standard library to be broken up; made smaller. For example, JSON support moving out of the standard library into a package. I don't know if that is actually the planned direction. I hope not.

Having a large standard library doesn't mean that the programmer shouldn't be able to swap out implementations easily as needed. But all that is required is for the standard library to define an interface along with the standard library implementation.

The small size of the standard library doesn't just affect developers using the language, it even encourages developers of the language itself to depend on libraries owned by individuals.

Take a look at the transitive dependencies of an official Node.js package like node-gyp. Is it really the ideal outcome of a small standard library to encourage dependence in official libraries on libraries owned by individuals, like env-paths, that haven't been modified in 3 years? 68 lines of code. Is it not safer at this point to vendor that code? i.e. copy the env-paths code into node-gyp.

Similarly, if you go looking for compression support in Rust, there's none in the standard library. But you may notice the flate2-rs repo under the official rust-lang GitHub namespace. If you look at its transitive dependencies: flate2-rs depends on (an individual's) miniz_oxide which depends on (an individual's) adler that hasn't been updated in 4 years. 300 lines of code including tests. Why not vendor this code? It's the habits a small standard library builds that seem to encourage everyone not to.

I don't mean these necessarily constitute a supply-chain risk. I'm not talking about left-pad. But the pattern is sort of clear. Even official packages may end up depending on external party packages, because the commitment to a small standard library meant omitting stuff like compression, checksums, and common OS paths.

It's a tradeoff and maybe makes the job of the standard library maintainer easier. But I don't think this is the ideal situation. Dependencies are useful but should be kept to a reasonable minimum.

Hopefully languages end up more like Go than like Rust in this regard.

Explicit allocation

When folk discuss the Zig standard library's pattern of requiring an allocator argument for every method that allocates, they often talk about the benefit of swapping out allocators or the benefit of being able to handle OOM failures.

Both of these seem pretty niche to me. For example, in Zig tests you are encouraged to pass around a debug allocator that tells you about memory leaks. But this doesn't seem too different from compiling a C project with a debug allocator or compiling with different sanitizers on and running tests against the binary produced. In both cases you mostly deal with allocators at a global level depending on the environment you're running the code in (production or tests).

The real benefit of explicit allocations to me is much more trivial. You basically can't code a method in Zig without acknowledging allocations.

This is particularly useful for hotpath code. Take an iterator for example. It has a new() method, a next() method, and a done() method. In most languages, it's basically impossible at the syntax or compiler-level to know if you are allocating in the next() method. You may know because you know the behavior of all the code in next() by heart. But that won't happen all the time.

Zig is practically alone in that if you write the next() method and and don't pass an allocator to any method in the next() body, nothing in that next() method will allocate.

In any other language it might not be until you run a profiler that you notice an allocation that should have been done once in new() accidentally ended up in next() instead.

On the other hand, for all the same reasons, writing Zig is kind of a pain because everything takes an allocator!

Explicit allocation is not intrinsic to Zig, the language. It is a convention that is prevalent in the standard library. There is still a global allocator and any user of Zig could decide to use the global allocator. At which point you've got implicit allocation. So explicit allocation as a convention isn't a perfect solution.

But it, by default, gives you a level of awareness of allocations you just can't get from typical Go or Rust or C code, depending on the project's practices. Perhaps it's possible to switch off the Go, Rust and C standard library and use one where all functions that allocate do require an allocator.

But explicitly passing allocators is still sort of a visual hack.

I think the ideal situation in the future will be that every language supports annotating blocks of code as must-not-allocate or something along those lines. Either the compiler will enforce this and fail if you seem to allocate in a block marked must-not-allocate, or it will panic during runtime so you can catch this in tests.

This would be useful beyond static programming languages. It would be as interesting to annotate blocks in JavaScript or Python as must-not-allocate too.

Otherwise the current state of things is that you'd normally configure this sort of thing at the global level. Saying "there must not be any allocations in this entire program" just doesn't seem as useful in general as being able to say "there must not be any allocations in this one block".

Optional, not required, allocator arguments

Rust has nascent support for passing an allocator to methods that allocate. But it's optional. From what I understand, C++ STL is like this too.

These are both super useful for programming extensions. And it's one of the reasons I think Zig makes a ton of sense for Postgres extensions specifically. Because it was only and always ever built for running in an environment with someone else's allocator.

Praise for Zig, Rust, and Go tooling

All three of these have really great first-party tooling including build system, package management, test runners and formatters. The idea that the language should provide a great environment to code in (end-to-end) makes things simpler and nicer for programmers.

Meandering non-conclusion

Use the language you want to use. Zig and Rust are both nice alternatives to writing vanilla C.

On the other hand, I've been pleasantly surprised writing Postgres C. How high level it is. It's almost a separate language since you're often dealing with user-facing constructs, like Postgres's Datum objects which represent what you might think of as a cell in a Postgres database. And you can use all the same functions provided for Postgres SQL for working with Datums, but from C.

I've also been able work a bit on Postgres extensions in Rust with pgrx lately, which I hope to write about soon. And when I saw pgzx for writing Postgres extensions in Zig I was excited to spend some time with that too.

March 11, 2024

Iterating terabyte-sized ClickHouse®️ tables in production

ClickHouse schema migrations can be challenging even on batch systems. But when you're streaming 100s of MB/s, it's a whole different beast. Here's how we make schema changes on a large ClickHouse table deployed across many clusters while streaming… without missing a bit.

First month on a database team

A little over a month ago, I joined EnterpriseDB on a distributed Postgres product (PGD). The process of onboarding myself has been pretty similar at each company in the last decade, though I think I've gotten better at it. The process is of course influenced by the team, and my coworkers have been excellent. Still, I wanted to share my thought process and personal strategies.

Avoid, at first, what is always challenging

Trickier things at companies are the people, organization, and processes. What code exists? How does it work together? Who owns what? How can I find easy code issues to tackle? How do I know what's important (so I can avoid picking it up and becoming a bottleneck).

But also, in the first few days or weeks you aren't exactly expected to contribute meaningfully to features or bugs. Your sprint contributions are not tracked too closely.

The combination of 1) what to avoid and 2) the sprint-freedom-you-have leads to a few interesting and valuable areas to work on on your own: the build process, tests, running the software, and docs.

But code need not be ignored either. Some frequent areas to get your first code contributions in include user configuration code, error messages, and stale code comments.

What follows are some little 1st day, 1st week, 1st month projects I went through to bootstrap my understanding of the system.

Build process

First off, where is the code and how do you build it? This requires you to have all the relevant dependencies. Much of my work is on a Postgres extension. This meant having a local Postgres development environment, having gcc, gmake (on mac), Perl, and so on. And furthermore, PGD is a pretty mature product so it supports building against multiple Postgres distributions. Can I build against all of them?

The easiest situation is when there are instructions for all of this, linked directly from your main repo. When I started, the instructions did exist but in a variety of places. So over the first week I started collecting all of what I had learned about building the system, with dependencies, across distributions, and with various important flags (debug mode on, asserts enabled, etc.). I finished the first week by writing a little internal blog post called "Hacking on PGD".

I hadn't yet figured out the team processes so I didn't want to bother anyone by trying to get this "blog post" committed anywhere yet as official internal documentation. Maybe there already was a good doc, I just hadn't noticed it yet. So I just published it in a private Confluence page and shared it in the private team slack. If anyone else benefited from it, great! Otherwise, I knew I'd want to refer back to it.

This is an important attitude I think. It can be hard to tell what others will benefit from. If you get into the habit of writing things down internally for your own sake, but making it available internally, it is likely others will benefit from it too. This is something I've learned from years of blogging publicly outside of work.

Moreover, the simple act of writing a good post creates yourself as something of an authority. This is useful for yourself if no one else.

Writing a good post

Let's get distracted here for a second. One of the most important things I think in documentation is documenting not just what does exist but what doesn't. If you had to take a path to get something to work, did you try other paths that didn't work? It can be extremely useful to figure out what exactly is required for something.

Was there a flag that you tried to build with but you didn't try building without it? Well try again without it and make sure it was necessary. Was there some process you executed where the build succeeded but you can't remember if it was actually necessary for the build to succeed?

It's difficult to explain why I think this sort of precision is useful but I'm pretty sure it is. Maybe because it builds the habit of not treating things as magic when you can avoid it. It builds the habit of asking questions (if only to yourself) to understand and not just to get by.

Static analysis? Dynamic analysis?

Going back to builds, another aspect to consider is static and dynamic analysis. Are there special steps to using gdb or valgrind or other analyzers? Are you using them already? Can you get them running locally? Has any of this been documented?

Maybe the answer to all of those is yes, or maybe none of those are relevant but there are likely similar tools for your ecosystem. If analysis tools are relevant and no one has yet explored them, that's another very useful area to explore as a newcomer.

Testing

After I got the builds working, I felt the obvious next step was to run tests. But what tests exist? Are there unit tests? Integration tests? Anything else? Moreover, is there test coverage? I was certain I'd be able to find some low-hanging contributions to make if I could find some files with low test coverage.

Alas, my certainty hit the wall in that there were in fact too many types of integration tests that all do provide coverage already. They just don't all report coverage.

The easiest ways to report coverage (with gcov) were only reporting coverage for certain integration tests that we run locally. There are more integration tests run in cloud environments and getting coverage reports there to merge with my local coverage files would have required more knowledge of people and processes, areas that I wanted not to be forced to think about too quickly.

So coverage wasn't a good route to go. But around this time, I noticed a ticket that asked for a simple change to user configuration code. I was able to make the change pretty quickly and wanted to add tests. We have our own test framework built on top of Postgres's powerful Perl test framework. But it was a little difficult to figure out how to use either of them.

So I copied code from other tests and pared it down until I got the smallest version of test code I could get. This took maybe a day or two of tweaking lines and rerunning tests since I didn't understand everything that was/wasn't required. Also it's Perl and I've never written Perl before so that took a bit of time and ChatGPT. (Arrays, man.)

In the end though I was able to collect my learnings into another internal confluence post just about how to write tests, how to debug tests, how to do common things within tests (for example, ensuring a Postgres log line was outputted), etc. I published this post as well and shared it in the team Slack.

Running

I had PGD built locally and was able to run integration tests locally, but I still hadn't gotten a cluster running. Nor played with the eventual consistency demos I knew we supported. We had a great quickstart that ran through all the manual steps of getting a two-node cluster up. This was a distillation, for devs, of a more elaborate process we give to customers in a production-quality script.

But I was looking for something in between a production-quality script and manually initializing a local cluster. And I also wanted to practice my understanding of our test process. So I ported our quickstart to our integration test framework and made a PR with this new test, eventually merging this into the repo. And I wrote a minimal Python script for bringing up a local cluster. I've got an open PR to add this script to the repo. Maybe I'll learn though that a simple script such as this does already exist, and that's fine!

Docs

The entire time, as I'd been trying to build and test and run PGD, I was trying to understand our terminology and architecture by going through our public docs. I had a lot of questions coming out of this I'd ask in the team channel.

Not to toot my horn but I think it's somewhat of a superpower to be able/willing to ask "dumb questions" in a group setting. That's how I frame it anyway. "Dumb question: what does X mean in this paragraph?" Or, "dumb question: when we say performance improvement because of Y, what is the intuition here?" Because of the time spent here, I was able to make a few more docs contributions as I read through the docs as well.

You have to balance where you ask your dumb questions though. Asking dumb questions to one person doesn't benefit the team. But asking dumb questions in too wide a group is sometimes bad politics. Asking "dumb questions" in front of your team seems to have the best bang for buck.

But maybe the more important contributions were, when I got more comfortable with the team, proposing to merge my personal, internal Confluence blog posts into the repo as docs. I think in a number of cases, what I wrote about indeed hadn't been concisely collected before and thus was useful to have as team documentation.

Even more challenging was trying to distill (a chunk of) the internal architecture. Only after following many varied internal docs and videos, and following through numerous code paths, was I able to propose an architecture diagram outlining major components and communication between them, with their differing formats (WAL records, internal enums, etc.) and means of communication (RPC, shared memory, etc.). This architecture diagram is still in review and may be totally off. But it's already helped at least me think about the system.

In most cases this was all information that the team had already written or explained but just bringing it together and summarizing provided a different useful perspective I think. Even if none of the docs got merged it still helped to build my own understanding.

Beyond the repo

Learning the project is just one aspect of onboarding. Beyond that I join the #cats channel, the #dogs channel, found some fellow New Yorkers and opened a NYC channel, and tried to find Zoom-time with the various people I'd see hanging around common team Slack channels. Trying to meet not just devs but support folk, product managers, marketing folk, sales folk, and anyone else!

Walking the line between scouring our docs and GitHub and Confluence and Jira on my own, and bugging people with my incessant questions.

I've enjoyed my time at startups. I've been a dev, a manager, a founder, a cofounder. But I'm incredibly excited to be back, at a bigger company, full-time as a developer hacking on a database!

And what about you? What do you do to onboard yourself at a new company or new project?

March 08, 2024