a curated list of database news from authoritative sources

February 13, 2024

February 08, 2024

An intuition for distributed consensus in OLTP systems

Distributed consensus in transactional databases (e.g. etcd or Cockroach) is a big deal these days. Most often under the hood are variations of log-based Paxos-like algorithms such as MultiPaxos, Viewstamped Replication, or Raft. While there are new variations that come out each year, optimizing for various workloads, these algorithms are fairly standard and well-understood.

In fact they are used in so many places, Kubernetes for example, that even if you don't decide to implement Raft (which is fun and I encourage it), it seems worth building an intuition for distributed consensus.

What happens as you tweak a configuration. What happens as the production environment changes. Or what to reach for as product requirements change.

I've been thinking about the basics of distributed consensus recently. There has been a lot to digest and characterize. And I'm only beginning to get an understanding.

This post is an attempt to share some of the intuition built up reading about and working in this space. Originally this post was also going to end with a walkthrough of my most recent Raft implementation in Rust. But I'm going to hold off on that for another time.

I was fortunate to have a few excellent reviewers looking at versions of this post: Paul Nowoczynski, Alex Miller, Jack Vanlightly, Daniel Chia, and Alex Petrov. Thank you!

Let's start with Raft.

Raft

Raft is a distributed consensus algorithm that allows you to build a replicated state machine on top of a replicated log.

A Raft library handles replicating and durably persisting a sequence (or log) of commands to at least a majority of nodes in a cluster. You provide the library a state machine that interprets the replicated commands. From the perspective of the Raft library, commands are just opaque byte strings.

For example, you could build a replicated key-value store out of SET and GET commands that are passed in by a client. You provide a Raft library state machine code that interprets the Raft log of SET and GET commands to modify or read from an in-memory hashtable. You can find concrete examples of exactly this replicated key-value store modeling in previous Raft posts I've written.

All nodes in the cluster run the same Raft code (including the state machine code you provide); communicating among themselves. Nodes elect a semi-permanent leader that accepts all reads and writes from clients. (Again, reads and writes are modeled as commands).

To commit a new command to the cluster, clients send the command to all nodes in the cluster. Only the leader accepts this command, if there is currently a leader. Clients retry until there is a leader that accepts the command.

The leader appends the command to its log and makes sure to replicate all commands in its log to followers in the same order. The leader sends periodic heartbeat messages to all followers to prolong its term as leader. If a follower hasn't heard from the leader within a period of time, it becomes a candidate and requests votes from the cluster.

When a follower is asked to accept a new command from a leader, it checks if its history is up-to-date with the leader. If it is not, the follower rejects the request and asks the leader to send previous commands to bring it up-to-date. It does this ultimately, in the worst case of a follower that has lost all history, by going all the way back to the very first command ever sent.

When a quorum (typically a majority) of nodes has accepted a command, the leader marks the command as committed and applies the command to its own state machine. When followers learn about newly committed commands, they also apply committed commands to their own state machine.

For the most part, these details are graphically summarized in Figure 2 of the Raft paper.

Availability and linearizability

Taking a step back, distributed consensus helps a group of nodes, a cluster, agree on a value. A client of the cluster can treat a value from the cluster as if the value was atomically written to and read from a single thread. This property is called linearizability.

However, with distributed consensus, the client of the cluster has better availability guarantees from the cluster than if the client atomically wrote to or read from a single thread. A single thread that crashes becomes unavailable. But some number f nodes can crash in a cluster implementing distributed consensus and still 1) be available and 2) provide linearizable reads and writes.

That is: distributed consensus solves the problem of high availability for a system while remaining linearizable.

Without distributed consensus you can still achieve high availability. For example, a database might have two read replicas. But a client reading from a read replica might get stale data. Thus, this system (a database with two read replicas) is not linearizable.

Without distributed consensus you can also try synchronous replication. It would be very simple to do. Have a fixed leader and require all nodes to acknowledge before committing, But the value here is extremely limited. If a single node in the cluster goes down the entire cluster is down.

You might think I'm proposing a strawman. We could simply designate a permanent leader that handles all reads and writes; and require a majority of nodes to commit a command before the leader responds to a client. But in that case, what's the process for getting a lagging follower up-to-date? And what happens if it is the leader who goes down?

Well, these are not trivial problems! And, beyond linearizability that we already mentioned, these problems are exactly what distributed consensus solves.

Why does linearizability matter?

It's very nice, and often even critical, to have a highly available system that will never give you stale data. And regardless, it's convenient to have a term for what we might naively think of as the "correct" way you'd always want to set and get a value.

So linearizability is a convenient way of thinking about complex systems, if you can use or build a system that supports it. But it's not the only consistency approach you'll see in the wild.

As you increase the guarantees of your consistency model, you tend to sacrifice performance. Going the opposite direction, some production systems sacrifice consistency to improve performance. For example, you might allow stale reads from any node, reading only from local state and avoiding consensus, so that you can reduce load on a leader and avoid the overhead of consensus.

There are formal definitions for lower consistency models, including sequential and read-your-writes. You can read the Jepsen page for more detail.

Best and worst case scenarios

A distributed system relies on communicating over the network. The worse the network, whether in terms of latency or reliability, the longer it will take for communication to happen.

Aside from the network, disks can misdirect writes or corrupt data. Or you could be mounted on a network filesystem such as EBS.

And processes themselves can crash due to low disk space or the OOM killer.

It will take longer to achieve consensus to commit messages these scenarios. If messages take longer to reach nodes, or if nodes are constantly crashing, followers will timeout more often, triggering leader election. And the leader election itself (which also requires consensus) will also take longer.

The best case scenario for distributed consensus is where the network is reliable and low-latency. Where disks are reliable and fast. And where processes don't often crash.

TigerBeetle has an incredible visual simulator that demonstrates what happens across ever-worsening environments. While TigerBeetle and this simulator use Viewstamped Replication, the demonstrated principles apply to Raft as well.

What happens when you add nodes?

Distributed consensus algorithms make sure that some minimum number of nodes in a cluster agree before continuing. The minimum number is proportional to the total number of nodes in the cluster.

A typical implementation of Raft for example will require 3 nodes in a 5-node cluster to agree before continuing. 4 nodes in a 7-node cluster. And so on.

Recall that the p99 latency for a service is at least as bad as the slowest external request the service must make. As you increase the number of nodes you must talk to in a consensus cluster, you increase the chance of a slow request.

Consider the extreme case of a 101-node cluster requiring 51 nodes to respond before returning to the client. That's 51 chances for a slower request. Compared to 4 chances in a 7-node cluster. The 101-node cluster is certainly more highly available though! It can tolerate 49 nodes going down. The 7-node cluster can only tolerate 3 nodes going down. The scenario where 49 nodes go down (assuming they're in different availability zones) seems pretty unlikely!

Horizontal scaling with distributed consensus? Not exactly

All of this is to say that the most popular algorithms for distributed consensus, on their own, have nothing to do with horizontal scaling.

The way that horizontally scaling databases like Cockroach or Yugabyte or Spanner work is by sharding the data, transparent to the client. Within each shard data is replicated with a dedicated distributed consensus cluster.

So, yes, distributed consensus can be a part of horizontal scaling. But again what distributed consensus primarily solves is high availability via replication while remaining linearizable.

This is not a trivial point to make. etcd, consul, and rqlite are examples of databases that do not do sharding, only replication, via a single Raft cluster that replicates all data for the entire system.

For these databases there is no horizontal scaling. If they support "horizontal scaling", they support this by doing non-linearizable (stale) reads. Writes remain a challenge.

This doesn't mean these databases are bad. They are not. One obvious advantage they have over Cockroach or Spanner is that they are conceptually simpler. Conceptually simpler often equates to easier to operate. That's a big deal.

Optimizations

We've covered the basics of operation, but real-world implementations get more complex.

Snapshots

Rather than letting the log grow indefinitely, most libraries implement snapshotting. The user of the library provides a state machine and also provides a method for serializing the state machine to disk. The Raft library periodically serializes the state machine to disk and truncates the log.

When a follower is so far behind that the leader no longer has a log entry (because it has been truncated), the leader transfers an entire snapshot to the follower. Then once the follower is caught up on snapshots, the leader can transfer normal log entries again.

This technique is described in the Raft paper. While it isn't necessary for Raft to work, it's so important that it is hardly an optimization and more a required part of a production Raft system.

Batching

Rather than limiting clients of the cluster to submitting only one command at a time, it's common for the cluster to accept many commands at a time. Similarly, many commands at a time are submitted to followers. When any node needs to write commands to disk, it can batch commands to disk as well.

But you can go a step beyond this in a way that is completely opaque to the Raft library. Each opaque command the client submits can also contain a batch of messages. In this scenario, only the user-provided state machine needs to be aware that each command it receives is actually a batch of messages that it should pull apart and interpret separately.

This latter techinque is a fairly trivial way to increase throughput by an order of magnitude or two.

Disk and network

In terms of how data is stored on disk and how data is sent over the network there is obvious room for optimization.

A naive implementation might store JSON on disk and send JSON over the network. A slightly more optimized implementation might store binary data on disk and send binary data over the network.

Similarly you can swap out your RPC for gRPC or introduce zlib for compression to network or disk.

You can swap out synchronous IO for libaio or io_uring or SPDK/DPDK.

A little tweak I made in my latest Raft implementation was to index log entries so searching the log was not a linear operation. Another little tweak was to introduce a page cache to eliminate unnecessary disk reads. This increased throughput for by an order of magnitude.

Flexible quorums

This brilliant optimization by Heidi Howard and co. shows you can relax the quorum required for committing new commands so long as you increase the quorum required for electing a leader.

In an environment where leader election doesn't happen often, flexible quorums can increase throughput and decrease latency. And it's a pretty easy change to make!

More

These are just a couple common optimizations. You can also read about parallel state machine apply, parallel append to disk, witnesses, compartmentalization, and leader leases. TiKV, Scylla, RedPanda, and Cockroach tend to have public material talking about this stuff.

There are also a few people I follow who are often reviewing relevant papers, if they are not producing their own. I encourage you to follow them too if this is interesting to you:

Safety and testing

The other aspect to consider is safety. For example, checksums for everything written to disk and passed over the network; or being able to recover from corruption in the log.

Testing is also a big deal. There are prominent tools like Jepsen that check for consistency in the face of fault injection (process failure, network failure, etc.). But even Jepsen has its limits. For example, it doesn't test disk failure.

FoundationDB made popular a number of testing techniques. And the people behind this testing went on to build a product, Antithesis, around deterministic testing of non-deterministic code while injecting faults.

And on that topic there's Facebook Experimental's Hermit deterministic Linux hypervisor that may help to test complex distributed systems. However, my experience with it has not been great and the maintainers do not seem very engaged with other people who have reported bugs. I'm hopeful for it but we'll see.

Antithesis and Hermit seem like a boon when half the trouble of working on distributed consensus implementations is avoiding flakey tests.

Another promising avenue is emitting logs during the Raft lifecycle and validating the logs against a TLA+ spec. Microsoft has such a project that has begun to see adoption among open-source Raft implementations.

Conclusion

Everything aside, consensus is expensive. There is overhead to the entire consensus process. So if you do not need this level of availability and can settle for some process via backups, it's going to have lower latency and higher throughput than if it had to go through distributed consensus.

If you do need high availability, distributed consensus can be a great choice. But consider the environment and what you want from your consensus algorithm.

Also, while MultiPaxos, Raft, and Viewstamped Replication are some of the most popular algorithms for distributed consensus, there is a world beyond. Two-phase commit, ZooKeeper Atomic Broadcast, PigPaxos, EPaxos, Accord by Cassandra. The world of distributed consensus also gets especially weird and interesting outside of OLTP systems.

But that's enough for one post.

Further reading

February 06, 2024

OAuth applications are now available to everyone

You can now build integrations that seamlessly authenticate with PlanetScale and allow management access to your users’ organizations and databases from your application.

February 05, 2024

Deprecating the Scaler plan

Today, in our effort to continue being the best database for serverless and applications that require massive scale, we are deprecating the Scaler plan.

February 04, 2024

How to Replace Your CPAP In Only 666 Days

This story is not practical advice. For me, it’s closing the book on an almost two-year saga. For you, I hope it’s an enjoyable bit of bureaucratic schadenfreude. For Anthem, I hope it’s the subject of a series of painful but transformative meetings. This is not an isolated event. I’ve had dozens of struggles with Anthem customer support, and they all go like this.

If you’re looking for practical advice: it’s this. Be polite. Document everything. Keep a log. Follow the claims process. Check the laws regarding insurance claims in your state. If you pass the legally-mandated deadline for your claim, call customer service. Do not allow them to waste a year of your life, or force you to resubmit your claim from scratch. Initiate a complaint with your state regulators, and escalate directly to Gail Boudreaux’s team–or whoever Anthem’s current CEO is.

To start, experience an equipment failure.

Use your CPAP daily for six years. Wake up on day zero with it making a terrible sound. Discover that the pump assembly is failing. Inquire with Anthem Ohio, your health insurer, about how to have it repaired. Allow them to refer you to a list of local durable medical equipment providers. Start calling down the list. Discover half the list are companies like hair salons. Eventually reach a company in your metro which services CPAPs. Discover they will not repair broken equipment unless a doctor tells them to.

Leave a message with your primary care physician. Call the original sleep center that provided your CPAP. Discover they can’t help, since you’re no longer in the same state. Return to your primary, who can’t help either, because he had nothing to do with your prescription. Put the sleep center and your primary in touch, and ask them to talk.

On day six, call your primary to check in. He’s received a copy of your sleep records, and has forwarded them to a local sleep center you haven’t heard of. They, in turn, will talk to Anthem for you.

On day 34, receive an approval letter labeled “confirmation of medical necessity” from Anthem, directed towards the durable medical equipment company. Call that company and confirm you’re waitlisted for a new CPAP. They are not repairable. Begin using your partner’s old CPAP, which is not the right class of device, but at least it helps.

Over the next 233 days, call that medical equipment company regularly. Every time, inquire whether there’s been any progress, and hear “we’re still out of stock”. Ask them you what the manufacturer backlog might be, how many people are ahead of you in line, how many CPAPs they do receive per month, or whether anyone has ever received an actual device from them. They won’t answer any questions. Realize they are never going to help you.

On day 267, realize there is no manufacturer delay. The exact machine you need is in stock on CPAP.com. Check to make sure there’s a claims process for getting reimbursed by Anthem. Pay over three thousand dollars for it. When it arrives, enjoy being able to breathe again.

On day 282, follow CPAP.com’s documentation to file a claim with Anthem online. Include your prescription, receipt, shipping information, and the confirmation of medical necessity Anthem sent you.

On day 309, open the mail to discover a mysterious letter from Anthem. They’ve received your appeal. You do not recall appealing anything. There is no information about what might have been appealed, but something will happen within 30-60 days. There is nothing about your claim.

On day 418, emerge from a haze of lead, asbestos, leaks, and a host of other home-related nightmares; remember Anthem still hasn’t said anything about your claim. Discover your claim no longer appears on Anthem’s web site. Call Anthem customer service. They have no record of your claim either. Ask about the appeal letter you received. Listen, gobsmacked, as they explain that they decided your claim was in fact an appeal, and transferred it immediately to the appeals department. The appeals department examined the appeal and looked for the claim it was appealing. Finding none, they decided the appeal was moot, and rejected it. At no point did anyone inform you of this. Explain to Anthem’s agent that you filed a claim online, not an appeal. At their instruction, resign yourself to filing the entire claim again, this time using a form via physical mail. Include a detailed letter explaining the above.

On day 499, retreat from the battle against home entropy to call Anthem again. Experience a sense of growing dread as the customer service agent is completely unable to locate either of your claims. After a prolonged conversation, she finds it using a different tool. There is no record of the claim from day 418. There was a claim submitted on day 282. Because the claim does not appear in her system, there is no claim. Experience the cognitive equivalent of the Poltergeist hallway shot as the agent tells you “Our members are not eligible for charges for claim submission”.

Hear the sentence “There is a claim”. Hear the sentence “There is no claim”. Write these down in the detailed log you’ve been keeping of this unfurling Kafkaesque debacle. Ask again if there is anyone else who can help. There is no manager you can speak to. There is no tier II support. “I’m the only one you can talk to,” she says. Write that down.

Call CPAP.com, which has a help line staffed by caring humans. Explain that contrary to their documentation, Anthem now says members cannot file claims for equipment directly. Ask if they are the provider. Discover the provider for the claim is probably your primary care physician, who has no idea this is happening. Leave a message with him anyway. Leave a plaintive message with your original sleep center for good measure.

On day 502, call your sleep center again. They don’t submit claims to insurance, but they confirm that some people do successfully submit claims to Anthem using the process you’ve been trying. They confirm that Anthem is, in fact, hot garbage. Call your primary, send them everything you have, and ask if they can file a claim for you.

On day 541, receive a letter from Anthem, responding to your inquiry. You weren’t aware you filed one.

Please be informed that we have received your concern. Upon review we have noticed that there is no claim billed for the date of service mentioned in the submitted documents, Please provide us with a valid claim. If not submitted,provide us with a valid claim iamge to process your claim further.

Stare at the letter, typos and all. Contemplate your insignificance in the face of the vast and uncaring universe that is Anthem.

On day 559, steel your resolve and call Anthem again. Wait as this representative, too, digs for evidence of a claim. Listen with delight as she finds your documents from day 282. Confirm that yes, a claim definitely exists. Have her repeat that so you can write it down. Confirm that the previous agent was lying: members can submit claims. At her instruction, fill out the claim form a third time. Write a detailed letter, this time with a Document Control Number (DCN). Submit the entire package via registered mail. Wait for USPS to confirm delivery eight days later.

On day 588, having received no response, call Anthem again. Explain yourself. You’re getting good at this. Let the agent find a reference number for an appeal, but not the claim. Incant the magic DCN, which unlocks your original claim. “I was able to confirm that this was a claim submitted form for a member,” he says. He sees your claim form, your receipts, your confirmation of medical necessity. However: “We still don’t have the claim”.

Wait for him to try system after system. Eventually he confirms what you heard on day 418: the claims department transferred your claims to appeals. “Actually this is not an appeal, but it was denied as an appeal.” Agree as he decides to submit your claim manually again, with the help of his supervisor. Write down the call ref number: he promises you’ll receive an email confirmation, and an Explanation of Benefits in 30-40 business days.

“I can assure you this is the last time you are going to call us regarding this.”

While waiting for this process, recall insurance is a regulated industry. Check the Ohio Revised Code. Realize that section 3901.381 establishes deadlines for health insurers to respond to claims. They should have paid or denied each of your claims within 30 days–45 if supporting documentation was required. Leave a message with the Ohio Department of Insurance’s Market Conduct Division. File an insurance complaint with ODI as well.

Grimly wait as no confirmation email arrives.

On day 602, open an email from Anthem. They are “able to put the claim in the system and currenty on processed [sic] to be applied”. They’re asking for more time. Realize that Anthem is well past the 30-day deadline under the Ohio Revised Code for all three iterations of your claim.

On day 607, call Anthem again. The representative explains that the claim will be received and processed as of your benefits. She asks you to allow 30-45 days from today. Quote section 3901.381 to her. She promises to expedite the request; it should be addressed within 72 business hours. Like previous agents, she promises to call you back. Nod, knowing she won’t.

On day 610, email the Ohio Department of Insurance to explain that Anthem has found entirely new ways to avoid paying their claims on time. It’s been 72 hours without a callback; call Anthem again. She says “You submitted a claim and it was received” on day 282. She says the claim was expedited. Ask about the status of that expedited resolution. “Because on your plan we still haven’t received any claims,” she explains. Wonder if you’re having a stroke.

Explain that it has been 328 days since you submitted your claim, and ask what is going on. She says that since the first page of your mailed claim was a letter, that might have caused it to be processed as an appeal. Remind yourself Anthem told you to enclose that letter. Wait as she attempts to refer you to the subrogation department, until eventually she gives up: the subrogation department doesn’t want to help.

Call the subrogation department yourself. Allow Anthem’s representative to induce in you a period of brief aphasia. She wants to call a billing provider. Try to explain there is none: you purchased the machine yourself. She wants to refer you to collections. Wonder why on earth Anthem would want money from you. Write down “I literally can’t understand what she thinks is going on” in your log. Someone named Adrian will call you by tomorrow.

Contemplate alternative maneuvers. Go on a deep Google dive, searching for increasingly obscure phrases gleaned from Anthem’s bureaucracy. Trawl through internal training PDFs for Anthem’s ethics and compliance procedures. Call their compliance hotline: maybe someone cares about the law. It’s a third-party call center for Elevance Health. Fail to realize this is another name for Anthem. Begin drawing a map of Anthem’s corporate structure.

From a combination of publicly-available internal slide decks, LinkedIn, and obscure HR databases, discover the name, email, and phone number of Anthem’s Chief Compliance Officer. Call her, but get derailed by an internal directory that requires a 10-digit extension. Try the usual tricks with automated phone systems. No dice.

Receive a call from an Anthem agent. Ask her what happened to “72 hours”. She says there’s been no response from the adjustments team. She doesn’t know when a response will come. There’s no one available to talk to. Agree to speak to another representative tomorrow. It doesn’t matter: they’ll never call you.

Do more digging. Guess the CEO’s email from what you can glean of Anthem’s account naming scheme. Write her an email with a short executive summary and a detailed account of the endlessly-unfolding Boschian hellscape in which her company has entrapped you. A few hours later, receive an acknowledgement from an executive concierge at Elevance (Anthem). It’s polite, formal, and syntactically coherent. She promises to look into things. Smile. Maybe this will work.

On day 617, receive a call from the executive concierge. 355 days after submission, she’s identified a problem with your claim. CPAP.com provided you with an invoice with a single line item (the CPAP) and two associated billing codes (a CPAP and humidifier). Explain that they are integrated components of a single machine. She understands, but insists you need a receipt with multiple line items for them anyway. Anthem has called CPAP.com, but they can’t discuss an invoice unless you call them. Explain you’ll call them right now.

Call CPAP.com. Their customer support continues to be excellent. Confirm that it is literally impossible to separate the CPAP and humidifier, or to produce an invoice with two line items for a single item. Nod as they ask what the hell Anthem is doing. Recall that this is the exact same machine Anthem covered for you eight years ago. Start a joint call with the CPAP.com representative and Anthem’s concierge. Explain the situation to her voicemail.

On day 623, receive a letter from ODI. Anthem has told ODI this was a problem with the billing codes, and ODI does not intervene in billing code issues. They have, however, initiated a secretive second investigation. There is no way to contact the second investigator.

Write a detailed email to the concierge and ODI explaining that it took over three hundred days for Anthem to inform you of this purported billing code issue. Explain again that it is a single device. Emphasize that Anthem has been handling claims for this device for roughly a decade.

Wait. On day 636, receive a letter from Anthem’s appeals department. They’ve received your request for an appeal. You never filed one. They want your doctor or facility to provide additional information to Carelon Medical Benefits Management. You have never heard of Carelon. There is no explanation of how to reach Carelon, or what information they might require. The letter concludes: “There is currently no authorization on file for the services rendered.” You need to seek authorization from a department called “Utilization Management”.

Call the executive concierge again. Leave a voicemail asking what on earth is going on.

On day 637, receive an email: she’s looking into it.

On day 644, Anthem calls you. It’s a new agent who is immensely polite. Someone you’ve never heard of was asked to work on another project, so she’s taking over your case. She has no updates yet, but promises to keep in touch.

She does so. On day 653, she informs you Anthem will pay your claim in full. On day 659, she provides a check number. On day 666, the check arrives.

Deposit the check. Write a thank you email to the ODI and Anthem’s concierge. Write this, too, down in your log.

February 02, 2024

January 31, 2024

Databases at scale

Learn about the three main aspects of database scaling: storage, compute, and network.