Distributed consensus in transactional databases (e.g. etcd or
Cockroach) is a big deal these days. Most often under the hood are
variations of log-based Paxos-like algorithms such as MultiPaxos,
Viewstamped Replication, or Raft. While there are new variations that
come out each year, optimizing for various workloads, these algorithms
are fairly standard and well-understood.
In fact they are used in so many places, Kubernetes for example, that
even if you don't decide to implement Raft (which is fun and I
encourage it), it seems worth building an intuition for distributed
consensus.
What happens as you tweak a configuration. What happens as the
production environment changes. Or what to reach for as product
requirements change.
I've been thinking
about the
basics of
distributed consensus
recently. There has been a lot to digest and characterize. And I'm
only beginning to get an understanding.
This post is an attempt to share some of the intuition built up
reading about and working in this space. Originally this post was also
going to end with a walkthrough of my most
recent Raft implementation in
Rust. But I'm going to hold off on that for another time.
I was fortunate to have a few excellent reviewers looking at versions
of this post: Paul Nowoczynski, Alex Miller, Jack Vanlightly, Daniel
Chia, and Alex Petrov. Thank you!
Let's start with Raft.
Raft
Raft is a distributed consensus algorithm that allows you to build a
replicated state machine on top of a replicated log.
A Raft library handles replicating and durably persisting a sequence
(or log) of commands to at least a majority of nodes in a
cluster. You provide the library a state machine that interprets the
replicated commands. From the perspective of the Raft library,
commands are just opaque byte strings.
For example, you could build a replicated key-value store out of SET
and GET
commands that are passed in by a client. You provide a Raft
library state machine code that interprets the Raft log of SET
and
GET
commands to modify or read from an in-memory hashtable. You can
find concrete examples of exactly this replicated key-value store
modeling in previous Raft
posts I've written.
All nodes in the cluster run the same Raft code (including the state
machine code you provide); communicating among themselves. Nodes elect
a semi-permanent leader that accepts all reads and writes from
clients. (Again, reads and writes are modeled as commands).
To commit a new command to the cluster, clients send the command to
all nodes in the cluster. Only the leader accepts this command, if
there is currently a leader. Clients retry until there is a leader
that accepts the command.
The leader appends the command to its log and makes sure to replicate
all commands in its log to followers in the same order. The leader
sends periodic heartbeat messages to all followers to prolong its term
as leader. If a follower hasn't heard from the leader within a period
of time, it becomes a candidate and requests votes from the cluster.
When a follower is asked to accept a new command from a leader, it
checks if its history is up-to-date with the leader. If it is not, the
follower rejects the request and asks the leader to send previous
commands to bring it up-to-date. It does this ultimately, in the worst
case of a follower that has lost all history, by going all the way
back to the very first command ever sent.
When a quorum (typically a majority) of nodes has accepted a command,
the leader marks the command as committed and applies the command to
its own state machine. When followers learn about newly committed
commands, they also apply committed commands to their own state machine.
For the most part, these details are graphically summarized in Figure
2 of the Raft paper.
Availability and linearizability
Taking a step back, distributed consensus helps a group of nodes, a
cluster, agree on a value. A client of the cluster can treat a value
from the cluster as if the value was atomically written to and read
from a single thread. This property is called
linearizability.
However, with distributed consensus, the client of the cluster has
better availability guarantees from the cluster than if the client
atomically wrote to or read from a single thread. A single thread that
crashes becomes unavailable. But some number f
nodes can crash in a
cluster implementing distributed consensus and still 1) be available
and 2) provide linearizable reads and writes.
That is: distributed consensus solves the problem of high
availability for a system while remaining linearizable.
Without distributed consensus you can still achieve high
availability. For example, a database might have two read
replicas. But a client reading from a read replica might get stale
data. Thus, this system (a database with two read replicas) is not
linearizable.
Without distributed consensus you can also try synchronous
replication. It would be very simple to do. Have a fixed leader and
require all nodes to acknowledge before committing, But the value here
is extremely limited. If a single node in the cluster goes down the
entire cluster is down.
You might think I'm proposing a strawman. We could simply designate a
permanent leader that handles all reads and writes; and require a
majority of nodes to commit a command before the leader responds to a
client. But in that case, what's the process for getting a lagging
follower up-to-date? And what happens if it is the leader who goes
down?
Well, these are not trivial problems! And, beyond linearizability that
we already mentioned, these problems are exactly what distributed
consensus solves.
Why does linearizability matter?
It's very nice, and often even critical, to have a highly available
system that will never give you stale data. And regardless, it's
convenient to have a term for what we might naively think of as the
"correct" way you'd always want to set and get a value.
So linearizability is a convenient way of thinking about complex
systems, if you can use or build a system that supports it. But it's
not the only consistency approach you'll see in the wild.
As you increase the guarantees of your consistency model, you tend to
sacrifice performance. Going the opposite direction, some production
systems sacrifice consistency to improve performance. For example, you
might allow stale reads from any node, reading only from local state
and avoiding consensus, so that you can reduce load on a leader and
avoid the overhead of consensus.
There are formal definitions for lower consistency models, including
sequential and read-your-writes. You can read the Jepsen
page for more detail.
Best and worst case scenarios
A distributed system relies on communicating over the network. The
worse the network, whether in terms of latency or reliability, the
longer it will take for communication to happen.
Aside from the network, disks can misdirect writes or corrupt data. Or
you could be mounted on a network filesystem such as EBS.
And processes themselves can crash due to low disk space or the OOM
killer.
It will take longer to achieve consensus to commit messages these
scenarios. If messages take longer to reach nodes, or if nodes are
constantly crashing, followers will timeout more often, triggering
leader election. And the leader election itself (which also requires
consensus) will also take longer.
The best case scenario for distributed consensus is where the network
is reliable and low-latency. Where disks are reliable and fast. And
where processes don't often crash.
TigerBeetle has an incredible visual
simulator that demonstrates what
happens across ever-worsening environments. While TigerBeetle and this
simulator use Viewstamped Replication, the demonstrated principles
apply to Raft as well.
What happens when you add nodes?
Distributed consensus algorithms make sure that some minimum number of
nodes in a cluster agree before continuing. The minimum number is
proportional to the total number of nodes in the cluster.
A typical implementation of Raft for example will require 3 nodes in a
5-node cluster to agree before continuing. 4 nodes in a 7-node
cluster. And so on.
Recall that the p99 latency for a service is at least as bad as the
slowest external request the service must make. As you increase the
number of nodes you must talk to in a consensus cluster, you increase
the chance of a slow request.
Consider the extreme case of a 101-node cluster requiring 51 nodes to
respond before returning to the client. That's 51 chances for a slower
request. Compared to 4 chances in a 7-node cluster. The 101-node
cluster is certainly more highly available though! It can tolerate 49
nodes going down. The 7-node cluster can only tolerate 3 nodes going
down. The scenario where 49 nodes go down (assuming they're in
different availability zones) seems pretty unlikely!
Horizontal scaling with distributed consensus? Not exactly
All of this is to say that the most popular algorithms for distributed
consensus, on their own, have nothing to do with horizontal scaling.
The way that horizontally scaling databases like Cockroach or Yugabyte
or Spanner work is by sharding the data, transparent to the
client. Within each shard data is replicated with a dedicated
distributed consensus cluster.
So, yes, distributed consensus can be a part of horizontal
scaling. But again what distributed consensus primarily solves is high
availability via replication while remaining linearizable.
This is not a trivial point to
make. etcd,
consul,
and rqlite are examples of
databases that do not do sharding, only replication, via a single
Raft cluster that replicates all data for the entire system.
For these databases there is no horizontal scaling. If they support
"horizontal scaling", they support this by doing non-linearizable
(stale) reads. Writes remain a challenge.
This doesn't mean these databases are bad. They are not. One obvious
advantage they have over Cockroach or Spanner is that they are
conceptually simpler. Conceptually simpler often equates to easier to
operate. That's a big deal.
Optimizations
We've covered the basics of operation, but real-world implementations
get more complex.
Snapshots
Rather than letting the log grow indefinitely, most libraries
implement snapshotting. The user of the library provides a state
machine and also provides a method for serializing the state machine
to disk. The Raft library periodically serializes the state machine to
disk and truncates the log.
When a follower is so far behind that the leader no longer has a log
entry (because it has been truncated), the leader transfers an entire
snapshot to the follower. Then once the follower is caught up on
snapshots, the leader can transfer normal log entries again.
This technique is described in the Raft paper. While it isn't
necessary for Raft to work, it's so important that it is hardly an
optimization and more a required part of a production Raft system.
Batching
Rather than limiting clients of the cluster to submitting only one
command at a time, it's common for the cluster to accept many commands
at a time. Similarly, many commands at a time are submitted to
followers. When any node needs to write commands to disk, it can batch
commands to disk as well.
But you can go a step beyond this in a way that is completely opaque
to the Raft library. Each opaque command the client submits can also
contain a batch of messages. In this scenario, only the user-provided
state machine needs to be aware that each command it receives is
actually a batch of messages that it should pull apart and interpret
separately.
This latter techinque is a fairly trivial way to increase throughput
by an order of magnitude or two.
Disk and network
In terms of how data is stored on disk and how data is sent over the
network there is obvious room for optimization.
A naive implementation might store JSON on disk and send JSON over the
network. A slightly more optimized implementation might store binary
data on disk and send binary data over the network.
Similarly you can swap out your RPC for gRPC or introduce zlib for
compression to network or disk.
You can swap out synchronous IO for libaio or io_uring or SPDK/DPDK.
A little tweak I made in my latest Raft implementation was to index
log entries so searching the log was not a linear operation. Another
little tweak was to introduce a page cache to eliminate unnecessary
disk reads. This increased throughput for by an order of magnitude.
Flexible quorums
This brilliant optimization by
Heidi Howard and co. shows you can relax the quorum required for
committing new commands so long as you increase the quorum required
for electing a leader.
In an environment where leader election doesn't happen often, flexible
quorums can increase throughput and decrease latency. And it's a
pretty easy change to make!
More
These are just a couple common optimizations. You can also read about
parallel state machine
apply,
parallel append to
disk,
witnesses,
compartmentalization,
and leader leases. TiKV, Scylla, RedPanda, and Cockroach tend to have
public material talking about this stuff.
There are also a few people I follow who are often reviewing relevant
papers, if they are not producing their own. I encourage you to follow
them too if this is interesting to you:
Safety and testing
The other aspect to consider is safety. For example, checksums for
everything written to disk and passed over the network; or being able
to
recover
from corruption in the log.
Testing is also a big deal. There are prominent tools like
Jepsen that check for consistency in the face of
fault injection (process failure, network failure, etc.). But even
Jepsen has its limits. For example, it doesn't test disk failure.
FoundationDB made
popular a number of
testing techniques. And the people behind this testing went on to
build a product, Antithesis, around deterministic
testing of non-deterministic code while injecting faults.
And on that topic there's Facebook Experimental's
Hermit deterministic
Linux hypervisor that may help to test complex distributed
systems. However, my experience with it has not been great and the
maintainers do not seem very engaged with other people who have
reported bugs. I'm hopeful for it but we'll see.
Antithesis and Hermit seem like a boon when half the trouble of
working on distributed consensus implementations is avoiding flakey
tests.
Another promising avenue is emitting logs during the Raft lifecycle
and validating the logs against a TLA+ spec. Microsoft has such a
project that has begun to see
adoption among
open-source Raft implementations.
Conclusion
Everything aside, consensus is expensive. There is overhead to the
entire consensus process. So if you do not need this level of
availability and can settle for some process via backups, it's going
to have lower latency and higher throughput than if it had to go
through distributed consensus.
If you do need high availability, distributed consensus can be a great
choice. But consider the environment and what you want from your
consensus algorithm.
Also, while MultiPaxos, Raft, and Viewstamped Replication are some of
the most popular algorithms for distributed consensus, there is a
world beyond. Two-phase commit, ZooKeeper Atomic Broadcast, PigPaxos,
EPaxos, Accord by Cassandra. The world of distributed consensus also
gets especially weird and interesting outside of OLTP systems.
But that's enough for one post.
Further reading