Bugs in distributed systems are hard to find, largely because systems
interact in chaotic ways. And even once you've found a bug, it can be
anywhere from simple to impossible to reproduce it. It's about as far
away as you can get from the ideal test environment: property testing
a pure function.
But what if we could write our code in a way that we can isolate the
chaotic aspects of our distributed system during testing: run
multiple systems communicating with each other on a single
thread and control all randomness in each system? And property
test this single-threaded version of the distributed system with
controlled randomness, all the while injecting faults (fancy term for
unhappy path behavior like errors and latency) we might see in the
real-world?
Crazy as it sounds, people actually do this. It's called Deterministic
Simulation Testing (DST). And it's become more and more popular with
startups like FoundationDB, Antithesis, TigerBeetle, Polar Signals,
and WarpStream; as well as folks like Tyler Neely and Pekka Enberg,
talking about and making use of this technique.
It has become so popular to talk about DST in my corner of the world
that I worry it risks coming off sounding too magical and maybe a
little hyped. It's worth getting a better understanding of both the
benefits and the limitations.
Thank you to Alex Miller
and Will Wilson
for reviewing a version of this post.
Randomness and time
A big source of non-determinism in business logic is the use of random
numbers—in your code or your transitive dependencies or your language
runtime or your operating system.
Crucially, DST does not imply you can't have randomness! DST merely
assumes that you have a global seed for all randomness in your program
and that the simulator controls the seed. The seed may change across
runs of the simulator.
Once you observe a bad state as a result of running the simulation on
a random seed, you allow the user to enter the same seed again. This
allows the user to recreate the entire program run that led to that
observed bad state. Allows the user to debug the program trivially.
Another big source of non-determinism is being dependent on time. As
with randomness, DST does not mean you can't depend on time. DST means
you must be able to control the clock during the simulation.
To "control" randomness or time basically means you support dependency
injection, or the old-school alternative to dependency injection
called passing the dependency as an explicit parameter. Rather
than referring to a global clock or a global seed, you need to be able
to receive a clock or a seed from someone.
For example we might separate the operation of an application into the
language's main() entrypoint and an actual application start()
entrypoint.
# app.pseudocode
def start(clock, seed):
# lots of business logic that might depend on time or do random things
def main:
clock = time.clock()
seed = time.now()
app.start(clock, seed)
The application entrypoint is where we must be able to
swap out a real clock or real random seed for one controlled by our
simulator:
# sim.pseudocode
import "app.pseudocode"
def main:
sim_clock = make_sim_clock()
sim_seed = os.env.DST_SEED or time.now()
try:
app.start(sim_clock, sim_seed)
catch(e):
print("Bad execution at seed: %s", sim_seed)
throw e
Let's look at another example.
Converting an existing function
Let's say that we had a helper method that kept calling a function
until it succeeded, with backoff.
# retry.pseudocode
class Backoff:
def init:
this.rnd = rnd.new(seed = time.now())
this.tries = 0
async def retry_backoff(f):
while this.tries < 3:
if f():
return
await time.sleep(this.rnd.gen())
this.tries++
There is a single source of nondeterminism here and it's where we
generate a seed. We could parameterize the seed, but since we want to
call time.sleep() and since in DST we control the time, we can just
parameterize time.
# retry.psuedocode
class Backoff:
def init(this, time):
this.time = time
this.rnd = rnd.new(seed = this.time.now())
this.tries = 0
async def retry_backoff(this, f):
while this.tries < 3:
if f():
return
await this.time.sleep(this.rnd.gen())
this.tries++
Now we can write a little simulator to test this:
# sim.psuedocode
import "retry.pseudocode"
sim_time = {
now: 0
sleep: (ms) => {
await future.wait(ms)
}
tick: (ms) => now += ms
}
backoff = Backoff(sim_time)
while true:
failures = 0
f = () => {
if rnd.rand() > 0.5:
failures++
return false
return true
}
try:
while sim_time.now < 60min:
promise = backoff.retry_backoff(f)
sim_time.tick(1ms)
if promise.read():
break
assert_expect_failure_and_expected_time_elapse(sim_time, failures)
catch(e):
print("Found logical error with seed: %d", seed)
throw e
This demonstrates a few critical aspects of DST. First, the simulator
itself depends on randomness. But allows the user to provide a seed so
they can replay a simulation that discovers a bug. The controlled
randomness in the simulator is what lets us do property testing.
Second, the simulation workload must be written by the user. Even when
you've got a platform like Antithesis that gives you an environment
for DST, it's up to you to exercise the application.
Now let's get a little more complex.
A single thread and asynchronous IO
The determinism of multiple threads can only be controlled at the
operating system or emulator or hypervisor layer. Realistically, that
would require third-party systems like Antithesis or
Hermit (which, don't
get excited, is not actively developed and hasn't worked on any
interesting program of mine) or rr.
These systems transparently transform multi-threaded code into single
threaded code. But also note that Hermit and rr have only limited
ability to do fault injection which, in addition to deterministic
execution, is a goal of ours. And you can't run them on a mac. And
can't
run
them on ARM.
But we can, and would like, to write a simulator without writing a new
operating system or emulator or hypervisor, and without a third-party
system. So we must limit ourselves to writing code that can be
collapsed into a single thread. Significantly, since using blocking IO
would mean an entire class of concurrency bugs could not be discovered
while running the simulator in a single thread, we must limit
ourselves to asynchronous IO.
Single threaded and asynchronous IO. These are already two big limitations.
Some languages like Go are entirely built around transparent
multi-threading and blocking IO. Polar Signals
solved
this for DST by compiling their application to WASM where it would run
on a single thread. But that wasn't enough. Even on a single thread,
the Go runtime intentionally schedules goroutines randomly. So Polar
Signals forked the Go runtime to control this randomness with an
environment variable. That's kind of crazy. Resonate took another
approach
that also looks cumbersome. I'm not going to attempt to describe
it. Go seems like a difficult choice of a language if you want to do
DST.
Like Go, Rust has no builtin async IO. The most mature async IO
library is tokio. The tokio folks attempted to provide a
tokio-compatible simulator
implementation with all sources of nondeterminism removed. From what I
can tell, they did not at any point fully
succeed. That repo
has now been replaced with a "this is very experimental" tokio-rs
project called turmoil that
provides deterministic execution plus network fault injection. (But
not disk fault injection. More on that later.) It isn't surprising
that it is difficult to provide deterministic execution for an IO
library that was not designed for it. tokio is a large project with
many transitive dependencies. They must all be combed for
non-determinism.
On the other hand, Pekka has already
demonstrated
for us how we might build a simpler Rust async IO library that is
designed to be simulation tested. He modeled this on the TigerBeetle
design King and I
wrote
about two years ago.
So let's sketch out a program that does buggy IO and let's look at how
we can apply DST to it.
# readfile.pseudocode
def read_file(io, name, into_buffer):
f = await io.open(name)
read_buffer = [4096]u8{}
while true:
err, n_read = await f.read(&read_buffer)
if err == io.EOF:
into_buffer.copy_maybe_allocate(read_buffer[0:sizeof(read_buffer)])
return
if err:
throw err
into_buffer.copy_maybe_allocate(read_buffer[0:sizeof(read_buffer)])
In our simulator, we will provide a mocked out IO system and we will
randomly inject various errors while asserting pre- and
post-conditions.
# sim.psuedocode
import "readfile.pseudocode"
seed = if os.env.DST_SEED ? int(os.env.DST_SEED) : time.now()
rnd = rnd.new(seed)
while true:
sim_disk_data = rnd.rand_bytes(10MB)
sim_fd = {
pos: 0
EOF: Error("eof")
read: (fd, buf) => {
partial_read = rnd.rand_in_range_inclusive(0, sizeof(buf))
memcpy(sim_disk_data, buf, fd.pos, partial_read)
fd.pos += partial_read
if fd.pos == sizeof(sim_disk_data):
return io.EOF, partial_read
return partial_read
}
}
sim_io = {
open: (filename) => sim_fd
}
out_buf = Vector<u8>.new()
try:
read_file(sim_io, "somefile", out_buf)
assert_bytes_equal(out_buf.data, sim_disk_data)
catch (e):
print("Found logical error with seed: %d", seed)
throw e
And with this simulator we would have eventually caught our partial
read bug! In our original program when we wrote:
into_buffer.copy_maybe_allocate(read_buffer[0:sizeof(read_buffer)])
We should have written:
into_buffer.copy_maybe_allocate(read_buffer[0:n_read])
Great! Let's get a little more complex.
A distributed system
I already mentioned in the beginning that the gist of deterministic
simulation testing a distributed system is that you get all of the
nodes in the system to run in the same process. This would be
basically impossible if you wanted to test a system that involved your
application plus Kafka plus Postgres plus Redis. But if your system is
a self-contained distributed system, such as one that embeds a Raft
library for high availability of your application, you can actually
run multiple nodes into the same process!
For a system like this, our simulator might look like:
# sim.pseudocode
import "distsys-node.pseudocode"
seed = if os.env.DST_SEED ? int(os.env.DST_SEED) : time.now()
rnd = rnd.new(seed)
while true:
sim_fd = {
send(fd, buf) => {
# Inject random failure.
if rnd.rand() > .5:
throw Error('bad write')
# Inject random latency.
if rnd.rand() > .5:
await time.sleep(rnd.rand())
n_written = assert_ok(os.fd.write(buf))
return n_written
},
recv(fd, buf) => {
# Inject random failure.
if rnd.rand() > .5:
throw Error('bad read')
# Inject random latency.
if rnd.rand() > .5:
await time.sleep(rnd.rand())
return os.fd.read(buf)
}
}
sim_io = {
open: (filename) => {
# Inject random failure.
if rnd.rand() > .5:
throw Error('bad open')
# Inject random latency.
if rnd.rand() > .5:
await time.sleep(rnd.rand())
return sim_fd
}
}
all_ports = [6000, 6001, 6002]
nodes = [
await distsys-node.start(sim_io, all_ports[0], all_ports),
await distsys-node.start(sim_io, all_ports[1], all_ports),
await distsys-node.start(sim_io, all_ports[2], all_ports),
]
history = []
try:
key = rnd.rand_bytes(10)
value = rnd.rand_bytes(10)
nodes[rnd.rand_in_range_inclusive(0, len(nodes)].insert(key, value)
history.add((key, value))
assert_valid_history(nodes, history)
# Crash a process every so often
if rnd.rand() > 0.75:
node = nodes[rnd.rand_in_range_inclusive(0, 3)]
node.restart()
catch (e):
print("Found logical error with seed: %d", seed)
throw e
I'm completely hand waving here to demonstrate the broader point and
not any specific testing strategy for a specific distributed
system. The important points are that these three nodes run in the
same process, on different ports.
We control disk IO. We control network IO. We control how time
elapses. We run a deterministic simulated workload against the three
node system while injecting disk, network, and process faults.
And we are constantly checking for an invalid state. When we get the
invalid state, we can be sure the user can easily recreate this
invalid state.
Other sources of non-determinism
Within some error margin, most CPU instructions and most CPU behavior are
considered to be deterministic. There are, however, certain CPU
instructions that are definitely
not. Unfortunately
that might
include
system calls. It might also
include malloc. There is very
little to trust.
If we ignore
Antithesis, people doing DST seem not to worry about these smaller
bits of nondeterminism. Yet it's generally agreed that DST is still
worthwhile anyway. The intuition here is that every bit of
non-determinism you can eliminate makes it that much easier to
reproduce bugs when you find them.
Put another way: determinism, even among DST practitioners, remains a spectrum.
Considerations
As you may have noticed already from some of the pseudocode, DST is not a panacea.
Consideration 1: Edges
First, because you must swap out non-deterministic parts of your code,
you are not actually testing the entirety of your code. You are
certainly encouraged to keep the deterministic kernel large. But there
will always be the non-deterministic edges.
Without a system like Antithesis which gives you an entire
deterministic machine, you can't test your whole program.
But even with Antithesis you cannot test the integration between your
system and external systems. You must mock out the external systems.
It's also worth noting that there are many areas where you could
inject simulation. You could do it at a high-level RPC and storage
layer. This would be simpler and easier to understand. But then you'd
be omitting testing and error-handling of lower-level errors.
Consideration 2: Your workload(s)
DST is dependent on your creativity and thoroughness of your workload
as much as any other type of test or benchmark.
Just as you wouldn't depend on one single benchmark to qualify your
application, you may not want to depend on a single simulated
workload.
Or as Will Wilson put it for me:
The biggest challenge of DST in my experience is that tuning all the
random distributions, the parameters of your system, the workload,
the fault injection, etc. so that it produces interesting behavior
is very challenging and very labor intensive. As with fuzzing or
PBT, it's terrifyingly easy to build a DST system that appears to be
doing a ton of testing, but actually never explores very much of the
state space of your system. At FoundationDB, the vast majority of
the work we put into the simulator was an iterative process of
hunting for what wasn't being covered by our tests and then figuring
out how to make the tests better. This process often resembles
science more than it does engineering.
Unfortunately, unlike with fuzzing, mere branch coverage in your
code is usually a pretty poor signal for the kinds of systems you
want to test with DST. At Antithesis we handle this with Sometimes
assertions, at FDB we did something pretty similar, and I assume
TigerBeetle and others have their own version of this. But of course
the ultimate figure of merit is whether your DST system is finding
100% of your bugs. It's quite difficult to get to the point that it
does. The truly ambitious part of Antithesis isn't the hypervisor,
but the fact that we also aim to solve the much harder "is my DST
working?" problem with minimal human guidance or supervision.
Consideration 3: Your knowledge of what you mocked
When you mock out the behavior of disk or network IO, the benefits of
DST are tied to your understanding of the spectrum of behavior that
may happen in the real world.
What are all possible error conditions? What are the extreme latency
bounds of the original method? What about corruption or misdirected
IO?
The flipside here is that only in deterministic simulation testing can
you configure these crazy scenarios to happen at a configurable
regularity. You can kick off a set of runs that have especially
high IO latency or especially high corrupt reads/writes. Joran and I
wrote
a year ago about how the TigerBeetle simulator does exactly this.
Consideration 4: Non-reproducible seeds as code changes
Critically, the reproducibility of DST only helps so long as your code
doesn't change. As soon as your code changes, the seed may no longer
even get you to the state where the bug was exhibited. So the
reproducibility of DST means more that it may help you convert the
seed simulation run into an integration test that describes the
precise scenario even as the code changes.
Consideration 5: Time and compute
Because of Consideration 4, you need to keep rerunning the simulator
not just to keep finding new seeds and new histories but because the
new seeds and new histories may change every time you make changes to
code.
What about Jepsen?
Jepsen does limited process and network fault injection while testing
for linearizability. It's a fantastic project.
However, it represents only a subset of what is possible with
Deterministic Simulation Testing (if you actually put in the effort
described above to get there).
But even more importantly, Jepsen has nothing to do with deterministic
execution. If Jepsen finds a bug and your system can't do
deterministic execution, you may or may not be able to reproduce that
Jepsen bug.
Here's another Will Wilson
quote for you
on Jepsen and FoundationDB:
Anyway, we did [Deterministic Simulation Testing] for a while and
found all of the bugs in the database. I know, I know, that’s an
insane thing to say. It’s kind of true though. In the entire history
of the company, I think we only ever had one or two bugs reported by
a customer. Ever. Kyle Kingsbury aka “aphyr” didn’t even bother
testing it with Jepsen, because he didn’t think he’d find anything.
Conclusion
The degree to which you can place faith in DST alone, and not time
spent in production, has limits. However, it certainly does no harm to
employ DST. And, barring the considerations described above, will
likely make the kernel of your product significantly more
stable. Furthermore, everyone who uses DST knows about these
considerations. But I think it's worthwhile to list them out to help
folks who do not know DST to build an intuition for what it's
excellent at.
Further reading: