October 02, 2023
October 01, 2023
Intercepting and modifying Linux system calls with ptrace
How software fails is interesting. But real-world errors can be infrequent to manifest. Fault injection is a formal-sounding term that just means: trying to explicitly trigger errors in the hopes of discovering bad logic, typically during automated tests.
Jepsen and ChaosMonkey are two famous examples that help to trigger process and network failure. But what about disk and filesystem errors?
A few avenues seem worth investigating:
- A custom FUSE filesystem
- An LD_PRELOAD interception layer
- A ptrace system call interception layer
- A
SECCOMP_RET_TRAP
interception layer - Or, symbolic analysis a la Alice from University of Wisconsin-Madison
I would like to try out FUSE sometime. But LD_PRELOAD layer only works if IO goes through libc, which won't be the case for all programs. ptrace is something I've wanted to dig into for years since learning about gvisor.
SECCOMP_RET_TRAP
doesn't have the same high-level guides that ptrace
does so maybe I'll dig into it later. And symbolic analysis might be
able to detect bad workloads but it also isn't fault injection. Maybe
it's the better idea but fault injection just sounds more fun.
So this particular post will cover intercepting system calls (syscalls) using ptrace with code written in Zig. Not because readers will likely write their own code in Zig but because hopefully the Zig code will be easier for you to read and adapt to your language compared to if we had to deal with the verbosity and inconvenience of C.
In the end, we'll be able to intercept and force short (incomplete) writes in a Go, Python, and C program. Emulating a disk that is having an issue completing the write. This is a case that isn't common, but should probably be handled with retries in production code.
This post corresponds roughly to this commit on GitHub.
A bad program
First off, let's write some code for a program that would exhibit a short write. Basically, we write to a file and don't check how many bytes we wrote. This is extremely common code; or at least I've written it often.
$ cat test/main.go
package main
import (
"os"
)
func main() {
f, err := os.OpenFile("test.txt", os.O_RDWR|os.O_CREATE|os.O_TRUNC, 0755)
if err != nil {
panic(err)
}
text := "some great stuff"
_, _ = f.Write([]byte(text))
_ = f.Close()
}
With this code, if the Write()
call doesn't actually succeed in
writing everything, we won't know that. And the file will contain less
than all of some great stuff
.
This logical mistake will happen rarely, if ever, on a normal disk. But it is possible.
Now that we've got an example program in mind, let's see if we can trigger the logic error.
ptrace
ptrace is a somewhat cross-platform layer that allows you to intercept syscalls in a process. You can read and modify memory and registers in the process, when the syscalls starts and before it finishes.
gdb and strace both use ptrace for their magic.
Google's gvisor that powers various serverless runtimes in Google
Cloud was also
historically based on ptrace (PTRACE_SYSEMU
specifically, which we
won't explore much in this post).
Interestingly though,
gvisor switched
only this year (2023) to a different default backend for
trapping system calls. Based
on SECCOMP_RET_TRAP
.
You can get similar vibes
from this
Brendan Gregg post on the dangers of using strace (that is based
on ptrace) in production.
Although ptrace is cross-platform, actually writing cross-platform-aware code with ptrace can be complex. So this post assumes amd64/linux.
Protocol
The ptrace protocol is described in the ptrace manpage, but Chris Wellons and a University of Alberta group also wrote nice introductions. I referenced these three pages heavily.
Here's what the UAlberta page has to say:
We fork and have the child call PTRACE_TRACEME
. Then we handle each
syscall entrance by calling PTRACE_SYSCALL
and waiting with wait
until the child has entered the syscall. It is in this moment we can
mess with things.
Implementation
Let's turn that graphic into Zig code.
const std = @import("std");
const c = @cImport({
@cInclude("sys/ptrace.h");
@cInclude("sys/user.h");
@cInclude("sys/wait.h");
@cInclude("errno.h");
});
const cNullPtr: ?*anyopaque = null;
// TODO //
pub fn main() !void {
var arena = std.heap.ArenaAllocator.init(std.heap.page_allocator);
defer arena.deinit();
var args = try std.process.argsAlloc(arena.allocator());
std.debug.assert(args.len >= 2);
const pid = try std.os.fork();
if (pid < 0) {
std.debug.print("Fork failed!\n", .{});
return;
} else if (pid == 0) {
// Child process
_ = c.ptrace(c.PTRACE_TRACEME, pid, cNullPtr, cNullPtr);
return std.process.execv(
arena.allocator(),
args[1..],
);
} else {
// Parent process
const childPid = pid;
_ = c.waitpid(childPid, 0, 0);
var cm = ChildManager{ .arena = &arena, .childPid = childPid };
try cm.childInterceptSyscalls();
}
}
So like the graphic suggested, we fork and start a child process. That means this Zig program should be called like:
$ zig build-exe --library c main.zig
$ ./main /actual/program/to/intercept --and --its args
Presumably, as with strace or gdb, we could instead attach to an
already running process with PTRACE_ATTACH
or PTRACE_SEIZE
(based
on the ptrace
manpage) rather
than going the PTRACE_TRACEME
route. But I haven't tried that out
yet.
With the child ready to be intercepted, we can implement the
ChildManager
that actually does the interception.
ChildManager
The core of the ChildManager
is an infinite loop (at least as long
as the child process lives) that waits for the next syscall and then
calls a hook for the sytem call if it exists.
const ChildManager = struct {
arena: *std.heap.ArenaAllocator,
childPid: std.os.pid_t,
// TODO //
fn childInterceptSyscalls(
cm: *ChildManager,
) !void {
while (true) {
// Handle syscall entrance
const status = cm.childWaitForSyscall();
if (std.os.W.IFEXITED(status)) {
break;
}
var args: ABIArguments = cm.getABIArguments();
const syscall = args.syscall();
for (hooks) |hook| {
if (syscall == hook.syscall) {
try hook.hook(cm.*, &args);
}
}
}
}
};
Later we'll write a hook for the sys_write
syscall that
will force an incomplete write.
Back to the protocol, childWaitForSyscall
will call PTRACE_SYSCALL
to allow the child process to start up again and continue until the
next syscall. We'll follow that by wait
-ing for the child
process to be stopped again so we can handle the syscall entrance.
fn childWaitForSyscall(cm: ChildManager) u32 {
var status: i32 = 0;
_ = c.ptrace(c.PTRACE_SYSCALL, cm.childPid, cNullPtr, cNullPtr);
_ = c.waitpid(cm.childPid, &status, 0);
return @bitCast(status);
}
Now that we've intercepted a syscall (after waitpid
finishes
blocking), we need to figure out what syscall it was. We do this by
calling PTRACE_GETREGS
and reading the rax
register which
according to amd64/linux calling
convention is the
syscall called.
Registers
PTRACE_GETREGS
fills out the following
struct:
struct user_regs_struct
{
unsigned long r15;
unsigned long r14;
unsigned long r13;
unsigned long r12;
unsigned long rbp;
unsigned long rbx;
unsigned long r11;
unsigned long r10;
unsigned long r9;
unsigned long r8;
unsigned long rax;
unsigned long rcx;
unsigned long rdx;
unsigned long rsi;
unsigned long rdi;
unsigned long orig_rax;
unsigned long rip;
unsigned long cs;
unsigned long eflags;
unsigned long rsp;
unsigned long ss;
unsigned long fs_base;
unsigned long gs_base;
unsigned long ds;
unsigned long es;
unsigned long fs;
unsigned long gs;
};
Let's write a little amd64/linux-specific wrapper for accessing meaningful fields.
const ABIArguments = struct {
regs: c.user_regs_struct,
fn nth(aa: ABIArguments, i: u8) c_ulong {
std.debug.assert(i < 4);
return switch (i) {
0 => aa.regs.rdi,
1 => aa.regs.rsi,
2 => aa.regs.rdx,
else => unreachable,
};
}
fn setNth(aa: *ABIArguments, i: u8, value: c_ulong) void {
std.debug.assert(i < 4);
switch (i) {
0 => { aa.regs.rdi = value; },
1 => { aa.regs.rsi = value; },
2 => { aa.regs.rdx = value; },
else => unreachable,
}
}
fn result(aa: ABIArguments) c_ulong { return aa.regs.rax; }
fn setResult(aa: *ABIArguments, value: c_ulong) void {
aa.regs.rax = value;
}
fn syscall(aa: ABIArguments) c_ulong { return aa.regs.orig_rax; }
};
One thing to note is that the field we read to get rax
is not
aa.regs.rax
but aa.regs.orig_rax
. This is because rax
is also
the return value and PTRACE_SYSCALL
gets called twice for some
syscalls on entrance and exit. The orig_rax
field preserves the
original rax
value on syscall entrance. You can read more about this
here.
Getting and setting registers
Now let's write the ChildManager
code that actually calls
PTRACE_GETREGS
to fill out one of these structs.
fn getABIArguments(cm: ChildManager) ABIArguments {
var args = ABIArguments{ .regs = undefined };
_ = c.ptrace(c.PTRACE_GETREGS, cm.childPid, cNullPtr, &args.regs);
return args;
}
Setting registers is similar, we just pass the struct back and call
PTRACE_SETREGS
instead:
fn setABIArguments(cm: ChildManager, args: *ABIArguments) void {
_ = c.ptrace(c.PTRACE_SETREGS, cm.childPid, cNullPtr, &args.regs);
}
A hook
Now we finally have enough code to write a hook that can get and set registers; i.e. manipulate a system call!
We'll start by registering a sys_write
hook in the hooks
field we
check in childInterceptSyscalls
above.
const hooks = &[_]struct {
syscall: c_ulong,
hook: *const fn (ChildManager, *ABIArguments) anyerror!void,
}{.{
.syscall = @intFromEnum(std.os.linux.syscalls.X64.write),
.hook = writeHandler,
}};
If we look at the manpage for
write
we see it
takes three arguments
- The file descriptor (fd) to write to
- The address to start writing data from
- And the number of bytes to write
Going back to the calling
convention
that means the fd will be in rdi
, the data address in rsi
, and the
data length in rdx
.
So if we shorten the data length, we should be creating a short (incomplete) write.
fn writeHandler(cm: ChildManager, entryArgs: *ABIArguments) anyerror!void {
const fd = entryArgs.nth(0);
const dataAddress = entryArgs.nth(1);
var dataLength = entryArgs.nth(2);
// Truncate some bytes
if (dataLength > 2) {
dataLength -= 2;
entryArgs.setNth(2, dataLength);
cm.setABIArguments(entryArgs);
}
}
In a more sophisticated version of this program, we could randomly decide when to truncate data and randomly decide how much data to truncate. However, for our purposes this is sufficient.
But there are some real problems with this code. When I ran this program against a basic Go program, I saw duplicate requests.
Ah ok, PTRACE_SYSCALL gets hit when you both enter and exit a syscall.
— Phil Eaton (@eatonphil) September 29, 2023
So each time you call PTRACE_SYSCALL and you do stuff, you just call it again afterwards to handle/wait for the exit. pic.twitter.com/PjmNwcMepG
So the deal with PTRACE_SYSCALL
is that for (most?) syscalls, you
get to modify data before the data actually is handled by the
kernel. And you get to modify data after the kernel has finished the
syscall too.
This makes sense because PTRACE_SYSCALL
(unlike PTRACE_SYSEMU
)
allows the syscall to actually happen. And if we wanted to, for
example, modify the syscall exit code, we'd have to do that after the
syscall was done not before it started. We are modifying registers
directly after all.
All this means for our Zig code is that when we handle sys_write
, we
need to call PTRACE_SYSCALL
again to process the syscall
exit. Otherwise we'd reach this writeHandler
for both entries and
exits, which would require some additional way of disambiguating
entrances from exits.
fn writeHandler(cm: ChildManager, entryArgs: *ABIArguments) anyerror!void {
const fd = entryArgs.nth(0);
const dataAddress = entryArgs.nth(1);
var dataLength = entryArgs.nth(2);
// Truncate some bytes
if (dataLength > 2) {
dataLength -= 2;
entryArgs.setNth(2, dataLength);
cm.setABIArguments(entryArgs);
}
const data = try cm.childReadData(dataAddress, dataLength);
defer data.deinit();
std.debug.print("Got a write on {}: {s}\n", .{ fd, data.items });
// Handle syscall exit
_ = cm.childWaitForSyscall();
}
We could put the cm.childWaitForSyscall()
waiting for
the syscall exit in the main loop and I did try that at
first. However, not all syscalls seemed to have the same entry and
exit hook and this resulted in the hooks sometimes starting with a
syscall exit rather than a syscall entry. So rather than making the
code more complicated, I decided to only wait for the exit on
syscalls I knew had an exit (by observation at least), like
sys_write
.
Multiple writes? No bad logic?
So I had this code as is, correctly handling syscall entrances and exits, but I was seeing multiple write calls. And the text file I was writing to had the complete text I wanted to write. There was no short write even though I truncated the data length.
Ok so what happens in this Go program if I truncate the amount of data?
— Phil Eaton (@eatonphil) September 29, 2023
I assumed Go would do nothing since all I did was call `f.Write()` once and `f.Write()` returns a number of bytes written.
But actually, it still writes everything! pic.twitter.com/OSalKEbERM
This took some digging into Go source code to understand. If you trace
what os.File.Write()
does on Linux you eventually get to
src/internal/poll/fd_unix.go:
// Write implements io.Writer.
func (fd *FD) Write(p []byte) (int, error) {
if err := fd.writeLock(); err != nil {
return 0, err
}
defer fd.writeUnlock()
if err := fd.pd.prepareWrite(fd.isFile); err != nil {
return 0, err
}
var nn int
for {
max := len(p)
if fd.IsStream && max-nn > maxRW {
max = nn + maxRW
}
n, err := ignoringEINTRIO(syscall.Write, fd.Sysfd, p[nn:max])
if n > 0 {
nn += n
}
if nn == len(p) {
return nn, err
}
if err == syscall.EAGAIN && fd.pd.pollable() {
if err = fd.pd.waitWrite(fd.isFile); err == nil {
continue
}
}
if err != nil {
return nn, err
}
if n == 0 {
return nn, io.ErrUnexpectedEOF
}
}
}
This might be common knowledge but I didn't realize Go did this. And
when I tried out the same basic program in Python and even C, the
behavior was the same. The builtin write()
behavior on a file (in
many languages apparantly) is to retry until all data is written, with
some exceptions.
This makes sense since files on disk, unlike file descriptors backed by network sockets, are generally always available. Compared to a network connection, disks are physically close and almost always stay connected. (With some obvious exceptions like network-attached storage and thumb drives.)
So to trigger the short write, the easiest way seems to have the
sys_write
call return an error that is NOT EAGAIN
since the code
will retry if that is the error.
After looking through the list of errors that sys_write can
return,
EIO
seems like a nice one.
So let's do our final version of writeHandler
and on the syscall
exit, we'll modify the return value (rax
in amd64/linux) to be
EIO
.
fn writeHandler(cm: ChildManager, entryArgs: *ABIArguments) anyerror!void {
const fd = entryArgs.nth(0);
const dataAddress = entryArgs.nth(1);
var dataLength = entryArgs.nth(2);
// Truncate some bytes
if (dataLength > 2) {
dataLength -= 2;
entryArgs.setNth(2, dataLength);
cm.setABIArguments(entryArgs);
}
// Handle syscall exit
_ = cm.childWaitForSyscall();
var exitArgs = cm.getABIArguments();
dataLength = exitArgs.nth(2);
if (dataLength > 2) {
// Force the writes to stop after the first one by returning EIO.
var result: c_ulong = 0;
result = result -% c.EIO;
exitArgs.setResult(result);
cm.setABIArguments(&exitArgs);
}
}
Let's give it a whirl!
All together
Build the Zig fault injector and the Go test code:
$ zig build-exe --library c main.zig
$ ( cd test && go build main.go )
And run:
$ ./main test/main
And check test.txt
:
$ cat test.txt
some great stu
Hey, that's a short write! :)
Sidenote: Reading data from the child
We accomplished everything we set out to, but there's one other useful thing we can do: reading the actual data passed to the write syscall.
Just like how we can get the child process registers with
PTRACE_GETREGS
, we can read child memory with
PTRACE_PEEKDATA
. PTRACE_PEEKDATA
takes the child process id and
the memory address in the child to read from. It returns a word of
data (which on amd64/linux is 8 bytes).
We can use the syscall arguments (data address and length) to keep
calling PTRACE_PEEKDATA
on the child until we've read all bytes of
the data the child process wanted to write:
fn childReadData(
cm: ChildManager,
address: c_ulong,
length: c_ulong,
) !std.ArrayList(u8) {
var data = std.ArrayList(u8).init(cm.arena.allocator());
while (data.items.len < length) {
var word = c.ptrace(
c.PTRACE_PEEKDATA,
cm.childPid,
address + data.items.len,
cNullPtr,
);
for (std.mem.asBytes(&word)) |byte| {
if (data.items.len == length) {
break;
}
try data.append(byte);
}
}
return data;
}
And we could modify writeHandler
to print out the entirety of the write message each time (for debugging):
fn writeHandler(cm: ChildManager, entryArgs: *ABIArguments) anyerror!void {
const fd = entryArgs.nth(0);
const dataAddress = entryArgs.nth(1);
var dataLength = entryArgs.nth(2);
// Truncate some bytes
if (dataLength > 2) {
dataLength -= 2;
entryArgs.setNth(2, dataLength);
cm.setABIArguments(entryArgs);
}
const data = try cm.childReadData(dataAddress, dataLength);
defer data.deinit();
std.debug.print("Got a write on {}: {s}\n", .{ fd, data.items });
// Handle syscall exit
_ = cm.childWaitForSyscall();
var exitArgs = cm.getABIArguments();
dataLength = exitArgs.nth(2);
if (dataLength > 2) {
// Force the writes to stop after the first one by returning EIO.
var result: c_ulong = 0;
result = result -% c.EIO;
exitArgs.setResult(result);
cm.setABIArguments(&exitArgs);
}
}
That's pretty neat!
Next steps
Short writes are just one of many bad IO interactions. Another fun one would be to completely buffer all writes on a file descriptor (not allowing anything to be written to disk at all) until fsync is called on the file descriptor. Or forcing fsyncs to fail.
An interesting optimization would be to apply seccomp
filters
so that rather than paying a penalty for watching every syscall, I
only get notified about the ones I have hooks for like
sys_write
. Here's another
post
that explores ptrace with seccomp filters.
Credits: Thank you Charlie Cummings and Paul Khuong for reviewing a draft of this post!
Selected responses after publication
- oscooter on Reddit gave some
tips
on using ptrace, including using
process_vm_readv
instead ofPTRACE_PEEKDATA
to read memory from the tracee process.
Fault injection is a scary-sounding term. Intercepting and modifying Linux system calls sounds scary too.
— Phil Eaton (@eatonphil) October 1, 2023
But it's a neat way to trigger logical errors in programs, to build confidence we wrote code correctly.
Let's trigger short writes to disk in Zig!https://t.co/0C3tWt3vtT pic.twitter.com/OS7auDe8jR
September 28, 2023
How to Scale your Database and when to Shard MySQL
September 27, 2023
Scaling hundreds of thousands of database clusters on Kubernetes
Event sourcing with Kafka: A practical example
September 21, 2023
How do databases execute expressions?
Databases are fun. They sit at the confluence of Computer Science topics that might otherwise not seem practical in life as a developer. For example, every database with a query language is also a programming language implementation of some caliber. That doesn't include all databases though of course; see: RocksDB, FoundationDB, TigerBeetle, etc.
This post looks at how various databases execute expressions in their query language.
tldr; Most surveyed databases use a tree-walking interpreter. A few use stack- or register-based virtual machines. A couple have just-in-time compilers. And, tangentially, a few do vectorized interpretation.
Throughout this post I'll use "virtual machine" as a shorthand for stack- or register-based loops that process a linearized set of instructions. I say this since it is sometimes fair to call a tree-walking interpreter a virtual machine. But that is not what I mean when I say virtual machine in this post.
Stepping back
Programming languages are typically implemented by turning an Abstract Syntax Tree (AST) into a linear set of instructions for a virtual machine (e.g. CPython, Java, C#) or native code (e.g. GCC's C compiler, Go, Rust). Some of the former implementations also generate and run Just-In-Time (JIT) compiled native code (e.g. Java and C#).
Less commonly these days in programming languages does the implementation interpret off the AST or some other tree-like intermediate representation. This style is often called tree-walking.
Shell languages sometimes do tree-walking. Otherwise, implementations that interpret directly off of a tree normally do so as a short-term measure before switching to compiled virtual machine code or JIT-ed native code (e.g. some JavaScript implementations, GraalVM, RPython, etc.)
That is, while some major programming language implementations started out with tree-walking interpreters, they mostly moved away from solely tree-walking over a decade ago. See JSC in 2008, Ruby in 2007, etc.
My intuition is that tree-walking takes up more memory and is less cache-friendly than the linear instructions you give to a virtual machine or to your CPU. There are some folks who disagree, but they mostly talk about tree-walking when you've also got a JIT compiler hooked up. Which isn't quite the same thing. There has also been some early exploration and improvements reported when tree-walking with a tree organized as an array.
And databases?
Databases often interpret directly off a tree. (It isn't, generally speaking, fair to say they are AST-walking interpreters because databases typically transform and optimize beyond just an AST as parsed from user code.)
But not all databases interpret a tree. Some have a virtual machine. And some generate and run JIT-ed native code.
Methodology
If a core function (in the query execution path that does something like arithmetic or comparison) returns a value, that's a sign it's a tree-walking interpreter. Or, if you see code that is evaluating its arguments during execution, that's also a sign of a tree-walking interpreter.
On the other hand, if the function mutates internal state such as by assigning a value to a context or pushing to a stack, that's a sign it's a stack- or register-based virtual machine. If a function pulls its arguments from memory and doesn't evaluate the arguments, that's also an indication it's a stack- or register-based virtual machine.
This approach can result in false-positives depending on the
architecture of the interpreter. User-defined functions (UDFs) would
probably accept evaluated arguments and return a value regardless of
how the interpreter is implemented. So it's important to find not just
functions that could be implemented like UDFs, but core builtin
behavior. Control flow implementations of functions like if
or
case
can be great places to look.
And tactically, I clone the source code and run stuff like git grep
-i eval | grep -v test | grep \\.java | grep -i eval
or git grep -i
expr | grep -v test | grep \\.go | grep -i expr
until I convince
myself I'm somewhere interesting.
Note: In talking about a broad swath of projects, maybe I've misunderstood one or some. If you've got a correction, let me know! If there's a proprietary database you work on where you can link to the (publicly described) execution strategy, feel free to pass it along! Or if I'm missing your public-source database in this list, send me a message!
Survey
Cockroach (Ruling: Tree Walker)
Judging by functions like func (e *evaluator)
EvalBinaryExpr
that evaluates the left-hand
side
and then evaluates the right-hand
side
and returns a value, Cockroach looks like a tree walking interpreter.
It gets a little more interesting though, since Cockroach also supports vectorized expression execution. Vectorizing is a fancy term for acting on many pieces of data at once rather than one at a time. It doesn't necessarily imply SIMD. Here is an example of a vectorized addition of two int64 columns.
ClickHouse (Ruling: Tree Walker + JIT)
The ClickHouse architecture is a little unique and difficult for me to read through – likely due to it being fairly mature, with serious optimization. But they tend to document their header files well. So files like src/Functions/IFunction.h and src/Interpreters/ExpressionActions.h were helpful.
They have also spoken publicly about their pipeline execution model; e.g. this presentation and this roadmap issue. But it isn't completely clear how much pipeline execution (which is broader than just expression evaluation) connects to expression evaluation.
Moreover, they have publicly
spoken
about their support for JIT compilation for query execution. But let's
look at how execution works when the JIT is not enabled. For example,
If we take a look at how if
is
implemented,
we know that the then
and else
rows must be conditionally
evaluated.
In the runtime entrypoint,
executeImpl
,
we see the function call
executeShortCircuitArguments
which in turn calls
ColumnFunction::reduce()
which evaluates each column vector that is an
argument
to the function and then calls execute on the function.
So from this we can tell the non-JIT execution is a tree walker and that it is over chunks of columns, i.e. vectorized data, similar to Cockroach. However in ClickHouse execution is always over column vectors.
In the original version of this post, I had some confusion about the ClickHouse execution strategy. Robert Schulze from ClickHouse helped clarify things for me. Thanks Robert!
DuckDB (Ruling: Tree Walker)
If we take a look at how function expressions are executed, we can see each argument in the function being evaluated before being passed to the actual function. So that looks like a tree walking interpreter.
Like ClickHouse, DuckDB expression execution is always over column vectors. You can read more about this architecture here and here.
Influx (Ruling: Tree Walker)
Influx originally had a SQL-like query language called InfluxQL. If we look at how it evaluates a binary expression, it first evaluates the left-hand side and then the right-hand side before operating on the sides and returning a value. That's a tree-walking interpreter.
Flux was the new query language for Influx. While the Flux docs suggest they transform to an intermediate representation that is executed on a virtual machine, there's nothing I'm seeing that looks like a stack- or register-based virtual machine. All the evaluation functions evaluate their arguments and return a value. That looks like a tree-walking interpreter to me.
Today Influx announced that Flux is in maintenance mode and they are focusing on InfluxQL again.
MariaDB / MySQL (Ruling: Tree Walker)
Control flow methods are normally a good way to see how an interpreter
is implemented. The implementation of COALESCE looks pretty
simple. We
see it call
val_str()
for each argument to COALESCE. But I can only seem to find
implementations of val_str()
on raw values and not
expressions. Item_func_coalesce
itself does not implement
val_str()
for example, which would be a strong indication of a tree
walker. Maybe it does implement val_str()
through inheritance.
It becomes a little clearer if we look at non-control flow methods
like
acos
. In
this method we see Item_func_acos
itself implement val_real()
and
also call val_real()
on all its arguments. In this case it's obvious
how the control flow of acos(acos(.5))
would work. So that seems to
indicate expressions are executed with a tree walking interpreter.
I also noticed
sql/sp_instr.cc. That
is scary (in terms of invalidating my analysis) since it looks like a
virtual machine. But after looking through it, I think this virtual
machine only corresponds to how stored procedures are executed, hence
the sp_
prefix for Stored Programs. MySQL
docs
also explain that stored procedures are executed with a bytecode
virtual machine.
I'm curious why they don't use that virtual machine for query execution.
As far as I can tell MySQL and MariaDB do not differ in this regard.
MongoDB (Ruling: Virtual Machine)
Mongo recently introduced a virtual machine for executing queries, called Slot Based Execution (SBE). We can find the SBE code in src/mongo/db/exec/sbe/vm/vm.cpp and the main virtual machine entrypoint here. Looks like a classic stack-based virtual machine!
It isn't completely clear to me if the SBE path is always used or if there are still cases where it falls back to their old execution model. You can read more about Mongo execution here and here.
PostgreSQL (Ruling: Virtual Machine + JIT)
The top of PostgreSQL's src/backend/executor/execExprInterp.c clearly explains that expression execution uses a virtual machine. You see all the hallmarks: opcodes, a loop over a giant switch, etc. And if we look at how function expressions are executed, we see another hallmark which is that the function expression code doesn't evaluate its arguments. They've already been evaluated. And function expression code just acts on the results of its arguments.
PostgreSQL also supports JIT-ing expression execution. And we can find the switch between interpreting and JIT-compiling an expression here.
QuestDB (Ruling: Tree Walker + JIT)
QuestDB wrote about their execution engine recently. When the conditions are right, they'll switch over to a JIT compiler and run native code.
But let's look at the default path. For example, how AND
is
implemented. AndBooleanFunction
implements BooleanFunction
which implements Function
. An
expression can be evaluated by calling a getX()
method on the
expression type that implements Function
. AndBooleanFunction
calls
getBool()
on its left and right hand sides. And if we look at the
partial
implementation
of BooleanFunction
we'll also see it doing getX()
specific
conversions during the call of getX()
. So that's a tree-walking
interpreter.
Scylla (Ruling: Tree Walker)
If we take a look at how functions are
evaluated
in Scylla, we see function evaluation first evaluating all of its
arguments. And
the function evaluation function itself returns a
cql3::raw_value
. So that's a tree-walking interpreter.
SQLite (Ruling: Virtual Machine)
SQLite's virtual machine is comprehensive and well-documented. It encompasses more than just expression evaluation but the entirety of query execution.
We can find the massive virtual machine switch in src/vdbe.c.
And if we look, for example, at how AND
is implemented, we see it
pulling its arguments out of
memory
(already evaluated) and assigning the result back to a designated
point in
memory.
SingleStore (Ruling: Virtual Machine + JIT)
While there's no source code to link to, SingleStore gave a talk at CMU that broke down their query execution pipeline. Their docs also cover the topic.
TiDB (Ruling: Tree Walker)
Similar to DuckDB and ClickHouse, TiDB implements vectorized interpretation. They've written publicly about their switch to this method.
Let's take a look at how if
is implemented in TiDB. There is a
vectorized and non-vectorized version of if
(in
expression/control_builtin.go
and
expression/control_builtin_generated.go
respectively). So maybe they haven't completely switched over to
vectorized execution or maybe it can only be used in some conditions.
If we look at the non-vectorized version of
if
,
we see the condition
evaluated. And
then the then
or else
is evaluated depending on the result of the
condition. That's
a tree-walking interpreter.
Conclusion
As the DuckDB team points out, vectorized interpretation or JIT compilation seem like the future for database expression execution. These strategies seem particularly important for analytics or time-series workloads. But vectorized interpretation seems to make the most sense for column-wise storage engines. And column-wise storage normally only makes sense for analytics workloads. Still, TiDB and Cockroach are transactional databases that also vectorize execution.
And while SQLite and PostgreSQL use the virtual machine model, it's possible databases with tree-walking interpreters like Scylla and MySQL/MariaDB have decided there is not significant enough gains to be had (for transactional workloads) to justify the complexity of moving to a compiler + virtual machine architecture.
Tree-walking interpreters and virtual machines are also independent from whether or not execution is vectorized. So that will be another interesting dimension to watch: if more databases move toward vectorized execution even if they don't adapt JIT compilation.
Yet another alternative is that maybe as databases mature we'll see compilation tiers similar to what browsers do with JavaScript.
Credits: Thanks Max Bernstein, Alex Miller, and Justin Jaffray for reviewing a draft version of this! And thanks to the #dbs channel on Discord for instigating this post!
I spent some time looking into how various databases execute expressions in their query language.
— Phil Eaton (@eatonphil) September 21, 2023
Most of them have a tree-walking interpreter, some have a virtual machine, and some do just-in-time compilation.
Let's dig into some database code to see!https://t.co/BIGtHKh1X4 pic.twitter.com/nmhe9HmYw7