a curated list of database news from authoritative sources

December 13, 2020

December 02, 2020

December 01, 2020

November 30, 2020

Django with Vitess

Django is a popular framework for Python application developers. It includes packages which make tasks like authorization and content administration easier. Django supports a number of databases including MySQL which makes it possible to run a Django application over Vitess without having to change the application code. Let’s take a look at how to combine the strengths of these two open source frameworks. We built this example using Vitess operator. You can see the details of the implementation in the blog post Vitess Operator for Kubernetes.

November 29, 2020

Moving to Hugo

some personal notes to remember the migration effort from Pelican to Hugo

November 26, 2020

Emulating linux/AMD64 userland: interpreting an ELF binary

In this post we'll stumble toward a working emulator for a barebones C program compiled for linux/AMD64. The approach will be slightly more so based on observation than by following a spec; a great way to quickly become familiar with a topic, and a bad way to guarantee correctness.

The goal:

$ cat tests/simple.c
int main() {
  return 4;
}
$ gcc tests/simple.c
$ go build -o main
$ ./main a.out && echo $?
4

This may look ridiculously simple but when you don't know how to deal with a binary or how instructions are encoded, it will take a few hours to write an emulator that can generally handle this program!

Code for this project is available on Github.

Background

AMD64, x86_64 or x64 are different names for AMD's widely adopted 64-bit extension to Intel's x86 instruction set (i.e. the encoding and semantics of x86 binaries). AMD64 is a superset of x86 (introducing 64-bit registers and operations) and thus backwards compatible with x86 programs.

A year and a half ago I first got into emulation with an AMD64 emulator in JavaScript. The JavaScript emulator interpreted the textual representation of AMD64 programs (e.g. MOV RBP, RSP, Intel's assembly syntax). A C program had to be compiled with -S to produce an assembly file that the JavaScript emulator could read (i.e. gcc -S tests/simple.c) This was a great way to get started with emulation by ignoring the complexity of encoded instructions and executable formats.

If we dig into the binary file produced by gcc on Linux we learn that it is an ELF file.

$ gcc test/simple.c
$ file a.out
a.out: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, BuildID[sha1]=d0b5c742b9fbcbcca4dfa9438a8437a8478a51bb, for GNU/Linux 3.2.0, not stripped

ELF is responsible for surrounding the actual binary-encoded program instructions with metadata on exported and imported C identifiers and program entrypoint. But for simple programs like this initial emulator, we can ignore export/imports. We'll only use the ELF metadata to find out where the instructions for our main function start.

Where is main?

If we use an ELF reader+disassembler on the binary generated by gcc and search for main we can find its address.

$ objdump -d a.out | grep -A10 '<main>'
0000000000401106 <main>:
  401106:       55                      push   %rbp
  401107:       48 89 e5                mov    %rsp,%rbp
  40110a:       b8 fe 00 00 00          mov    $0xfe,%eax
  40110f:       5d                      pop    %rbp
  401110:       c3                      retq
  401111:       66 2e 0f 1f 84 00 00    nopw   %cs:0x0(%rax,%rax,1)
  401118:       00 00 00
  40111b:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)

0000000000401120 <__libc_csu_init>:

This means that the function, main, starts at address 0x401106 in memory. Furthermore, this implies that the binary must be loaded into CPU memory such that the CPU can jump here to execute our program.

In truth, main is not this program's entrypoint. If we run objdump -x a.out we can see that the ELF entrypoint is 0x401020.

$ objdump -x a.out
a.out:     file format elf64-x86-64
a.out
architecture: i386:x86-64, flags 0x00000112:
EXEC_P, HAS_SYMS, D_PAGED
start address 0x0000000000401020

Program Header:
    PHDR off    0x0000000000000040 vaddr 0x0000000000400040 paddr 0x0000000000400040 align 2**3
             filesz 0x00000000000002d8 memsz 0x00000000000002d8 flags r--

This is because the actual entrypoint gcc sets up is a function called _start. The libc prelude beginning with _start is responsible for initializing the libc runtime, calling our main function and executing the exit syscall with the return value of main.

objdump -d a.out | grep -A10 '<_start>'
0000000000401020 <_start>:
  401020:       f3 0f 1e fa             endbr64
  401024:       31 ed                   xor    %ebp,%ebp
  401026:       49 89 d1                mov    %rdx,%r9
  401029:       5e                      pop    %rsi
  40102a:       48 89 e2                mov    %rsp,%rdx
  40102d:       48 83 e4 f0             and    $0xfffffffffffffff0,%rsp
  401031:       50                      push   %rax
  401032:       54                      push   %rsp
  401033:       49 c7 c0 90 11 40 00    mov    $0x401190,%r8
  40103a:       48 c7 c1 20 11 40 00    mov    $0x401120,%rcx

But because all this libc initialization is relatively complicated we're just going to skip the actual ELF entrypoint for now. Our emulator will locate main, load the binary into memory, jump to the start of main, and set the exit code of the emulator to the result of main.

As you can see, this ELF binary has its own hard-coded view of where it will be in memory. What if our CPU were to run multiple process at once? We might give each process its own virtual memory space and map back to a real memory space so each process (and by extension, compilers) doesn't have to think about how they fit into memory relative to other processes.

The last question to figure out is where to load the ELF binary into emulator memory so that addresses in memory are where the program expects.

As it turns out, there is a piece of metadata called section headers that contain an address and a offset from the start of the ELF file. By subtracting this we can get the location the file expects to be in memory.

$ objdump -x a.out
a.out:     file format elf64-x86-64
a.out
architecture: i386:x86-64, flags 0x00000112:
EXEC_P, HAS_SYMS, D_PAGED
start address 0x0000000000401020

Program Header:
    PHDR off    0x0000000000000040 vaddr 0x0000000000400040 paddr 0x0000000000400040 align 2**3
             filesz 0x00000000000002d8 memsz 0x00000000000002d8 flags r--

That is: 0x400040 (vaddr) - 0x40 (off) = 0x400000. Judging from a Google search this seems to be a pretty common address where ELF binaries are loaded into memory.

ELF and Go

Binary file formats tend to be a pain to work with because, to enable greater compression, everything ends up being a pointer to something else. So you end up jumping all around the file just to stitch information back together.

So the one third-party-ish library we'll use is Go's builtin debug/elf package. With this library we can load an ELF binary and iterate over symbols and sections to discover the location of main and the start address for the binary in memory.

Editing in main.go:

package main

import (
    "bytes"
    "debug/elf"
    "fmt"
    "io/ioutil"
    "log"
    "os"
)

type process struct {
    startAddress uint64
    entryPoint   uint64
    bin          []byte
}

func readELF(filename, entrySymbol string) (*process, error) {
    bin, err := ioutil.ReadFile(filename)
    if err != nil {
        return nil, err
    }

    elffile, err := elf.NewFile(bytes.NewReader(bin))
    if err != nil {
        return nil, err
    }

    symbols, err := elffile.Symbols()
    if err != nil {
        return nil, err
    }

    var entryPoint uint64
    for _, sym := range symbols {
        if sym.Name == entrySymbol && elf.STT_FUNC == elf.ST_TYPE(sym.Info) && elf.STB_GLOBAL == elf.ST_BIND(sym.Info) {
            entryPoint = sym.Value
        }
    }

    if entryPoint == 0 {
        return nil, fmt.Errorf("Could not find entrypoint symbol: %s", entrySymbol)
    }

    var startAddress uint64
    for _, sec := range elffile.Sections {
        if sec.Type != elf.SHT_NULL {
            startAddress = sec.Addr - sec.Offset
            break
        }
    }

    if startAddress == 0 {
        return nil, fmt.Errorf("Could not determine start address")
    }

    return &process{
        startAddress: startAddress,
        entryPoint:   entryPoint,
        bin:          bin,
    }, nil
}

func main() {
    if len(os.Args) < 2 {
        log.Fatal("Binary not provided")
    }

    proc, err := readELF(os.Args[1], "main")
    if err != nil {
        panic(err)
    }

    fmt.Printf("Start: 0x%x\nEntry: 0x%x\n", proc.startAddress, proc.entryPoint)
}

We can test on a basic compiled C program:

$ cat tests/simple.c
int main() {
  return 4;
}
$ gcc tests/simple.c
$ go build -o main
$ ./main a.out
Start: 0x400000
Entry: 0x401106

And verify against objdump:

$ objdump -d a.out | grep -A10 '<main>'
0000000000401106 <main>:
  401106:       55                      push   %rbp'>'

And that's it for dealing with ELF. Now we can sketch out a virtual CPU and how we deal with interpreting instructions starting at this address.

The CPU

AMD64 counts on being able to store values in registers and memory, sometimes through direct addressing and sometimes indirectly using stack operations (push and pop). And userland processes count on being loaded into CPU memory so the CPU can jump to the process entrypoint and process.

type cpu struct {
    proc    *process
    mem     []byte
    regfile *registerFile
    tick    chan bool
}

func newCPU(memory uint64) cpu {
    return cpu{
        mem:     make([]byte, memory),
        regfile: &registerFile{},
        tick:    make(chan bool, 1),
    }
}

The tick channel is so that later on we can wrap the emulator in a terminal debugger. But by default we'll just set up a goroutine to tick forever.

func repl(c *cpu) {
  // TODO
}

func main() {
    if len(os.Args) < 2 {
        log.Fatal("Binary not provided")
    }

    proc, err := readELF(os.Args[1], "main")
    if err != nil {
        panic(err)
    }

    debug := false
    for _, arg := range os.Args[1:] {
        switch arg {
        case "--debug":
            fallthrough
        case "-d":
            debug = true
        }
    }

    // 10 MB
    cpu := newCPU(0x400000 * 10)

    go cpu.run(proc)
    if debug {
        repl(&cpu)
    } else {
        for {
            cpu.tick <- true
        }
    }
}

Registers

To emulate a simple program like our tests/simple.c, we'll only need to support a few common registers. The order is important so that we can use the Go identifiers when we want to refer to the encoded integer value of the register.

type register int

const (
    // These are in order of encoding value (i.e. rbp is 5)
    rax register = iota
    rcx
    rdx
    rbx
    rsp
    rbp
    rsi
    rdi
    r8
    r9
    r10
    r11
    r12
    r13
    r14
    r15
    rip
    rflags
)

var registerMap = map[register]string{
    rax:    "rax",
    rcx:    "rcx",
    rdx:    "rdx",
    rbx:    "rbx",
    rsp:    "rsp",
    rbp:    "rbp",
    rsi:    "rsi",
    rdi:    "rdi",
    r8:     "r8",
    r9:     "r9",
    r10:    "r10",
    r11:    "r11",
    r12:    "r12",
    r13:    "r13",
    r14:    "r14",
    r15:    "r15",
    rip:    "rip",
    rflags: "rflags",
}

type registerFile [18]uint64

func (regfile *registerFile) get(r register) uint64 {
    return regfile[r]
}

func (regfile *registerFile) set(r register, v uint64) {
    regfile[r] = v
}

Of immediate importance will be rip, rsp, and rax registers. rip is used to track the current instruction to process. It will generally be incremented except for when dealing with function calls and returns. rsp is used as a pointer to the top of a stack in memory. It is incremented and decremented as values are pushed and popped on this stack. Finally, rax is used to pass function return values.

Loading a program

Running a program is a matter of loading the program into memory, setting the stack pointer to the last address of memory (in x86 the stack grows down), pointing rip at the entrypoint, and looping until the entrypoint function returns.

func writeBytes(to []byte, start uint64, bytes int, val uint64) {
    for i := 0; i < bytes; i++ {
        to[start+uint64(i)] = byte(val >> (8 * i) & 0xFF)
    }
}

func (c *cpu) loop(entryReturnAddress uint64) {
    for {
        <-c.tick

        ip := c.regfile.get(rip)
        if ip == entryReturnAddress {
            break
        }

        inb1 := c.mem[ip]

        // TODO: deal with instructions

        // move to next instruction
        c.regfile.set(rip, ip+1)
    }
}

func (c *cpu) run(proc *process) {
    copy(c.mem[proc.startAddress:proc.startAddress+uint64(len(proc.bin))], proc.bin)
    c.regfile.set(rip, proc.entryPoint)
    initialStackPointer := uint64(len(c.mem)-8)
    writeBytes(c.mem, initialStackPointer, 8, initialStackPointer)
    c.regfile.set(rsp, initialStackPointer)
    c.loop(initialStackPointer)
    os.Exit(int(c.regfile.get(rax)))
}

We write the initial stack pointer address into the stack so that when the program final returns, it will return to this address at which pointer we can exit the program.

And now we're ready to start interpreting instructions.

Instruction decoding

Using objdump we get a sense for what the program decodes to.

$ objdump -d a.out | grep -A10 '<main>'
0000000000401106 <main>:
  401106:       55                      push   %rbp
  401107:       48 89 e5                mov    %rsp,%rbp
  40110a:       b8 fe 00 00 00          mov    $0xfe,%eax
  40110f:       5d                      pop    %rbp
  401110:       c3                      retq
  401111:       66 2e 0f 1f 84 00 00    nopw   %cs:0x0(%rax,%rax,1)
  401118:       00 00 00
  40111b:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)

0000000000401120 <__libc_csu_init>:

We see that 0x55 means push %rbp. And we also see that instructions aren't a fixed number of bytes. Some are one byte, some are seven. Some (not shown) are even longer.

Thankfully instructions follow some fairly simple patterns. There are a set of prefix instructions and a set of real instructions. So far we should be able to tell on the first byte whether the instruction is a prefix instruction and, if not, how many bytes the instruction will take up on the whole.

push

To support a new instruction, we'll look up 0x55 in an opcode table like this. Clicking on 55 in the opcode index we see that this is indeed a push instruction. 50+r means that we have to subtract 0x50 from the opcode to determine the register we should push.

The register will be 0x55 - 0x50 = 5 which if we look up in a register table is rbp. Since we set up our register enum in code in this order, we'll be able to just use the constant rbp in Go code.

Finally, since the next instruction numerically is 0x58 we know that this instruction is identified by being between 0x50 and 0x57 inclusive. This is all the info we need to handle this instruction.

// helper for dumping byte arrays as hex
func hbdebug(msg string, bs []byte) {
    str := "%s:"
    args := []interface{}{msg}
    for _, b := range bs {
        str = str + " %x"
        args = append(args, b)
    }
    fmt.Printf(str+"\n", args...)
}

func (c *cpu) loop(entryReturnAddress uint64) {
    for {
        <-c.tick

        ip := c.regfile.get(rip)
        if ip == entryReturnAddress {
            break
        }

        inb1 := c.mem[ip]

        if inb1 >= 0x50 && inb1 < 0x58 { // push
            regvalue := c.regfile.get(register(inb1 - 0x50))
            sp := c.regfile.get(rsp)
            writeBytes(c.mem, sp-8, 8, regvalue)
            c.regfile.set(rsp, uint64(sp-8))
        } else {
            hbdebug("prog", c.mem[ip:ip+10])
            panic("Unknown instruction")
        }

        c.regfile.set(rip, ip+1)
    }
}

If we try this out now we should expect it to panic on the second byte, 0x48.

$ go build -o main
$ ./main a.out
prog: 48 89 e5 b8 4 0 0 0 5d c3
panic: Unknown instruction

goroutine 19 [running]:
main.(*cpu).loop(0xc000086c30, 0x2800000)
        /home/phil/tmp/goamd/main.go:168 +0x16d
main.(*cpu).run(0xc000086c30, 0xc000086c00)
        /home/phil/tmp/goamd/main.go:180 +0xac
created by main.main
        /home/phil/tmp/goamd/main.go:211 +0x286

Looking good.

mov

Taking a look at the next two instructions with objdump we see mov encoded two different ways.

$ objdump -d a.out | grep -A4 '<main>'
0000000000401106 <main>:
  401106:       55                      push   %rbp
  401107:       48 89 e5                mov    %rsp,%rbp
  40110a:       b8 fe 00 00 00          mov    $0xfe,%eax

Looking up 0x48 we see that this is a prefix instruction that turns on 64-bit mode for the instruction. Some instructions like pop and push don't need this prefix to be in 64-bit mode. In any case, this just means we'll have to have a size flag that switches from 32-bit to 64-bit mode on seeing this instruction. This flag will be reset each time we start reading an instruction.

To deal with prefixes in general we'll loop through bytes when processing an instruction until we no longer see a prefix bytes. As we see prefix bytes we'll handle them accordingly.

var prefixBytes = []byte{0x48}

func (c *cpu) loop(entryReturnAddress uint64) {
    for {
        <-c.tick

        ip := c.regfile.get(rip)
        if ip == entryReturnAddress {
            break
        }

        inb1 := c.mem[ip]

        widthPrefix := 32
        for {
            isPrefixByte := false
            for _, prefixByte := range prefixBytes {
                if prefixByte == inb1 {
                    isPrefixByte = true
                    break
                }
            }

            if !isPrefixByte {
                break
            }

            // 64 bit prefix signifier
            if inb1 == 0x48 {
                widthPrefix = 64
            } else {
                hbdebug("prog", c.mem[ip:ip+10])
                panic("Unknown prefix instruction")
            }

            ip++
            inb1 = c.mem[ip]
        }

        if inb1 >= 0x50 && inb1 < 0x58 { // push

...

Moving past this prefix we get to 0x89. This instruction is for copying one register into another. The register operands are encoded in the second byte, 0xe5, called the ModR/M byte. Pulling out the two registers is just a matter of shifting and bitmasking the right 3 bits for each.

With this knowledge we can expand the instruction handling code.

        if inb1 >= 0x50 && inb1 < 0x58 { // push
            regvalue := c.regfile.get(register(inb1 - 0x50))
            sp := c.regfile.get(rsp)
            writeBytes(c.mem, sp-8, 8, regvalue)
            c.regfile.set(rsp, uint64(sp-8))
        }  else if inb1 == 0x89 { // mov r/m16/32/64, r/m16/32/64
            ip++
            inb2 := c.mem[ip]
            rhs := register((inb2 & 0b00111000) >> 3)
            lhs := register(inb2 & 0b111)
            c.regfile.set(lhs, c.regfile.get(rhs))
        } else {
            hbdebug("prog", c.mem[ip:ip+10])
            panic("Unknown instruction")
        }

Try emulating a.out again now. It will panic on the next unknown instruction, 0xb8. From objdump disassembly we see this is another mov instruction.

Hurray! There are apparently multiple ways the same instruction can be encoded. Looking it up in the opcode table, we see 0xB8 is for when the value to be copied is a literal number. The operand will be 32-bits, or four bytes, presumably because it doesn't have the 0x48 prefix.

// helper for converting up to 8 bytes into a single integer
func readBytes(from []byte, start uint64, bytes int) uint64 {
    val := uint64(0)
    for i := 0; i < bytes; i++ {
        val |= uint64(from[start+uint64(i)]) << (8 * i)
    }

    return val
}

func (c *cpu) loop(entryReturnAddress uint64) {

    ...

    for {

        ...

        } else if inb1 >= 0xB8 && inb1 < 0xC0 { // mov r16/32/64, imm16/32/64
            lreg := register(inb1 - 0xB8)
            val := readBytes(c.mem, ip+uint64(1), widthPrefix/8)
            ip += uint64(widthPrefix / 8)
            c.regfile.set(lreg, val)
        }

        ...

Two more instructions to go: pop and ret.

A terminal debugger

Taking a break for a moment, our system is already too complex to understand. It would be helpful to have a REPL so we can step through instructions and print register and memory values.

func (c *cpu) resolveDebuggerValue(dval string) (uint64, error) {
    for reg, val := range registerMap {
        if val == dval {
            return c.regfile.get(reg), nil
        }
    }

    if len(dval) > 2 && (dval[:2] == "0x" || dval[:2] == "0X") {
        return strconv.ParseUint(dval[2:], 16, 64)
    }

    return strconv.ParseUint(dval, 10, 64)
}

func repl(c *cpu) {
    fmt.Println("go-amd64-emulator REPL")
    help := `commands:
    s/step:             continue to next instruction
    r/registers [$reg]:     print all register values or just $reg
    d/decimal:          toggle hex/decimal printing
    m/memory $from $count:      print memory values starting at $from until $from+$count
    h/help:             print this`
    fmt.Println(help)
    scanner := bufio.NewScanner(os.Stdin)

    intFormat := "%d"
    for {
        fmt.Printf("> ")
        if !scanner.Scan() {
            break
        }
        input := scanner.Text()
        parts := strings.Split(input, " ")

        switch parts[0] {
        case "h":
            fallthrough
        case "help":
            fmt.Println(help)

        case "m":
            fallthrough
        case "memory":
            msg := "Invalid arguments: m/memory $from $to; use hex (0x10), decimal (10), or register name (rsp)"
            if len(parts) != 3 {
                fmt.Println(msg)
                continue
            }

            from, err := c.resolveDebuggerValue(parts[1])
            if err != nil {
                fmt.Println(msg)
                continue
            }

            to, err := c.resolveDebuggerValue(parts[2])
            if err != nil {
                fmt.Println(msg)
                continue
            }

            hbdebug(fmt.Sprintf("memory["+intFormat+":"+intFormat+"]", from, from+to), c.mem[from:from+to])

        case "d":
            fallthrough
        case "decimal":
            if intFormat == "%d" {
                intFormat = "0x%x"
                fmt.Println("Numbers displayed as hex")
            } else {
                intFormat = "%d"
                fmt.Println("Numbers displayed as decimal")
            }

        case "r":
            fallthrough
        case "registers":
            filter := ""
           ... (truncated)
                                    

November 22, 2020

Taming Deep Recursion

 When operating on hierarchical data structures, it is often convenient to formulate that using pairwise recursive functions. For example, our semantic analysis walks that parse tree recursively and transforms it into an expression tree. This corresponding code looks roughly like this:

 unique_ptr<Expression> analyzeExpression(AST* astNode) {  
   switch (astNode->getType()) {  
    case AST::BinaryExpression: return analyzeBinaryExpression(astNode->as<BinaryExpAST>());  
    case AST::CaseExpression: return analyzeCaseExpression(astNode->as<CaseExpAST>());  
    ...  
   }  
 }  
 unique_ptr<Expression> analyzeBinaryExpression(BinaryExpAST* astNode) {  
   auto left = analyzeExpression(astNode->left);  
   auto right = analyzeExpression(astNode->right);  
   auto type = inferBinaryType(astNode->getOp(), left, right);  
   return make_unique<BinaryExpression>(astNode->getOp(), move(left), move(right), type);  
 }  

It recursively walks the tree, collects input expressions, infers types, and constructs new expressions. This works beautifully until you encounter a (generated) query with 300,000 expressions, which we did. At that point our program crashed due to stack overflow. Oops.

Our first mitigation was using __builtin_frame_address(0) at the beginning of analyzeExpression to detect excessive stack usage, and to throw an exception if that happens. This prevented the crash, but is not very satisfying. First, it means we refuse a perfectly valid SQL query "just" because it uses 300,000 terms in one expression. And second, we cannot be sure that this is enough. There are several places in the code that recursively walk the algebra tree, and it is hard to predict their stack usage. Even worse, the depth of the tree can change due to optimizations. For example, when a query has 100,000 entries in the from clause, the initial tree is extremely wide but flat. Later, after we have stopped checking for stack overflows, the optimizer might transform that into a tree with 100,000 levels, again leading to stack overflow. Basically, all recursive operations on the algebra tree are dangerous.

Now common wisdom is to avoid recursion if we cannot bound the maximum depth, and use iteration with an explicit stack instead. We spend quite some time thinking about that approach, but the main problem with that is that it makes the code extremely ugly. The code snippet above is greatly simplified, but even there an explicit stack would be unwieldy and ugly if we have to cover both binaryExpression and caseExpression using one stack. And the code gets cut into tiny pieces due to the control flow inversion required for manual stacks. And all that to defend against something that nearly never happens. We were unhappy with that solution, we wanted something that is minimal invasive and created overhead only in the unlikely case that a user gives us an extremely deep input.

One mechanism that promises to solve this problem is -fsplit-stack. There, the compiler checks for stack overflows in the function prolog and creates a new stack segment if needed. Great, exactly what we wanted! We can handle deep trees, no code change, and we only create a new stack if we indeed encounter deep recursion. Except that it is not really usable in practice. First, -fsplit-stack is quite slow. We measured 20% overhead in our optimizer when enabling split stacks, and that in cases where we did not create any new stacks at all. When -fsplit-stack does create new stacks it is even worse. This is most likely a deficit of the implementation, one could implement -fsplit-stack much more efficiently, but the current implementation is not encouraging. Even worse, clang produces an internal compiler error when compiling some functions with -fsplit-stack. Nobody seems to use this mechanism in production, and after disappointing results we stopped considering -fsplit-stack.

But the idea of split stacks is clearly good. When encountering deep recursion we will have to switch stacks at some point. After contemplating this problem for some time we realized that boost.context offers the perfect mechanism for switching stacks: It can start a fiber with a given stack, and switching between fibers costs just 19 cycles. By caching additional stacks and their fibers in thread-local data structures we can provide our own implementation of split stacks that is fast and supports arbitrary code. Without compiler support the split stack mechanism is visible, of course, but that is fine in our code. We have only a few entry points like analyzeExpression that will be called over and over again during recursion, and checking there is enough. Code wise the mechanism is not too ugly, it needs two lines of code per recursion head and looks like

 unique_ptr<Expression> analyzeExpression(AST* astNode) {  
   if (StackGuard::needsNewStack())  
    return StackGuard::newStack([=]() { return analyzeExpression(astNode); });  
   ... // unchanged  
 }  

Note that the lambda argument for newStack will be called from within the new stack, avoiding the stack overflow. When the lambda returns the mechanism will use boost.context to switch back to the previous stack. The performance impact of that mechanism is negligible, as 1) we do not check for overflows all the time but only in the few places that we know are central to the recursive invocations, like analyzeExpression here, 2) stack overflows are extremely rare in practice, and we only pay with one if per invocation is no overflow happens, and 3) even if they do happen, the mechanism is reasonably cheap. We cache the child stacks, and switching to a cached stack costs something like 30 cycles. And we never recurse in a hot loop.

It took us a while to get there, but now we can handle really large queries without crashing. Just for fun we tried running a query with 100,000 relations in the from clause. Fortunately our optimizer could already handle that, and now the rest of the system can handle it, too. And that with nice, intuitive, recursive code, at the small price of two lines of code per recursion head.