a curated list of database news from authoritative sources

June 22, 2019

Writing a lisp compiler from scratch in JavaScript: 6. LLVM system calls

Previously in compiler basics: <! forgive me, for I have sinned >
1. lisp to assembly
2. user-defined functions and variables
3. LLVM
4. LLVM conditionals and compiling fibonacci
Next in compiler basics:
5. an x86 upgrade

In this post we'll extend the ulisp compiler's LLVM backend to support printing integers to stdout.

Exit code limitations

Until now we've validated program state by setting the exit code to the result of the program computation. But the exit code is an eight bit integer. What if we want to validate a computation that produces a result larger than 255?

To do this we need a way to print integers. This is challenging because printing normally deals with byte arrays. libc's printf, for example, takes a byte array as its first argument.

The shortest path forward is to add support for system calls so we can print one character at a time. Here's a version of a print form that hacks around not having arrays to send each integer in a number to stdout.

(def print-char (c)
     ; First argument is stdout
     ; Second argument is a pointer to a char array (of length one)
     ; Third argument is the length of the char array
     (syscall/sys_write 1 &c 1))

(def print (n)
     (if (> n 9)
         (print (/ n 10)))

     ; 48 is the ASCII code for '0'
     (print-char (+ 48 (% n 10))))

In order to support this we need to add the syscall/sys_write, >, %, and / builtin forms. We'll also need to add support for taking the address of a variable.

All code is available on Github as is the particular commit related to this post.

References

The sys_write syscall requires us to pass the memory address of the byte array to write. We don't support arrays, but we can treat an individual variable as an array of length one by passing the variable's address.

If we were compiling to C we could just pass the address of a local variable. But LLVM doesn't allow us to take the address of variables directly. We need to push the variable onto the LLVM stack to get an address.

Under the hood LLVM will likely optimize this into a local variable reference instead of first pushing to the stack.

Since LLVM IR is typed, the value representing the address of a local variable will be a pointer type. We'll need to refer to all uses of this value as a pointer. So we'll need to modify ulisp to track local types rather than hard-coding i64 everywhere.

Scope

To begin we'll modify the Scope class to track types. We only need to do this on registration. Afterward, we'll have to find all uses of local variables to make sure they use the local's value and type fields appropriately.

class Scope {

  ...

  register(local) {
    let copy = local.replace('-', '_');
    let n = 1;
    while (this.locals[copy]) {
      copy = local + n++;
    }

    this.locals[local] = {
      value: copy,
      type: 'i64',
    };
    return this.locals[local];
  }

  ...

}

We won't go through every use of a Scope variable in this post, but you can find it in the related commit to ulisp.

Reference

The long-term approach for handling a reference syntactically is probably to rewrite &x to (& x) in the parser. The lazy approach we'll take for now is to handle a reference as a special kind of identifier in compileExpression.

We'll use the LLVM alloca instruction to create space on the stack. This will return a pointer and will turn the destination variable into a pointer type. Then we'll use store to set the value at the address to the current value of the variable being referenced.

class Compiler {

  ...

  compileExpression(exp, destination, context) {

    ...

    // Is a reference, push onto the stack and return its address
    if (exp.startsWith('&')) {
      const symbol = exp.substring(1);
      const tmp = context.scope.symbol();
      this.compileExpression(symbol, tmp, context);
      this.emit(1, `%${destination.value} = alloca ${tmp.type}, align 4`);
      destination.type = tmp.type + '*';
      this.emit(1, `store ${tmp.type} %${tmp.value}, ${destination.type} %${destination.value}, align 4`);
      return;
    }

    ...

  }

  ...

}

And now we're set to take the address of any code.

System calls

LLVM IR provides no high-level means for making system calls. The only way is to use inline assembly. This syntax is based on GCC inline assembly and is confusing, with few explained examples, and unhelpful error messages.

Thankfully the assembly code needed for a syscall is only one line, one word: the syscall assembly instruction. We use inline assembly variable-to-register mapping functionality to line up all the parameters for the syscall. Here is an example:

%result = call i64 asm sideeffect "syscall", "=r,{rax},{rdi},{rsi},{rdx}" (i64 %raxArg, i64 %rdiArg, i64 %rsiArg, i64 %rdxArg)

This says to execute the inline assembly string, "syscall". The sideeffect flag means that this assembly should always be run even if the result isn't used. =r means the inline assembly returns a value, and the rest of the string is the list of registers that arguments should be mapped to. Finally we call the function with all the LLVM variables we want to be mapped.

Eventually we should also use the inline assembly syntax to list registers that are modified so that LLVM can know to save them before and after.

Code

We'll add a mapping for syscall/sys_write and a helper function for generating syscall code using the example above as a template. We'll suport 64-bit Darwin and Linux kernels.

const SYSCALL_TABLE = {
  darwin: {
    sys_write: 0x2000004,
    sys_exit: 0x2000001,
  },
  linux: {
    sys_write: 1,
    sys_exit: 60,
  },
}[process.platform];

class Compiler {
  constructor() {
    this.outBuffer = [];
    this.primitiveFunctions = {
      def: this.compileDefine.bind(this),
      begin: this.compileBegin.bind(this),
      'if': this.compileIf.bind(this),
      '+': this.compileOp('add'),
      '-': this.compileOp('sub'),
      '*': this.compileOp('mul'),
      '%': this.compileOp('urem'),
      '<': this.compileOp('icmp slt'),
      '=': this.compileOp('icmp eq'),
      'syscall/sys_write': this.compileSyscall(SYSCALL_TABLE.sys_write),
    };
  }

  ...

  compileSyscall(id) {
    return (args, destination, context) => {
      const argTmps = args.map((arg) => {
          const tmp = context.scope.symbol();
          this.compileExpression(arg, tmp, context);
          return tmp.type + ' %' + tmp.value;
      }).join(', ');
      const regs = ['rdi', 'rsi', 'rdx', 'r10', 'r8', 'r9'];
      const params = args.map((arg, i) => `{${regs[i]}}`).join(',');
      const idTmp = context.scope.symbol().value;
      this.emit(1, `%${idTmp} = add i64 ${id}, 0`)
      this.emit(1, `%${destination.value} = call ${destination.type} asm sideeffect "syscall", "=r,{rax},${params},~{dirflag},~{fpsr},~{flags}" (i64 %${idTmp}, ${argTmps})`);
    }
  }
}

>, /

Finally, we have a few new operations to add support for. But they'll be pretty simple using the compileOp helper function.

class Compiler {
  constructor() {
    this.outBuffer = [];
    this.primitiveFunctions = {
      def: this.compileDefine.bind(this),
      begin: this.compileBegin.bind(this),
      'if': this.compileIf.bind(this),
      '+': this.compileOp('add'),
      '-': this.compileOp('sub'),
      '*': this.compileOp('mul'),
      '/': this.compileOp('udiv'),
      '%': this.compileOp('urem'),
      '<': this.compileOp('icmp slt'),
      '>': this.compileOp('icmp sgt'),
      '=': this.compileOp('icmp eq'),
      'syscall/sys_write': this.compileSyscall(SYSCALL_TABLE.sys_write),
    };
  }

  ...

}

print

We're ready to give our print function a shot.

$ cat test.lisp
(def print-char (c)
     ; First argument is stdout
     ; Second argument is a pointer to a char array (of length one)
     ; Third argument is the length of the char array
     (syscall/sys_write 1 &c 1))

(def print (n)
     (if (> n 9)
         (print (/ n 10)))

     ; 48 is the ASCII code for '0'
     (print-char (+ 48 (% n 10))))

(def main ()
     (print 1234)
     0)
$ node ulisp.js test.lisp
$ ./build/a.out
1234

Looks good! In the next post we'll talk about tail call elimination.

June 17, 2019

The Benefits of Unsharded Vitess

For many large companies seeking help with horizontal scaling, Vitess' value proposition is easily understood; running stateful workloads at astronomical scale is a hard problem that Vitess has boldly solved in the past. However, for businesses that aren't hitting the performance limitations of standard MySQL, it may seem difficult to justify placing seemingly complex middleware in your data path with no immediate reward. I'm here to show you why unsharded Vitess is not just a pre-optimization for future horizontal scaling - it provides many upgrades to the MySQL experience.

June 04, 2019

Tinybird Changelog: Faster CSV Imports

During the last couple of weeks, we’ve made major improvements to our CSV import process, increasing its speed and reliability, getting almost a 2x performance.

May 29, 2019

May 21, 2019

Writing an x86 emulator from scratch in JavaScript: 1. a stack and register machine

Better yet, take a look at this post walking through emulating x86 ELF binaries in Go:
Emulating linux/AMD64 userland: interpreting an ELF binary

Next up in emulator basics: <! forgive me, for I have sinned >
2. system calls

In this post we'll create a small virtual machine in JavaScript and use it to run a simple C program compiled with GCC for an x86_64 (or AMD64) CPU running Linux.

All source code is available on Github.

Virtual machine data storage

Our virtual machine will have two means of storing data: registers and an integer stack. Each register can store a 64-bit integer. The stack is an array of 8-bit (or 1 byte) integers.

We'll make the following registers available for modification and use by the program(mer):

RDI, RSI, RSP, RBP, RAX, RBX, RCX, RDX, R8, R9, R10, R11, R12, R13, R14, R15

The RSP register is used by the virtual machine for tracking the location of the last entry in the stack. It will be modified by the virtual machine when it encounters the POP, PUSH, CALL and RET instructions we'll support. We'll get into the specifics shortly.

And we'll make the following registers available for use (but not modification) by the program(mer):

RIP, CS, DS, FS, SS, ES, GS, CF, ZF, PF, AF, SF, TF, IF, DF, OF

Each of these has a special meaning but we'll focus on RIP. The RIP register contains the address of the instruction currently being interpreted by our virtual machine. After every instruction the virtual machine will increment the value in this register -- except for a few special instructions like CALL and RET.

Memory addresses

It will become useful to provide direct access to memory with a special syntax. We'll focus just on 64-bit addresses that will look like this:

  MOV QWORD PTR [RBP - 8], 12

This asks for the value 12 to be written into the memory address at RBP - 8 bytes. The QWORD PTR part clarifies that we want to write 8 bytes worth of the value. Since 12 is less than 8 bytes, the rest will be filled with zeros.

  ADD RAX, QWORD PTR [RBP - 8]

This asks for eight bytes starting from the memory address RBP - 8 to be added to the value in RAX and stored back in RAX.

Virtual machine instruction set

In our virtual machine we'll define support for the following instructions:

  • MOV $REGISTER, $REGISTER or $MEMORY ADDRESS or $LITERAL NUMBER
    • This instruction copies the second value into the first.
  • ADD $REGISTER, $REGISTER or $MEMORY ADDRESS
    • This instruction adds the second value into the first and stores the result into the first.
  • PUSH $REGISTER
    • This instruction will decrement the RSP register by 8 bytes and store the value at the bottom of the stack.
  • POP $REGISTER
    • This instruction will increment the RSP register by 8 bytes, remove the last element in the stack (at the bottom), and store it into the register.
  • CALL $LABEL
    • This instruction will push the value in the RIP register (plus one) onto the stack and set the RIP register to the line of code of the label. More on this later.
  • RET
    • This instruction will remove the value at the bottom of the stack and store it in the RIP register.

Now we have more than enough instructions to write some interesting programs for the virtual machine.

Virtual machine semantics

We'll make one last assumption before explaining further. In our programs, there must be a main label which must contain a RET instruction. Once we hit the terminal RET, we will exit the virtual machine and set the exit code to the value stored in the RAX register.

Let's look at a simple program:

main: ; the required main label
  MOV RAX, 1 ; store 1 in RAX
  MOV RDI, 2 ; store 2 in RDI
  ADD RAX, RDI ; store the result of adding RAX and RDI in RAX
  RET ; give control back to the virtual machine

When we run this program, first we initialize a stack (we'll give it 1000 elements) and set the RSP register to 1000 (the top of the stack). Then we look for the main label and set the RIP register to 1, the line number after the label appears (0). Then until the RIP register is 1000 again, we interpret the instruction at the line number stored in the RIP register. Once the RIP register hits 1000, we exit the program setting RAX as the exit code.

One more example

Now let's look at one more program:

plus:
  ADD RDI, RSI
  MOV RAX, RDI
  RET

main:
  MOV RDI, 1
  MOV RSI, 2
  CALL plus
  RET

Our virtual machine will start at the line after the main label. Then it will store 1 into RDI and 2 into RSI. Then it will jump to the second line in the program to add RDI and RSI and store the result in RDI. Then it will copy RDI into RAX and return control to the final line. This last RET will in turn return control to the virtual machine. Then the program will exit with exit code 3.

Parsing

Now that we've finished up describing our virtual machine language and semantics, we need to parse the instructions into a format we can easily interpret.

To do this we'll iterate over the program skip any lines that start with a dot. These are virtual machine directives that are important for us to ignore for now. We'll also remove any characters including and following a semi-colon or hash-tag, until the end of the line. These are comments.

We'll store a dictionary of label names to line numbers (the line number of the label plus one) and without the colon.

And we'll store the instructions as an array of objects composed of an operation and optional operands.

Code

function parse(program) {
  const labels = {};
  const instructions = [];

  const lines = program.split('\n');

  for (let i = 0; i < lines.length; i++) {
    let line = lines[i].trim(); // Remove any trailing, leading whitespace
    // TODO: handle each line
  }

  return { labels, instructions };
}

First let's handle the directives we want to ignore:

  for (let i = 0; i < lines.length; i++) {
    let line = lines[i].trim(); // Remove any trailing, leading whitespace

    if (line.startsWith('.')) {
      continue;
    }
  }

And then comments:

  for (let i = 0; i < lines.length; i++) {
    let line = lines[i].trim(); // Remove any trailing, leading whitespace

    if (line.startsWith('.') || line.startsWith(';') || line.startsWith('#')) {
      continue;
    }

    if (line.includes(';')) {
      line = line.split(';')[0];
    }

    if (line.includes('#')) {
      line = line.split('#')[0];
    }

    if (!line) {
      continue;
    }
  }

And then labels:

  for (let i = 0; i < lines.length; i++) {
    let line = lines[i].trim(); // Remove any trailing, leading whitespace

    if (line.startsWith('.') || line.startsWith(';') || line.startsWith('#')) {
      continue;
    }

    if (line.includes(';')) {
      line = line.split(';')[0];
    }

    if (line.includes('#')) {
      line = line.split('#')[0];
    }

    if (!line) {
      continue;
    }

    if (line.includes(':')) {
      const label = line.split(':')[0];
      labels[label] = instructions.length;
      continue;
    }
  }

And finally instruction parsing plus the rest:

function parse(program) {
  const labels = {};
  const instructions = [];

  const lines = program.split('\n');

  for (let i = 0; i < lines.length; i++) {
    let line = lines[i].trim(); // Remove any trailing, leading whitespace

    if (line.startsWith('.') || line.startsWith(';')) {
      continue;
    }

    if (line.includes(';')) {
      line = line.split(';')[0];
    }

    if (line.includes(':')) {
      const label = line.split(':')[0];
      labels[label] = instructions.length;
      continue;
    }

    const operation = line.split(/\s/)[0].toLowerCase();
    const operands = line.substring(operation.length).split(',').map(t => t.trim());
    instructions.push({
      operation,
      operands,
    });
  }

  return { labels, instructions };
}

Hurray! A brittle parser.

Interpreting

We've already described the semantics a few times. So let's get started with the foundation and initialization.

We'll use BigInts because JavaScript integers are 53-bits wide. This isn't incredibly important in our basic programs but it will quickly became painful without.

And we'll make process memory available as an array of 8-bit integers. In order to make this easy to use, we'll also provide helper function for writing to and reading from memory.

const fs = require('fs');

const REGISTERS = [
  'RDI', 'RSI', 'RSP', 'RBP', 'RAX', 'RBX', 'RCX', 'RDX', 'RIP', 'R8',
  'R9', 'R10', 'R11', 'R12', 'R13', 'R14', 'R15', 'CS', 'DS', 'FS',
  'SS', 'ES', 'GS', 'CF', 'ZF', 'PF', 'AF', 'SF', 'TF', 'IF', 'DF', 'OF',
];

function writeMemoryBytes(process, address, value, size) {
  for (let i = 0n; i < size; i++) {
    value >>= i * 8n;
    process.memory[address + i] = value & 0xFFn;
  }
}

function readMemoryBytes(process, address, size) {
  let value = 0n;
  for (let i = 0n; i < size; i++) {
    value |= (process.memory[address + i] || 0n) << (i * 8n);
  }
  return value;
}

function interpret(process) {
  // TODO: interpret
}

function main(file) {
  const memory = new Array(10000);
  const code = fs.readFileSync(file).toString();
  const { instructions, labels } = parse(code);

  const registers = REGISTERS.reduce((rs, r) => ({ ...rs, [r]: 0n }), {});
  registers.RIP = BigInt(labels.main === undefined ? labels._main : labels.main);
  registers.RSP = BigInt(memory.length - 8);

  const process = {
    registers,
    memory,
    instructions,
    labels,
  };

  writeMemoryBytes(process, registers.RSP, registers.RSP, 8);

  interpret(process);
  return Number(registers.RAX);
}

process.exit(main(process.argv[2]));

We'll accept _main as an entry point as well as main to support our macOS users. If you know why our macOS users use _main I'd love to know.

To interpret, we grab the instruction pointed to in RIP and switch on the operation.

function interpret(process) {
  do {
    const instruction = process.instructions[process.registers.RIP];
    switch (instruction.operation.toLowerCase()) {
      case 'mov':
        break;
      case 'add':
        break;
      case 'call':
        break;
      case 'ret':
        break;
      case 'push':
        break;
      case 'pop':
        break;
    }
  } while (process.registers.RIP != BigInt(readMemoryBytes(process, memory.length - 8, 8)));
}

Interpreting MOV

Example:

  MOV RAX, 1
  MOV RAX, RDI
  MOV QWORD PTR [RBP - 8], 8

This instruction will store a value into a register or address and increment RIP. If the left-hand side is a memory address we will write to memory.

      case 'mov': {
        const lhs = interpretValue(process, instruction.operands[0], { lhs: true });
        const rhs = interpretValue(process, instruction.operands[1]);
        if (REGISTERS.includes(lhs)) {
          process.registers[lhs] = rhs;
        } else {
          writeMemoryBytes(process, lhs.address, rhs, lhs.size);
        }
        process.registers.RIP++;
        break;
      }

We're delegating to a helper function to handle registers vs. memory addresses vs. literals appropriately. Without memory addresses it's a simple function:

function interpretValue(process, value, { lhs } = { lhs: false }) {
  if (REGISTERS.includes(value)) {
    if (lhs) {
      return value;
    } else {
      return process.registers[value];
    }
  }

  return BigInt.asIntN(64, value);
}

We need to do some hacking to support memory addresses:

function interpretValue(process, value, { lhs } = { lhs: false }) {
  if (REGISTERS.includes(value)) {
    if (lhs) {
      return value;
    } else {
      return process.registers[value];
    }
  }

  if (value.startsWith('QWORD PTR [')) {
    const offsetString = value.substring('QWORD PTR ['.length, value.length - 1).trim();
    if (offsetString.includes('-')) {
      const [l, r] = offsetString.split('-').map(l => interpretValue(process, l.trim()));
      const address = l - r;
      const bytes = 8; // qword is 8 bytes
      if (lhs) {
        return { address, size: bytes };
      } else {
        return readMemoryBytes(process, address, bytes);
      }
    }

    throw new Error('Unsupported offset calculation: ' + value);
  }

  return BigInt.asIntN(64, value);
}

Interpreting ADD

Example:

  ADD RAX, RDI

This instruction will combine both registers and store the result in the first, then increment the RIP register.

      case 'add': {
        const lhs = interpretValue(process, instruction.operands[0], { lhs: true });
        const rhs = interpretValue(process, instruction.operands[1]);
        process.registers[lhs] += rhs;
        process.registers.RIP++;
        break;
      }

Interpreting CALL

Example:

  CALL plus

This instruction store RIP (plus one, to continue after the call instruction) on the stack and sets RIP to the location specified by the label.

      case 'call': {
        process.registers.RSP -= 8n;
        writeMemoryBytes(process, process.registers.RSP, process.registers.RIP + 1n, 8);
        const label = instruction.operands[0];
        process.registers.RIP = process.labels[label];
        break;
      }

Interpreting RET

Example:

  RET

This instruction removes the last element from the stack and stores it in the RIP register.

      case 'ret': {
        const value = readMemoryBytes(process, process.registers.RSP, 8);
        process.registers.RSP += 8n;
        process.registers.RIP = value;
        break;
      }

Interpreting PUSH

Example:

  PUSH RAX

This instruction stores the value in the register on the stack and increments RIP.

      case 'push': {
        const value = interpretValue(process, instruction.operands[0]);
        process.registers.RSP -= 8n;
        writeMemoryBytes(process, process.registers.RSP, value, 8);
        process.registers.RIP++;
        break;
      }

Interpreting POP

Example:

  POP RAX

This instruction removes the last element from the stack and stores it into the register specified. Then it increments RIP.

      case 'pop': {
        const lhs = interpretValue(process, instruction.operands[0], { lhs: true });
        const value = readMemoryBytes(process, process.registers.RSP, 8);
        process.registers.RSP += 8n;
        process.registers[lhs] = value;
        process.registers.RIP++;
        break;
      }

All together

$ cat test1.asm
main: ; the required main label
  MOV RAX, 1 ; store 1 in RAX
  MOV RDI, 2 ; store 2 in RDI
  ADD RAX, RDI ; store the result of adding RAX and RDI in RAX
  RET ; give control back to the virtual machine
$ node emulator.js test1.asm
$ echo $?
3

And finally, let's see what we can do with a simple C program:

$ cat plus.c
long main() {
  long a = 5;
  long b = 6;
  return a + b;
}
$ gcc -S -masm=intel -o plus.s plus.c
$ node emulator.js plus.s
$ echo $?
11

And we've got the start of a working x86_64/AMD64 emulator.

Next steps

We aren't setting flags appropriately to support conditionals, so that's low-hanging fruit. Additionally, syscalls open up a new world (that we'll end up needing since exit codes are limited to 8-bits of information). Additionally, our parsing is brittle. Dealing with ELF files may be a better direction to go and also enables more. We'll explore these aspects and others in follow-up posts.

Human interest

I originally intended to build a GameBoy emulator because the hardware is simple and uniform. But I found it easiest to start hacking together an AMD64 emulator because AMD64 is well-documented and gcc is easy enough to use. I'm still interested though unless/until I figure out how to emulate a graphics card for AMD64.

It's tricky! But not that tricky. I built a graphical debugger around this emulator to help out with the logic and off-by-one errors. But ultimately it's been surprising to me how easy it is to get started -- especially when I'm not concerned about absolute correctness (yet).

May 16, 2019

Why use learning when you can fit?

We recently had a talk by Tim Kraska in our group, and he spoke among other things about learned indexes. As I had mentioned before, I am more in favor of using suitably implemented b-trees, for reasons like update friendliness and distribution independence. But nevertheless, the talk made me curious: The model they are learning is in the end very primitive. It is a two-level linear model, i.e., they are using a linear function to select another linear function. But if that is enough, why do we need machine learning? A simple function fit should work just as well.

Thus, I tried the following:
1) we sort all data and keep it in an array, just like with learned indexes
2) we build the CDF
3) we fit a linear spline to the CDF minimizing the Chebyshev norm
4) we fit a polynomial function to the spline nodes
5) now we can lookup a value by evaluating first the polynomial function, then the spline, and then retrieving the values from the array. The previous step is always the seed to a local search in the next step.

As we bound the Chebyshev norm in each step, the lookup is in O(1), without any need for machine learning or other magic boxes.

Now admittedly there was some weasel wording in the previous paragraph: The lookup is in O(1), but the "constant" here is the Chebyshev norm of the fit, which means this only works well if we can find the good fit. But just the same is true for the learned indexes, of course.

Now do we find a good fit? In theory we know how to construct the optimal fit in O(n log n), but that paper is beyond me. I am not aware of any implementation, and the paper is much too vague to allow for one. But constructing a good fit is much easier, and can also be done in O(n log n). Using that algorithm, we can construct a linear spline that maximum error efficiently, and we know what the maximum is over the whole domain. Thus, we can probe the spline to get an estimate for the real value position, and we then can perform an efficient local search on a small, known, window of the data.

The only problem is evaluating the spline itself. Evaluating a linear spline is pretty cheap, but we have to find the appropriate knot points to evaluate. Traditionally, we find these with binary search again. Note that the spline is much smaller than the original data, but still we want to avoid the binary search. Thus, we construct a polynomial function to predict the spline knot, again minimizing the Chebyshev norm, which allows us to consider only a small subset of spline nodes, leading to the before mentioned time bound.

How well does this work in practice? On the map data set from the learned indexes paper and a log normal data set we get the following. (The learned indexes numbers are from the paper, the b-tree numbers are from here, and the spline numbers from this experiments. I still do not really know what the averages mean for the learned indexes, but probably the average errors averaged over all models).


Map datasize (MB)avg error
Learned Index (10,000)0.158 ± 45
Learned Index (100,000)1.532 ± 36
B-tree (10,000)0.15225
B-tree (100,000)1.5322
Spline (10,000)0.15193
Spline (100,000)1.5322

Log normal datasize (MB)avg error
Learned Index (10,000)0.1517,060 ± 61,072
Learned Index (100,000)1.5317,005 ± 60,959
B-tree (10,000)0.151330
B-tree (100,000)1.533
Spline (10,000)0.15153
Spline (100,000)1.531


Somewhat surprising the accuracy the accuracy of the spline is nearly identical to the interpolating b-tree for the real-world map data, which suggests that the separators span the domain reasonably well there. For the log normal data the spline is significantly better, and leads to nearly perfect predictions. Note that the original data sets contains many millions of data points in both cases, thus the prediction accuracy is really high.

For practical applications I still recommend the B-tree, of course, even though the polynomial+spline solution is in "O(1)" while the B-tree is in O(log n). I can update a B-tree just fine, including concurrency, recovery, etc., while I do not know how to do that with the polynomial+spline solution.
But if one wants to go the read-only/read-mostly route, the function functions could be attractive alternative the machine learning. The advantage of using fits is that the algorithms are reasonably fast, we understand how they work, and they give strong guarantees for the results.

May 14, 2019

Tail call elimination

In this post we'll explore what tail calls are, why they are useful, and how they can be eliminated in an interpreter, a compiler targeting C++, and a compiler targeting LLVM IR.

Tail calls

A tail call is a function call made at the end of a block that returns the value of the call (some languages do not force this return requirement). Here are a few examples.

function tailCallEx1() {
  // Loops forever but is a tail call.
  return tailCallEx1();
}

function tailCallEx2(x) {
  if (x) {
    return tailCallEx2(x - 1);
  }

  return 1;
}

function tailCallEx3(x) {
  return x && tailCallEx(x - 1) || 1;
}

function tailCallEx4(x) {
  switch (x) {
    case 0:
      return 1;
    default:
      return tailCallEx4(x - 1);
  }
}

And here are some examples of non-tail calls.

function nonTailCallEx1(x) {
  if (x) {
    // Not a tail call because the call is not the value returned.
    return 1 + nonTailCallEx1(x - 1);
  }

  return 0;
}

function nonTailCallEx2(x) {
  if (x) {
    const r = nonTailCallEx2(x - 1);
    // Not a tail call because the value is not *immediately* returned.
    console.log(r);
    return r;
  }

  return 1;
}

Why is this important?

Some languages can rewrite a recursive tail call as a jump/branch/goto instead of a function call. This allows:

  1. Potential performance gain if function calls have large overhead
  2. No stack overflows due to no nested function call stacks

Implementation 1: Interpreter

Given a tail call recursive fibonacci:

function fibonacci(a, b, n) {
  if (n === 0) {
    return a;
  }

  if (n === 1) {
    return b;
  }

  return fibonacci(b, a + b, n - 1);
}

Here is how we could transform (by hand) this without a tail call.

function fibonacci(a, b, n) {
  while (true) {
    if (n === 0) {
      return a;
    }

    if (n === 1) {
      return b;
    }

    const a1 = b;
    const b1 = a + b;
    const n1 = n - 1;
    a = a1;
    b = b1;
    n = n1;
  }
}

If this was written in a language with labels and goto we could simplify the code slightly by doing that. But it is the same effect as a loop.

Since we're in an interpreter (that isn't JIT compiling), we cannot pick between these two and must merge them. So we put all function bodies in a loop and break if it isn't a tail call. Otherwise we line up the paremeters and let the loop take us back.

Here is an example of this strategy used in a Scheme interpreter written in D.

// Define a new function with name `name` and add it to the context.
Value namedLambda(Value arguments, Context ctx, string name) {
  auto funArguments = car(arguments);
  auto funBody = cdr(arguments);

  Value defined(Value parameters, void** rest) {
    Context newCtx = ctx.dup;

    // Copy the runtime calling context to the new context.
    Context runtimeCtx = cast(Context)(*rest);
    auto runtimeCallingContext = runtimeCtx.callingContext;
    newCtx.callingContext = runtimeCallingContext.dup;

    Value result;
    // Loop forever, will break immediately if not a tail call
    bool tailCalling = false;
    while (true) {
      if (valueIsList(funArguments)) {
        auto keyTmp = valueToList(funArguments);
        auto valueTmp = valueToList(parameters);
        while (true) {
          auto key = valueToSymbol(keyTmp[0]);
          auto value = valueTmp[0];

          newCtx.set(key, value);

          // TODO: handle arg count mismatch
          if (valueIsList(keyTmp[1])) {
            keyTmp = valueToList(keyTmp[1]);
            valueTmp = valueToList(valueTmp[1]);
          } else {
            break;
          }
        }
      } else if (valueIsSymbol(funArguments)) {
        auto key = valueToSymbol(funArguments);
        newCtx.set(key, car(parameters));
      } else if (!valueIsNil(funArguments)) {
        error("Expected symbol or list in lambda formals", funArguments);
      }

      if (!tailCalling) {
        newCtx.callingContext.push(Tuple!(string, Delegate)(name, &defined));
      }

      result = eval(withBegin(funBody), cast(void**)[newCtx]);

      // In a tail call, let the loop carry us back.
      if (newCtx.doTailCall == &defined) {
        tailCalling = true;
        parameters = result;
        newCtx.doTailCall = null;
      } else {
        break; // Not in a tail call, we're done a regular function call.
      }
    }

    return result;
  }

  return makeFunctionValue(name, &defined, false);
}

We can not eliminate mutually recursive tail calls with this method. We could use continuation-passing style but that would not have addressed the concern: not making a function call.

Implementation 2: Compiling to C++

The strategy here is the same as in the interpreter except for that since tail call recursive functions are known at compile time, we can generate non-generalized code in function bodies.

Here is how a JavaScript compiler transforms the above fibonacci implementation into C++:

void tco_fib(const FunctionCallbackInfo<Value> &args) {
  Isolate *isolate = args.GetIsolate();
  double tco_n = toNumber(args[0]);
  double tco_a = toNumber(args[1]);
  double tco_b = toNumber(args[2]);

tail_recurse_1:

    ;

  bool sym_if_test_58 = (tco_n == 0);
  if (sym_if_test_58) {
    args.GetReturnValue().Set(Number::New(isolate, tco_a));
    return;
  }

  bool sym_if_test_70 = (tco_n == 1);
  if (sym_if_test_70) {
    args.GetReturnValue().Set(Number::New(isolate, tco_b));
    return;
  }

  Local<Value> sym_arg_83 = Number::New(isolate, (tco_n - 1));
  Local<Value> sym_arg_92 = Number::New(isolate, (tco_a + tco_b));
  tco_n = toNumber(sym_arg_83);
  tco_a = tco_b;
  tco_b = toNumber(sym_arg_92);
  goto tail_recurse_1;
}

This is implemented by checking every function call. If the function call is in tail call position, we generate code for jumping to the beginning of the function. Otherwise, we generate a call as usual.

Here is how the tail call check and code-generation is done in the compiler:

function compileCall(
  context: Context,
  destination: Local,
  ce: ts.CallExpression,
) {
  let tcoLabel;
  let tcoParameters;
  if (ce.expression.kind === ts.SyntaxKind.Identifier) {
    const id = identifier(ce.expression as ts.Identifier);
    const safe = context.locals.get(mangle(context.moduleName, id));

    if (safe && context.tco[safe.getCode()]) {
      const safeName = safe.getCode();
      tcoLabel = context.tco[safeName].label;
      tcoParameters = context.tco[safeName].parameters;
    }
  }

  ...

  if (ce.expression.kind === ts.SyntaxKind.Identifier) {
    const id = identifier(ce.expression as ts.Identifier);
    const mangled = mangle(context.moduleName, id);
    const safe = context.locals.get(mangled);

    if (safe) {
      if (tcoLabel) {
        args.forEach((arg, i) => {
          compileParameter(
            context,
            tcoParameters[i],
            i,
            i === args.length - 1,
            arg,
          );
        });

        context.emitStatement(`goto ${tcoLabel}`);
        context.emit('', 0);
        destination.tce = true;
        return;
      }
    }
  }

  ...

  // Otherwise generate regular function call

This requires you to have been building up the state throughout the AST to know whether or not any particular call is in tail position.

Implementation 3: Compiling to LLVM IR

LLVM IR is the most boring because all you do is mark any tail call as being a tail call. Then so long as the call meets some requirements, the key one being that the result of the call must be returned immediately, LLVM will generate a jump instead of a call for you.

Given the following lisp-y implementation of the same tail call recursive fibonacci function (compiler here):

(def fib (a b n)
     (if (= n 0)
         a
       (fib b (+ a b) (- n 1))))

We generate the following LLVM IR:

define i64 @fib(i64 %a, i64 %b, i64 %n) {
  %ifresult13 = alloca i64, align 4
  %sym14 = add i64 %n, 0
  %sym15 = add i64 0, 0
  %sym12 = icmp eq i64 %sym14, %sym15
  br i1 %sym12, label %iftrue16, label %iffalse17
iftrue16:
  %sym18 = add i64 %a, 0
  store i64 %sym18, i64* %ifresult13, align 4
  br label %ifend19
iffalse17:
  %sym21 = add i64 %b, 0
  %sym23 = add i64 %a, 0
  %sym24 = add i64 %b, 0
  %sym22 = add i64 %sym23, %sym24
  %sym26 = add i64 %n, 0
  %sym27 = add i64 1, 0
  %sym25 = sub i64 %sym26, %sym27
  ; NOTE the `tail` before `call` here
  %sym20 = tail call i64 @fib(i64 %sym21, i64 %sym22, i64 %sym25)
  ret i64 %sym20
  store i64 %sym20, i64* %ifresult13, align 4
  br label %ifend19
ifend19:
  %sym11 = load i64, i64* %ifresult13, align 4
  ret i64 %sym11
}

The only difference between supporting tail call elimination in is whether the call instruction is preceeded by a tail directive. That makes the implementation very simple:

const isTailCall = module.exports.TAIL_CALL_ENABLED &&
                   context.tailCallTree.includes(validFunction.value);
const maybeTail = isTailCall ? 'tail ' : '';
this.emit(1, `%${destination.value} = ${maybeTail}call ${validFunction.type} @${validFunction.value}(${safeArgs})`);
if (isTailCall) {
  this.emit(1, `ret ${destination.type} %${destination.value}`);
}

Generated assembly

The resulting generated code (run through llc) for that call will be:

...
    add rax, rsi
    dec rdx
    mov rdi, rsi
    mov rsi, rax
    jmp _fib               ## TAILCALL
...

And if tail call elimination is disabled:

...
    add rax, rsi
    dec rdx
    mov rdi, rsi
    mov rsi, rax
    call    _fib
...

Summary

The last bit I haven't covered is how you track whether or not a call is in tail position. That is difficult to cover in a blog post because it's a matter of you propagating/not propagating at each syntax node type. But generally speaking, if the syntax node is not in tail position (e.g. not the last expression in a block), you drop the tail state you've built up. When you make a function call, you add the function name to the tail state.

But I will be covering this in detail in the LLVM case in the next post in my compiler basics series.