a curated list of database news from authoritative sources

February 07, 2019

Choosing a Primary Vindex

When we talk to Vitess, it appears as if we are talking to a single unit, so what sets up this seemingly magical feature? This is achieved through a VSchema, the beautiful mind that sends all of your queries to the correct shard. When Vitess shards your database, it assigns each row an identifier called a keyspace ID. We then break up the domain of all keyspace ID values into ranges and assign ranges to each shard.

February 01, 2019

Honest asymptotic complexity for search trees and tries

Fun fact that was pointed out to me by Viktor: All traditional books on algorithms and data structures that we could find gave the lookup costs of balanced search trees as O(log n) (i.e., the depth of the search tree), and the lookup costs of tries as O(k) (i.e., the length of the key). At a first glance this is a logarithmic time lookup against a linear time lookup, which makes people nervous when thinking about long keys.

But that argument is very unfair: Either we consider a key comparison a O(1) operation, then a tree lookup is indeed in O(log n), but then a trie lookup is in O(1)! As the length of the key has to be bounded by a constant to get O(1) comparisons, the depth of the trie is bounded, too. Or the length of the key matters, then a trie lookup is indeed in O(k), but then a tree lookup is in O(k log n). We have to compare with the key on every level, and if we are unlucky we have to look at the whole key, which gives the factor k.

Which of course makes tries much more attractive asymptotic wise. Note that we ignore wall clock times here, which are much more difficult to predict, but in many if not most cases tries are indeed much faster than search trees.

I believe the reason why text books get away with this unfair comparison is that they all present balanced search trees with integer keys:

image/svg+xml 8 10 4 1 6
While tries are traditionally introduced with string examples. If they had used string keys for balanced search trees instead it would have been clear that the key length matters:

image/svg+xml ABCD8 ABCD10 ABCD4 ABCD1 ABCD6
The trie examines every key byte at most once, while the search tree can examine every key byte log n times. Thus, the asymptotic complexity of tries is actually better than that of balanced search tries.

January 22, 2019

Transparency and communication on small teams

I saw a post on dev.to that talks about dysfunctional teams. This is a response that focuses specifically on how to prevent burnout from overworking. This is aimed at senior/lead engineers and engineering/project managers -- because everyone in a leadership role is responsible for the health of the team and the company.

In an otherwise good company with hard-working, ethical employees, overworking happens because of imperfect communication. If neither of those premises hold, you have more serious issues and have no need for this post.

The primary subjects of poor communication are:

  • Capacity/capabilities
  • Priorities
  • Results

If any member of the team (or worse, the entire team) is not honestly reporting on their capacity and capability, this will drive them to overwork to make up for what they couldn't accomplish on work hours.

If any member of the team (or worse, the entire team) is not honestly and publicly reporting on what they understand to be the priorities, they will end up needing to work overtime if true priorities become apparent too late.

And if any member of the team (or worse, the entire team) is not honestly and publicly reporting on what they accomplished, they will end up needing to work overtime if discrepancies become apparent too late.

Solution

Put a sprint process in place and schedule at least one meeting every sprint. Discover every political, technical, and structural stakeholder and find a time they can attend this meeting. At this meeting you will cover at a high level (perhaps with some demos) what was accomplished in the sprint and what you intend to accomplish in the next sprint.

If any stakeholder cannot make this meeting, find a time to sync up with him/her separately.

Your sprints should not last more than two weeks because any longer is too long to go before talking to/reviewing with your stakeholders.

Finally, publish a report on what you accomplished this sprint (and also what you did not accomplish!) and what you plan to accomplish the next sprint. For example, I send an email to the engineering organization with two docs at the end of each sprint: 1) a review doc listing tasks accomplished/not accomplished and 2) a list of tasks planned for the next sprint. This gives your stakeholders (and anyone else interested) an opportunity to review the contents of the meeting at their leisure.

Doing this can be difficult and embarrassing at first. Hard-working, ethical employees never want to be seen as not accomplishing their share of work. But the most important thing for the mid-to-long-term health of these employees is to get them reporting honestly.

This helps make it clear where these employees can legitimately improve (i.e. receive more training) and where it's necessary to hire more or different employees. You'll likely need to put pressure on every team member to report honestly and to do so without fear.

And as a result of doing this, you've done everything you can as a senior/lead member of a small team to push responsibility for your team's work up to your stakeholders. This is the best position to be in.

January 20, 2019

Windows

It has been six years since I last used Windows for any remotely serious software development. I've used Ubuntu, Arch, or FreeBSD since. But eventually I spent so much time working around common workplace tasks that I decided to put Windows 10 Pro on my work laptop.

Windows Subsystem for Linux

Introduced in 2016, this technology allows Windows to run unmodified Linux binaries. The core feat being syscall translation.

It works nearly flawlessly. This means I can do all my Go, Node, PostgreSQL development on Windows without a virtual machine using bash, tmux, git, emacs, etc.

I've seen a few minor exceptions over the course of regular software development in WSL:

More generally, Linux programs are heavily file-oriented. And Windows I/O is not designed well for that. In the worst cases (installing/adding Node packages) it can take minutes to do operations that would take Linux seconds.

Vagrant

Vagrant-Windows interoperability is abysmal.

As noted above, you cannot manage Hyper-V from vagrant within WSL. So you're stuck using Powershell. Even then, managing synced files from vagrant is a nightmare. The default sync method requires you to sign in using your Windows Live username and password on every reboot. But Node package installation attempts some file operations that are not supported over the default synced, network filesystem.

When I switched to rsync vagrant wouldn't reliable sync when the virtual machine went down and came back up.

After hours of trying to get some files synced with vagrant I gave up.

Hyper-V

Hyper-V's GUI is much more complex/feature-complete than VirtualBox. It even provides a Ubuntu-quick-install that I used to jump right in. I don't recommend using this though because it gives you no option but an 11GB hard disk. I didn't realize this until I went through an hour or two of post-install customization only to run out of space. Too lazy to boot into a live CD to grow the root filesystem I reinstalled with a more suitable 64GB drive and went through the hour-long post-install customization process again.

Networking in Hyper-V is more complex/feature-complete than VirtualBox as well. To access a Hyper-V machine you must create a new virtual network interface manually and associate it. Static IP address appear to be controlled at the host networking level (e.g. Control Panel) instead of within the Hyper-V interface. This highlights how these virtual interfaces are first-class, but overcomplicates the process of getting started.

Ultimately I gave up on a static IP address and decided to reboot less frequently.

Performance-wise Hyper-V machines are exactly as expected: excellent.

Misc

Docker support on Windows needs work. It took me a while to understand how Docker interacts with the WSL filesystem and what I needed to do to allow Docker to mount. The complexity is similar on macOS when you want to mount privileged directories like /var, but the experience is worse on Windows.

Apparently Windows does have tiling window managers, but I have not tried one out yet.

Powershell, a language with real types, is pretty compelling. But I have not spent enough time with it to be efficient. And since WSL is mostly good enough I don't really plan to.

Windows doesn't allow you to delete any files that are "in use". This is kinda cool except for that the errors you get when trying to delete files that are in use are useless. They are even more useless when you get the plain "could not delete directory" when you try to delete a directory with some file inside it that is in use. I had to start deleting files within by hand until I found the one I realized was in use.

Conclusion

If you have never run Linux or FreeBSD, don't use this post as an excuse not to. You should run Linux or FreeBSD for the experience. But if you've reached diminishing returns in your Linux/FreeBSD use, Windows as a development environment has come a long way. It may be the best platform available for software development, the profession.

Writing a lisp compiler from scratch in JavaScript: 2. user-defined functions and variables

Previously in compiler basics: <! forgive me, for I have sinned >
1. lisp to assembly

Next in compiler basics:
3. LLVM
4. LLVM conditionals and compiling fibonacci
5. LLVM system calls
6. an x86 upgrade

In this post we'll extend the compiler to support defining functions and variables. Additionally, we'll require the program's entrypoint to be within a main function.

The resulting code can be found here.

Function definition

The simplest function definition we need to support is for our main function. This will look like this:

$ cat basic.lisp
(def main ()
     (+ 1 2))

Where compiling and running it should produce a return code of 3:

$ node ulisp.js basic.lisp
$ ./build/a.out
$ echo $?
3

Parsing function definitions

The entire language is defined in S-expressions and we already parse S-expressions.

$ node
> const { parse } = require('./parser');
> JSON.stringify(parse('(def main () (+ 1 2))'));
'[[["def","main",[],["+",1,2]]],""]'

So we're done!

Code generation

There are two tricky parts to code generation once function definitions are introduced:

  • Functions definitions are not expressions (in assembly)
  • Function calling conventions for the callee
  • Variable scope

Function definitions

A function definition looks like a function call. So we'll need to keep a list of "primitive" functions that handle what looks like function calls differently.

function compile_define() {
  // TODO
}

const primitive_functions = {
  def: compile_define,
};

Then in our compile_call function we need to see if the function being "called" is in this list. If so, we allow the associated callback to handle compilation.

function compile_call(fun, args, destination) {
  if (primitive_functions[fun]) {
    primitive_functions[fun](args, destination);
    return;
  }

  // Save param registers
  args.map((_, i) => emit(1, `PUSH ${PARAM_REGISTERS[i]}`));

  // Compile registers and store as params
  args.map((arg, i) => compile_expression(arg, PARAM_REGISTERS[i], scope));

  emit(1, `CALL ${BUILTIN_FUNCTIONS[fun] || scope[fun]}`);

  // Restore param registers
  args.map((_, i) => emit(1, `POP ${PARAM_REGISTERS[args.length - i - 1]}`));

  if (destination && destination !== 'RAX') {
    emit(1, `MOV ${destination}, RAX`);
  }
}

Now we can begin thinking about compile_define. It takes args which will be a list of three elements containing the function's:

  • name
  • parameters
  • and body

It does not use destination because we're treating function definitions as statements for now and not as expressions. If we were treating it as an expression, we might store the address of the function in the destination register. We keep destination around to keep the primitive function signatures consistent.

Based on how we called functions before and how we defined the hard-coded add function, we know what a function definition in assembly generally looks like. And we know the arguments to the function when called will be in RDI, RSI, and RDX.

function compile_define([name, parameters, body]) {
  // Function name becomes a label we can CALL
  emit(0, `${name}:`);

  // Something to do with RDI, RSI, RDX and the parameters variable?

  // We renamed compile_argument to compile_expression to be more general
  compile_expression(body[0], 'RAX');

  // Maybe some cleanup to do with RDI, RSI, RDX?

  emit(1, 'RET\n');
}

Not a bad first sketch. But how do we match up RDI, RSI, RDX and the user-defined parameters variable names? For example in the following:

(def plus-two (a)
     (+ a 2))

It's clear to us that a must match up to RDI. In order to do this we need to track all variables in a scope dictionary mapping the variable name to the register where it's stored.

Additionally, keeping track of scope can help us fail quickly in the following scenario:

(def plus-two (a)
     (+ b 2))

The variable b is used but never defined. It has not been added to the scope dictionary. So our compiler can fail quickly saying there is an undefined variable being referenced.

Taking this a step further, what if we want to catch the following too:

(def plus-two (a)
     (plus a 2))

We're trying to call plus but it has not been defined. We should be able to fail quickly here too. But that means we're need to track the scope of function names in addition to variables. We'll choose to track function names and variable names in the same scope dictionary.

This is the distinction between a lisp-1 and a lisp-2. We are a lisp-1 like Scheme because we have a single scope. Common Lisp is a lisp-2 because it stores function name scope separately from variable name scope.

Implementing scope

We need to revise every compile function to accept a scope dictionary (specifically: compile, compile_expression, compile_call, and compile_define). If a variable is referenced, we need to look up it's location in the scope dictionary. If a variable is defined (e.g. a function name or a function parameter) we need to add a mapping to the scope dictionary.

Modifying compile_expression is easiest:

function compile_expression(arg, destination, scope) {
  // Is a nested function call, compile it
  if (Array.isArray(arg)) {
    compile_call(arg[0], arg.slice(1), destination, scope);
    return;
  }

  if (scope[arg] || Number.isInteger(arg)) {
    emit(1, `MOV ${destination}, ${scope[arg] || arg}`);
  } else {
    throw new Error('Attempt to reference undefined variable or unsupported literal: ' + arg);
  }
}

Next we modify compile_call:

function compile_call(fun, args, destination, scope) {
  if (primitive_functions[fun]) {
    primitive_functions[fun](args, destination, scope);
    return;
  }

  // Save param registers
  args.map((_, i) => emit(1, `PUSH ${PARAM_REGISTERS[i]}`));

  // Compile registers and store as params
  args.map((arg, i) => compile_expression(arg, PARAM_REGISTERS[i], scope));

  const validFunction = BUILTIN_FUNCTIONS[fun] || scope[fun];
  if (validFunction) {
    emit(1, `CALL ${validFunction}`);
  } else {
    throw new Error('Attempt to call undefined function: ' + fun);
  }

  // Restore param registers
  args.map((_, i) => emit(1, `POP ${PARAM_REGISTERS[args.length - i - 1]}`));

  if (destination && destination !== 'RAX') {
    emit(1, `MOV ${destination}, RAX`);
  }
}

And then compile_define where we modify scope for the first time:

function compile_define([name, params, ...body], destination, scope) {
  // Add this function to outer scope
  scope[name] = name.replace('-', '_');

  // Copy outer scope so parameter mappings aren't exposed in outer scope.
  const childScope = { ...scope };

  emit(0, `${scope[name]}:`);

  params.forEach(function (param, i) {
    const register = PARAM_REGISTERS[i];
    // Store parameter mapped to associated register
    childScope[param] = register;
  });

  // Pass childScope in for reference when body is compiled.
  compile_expression(body[0], 'RAX', childScope);

  emit(1, 'RET\n');
}

And finally we need to modify the entrypoint compile:

module.exports.compile = function (ast) {
  emit_prefix();
  // Pass in new, empty scope mapping
  compile_call(ast[0], ast.slice(1), 'RAX', {});
  emit_postfix();
}

And scope-wise we're pretty good!

Function calling convention: callee

We currently have a problem that we're using parameters registers to store local variables that messes up with how we are storing parameters for function calls within the function itself.

Ideally we could store function local variables (including the parameters when we get them) separately from how we store function call parameters within the function.

Thankfully according to the calling convention we've followed, we're given a set of registers that are callee-preserved. Of them we'll use RBX, RBP, and R12 in that order. This allows us to mess with so long as we store them and restore them within the function.

Applying the same storing/restoring strategy to local variables as we did for parameters, we get:

const LOCAL_REGISTERS = [
  'RBX',
  'RBP',
  'R12',
];

function compile_define([name, params, ...body], destination, scope) {
  // Add this function to outer scope
  scope[name] = name.replace('-', '_');

  // Copy outer scope so parameter mappings aren't exposed in outer scope.
  const childScope = { ...scope };

  emit(0, `${scope[name]}:`);

  params.forEach(function (param, i) {
    const register = PARAM_REGISTERS[i];
    const local = LOCAL_REGISTERS[i];
    emit(1, `PUSH ${local}`);
    emit(1, `MOV ${local}, ${register}`);
    // Store parameter mapped to associated local
    childScope[param] = local;
  });

  // Pass childScope in for reference when body is compiled.
  compile_expression(body[0], 'RAX', childScope);

  params.forEach(function (param, i) {
    // Backwards first
    const local = LOCAL_REGISTERS[params.length - i - 1];
    emit(1, `POP ${local}`);
  });

  emit(1, 'RET\n');
}

And we're set.

Cleanup

We've still got a few messes going on:

  • emit_prefix wraps out entire body in _main, we're requiring our own main now
  • emitting to stdout instead of to a file
  • multiple function definitions is treated as nonsense

Starting first, we rewrite emit_prefix and emit_postfix so that our _main just calls main.

function emit_prefix() {
  emit(1, '.global _main\n');

  emit(1, '.text\n');

  emit(0, 'plus:');
  emit(1, 'ADD RDI, RSI');
  emit(1, 'MOV RAX, RDI');
  emit(1, 'RET\n');
}

function emit_postfix() {
  emit(0, '_main:');
  emit(1, 'CALL main');
  emit(1, 'MOV RDI, RAX'); // Set exit arg
  emit(1, `MOV RAX, ${SYSCALL_MAP['exit']}`);
  emit(1, 'SYSCALL');
}

Next to deal with writing to a file instead of stdout, we need our emit function to write to a buffer. We'll let ulisp.js write that buffer to a file. Because we're incredibly lazy, we'll do this all globally.

let OUT = '';

function emit(depth, args) {
  const indent = new Array(depth + 1).join('  ');
  OUT += `${indent}${args}\n`;
}

...

module.exports.compile = function (ast) {
  OUT = '';

  emit_prefix();
  compile_call(ast[0], ast.slice(1), 'RAX', {});
  emit_postfix();

  return OUT;
}

And modify ulisp.js:

const cp = require('child_process');
const fs = require('fs');

const { parse } = require('./parser');
const { compile } = require('./compiler');

function main (args) {
  const input = fs.readFileSync(args[2]).toString();
  const [ast] = parse(input);
  const program = compile(ast);

  try {
    fs.mkdirSync('build');
  } catch (e) {}
  fs.writeFileSync('build/prog.s', program);
  cp.execSync('gcc -mstackrealign -masm=intel -o build/a.out build/prog.s');
}

main(process.argv);

And we're finally ready to run a simple program.

A program!

$ cat test.lisp
(def main () (+ 1 2))
$ node ulisp.js test.lisp
$ ./build/a.out
$ echo $?
3

Hurray! Now let's try defining and calling a second function to validate parameter behavior.

$ cat test.lisp
(def plus-two (a)
     (+ a 2))

(def main ()
     (plus-two 3))
$ node ulisp.js test.lisp
$ ./build/a.out
./compiler.js:106
    throw new Error('Attempt to call undefined function: ' + fun);
    ^

Error: Attempt to call undefined function: p2

...

We start getting some really weird errors. And the reason is because our compiler doesn't know how to deal with sibling S-expressions.

So we'll introduce a new primitive function called begin that calls all it's sibling functions and returns the value of the last call. Then we'll wrap the program in an implicit begin so we don't need to.

function compile_begin(body, destination, scope) {
  body.forEach((expression) => compile_expression(expression, 'RAX', scope));
  if (destination && destination !== 'RAX') {
    emit(1, `MOV ${destination}, RAX`);
  }
}

const primitive_functions = {
  def: compile_define,
  begin: compile_begin,
};

...

module.exports.compile = function (ast) {
  OUT = '';

  emit_prefix();
  compile_call('begin', ast, 'RAX', {});
  emit_postfix();

  return OUT;
}

And we try our test program again. :)

$ cat test.lisp
(def plus-two (a)
     (+ a 2))

(def main ()
     (plus-two 3))
$ node ulisp.js test.lisp
$ ./build/a.out
$ echo $?
5

And that's all there is to it! Stay tuned for the next post on conditionals and tail-call optimization.

December 27, 2018

Make small changes and solve the problems you have

Two frustrating things that can happen in an organization are 1) big changes and 2) changes that aren’t clearly associated with a known problem. It’s even worse in that order.

These situations tend to happen when a problem remain unaddressed for too long. These situations tend to happen when there is not a strong enough emphasis on respect for all employees -- their experience, ideas, and feelings.

I try to avoid these issues in teams I run by starting early with a problem statement. Specifically when there’s a problem I’d like to solve, I’ll mention it in our fortnightly team retro. If there’s general agreement a problem exists, we begin looking for the least invasive/least effort way to fix the problem. More on that later.

If the problem is not well understand or widely-enough shared, I’ll table the discussion until I can talk with more people to better articulate the problem. Or maybe there isn’t a problem after all.

This process of clarifying and agreeing a problem exists is the only appropriate first step when making a change. It is important to provide sufficient context to affected employees.

After the problem is understood I begin to suggest possible solutions -- soliciting feedback and alternatives. But making sure a problem is well understand is not the same thing as making sure that potential solutions could reasonably solve the problem. Throughout the discussion of solutions I try to repeatedly make sure that proposed solutions could actually address the problem.

From there I try to steer discussion of solutions to ones that are easiest to make and least invasive. Small changes are easier to make. There is little room for disagreement when there is little changing.

Making small changes among a small group of people is even easier. The few disagreements that you find when making small changes among a small group of people give you a chance to prove or improve the solution before introducing it to a larger group.

Communicating frequently and effectively should be a clear theme here.

At this point if there is a single most reasonable solution, I’ll pick it unless there is serious disagreement. Most of the time folks are amenable to the need for a solution to be chosen to solve a problem they agreed existed, even if they don’t love the solution.

If there is no clear solution or there is serious disagreement, go back a few paragraphs and start over to understand the problem and solicit feedback and alternative for solutions. Or take the heat of serious disagreement.

This is a philosophy. It’s difficult to prove the effectiveness one way or the other -- especially over the mid-to-long-term. But the logic makes sense to me, it agrees with what I’ve read on management, and has worked effectively in teams I’ve run so far.

Further reading:

December 10, 2018

Introducing Vitess 3.0

We are pleased to announce Vitess 3.0, a major upgrade from Vitess 2.2. Every major release of Vitess brings new features and improvements for our community, and also sometimes introduces new ways of doing things. Vitess 3.0 is built using “pure” go 1.11 (we do not need cgo now) and supports MySQL 8.0 and MariaDB 10.3. The key features of Vitess 3.0 are: Usability Simplified db parameters for vttablet Formal support for externally managed mysql More tutorials including how to run Vitess with Minikube Prometheus monitoring Performance improvements Snappier InitShardMaster and PlannedReparentShard Improved coordination with Orchestrator vtbench for benchmarking MySQL protocol performance improvements Faster reparents 4X faster parallelized backups Many more performance improvements VReplication Improvements to resharding We first showcased VReplication at the first ever Vitess meetup held at Slack in July 2018.

November 20, 2018

Writing a lisp compiler from scratch in JavaScript: 1. lisp to assembly

Next in compiler basics: <! forgive me, for I have sinned >
2. user-defined functions and variables
3. LLVM
4. LLVM conditionals and compiling fibonacci
5. LLVM system calls
6. an x86 upgrade

In this post we'll write a simple compiler in Javascript (on Node) without any third-party libraries. Our goal is to take an input program like (+ 1 (+ 2 3)) and produce an output assembly program that does these operations to produce 6 as the exit code. The resulting compiler can be found here.

We'll cover:

  • Parsing
  • Code generation
  • Assembly basics
  • Syscalls

And for now we'll omit:

  • Programmable function definitions
  • Non-symbol/-numeric data types
  • More than 3 function arguments
  • Lots of safety
  • Lots of error messsages

Parsing

We pick the S-expression syntax mentioned earlier because it's very easy to parse. Furthermore, our input language is so limited that we won't even break our parser into separate lexing/parsing stages.

Once you need to support string literals, comments, decimal literals, and other more complex literals it becomes easier to use separate stages.

If you're curious about these separate stages of parsing, you may be interested in my post on writing a JSON parser.

Or, check out my BSDScheme project for a fully-featured lexer and parser for Scheme.

The parser should produce an Abstract Syntax Tree (AST), a data structure representing the input program. Specifically, we want (+ 1 (+ 2 3)) to produce ['+', 1, ['+', 2, 3]] in Javascript.

There are many different ways to go about parsing but the most intuitive to me is to have a function that accepts a program (a string) and returns a tuple containing the program parsed so far (an AST) and the rest of the program (a string) that hasn't been parsed.

That leaves us with a function skeleton that looks like this:

module.exports.parse = function parse(program) {
  const tokens = [];

  ... logic to be added ...

  return [tokens, ''];
};

The code that initially calls parse will thus have to deal with unwrapping the outermost tuple to get to the AST. For a more helpful compiler we could check that the entire program was actually parsed by failing if the second element of the return result is not the empty string.

Within the function we will iterate over each character and accumulate until we hit space, left or right parenthesis:

module.exports.parse = function parse(program) {
  const tokens = [];
  let currentToken = '';

  for (let i = 0; i < program.length; i++) {
    const char = program.charAt(i);

    switch (char) {
      case '(': // TODO
        break;
      case ')': // TODO
        break;
      case ' ':
        tokens.push(+currentToken || currentToken);
        currentToken = '';
        break;
      default:
        currentToken += char;
        break;
    }
  }

  return [tokens, ''];
};

The recursive parts are always the most challenging. The right paren is easiest. We must push the current token and return all tokens with the rest of the program:

module.exports.parse = function parse(program) {
  const tokens = [];
  let currentToken = '';

  for (let i = 0; i < program.length; i++) {
    const char = program.charAt(i);

    switch (char) {
      case '(': // TODO
        break;
      case ')':
        tokens.push(+currentToken || currentToken);
        return [tokens, program.substring(i + 1)];
      case ' ':
        tokens.push(+currentToken || currentToken);
        currentToken = '';
        break;
      default:
        currentToken += char;
        break;
    }
  }

  return [tokens, ''];
};

Finally the left paren should recurse, add the parsed tokens to the list of sibling tokens, and force the loop to start at the new unparsed point.

module.exports.parse = function parse(program) {
  const tokens = [];
  let currentToken = '';

  for (let i = 0; i < program.length; i++) {
    const char = program.charAt(i);

    switch (char) {
      case '(': {
        const [parsed, rest] = parse(program.substring(i + 1));
        tokens.push(parsed);
        program = rest;
        i = 0;
        break;
      }
      case ')':
        tokens.push(+currentToken || currentToken);
        return [tokens, program.substring(i + 1)];
      case ' ':
        tokens.push(+currentToken || currentToken);
        currentToken = '';
        break;
      default:
        currentToken += char;
        break;
    }
  }

  return [tokens, ''];
};

Assuming this is all in parser.js, let's try it out in the REPL:

$ node
> const { parse } = require('./parser');
undefined
> console.log(JSON.stringify(parse('(+ 3 (+ 1 2)')));
[[["+",3,["+",1,2]]],""]

Solid. We move on.

Assembly 101

Assembly is essentially the lowest-level programming language we can use. It is a human readable, 1:1 representation of the binary instructions the CPU can interpret. Conversion from assembly to binary is done with an assembler; the reverse step is done with a disassembler. We'll use gcc for assembling since it deals with some oddities of assembly programming on macOS.

The primary data structures in assembly are registers (temporary variables stored by the CPU) and the program stack. Every function in a program has access to the same registers, but convention cordons off sections of the stack for each function so it ends up being a slightly more durable store than registers. RAX, RDI, RDX, and RSI are a few registers available to us.

Now we only need to know a few instructions to compile our program (the rest of programming assembly is convention):

  • MOV: store one register's content into another, or store a literal number into a register
  • ADD: store the sum of two register's contents in the first register
  • PUSH: store a register's content on the stack
  • POP: remove the top-most value from the stack and store in a register
  • CALL: enter a new section of the stack and start running the function
  • RET: enter the calling functions stack and return to evaluating from the next instruction after the call
  • SYSCALL: like CALL but where the function is handled by the kernel

Function calling convention

Assembly instructions are flexible enough that there is no language-defined way to make function calls. Therefore it is important to answer (at least) the following few questions:

  • Where are parameters stored by the caller so that the callee has access to them?
  • Where is the return value stored by the callee so the caller has access to it?
  • What registers are saved by whom?

Without getting too far into the specifics, we'll assume the following answers for development on x86_64 macOS and Linux systems:

  • Parameters are stored (in order) in the RDI, RSI, and RDX registers
    • We won't support passing more than three arguments
  • The return value is stored in the RAX register
  • RDI, RSI, and RDX registers are stored by the caller

Code generation

With assembly basics and the function call convention in mind, we've got enough to generate code from the parsed program's AST.

The skeleton of our compile code will look like this:

function emit(depth, code) {
  const indent = new Array(depth + 1).map(() => '').join('  ');
  console.log(indent + code);
}

function compile_argument(arg, destination) {
  // If arg AST is a list, call compile_call on it

  // Else must be a literal number, store in destination register
}

function compile_call(fun, args, destination) {
  // Save param registers to the stack

  // Compile arguments and store in param registers

  // Call function

  // Restore param registers from the stack

  // Move result into destination if provided
}

function emit_prefix() {
  // Assembly prefix
}

function emit_postfix() {
  // Assembly postfix
}

module.exports.compile = function parse(ast) {
  emit_prefix();
  compile_call(ast[0], ast.slice(1));
  emit_postfix();
};

From our pseudo-code in comments it is simple enough to fill in. Let's fill in everything but the prefix and postfix code.

function compile_argument(arg, destination) {
  // If arg AST is a list, call compile_call on it
  if (Array.isArray(arg)) {
    compile_call(arg[0], arg.slice(1), destination);
    return;
  }

  // Else must be a literal number, store in destination register
  emit(1, `MOV ${destination}, ${arg}`);
}

const BUILTIN_FUNCTIONS = { '+': 'plus' };
const PARAM_REGISTERS = ['RDI', 'RSI', 'RDX'];

function compile_call(fun, args, destination) {
  // Save param registers to the stack
  args.forEach((_, i) => emit(1, `PUSH ${PARAM_REGISTERS[i]}`));

  // Compile arguments and store in param registers
  args.forEach((arg, i) => compile_argument(arg, PARAM_REGISTERS[i]));

  // Call function
  emit(1, `CALL ${BUILTIN_FUNCTIONS[fun] || fun}`);

  // Restore param registers from the stack
  args.forEach((_, i) => emit(1, `POP ${PARAM_REGISTERS[args.length - i - 1]}`));

  // Move result into destination if provided
  if (destination) {
    emit(1, `MOV ${destination}, RAX`);
  }

  emit(0, ''); // For nice formatting
}

In a better compiler, we would not make plus a built-in function. We'd emit code for the assembly instruction ADD. However, making plus a function makes code generation simpler and also allows us to see what function calls look like.

We'll define the plus built-in function in the prefix code.

The prefix

Assembly programs consist of a few "sections" in memory. The most important of which are the text and data sections. text is a read-only section where the program instructions themselves are stored. The CPU is instructed to start interpreting from some location in this text section and it will increment through instructions, evaluating each instruction until it reaches an instruction that tells it to jump to a different location to evaluate instructions (e.g. with CALL, RET, or JMP).

To denote the text section we emit .text in our prefix before we emit our generated code.

The data section is for statically initialized values (e.g. global variables). We don't have any need for that right now so we'll ignore it.

Here is a good read with more detail on these (and other) sections.

Additionally, we need to emit an entrypoint (we'll use _main) and add a notice (.global _main) so that the location of this entrypoint is visible externally. This is important because we let gcc handle the hairier parts of generating an executable file and it needs access to the entrypoint.

So far, our prefix looks like this:

function emit_prefix() {
  emit(1, '.global _main\n');

  emit(1, '.text\n');

  // TODO: add built-in functions

  emit(0, '_main:');
}

The last part of our prefix needs to include the plus built-in function. For this, we add the first two parameter registers we agreed on (RDI and RSI) and store the result in RAX.

function emit_prefix() {
  emit(1, '.global _main\n');

  emit(1, '.text\n');

  emit(0, 'plus:');
  emit(1, 'ADD RDI, RSI');
  emit(1, 'MOV RAX, RDI');
  emit(1, 'RET\n');

  emit(0, '_main:');
}

And we're golden.

The postfix

The job of the postfix will be simple, call exit with the value of RAX since this will be the result of the last function called by the program.

exit is a syscall, so we'll use the SYSCALL instruction to call it. The x86_64 calling convention on macOS and Linux for SYSCALL defines parameters the same way CALL does. But we also need to tell SYSCALL what syscall to call. The convention is to set RAX to the integer representing the syscall on the current system. On Linux it will be 60; on macOS it is 0x2000001.

When I say "convention", I don't mean that you really have a choice as a programmer. It was arbitrary when the operating system and standard libraries chose it. But if you want to write a working program that uses syscalls or calls out to (say) glibc, you'll need to follow these conventions.

The postfix then looks like this:

const os = require('os');

const SYSCALL_MAP = os.platform() === 'darwin' ? {
    'exit': '0x2000001',
} : {
    'exit': 60,
};

function emit_postfix() {
  emit(1, 'MOV RDI, RAX'); // Set exit arg
  emit(1, `MOV RAX, ${SYSCALL_MAP['exit']}`); // Set syscall number
  emit(1, 'SYSCALL');
}

And we're set here too.

Putting it all together

We can finally write our Javascript entrypoint and run our compiler against a sample program.

The entrypoint might look like this:

const { parse } = require('./parser');
const { compile } = require('./compiler');

function main(args) {
  const script = args[2];
  const [ast] = parse(script);
  compile(ast[0]);
}

main(process.argv);

And we can call it like so:

$ node ulisp.js '(+ 3 (+ 2 1))'
  .global _main

  .text

plus:
  ADD RDI, RSI
  MOV RAX, RDI
  RET

_main:
  PUSH RDI
  PUSH RSI
  MOV RDI, 3
  PUSH RDI
  PUSH RSI
  MOV RDI, 2
  MOV RSI, 1
  CALL plus
  POP RSI
  POP RDI
  MOV RSI, RAX

  CALL plus
  POP RSI
  POP RDI

  MOV RDI, RAX
  MOV RAX, 0x2000001
  SYSCALL

Generating an executable file from the output

If we redirect the previous output to an assembly file and call gcc on it, we can generate a program we can run. Then we can echo the $? variable to see the exit code of the previous process.

$ node ulisp.js '(+ 3 (+ 2 1))' > program.S
$ gcc -mstackrealign -masm=intel -o program program.s
$ ./program
$ echo $?
6

And we've got a working compiler! The full source of the compiler is available here.

Further reading