database.news | database technology news aggregator

January 20, 2019

Phil Eaton

Writing a lisp compiler from scratch in JavaScript: 2. user-defined functions and variables

Previously in compiler basics: <! forgive me, for I have sinned >
1. lisp to assembly

Next in compiler basics:
3. LLVM
4. LLVM conditionals and compiling fibonacci
5. LLVM system calls
6. an x86 upgrade

In this post we'll extend the compiler to support defining functions and variables. Additionally, we'll require the program's entrypoint to be within a main function.

The resulting code can be found here.

Function definition

The simplest function definition we need to support is for our main function. This will look like this:

$ cat basic.lisp
(def main ()
     (+ 1 2))

Where compiling and running it should produce a return code of 3:

$ node ulisp.js basic.lisp
$ ./build/a.out
$ echo $?
3

Parsing function definitions

The entire language is defined in S-expressions and we already parse S-expressions.

$ node
> const { parse } = require('./parser');
> JSON.stringify(parse('(def main () (+ 1 2))'));
'[[["def","main",[],["+",1,2]]],""]'

So we're done!

Code generation

There are two tricky parts to code generation once function definitions are introduced:

Functions definitions are not expressions (in assembly)
Function calling conventions for the callee
Variable scope

Function definitions

A function definition looks like a function call. So we'll need to keep a list of "primitive" functions that handle what looks like function calls differently.

function compile_define() {
  // TODO
}

const primitive_functions = {
  def: compile_define,
};

Then in our compile_call function we need to see if the function being "called" is in this list. If so, we allow the associated callback to handle compilation.

function compile_call(fun, args, destination) {
  if (primitive_functions[fun]) {
    primitive_functions[fun](args, destination);
    return;
  }

  // Save param registers
  args.map((_, i) => emit(1, `PUSH ${PARAM_REGISTERS[i]}`));

  // Compile registers and store as params
  args.map((arg, i) => compile_expression(arg, PARAM_REGISTERS[i], scope));

  emit(1, `CALL ${BUILTIN_FUNCTIONS[fun] || scope[fun]}`);

  // Restore param registers
  args.map((_, i) => emit(1, `POP ${PARAM_REGISTERS[args.length - i - 1]}`));

  if (destination && destination !== 'RAX') {
    emit(1, `MOV ${destination}, RAX`);
  }
}

Now we can begin thinking about compile_define. It takes args which will be a list of three elements containing the function's:

name
parameters
and body

It does not use destination because we're treating function definitions as statements for now and not as expressions. If we were treating it as an expression, we might store the address of the function in the destination register. We keep destination around to keep the primitive function signatures consistent.

Based on how we called functions before and how we defined the hard-coded add function, we know what a function definition in assembly generally looks like. And we know the arguments to the function when called will be in RDI, RSI, and RDX.

function compile_define([name, parameters, body]) {
  // Function name becomes a label we can CALL
  emit(0, `${name}:`);

  // Something to do with RDI, RSI, RDX and the parameters variable?

  // We renamed compile_argument to compile_expression to be more general
  compile_expression(body[0], 'RAX');

  // Maybe some cleanup to do with RDI, RSI, RDX?

  emit(1, 'RET\n');
}

Not a bad first sketch. But how do we match up RDI, RSI, RDX and the user-defined parameters variable names? For example in the following:

(def plus-two (a)
     (+ a 2))

It's clear to us that a must match up to RDI. In order to do this we need to track all variables in a scope dictionary mapping the variable name to the register where it's stored.

Additionally, keeping track of scope can help us fail quickly in the following scenario:

(def plus-two (a)
     (+ b 2))

The variable b is used but never defined. It has not been added to the scope dictionary. So our compiler can fail quickly saying there is an undefined variable being referenced.

Taking this a step further, what if we want to catch the following too:

(def plus-two (a)
     (plus a 2))

We're trying to call plus but it has not been defined. We should be able to fail quickly here too. But that means we're need to track the scope of function names in addition to variables. We'll choose to track function names and variable names in the same scope dictionary.

This is the distinction between a lisp-1 and a lisp-2. We are a lisp-1 like Scheme because we have a single scope. Common Lisp is a lisp-2 because it stores function name scope separately from variable name scope.

Implementing scope

We need to revise every compile function to accept a scope dictionary (specifically: compile, compile_expression, compile_call, and compile_define). If a variable is referenced, we need to look up it's location in the scope dictionary. If a variable is defined (e.g. a function name or a function parameter) we need to add a mapping to the scope dictionary.

Modifying compile_expression is easiest:

function compile_expression(arg, destination, scope) {
  // Is a nested function call, compile it
  if (Array.isArray(arg)) {
    compile_call(arg[0], arg.slice(1), destination, scope);
    return;
  }

  if (scope[arg] || Number.isInteger(arg)) {
    emit(1, `MOV ${destination}, ${scope[arg] || arg}`);
  } else {
    throw new Error('Attempt to reference undefined variable or unsupported literal: ' + arg);
  }
}

Next we modify compile_call:

function compile_call(fun, args, destination, scope) {
  if (primitive_functions[fun]) {
    primitive_functions[fun](args, destination, scope);
    return;
  }

  // Save param registers
  args.map((_, i) => emit(1, `PUSH ${PARAM_REGISTERS[i]}`));

  // Compile registers and store as params
  args.map((arg, i) => compile_expression(arg, PARAM_REGISTERS[i], scope));

  const validFunction = BUILTIN_FUNCTIONS[fun] || scope[fun];
  if (validFunction) {
    emit(1, `CALL ${validFunction}`);
  } else {
    throw new Error('Attempt to call undefined function: ' + fun);
  }

  // Restore param registers
  args.map((_, i) => emit(1, `POP ${PARAM_REGISTERS[args.length - i - 1]}`));

  if (destination && destination !== 'RAX') {
    emit(1, `MOV ${destination}, RAX`);
  }
}

And then compile_define where we modify scope for the first time:

function compile_define([name, params, ...body], destination, scope) {
  // Add this function to outer scope
  scope[name] = name.replace('-', '_');

  // Copy outer scope so parameter mappings aren't exposed in outer scope.
  const childScope = { ...scope };

  emit(0, `${scope[name]}:`);

  params.forEach(function (param, i) {
    const register = PARAM_REGISTERS[i];
    // Store parameter mapped to associated register
    childScope[param] = register;
  });

  // Pass childScope in for reference when body is compiled.
  compile_expression(body[0], 'RAX', childScope);

  emit(1, 'RET\n');
}

And finally we need to modify the entrypoint compile:

module.exports.compile = function (ast) {
  emit_prefix();
  // Pass in new, empty scope mapping
  compile_call(ast[0], ast.slice(1), 'RAX', {});
  emit_postfix();
}

And scope-wise we're pretty good!

Function calling convention: callee

We currently have a problem that we're using parameters registers to store local variables that messes up with how we are storing parameters for function calls within the function itself.

Ideally we could store function local variables (including the parameters when we get them) separately from how we store function call parameters within the function.

Thankfully according to the calling convention we've followed, we're given a set of registers that are callee-preserved. Of them we'll use RBX, RBP, and R12 in that order. This allows us to mess with so long as we store them and restore them within the function.

Applying the same storing/restoring strategy to local variables as we did for parameters, we get:

const LOCAL_REGISTERS = [
  'RBX',
  'RBP',
  'R12',
];

function compile_define([name, params, ...body], destination, scope) {
  // Add this function to outer scope
  scope[name] = name.replace('-', '_');

  // Copy outer scope so parameter mappings aren't exposed in outer scope.
  const childScope = { ...scope };

  emit(0, `${scope[name]}:`);

  params.forEach(function (param, i) {
    const register = PARAM_REGISTERS[i];
    const local = LOCAL_REGISTERS[i];
    emit(1, `PUSH ${local}`);
    emit(1, `MOV ${local}, ${register}`);
    // Store parameter mapped to associated local
    childScope[param] = local;
  });

  // Pass childScope in for reference when body is compiled.
  compile_expression(body[0], 'RAX', childScope);

  params.forEach(function (param, i) {
    // Backwards first
    const local = LOCAL_REGISTERS[params.length - i - 1];
    emit(1, `POP ${local}`);
  });

  emit(1, 'RET\n');
}

And we're set.

Cleanup

We've still got a few messes going on:

emit_prefix wraps out entire body in _main, we're requiring our own main now
emitting to stdout instead of to a file
multiple function definitions is treated as nonsense

Starting first, we rewrite emit_prefix and emit_postfix so that our _main just calls main.

function emit_prefix() {
  emit(1, '.global _main\n');

  emit(1, '.text\n');

  emit(0, 'plus:');
  emit(1, 'ADD RDI, RSI');
  emit(1, 'MOV RAX, RDI');
  emit(1, 'RET\n');
}

function emit_postfix() {
  emit(0, '_main:');
  emit(1, 'CALL main');
  emit(1, 'MOV RDI, RAX'); // Set exit arg
  emit(1, `MOV RAX, ${SYSCALL_MAP['exit']}`);
  emit(1, 'SYSCALL');
}

Next to deal with writing to a file instead of stdout, we need our emit function to write to a buffer. We'll let ulisp.js write that buffer to a file. Because we're incredibly lazy, we'll do this all globally.

let OUT = '';

function emit(depth, args) {
  const indent = new Array(depth + 1).join('  ');
  OUT += `${indent}${args}\n`;
}

...

module.exports.compile = function (ast) {
  OUT = '';

  emit_prefix();
  compile_call(ast[0], ast.slice(1), 'RAX', {});
  emit_postfix();

  return OUT;
}

And modify ulisp.js:

const cp = require('child_process');
const fs = require('fs');

const { parse } = require('./parser');
const { compile } = require('./compiler');

function main (args) {
  const input = fs.readFileSync(args[2]).toString();
  const [ast] = parse(input);
  const program = compile(ast);

  try {
    fs.mkdirSync('build');
  } catch (e) {}
  fs.writeFileSync('build/prog.s', program);
  cp.execSync('gcc -mstackrealign -masm=intel -o build/a.out build/prog.s');
}

main(process.argv);

And we're finally ready to run a simple program.

A program!

$ cat test.lisp
(def main () (+ 1 2))
$ node ulisp.js test.lisp
$ ./build/a.out
$ echo $?
3

Hurray! Now let's try defining and calling a second function to validate parameter behavior.

$ cat test.lisp
(def plus-two (a)
     (+ a 2))

(def main ()
     (plus-two 3))
$ node ulisp.js test.lisp
$ ./build/a.out
./compiler.js:106
    throw new Error('Attempt to call undefined function: ' + fun);
    ^

Error: Attempt to call undefined function: p2

...

We start getting some really weird errors. And the reason is because our compiler doesn't know how to deal with sibling S-expressions.

So we'll introduce a new primitive function called begin that calls all it's sibling functions and returns the value of the last call. Then we'll wrap the program in an implicit begin so we don't need to.

function compile_begin(body, destination, scope) {
  body.forEach((expression) => compile_expression(expression, 'RAX', scope));
  if (destination && destination !== 'RAX') {
    emit(1, `MOV ${destination}, RAX`);
  }
}

const primitive_functions = {
  def: compile_define,
  begin: compile_begin,
};

...

module.exports.compile = function (ast) {
  OUT = '';

  emit_prefix();
  compile_call('begin', ast, 'RAX', {});
  emit_postfix();

  return OUT;
}

And we try our test program again. :)

$ cat test.lisp
(def plus-two (a)
     (+ a 2))

(def main ()
     (plus-two 3))
$ node ulisp.js test.lisp
$ ./build/a.out
$ echo $?
5

And that's all there is to it! Stay tuned for the next post on conditionals and tail-call optimization.

Part two on compiler basics using JavaScript: user-defined functions and variables https://t.co/XOam67HO8h
— Phil Eaton (@phil_eaton) January 20, 2019

December 27, 2018

Phil Eaton

Make small changes and solve the problems you have

Two frustrating things that can happen in an organization are 1) big changes and 2) changes that aren’t clearly associated with a known problem. It’s even worse in that order.

These situations tend to happen when a problem remain unaddressed for too long. These situations tend to happen when there is not a strong enough emphasis on respect for all employees -- their experience, ideas, and feelings.

I try to avoid these issues in teams I run by starting early with a problem statement. Specifically when there’s a problem I’d like to solve, I’ll mention it in our fortnightly team retro. If there’s general agreement a problem exists, we begin looking for the least invasive/least effort way to fix the problem. More on that later.

If the problem is not well understand or widely-enough shared, I’ll table the discussion until I can talk with more people to better articulate the problem. Or maybe there isn’t a problem after all.

This process of clarifying and agreeing a problem exists is the only appropriate first step when making a change. It is important to provide sufficient context to affected employees.

After the problem is understood I begin to suggest possible solutions -- soliciting feedback and alternatives. But making sure a problem is well understand is not the same thing as making sure that potential solutions could reasonably solve the problem. Throughout the discussion of solutions I try to repeatedly make sure that proposed solutions could actually address the problem.

From there I try to steer discussion of solutions to ones that are easiest to make and least invasive. Small changes are easier to make. There is little room for disagreement when there is little changing.

Making small changes among a small group of people is even easier. The few disagreements that you find when making small changes among a small group of people give you a chance to prove or improve the solution before introducing it to a larger group.

Communicating frequently and effectively should be a clear theme here.

At this point if there is a single most reasonable solution, I’ll pick it unless there is serious disagreement. Most of the time folks are amenable to the need for a solution to be chosen to solve a problem they agreed existed, even if they don’t love the solution.

If there is no clear solution or there is serious disagreement, go back a few paragraphs and start over to understand the problem and solicit feedback and alternative for solutions. Or take the heat of serious disagreement.

This is a philosophy. It’s difficult to prove the effectiveness one way or the other -- especially over the mid-to-long-term. But the logic makes sense to me, it agrees with what I’ve read on management, and has worked effectively in teams I’ve run so far.

December 10, 2018

Vitess Blog

Introducing Vitess 3.0

We are pleased to announce Vitess 3.0, a major upgrade from Vitess 2.2. Every major release of Vitess brings new features and improvements for our community, and also sometimes introduces new ways of doing things. Vitess 3.0 is built using “pure” go 1.11 (we do not need cgo now) and supports MySQL 8.0 and MariaDB 10.3. The key features of Vitess 3.0 are: Usability Simplified db parameters for vttablet Formal support for externally managed mysql More tutorials including how to run Vitess with Minikube Prometheus monitoring Performance improvements Snappier InitShardMaster and PlannedReparentShard Improved coordination with Orchestrator vtbench for benchmarking MySQL protocol performance improvements Faster reparents 4X faster parallelized backups Many more performance improvements VReplication Improvements to resharding We first showcased VReplication at the first ever Vitess meetup held at Slack in July 2018.

November 20, 2018

Phil Eaton

Writing a lisp compiler from scratch in JavaScript: 1. lisp to assembly

Next in compiler basics: <! forgive me, for I have sinned >
2. user-defined functions and variables
3. LLVM
4. LLVM conditionals and compiling fibonacci
5. LLVM system calls
6. an x86 upgrade

In this post we'll write a simple compiler in Javascript (on Node) without any third-party libraries. Our goal is to take an input program like (+ 1 (+ 2 3)) and produce an output assembly program that does these operations to produce 6 as the exit code. The resulting compiler can be found here.

We'll cover:

Parsing
Code generation
Assembly basics
Syscalls

And for now we'll omit:

Programmable function definitions
Non-symbol/-numeric data types
More than 3 function arguments
Lots of safety
Lots of error messsages

Parsing

We pick the S-expression syntax mentioned earlier because it's very easy to parse. Furthermore, our input language is so limited that we won't even break our parser into separate lexing/parsing stages.

Once you need to support string literals, comments, decimal literals, and other more complex literals it becomes easier to use separate stages.

If you're curious about these separate stages of parsing, you may be interested in my post on writing a JSON parser.

Or, check out my BSDScheme project for a fully-featured lexer and parser for Scheme.

The parser should produce an Abstract Syntax Tree (AST), a data structure representing the input program. Specifically, we want (+ 1 (+ 2 3)) to produce ['+', 1, ['+', 2, 3]] in Javascript.

There are many different ways to go about parsing but the most intuitive to me is to have a function that accepts a program (a string) and returns a tuple containing the program parsed so far (an AST) and the rest of the program (a string) that hasn't been parsed.

That leaves us with a function skeleton that looks like this:

module.exports.parse = function parse(program) {
  const tokens = [];

  ... logic to be added ...

  return [tokens, ''];
};

The code that initially calls parse will thus have to deal with unwrapping the outermost tuple to get to the AST. For a more helpful compiler we could check that the entire program was actually parsed by failing if the second element of the return result is not the empty string.

Within the function we will iterate over each character and accumulate until we hit space, left or right parenthesis:

module.exports.parse = function parse(program) {
  const tokens = [];
  let currentToken = '';

  for (let i = 0; i < program.length; i++) {
    const char = program.charAt(i);

    switch (char) {
      case '(': // TODO
        break;
      case ')': // TODO
        break;
      case ' ':
        tokens.push(+currentToken || currentToken);
        currentToken = '';
        break;
      default:
        currentToken += char;
        break;
    }
  }

  return [tokens, ''];
};

The recursive parts are always the most challenging. The right paren is easiest. We must push the current token and return all tokens with the rest of the program:

module.exports.parse = function parse(program) {
  const tokens = [];
  let currentToken = '';

  for (let i = 0; i < program.length; i++) {
    const char = program.charAt(i);

    switch (char) {
      case '(': // TODO
        break;
      case ')':
        tokens.push(+currentToken || currentToken);
        return [tokens, program.substring(i + 1)];
      case ' ':
        tokens.push(+currentToken || currentToken);
        currentToken = '';
        break;
      default:
        currentToken += char;
        break;
    }
  }

  return [tokens, ''];
};

Finally the left paren should recurse, add the parsed tokens to the list of sibling tokens, and force the loop to start at the new unparsed point.

module.exports.parse = function parse(program) {
  const tokens = [];
  let currentToken = '';

  for (let i = 0; i < program.length; i++) {
    const char = program.charAt(i);

    switch (char) {
      case '(': {
        const [parsed, rest] = parse(program.substring(i + 1));
        tokens.push(parsed);
        program = rest;
        i = 0;
        break;
      }
      case ')':
        tokens.push(+currentToken || currentToken);
        return [tokens, program.substring(i + 1)];
      case ' ':
        tokens.push(+currentToken || currentToken);
        currentToken = '';
        break;
      default:
        currentToken += char;
        break;
    }
  }

  return [tokens, ''];
};

Assuming this is all in parser.js, let's try it out in the REPL:

$ node
> const { parse } = require('./parser');
undefined
> console.log(JSON.stringify(parse('(+ 3 (+ 1 2)')));
[[["+",3,["+",1,2]]],""]

Solid. We move on.

Assembly 101

Assembly is essentially the lowest-level programming language we can use. It is a human readable, 1:1 representation of the binary instructions the CPU can interpret. Conversion from assembly to binary is done with an assembler; the reverse step is done with a disassembler. We'll use gcc for assembling since it deals with some oddities of assembly programming on macOS.

The primary data structures in assembly are registers (temporary variables stored by the CPU) and the program stack. Every function in a program has access to the same registers, but convention cordons off sections of the stack for each function so it ends up being a slightly more durable store than registers. RAX, RDI, RDX, and RSI are a few registers available to us.

Now we only need to know a few instructions to compile our program (the rest of programming assembly is convention):

MOV: store one register's content into another, or store a literal number into a register
ADD: store the sum of two register's contents in the first register
PUSH: store a register's content on the stack
POP: remove the top-most value from the stack and store in a register
CALL: enter a new section of the stack and start running the function
RET: enter the calling functions stack and return to evaluating from the next instruction after the call
SYSCALL: like CALL but where the function is handled by the kernel

Function calling convention

Assembly instructions are flexible enough that there is no language-defined way to make function calls. Therefore it is important to answer (at least) the following few questions:

Where are parameters stored by the caller so that the callee has access to them?
Where is the return value stored by the callee so the caller has access to it?
What registers are saved by whom?

Without getting too far into the specifics, we'll assume the following answers for development on x86_64 macOS and Linux systems:

Parameters are stored (in order) in the RDI, RSI, and RDX registers
- We won't support passing more than three arguments
The return value is stored in the RAX register
RDI, RSI, and RDX registers are stored by the caller

Code generation

With assembly basics and the function call convention in mind, we've got enough to generate code from the parsed program's AST.

The skeleton of our compile code will look like this:

function emit(depth, code) {
  const indent = new Array(depth + 1).map(() => '').join('  ');
  console.log(indent + code);
}

function compile_argument(arg, destination) {
  // If arg AST is a list, call compile_call on it

  // Else must be a literal number, store in destination register
}

function compile_call(fun, args, destination) {
  // Save param registers to the stack

  // Compile arguments and store in param registers

  // Call function

  // Restore param registers from the stack

  // Move result into destination if provided
}

function emit_prefix() {
  // Assembly prefix
}

function emit_postfix() {
  // Assembly postfix
}

module.exports.compile = function parse(ast) {
  emit_prefix();
  compile_call(ast[0], ast.slice(1));
  emit_postfix();
};

From our pseudo-code in comments it is simple enough to fill in. Let's fill in everything but the prefix and postfix code.

function compile_argument(arg, destination) {
  // If arg AST is a list, call compile_call on it
  if (Array.isArray(arg)) {
    compile_call(arg[0], arg.slice(1), destination);
    return;
  }

  // Else must be a literal number, store in destination register
  emit(1, `MOV ${destination}, ${arg}`);
}

const BUILTIN_FUNCTIONS = { '+': 'plus' };
const PARAM_REGISTERS = ['RDI', 'RSI', 'RDX'];

function compile_call(fun, args, destination) {
  // Save param registers to the stack
  args.forEach((_, i) => emit(1, `PUSH ${PARAM_REGISTERS[i]}`));

  // Compile arguments and store in param registers
  args.forEach((arg, i) => compile_argument(arg, PARAM_REGISTERS[i]));

  // Call function
  emit(1, `CALL ${BUILTIN_FUNCTIONS[fun] || fun}`);

  // Restore param registers from the stack
  args.forEach((_, i) => emit(1, `POP ${PARAM_REGISTERS[args.length - i - 1]}`));

  // Move result into destination if provided
  if (destination) {
    emit(1, `MOV ${destination}, RAX`);
  }

  emit(0, ''); // For nice formatting
}

In a better compiler, we would not make plus a built-in function. We'd emit code for the assembly instruction ADD. However, making plus a function makes code generation simpler and also allows us to see what function calls look like.

We'll define the plus built-in function in the prefix code.

The prefix

Assembly programs consist of a few "sections" in memory. The most important of which are the text and data sections. text is a read-only section where the program instructions themselves are stored. The CPU is instructed to start interpreting from some location in this text section and it will increment through instructions, evaluating each instruction until it reaches an instruction that tells it to jump to a different location to evaluate instructions (e.g. with CALL, RET, or JMP).

To denote the text section we emit .text in our prefix before we emit our generated code.

The data section is for statically initialized values (e.g. global variables). We don't have any need for that right now so we'll ignore it.

Here is a good read with more detail on these (and other) sections.

Additionally, we need to emit an entrypoint (we'll use _main) and add a notice (.global _main) so that the location of this entrypoint is visible externally. This is important because we let gcc handle the hairier parts of generating an executable file and it needs access to the entrypoint.

So far, our prefix looks like this:

function emit_prefix() {
  emit(1, '.global _main\n');

  emit(1, '.text\n');

  // TODO: add built-in functions

  emit(0, '_main:');
}

The last part of our prefix needs to include the plus built-in function. For this, we add the first two parameter registers we agreed on (RDI and RSI) and store the result in RAX.

function emit_prefix() {
  emit(1, '.global _main\n');

  emit(1, '.text\n');

  emit(0, 'plus:');
  emit(1, 'ADD RDI, RSI');
  emit(1, 'MOV RAX, RDI');
  emit(1, 'RET\n');

  emit(0, '_main:');
}

And we're golden.

The postfix

The job of the postfix will be simple, call exit with the value of RAX since this will be the result of the last function called by the program.

exit is a syscall, so we'll use the SYSCALL instruction to call it. The x86_64 calling convention on macOS and Linux for SYSCALL defines parameters the same way CALL does. But we also need to tell SYSCALL what syscall to call. The convention is to set RAX to the integer representing the syscall on the current system. On Linux it will be 60; on macOS it is 0x2000001.

When I say "convention", I don't mean that you really have a choice as a programmer. It was arbitrary when the operating system and standard libraries chose it. But if you want to write a working program that uses syscalls or calls out to (say) glibc, you'll need to follow these conventions.

The postfix then looks like this:

const os = require('os');

const SYSCALL_MAP = os.platform() === 'darwin' ? {
    'exit': '0x2000001',
} : {
    'exit': 60,
};

function emit_postfix() {
  emit(1, 'MOV RDI, RAX'); // Set exit arg
  emit(1, `MOV RAX, ${SYSCALL_MAP['exit']}`); // Set syscall number
  emit(1, 'SYSCALL');
}

And we're set here too.

Putting it all together

We can finally write our Javascript entrypoint and run our compiler against a sample program.

The entrypoint might look like this:

const { parse } = require('./parser');
const { compile } = require('./compiler');

function main(args) {
  const script = args[2];
  const [ast] = parse(script);
  compile(ast[0]);
}

main(process.argv);

And we can call it like so:

$ node ulisp.js '(+ 3 (+ 2 1))'
  .global _main

  .text

plus:
  ADD RDI, RSI
  MOV RAX, RDI
  RET

_main:
  PUSH RDI
  PUSH RSI
  MOV RDI, 3
  PUSH RDI
  PUSH RSI
  MOV RDI, 2
  MOV RSI, 1
  CALL plus
  POP RSI
  POP RDI
  MOV RSI, RAX

  CALL plus
  POP RSI
  POP RDI

  MOV RDI, RAX
  MOV RAX, 0x2000001
  SYSCALL

Generating an executable file from the output

If we redirect the previous output to an assembly file and call gcc on it, we can generate a program we can run. Then we can echo the $? variable to see the exit code of the previous process.

$ node ulisp.js '(+ 3 (+ 2 1))' > program.S
$ gcc -mstackrealign -masm=intel -o program program.s
$ ./program
$ echo $?
6

And we've got a working compiler! The full source of the compiler is available here.

November 12, 2018

Vitess Blog

Vitess Weekly Digest - Nov 12 2018

We continue the digest from the Slack discussions for Sep 22 2018 to Oct 5 2018. We've fallen slightly behind on this, but will catch up again soon. Enable VtGateExecute in vtctld # Arsalan [Sep 22nd] Hi, I want to query vitess on vtgate but I have below error. How can i fix this problem? vitess@vtctldv3-hrl74:/$ vtctlclient -server 10.5.61.20:16999 VtGateExecute -server 10.5.61.21:16991 "show tables" E0923 05:14:55.169771 1102 main.go:61] Remote error: rpc error: code = Unknown desc = query commands are disabled (set the -enable_queries flag to enable)

November 09, 2018

Andy Pavlo

An Updated Guide on Where to Apply for a PhD in Databases in the US (2018)

Andy provides an updated list of US database researchers as a guide for students applying to Ph.D. programs in 2018.

by Andy Pavlo

October 20, 2018

Phil Eaton

On NYC, Tokyo and Seoul

I’ve lived in NYC for the past year — moved here after years in Philly and after growing up in a rural community a few hours west of there. My wife is South Korean and last week concluded my second trip to the suburbs of Seoul to visit her family. We finished up that trip with a week in Tokyo.

Long a mecha and Godzilla fan, I was struck by a city not significantly more modern, or significantly more “Eastern”, than NYC. In contrast, the lesser known Seoul is more modern than both cities and shares as much “Eastern” vibe as Tokyo.

I’d go so far as to say that Seoul is the most livable of the three for anyone of a similar background. There are a few concrete areas that led me to this including transportation, apartments, WiFi/cafes, food, and language.

I'll conclude with a few tourist recommendations and a list of books to read on South Korea and Japan if you share my enthusiasm for comparing life in different cities.

Transportation

NYC is one of the few cities in the world with a subway that runs 24/7. Tokyo and Seoul do not share this trait despite being many decades newer. (Tokyo and Seoul were heavily damaged during World War II and the Korean War, respectively.) And despite being built later, Tokyo subway cars are even less wide than NYC subway cars (~8.2ft vs. ~8.5ft).

In contrast, Seoul subway cars are ~10.2ft wide. The difference may seem slight but it is noticeable during rush hour when in Seoul there is space for four people to stand in the aisle versus room for perhaps two in a Tokyo or NYC subway car.

Seoul subway car, source: Travel Blog

The Seoul subway system is also the most advanced in terms of safety. All stations have a floor-to-ceiling barrier with doors that only open when a train arrives. Most stations in Tokyo have a ~3ft tall barrier that does the same, though some stations have no barrier. In NYC there are no barriers anywhere.

Concerning innovation, Seoul and Tokyo both have multiple driverless subway lines whereas NYC has none. But in terms of complexity the NYC subway is the simplest because you pay only once. Seoul and Tokyo subways are slightly more complex in that you swipe your card when you enter and exit (or transfer).

Taxis

It was jarring to be greeted by the very 90s, vaguely British Toyota Crown taxi cabs that dominate the streets of Tokyo.

Source: Phil Eaton

These cabs have no integrated navigation unit but a modern unit was typically mechanically attached. We saw a few of the recently approved Toyota JPN Taxi, but they only account for around 10 percent of cabs. (The integrated navigation is massive, perhaps 10-inch screens.) In contrast, Seoul has a variety of modern cabs all with integrated navigation — the most common of which is the Hyundai Sonata.

Source: The Seoul Guide

Although Japanese car companies pioneered integrated navigation in the 90s, it appears to have been the standard for South Korean car companies for the past 10-20 years.

And then there’s NYC with its primary mix of Crown Victorias and Priuses with multiple 4-inch smartphones mechanically attached for navigation.

Source: New York Post

Living

South Korea has no concept of the suburb oriented around single-family houses. Drive an hour or two out from Seoul or Busan and see the same massive, modern apartment complexes that are found in the city center. After that it's the stark farms of Kansas. Japan appears more like the US in that the city graduates steadily to suburb and farm.

Apartments in Seoul, source: Japan Times

In general, buildings in South Korea are fairly homogeneous. Even the downtown areas of Seoul have little architectural creativity. Tokyo and NYC are both diverse in building styles and sizes. However, NYC takes the cake for ubiquity of massive towers. In fact, the first time my South Korean father-in-law visited Manhattan he was blown away by this mass.

New York City, source: Wikipedia

The most popular neighborhoods in Tokyo seem more developed than their Seoul counterparts, the mass of stores and crowds extends further. And while the average age of buildings in Tokyo seems younger than the average age of buildings throughout Seoul (including less desirable areas), the developed areas (including buildings and streets) of Seoul are significantly cleaner and more modern. In contrast, and on average, Tokyo buildings seem as old as NYC buildings.

Tokyo, source: Fodors

Air quality

Air quality in NYC and Tokyo is high, pollution is low. But in recent times, air quality in Seoul has deteriorated with dangerous levels of fine dust from factories in South Korea and China. It is not clear when or how the South Korean government will address this.

WiFi/Cafes

My idea of a good cafe is a decent ratio of seats to traffic, available electrical outlets, and decent WiFi. NYC and Tokyo have some similarities: chain coffee shops are larger and non-chains are often pretty small. Tokyo differs from NYC in that there are few electrical outlets and in the existence of interior smoking sections. (Tokyo bans smoking while walking but designates areas like parks or inner rooms in restaurants or cafes.)

But the WiFi in Tokyo is abysmal. Many cafes do not have it (though the trend is to provide) and even the chains that do provide it have terrible speeds reaching peaks of 5Mbps down. In NYC WiFi is available near ~20Mbps down at most chains and ~5Mpbs at smaller non-chains.

In contrast, South Korea is the jewel of cafe culture. Unlike how in the US coffee shop size decreases as population increases, coffee shop sizes in South Korea are oddly enormous everywhere. South Korea is rich with local shops, domestic chains (including the exported Paris Baguette and Tous Les Jours), and foreign chains (South Korea has the highest number of Starbucks Reserve stores per capita of any country).

Starbucks Reserve in Seoul, source: Pulse News

From Jeju Island to Seoul we never worried about a seat or an outlet at a cafe. Furthermore, the WiFi in South Korea is incredible. My tech-hopeless in-law’s basic internet plan got 80Mbps down and the small cafes near their apartment got at least 40Mbps down.

NYC falls closer to Seoul in terms of ubiquity and speed of WiFi and has the added benefit of fast city-provided, outdoor WiFi surprisingly fast and available throughout the city. NYC is much worse in terms of daylight. Most cafes close between 8-10pm whereas cafes in Seoul and Tokyo easily stay open past 11pm.

Caveat

It’s not exactly fair to exclude internet cafes, prevalent in both Seoul and Tokyo (oddly even NYC has a few). At an internet cafe in Tokyo you can expect abundant outlets and excellent WiFi (I saw peaks of 40Mbps down). I did not visit an internet cafe in Seoul but I expect it to be similar. In both Seoul and Tokyo you can easily find 24/7 service (with showers!?).

I did not include internet cafes above because I find them slightly less convenient for tourists. Though credit is due: unlike American Chinatown internet cafes, the ones we visited in Tokyo were very clean, spacious and warm.

Internet cafe in Shinjuku, source: Rakutama

Food

Dining out in NYC is similar in cost to other major US cities. The quality is usually pretty good. Tokyo was about as expensive as food in NYC and generally as high quality. For instance, most dinners in NYC and Tokyo cost about $40-60 for two people. In contrast, most entrees in Seoul are sold for two and the dinner in total was often about $20-40. Restaurants on average seemed to be lower quality in Seoul compared to Tokyo and New York, but there are still more than enough high quality options.

Language

I am biased having a better knowledge of Korean than Japanese and a South Korean partner to fall back on. But I believe South Korea is the more friendly place for an English speaker in that it is more dedicated to providing English translations and that the written language is simpler. In both cities the penetration of English-speaking natives (and quality of speech and comprehension) is indistinguishable and decent.

To the first point, even the oddest locations and obscure signage had English translations in South Korea (not just Seoul) — not so even within Tokyo.

To the second point, Japanese has three writing systems (kanji, hiragana, and katakana). Kanji (characters originating from Chinese) cannot be replaced in writing by phonetic counterparts in hiragana or katakana. So you have little choice but to memorize all important characters, disregarding the fact that many characters can be broken down. Then you must also memorize the alphabetic systems of hiragana and katakana.

In contrast, Korean has two writing systems (hangul and hanja) where hanja (characters originating from Chinese) is primarily used in formal settings (government forms, academic books, etc.) and can be replaced with the phonetic equivalent in hangul.

This makes it much simpler to memorize and read Korean compared to Japanese.

Assorted recommendations

For New Yorkers, don’t stay in the recommended areas of Shinjuku/Shibuya/Roppongi unless you’re the type who’d enjoy staying around Times Square. These three areas of Tokyo are just as obnoxious albeit much safer. I also don’t recommend the Harajuku area; it is extra. There’s no real equivalent level of crazy in Seoul although Hongdae comes close.

In a future Tokyo trip I’d stick to the Meguro Station area including Ebisu and Daikanyama. They are beautiful, quiet neighborhoods with lots of restaurants and cafes beside the Meguro river. Areas along the Sumida River are also beautiful and quiet. Ginza/Tokyo Station is also a fun-but-not-obnoxious area to visit.

Ebisu, source: Homeaway

I cannot recommend the Edo-Tokyo Museum enough, it is the best city museum I've visited. Tsukiji is also a must see, reminding me how much I miss going to Reading Terminal Market each weekend in Philly.

In Seoul I’d recommend Yeonnam-Dong, Itaewon (which is much nicer than it’s made out to be), and Gwanghwamun. Mapo-Gu in general is a great region of Gangbuk as is the area below it (near Yeouido) in Gangnam.

Yeonnam-Dong, source: Phil Eaton

I recommend visiting the National Museum of Korea in Seoul as well as Hangang Park and Gyeongui Line Forest Park. The areas around the Tancheon stream flowing South to Bundang are also beautiful.

Tancheon near Bundang, source: Misadventures of an Awkward American

Conclusion

I came to Tokyo with the expectation of a highly modern city fused with Eastern culture. But it is difficult to see many ways it is ahead of NYC technically and it is very similar to NYC culturally. In some ways Tokyo even seems a little stuck in the past or just... off. Why are all vending machines [e.g. for tickets, ordering food, etc.] mechanical and not touch screens? The National Museum of Science is awfully old and ugly, the National Diet Building the same.

So on the one hand I’d like to let the next person down lightly on the excitement of Japan. It is a world-class city with great restaurants, live music and refined culture but all-in-all very similar to NYC. On the other hand I recommend Seoul for a cheaper, cleaner, more English-speaker friendly, and genuinely novel city with splashes of "Eastern" romantic elements like Tokyo.

Cherry blossoms in Seoul, source: English Spectrum

January 20, 2019

Writing a lisp compiler from scratch in JavaScript: 2. user-defined functions and variables

Function definition

Parsing function definitions

Code generation

Function definitions

Implementing scope

Function calling convention: callee

Cleanup

A program!

December 27, 2018

Make small changes and solve the problems you have

December 10, 2018

Introducing Vitess 3.0

November 20, 2018

Writing a lisp compiler from scratch in JavaScript: 1. lisp to assembly

Parsing

Assembly 101

Function calling convention

Code generation

The prefix

The postfix

Putting it all together

Generating an executable file from the output

Further reading

November 12, 2018

Vitess Weekly Digest - Nov 12 2018

November 09, 2018

An Updated Guide on Where to Apply for a PhD in Databases in the US (2018)

October 20, 2018

On NYC, Tokyo and Seoul

Transportation

Taxis

Living

Air quality

WiFi/Cafes

Caveat

Food

Language

Assorted recommendations

Conclusion

Further reading