January 04, 2022
January 01, 2022
Recurse Center: Winter Break
December 28, 2021
Databases in 2021: A Year in Review
Andy's take on 2021 database industry happenings - PostgreSQL, Performance Wars, Passings, and Larry Ellison.
Writing a minimal Lua implementation with a virtual machine from scratch in Rust
By the end of this guide we'll have a minimal, working implementation of a small part of Lua from scratch. It will be able to run the following program (among others):
function fib(n)
if n < 2 then
return n;
end
local n1 = fib(n-1);
local n2 = fib(n-2);
return n1 + n2;
end
print(fib(30));
This is my second project in Rust and only the third time I've invented an instruction set so don't take my style as gospel. However, I have found some Rust parsing tutorials overly complex so I'm hoping you'll find this one simpler.
All source code is available on Github.
Entrypoint
Running cargo init will give the boilerplate necessary. In
src/main.rs we'll accept a file name from the command line, perform
lexical analysis to retrieve all tokens from the file, perform grammar
analysis on the tokens to retrieve a tree structure, compile the tree
to a linear set of virtual machine instructions, and finally interpret
the virtual machine instructions.
mod eval;
mod lex;
mod parse;
use std::env;
use std::fs;
fn main() {
let args: Vec<String> = env::args().collect();
let contents = fs::read_to_string(&args[1]).expect("Could not read file");
let raw: Vec<char> = contents.chars().collect();
let tokens = match lex::lex(&raw) {
Ok(tokens) => tokens,
Err(msg) => panic!("{}", msg),
};
let ast = match parse::parse(&raw, tokens) {
Ok(ast) => ast,
Err(msg) => panic!("{}", msg),
};
let pgrm = eval::compile(&raw, ast);
eval::eval(pgrm);
}
Easy peasy. Now let's implement lex.
Lexical analysis
Lexical analysis drops whitespace (Lua is not whitespace sensitive) and chunks all source code characters into their smallest possible meaningful pieces like commas, numbers, identifiers, keywords, etc.
In order to have useful error messages, we'll keep track of state in
the file with a Location struct that implements increment and
debug.
This goes in src/lex.rs.
#[derive(Copy, Clone, Debug)]
pub struct Location {
col: i32,
line: i32,
index: usize,
}
The increment function will update line and column numbers as well
as the current index in the file.
impl Location {
fn increment(&self, newline: bool) -> Location {
if newline {
Location {
index: self.index + 1,
col: 0,
line: self.line + 1,
}
} else {
Location {
index: self.index + 1,
col: self.col + 1,
line: self.line,
}
}
}
And the debug function will dump the current line with a pointer in
text to the current column along with a message.
pub fn debug<S: Into<String>>(&self, raw: &[char], msg: S) -> String {
let mut line = 0;
let mut line_str = String::new();
// Find the whole line of original source
for c in raw {
if *c == '\n' {
line += 1;
// Done discovering line in question
if !line_str.is_empty() {
break;
}
continue;
}
if self.line == line {
line_str.push_str(&c.to_string());
}
}
let space = " ".repeat(self.col as usize);
format!("{}\n\n{}\n{}^ Near here", msg.into(), line_str, space)
}
}
The smallest individual unit after lexical analysis is a token which is either a keyword, number, identifier, operator, or syntax. (This implementation is clearly skipping lots of real Lua syntax like strings.)
#[derive(Debug, PartialEq, Eq, Clone)]
pub enum TokenKind {
Identifier,
Syntax,
Keyword,
Number,
Operator,
}
#[derive(Debug, Clone)]
pub struct Token {
pub value: String,
pub kind: TokenKind,
pub loc: Location,
}
The top-level lex function will iterate over the file and call a lex
helper for each kind of token, returning an array of all tokens on
success. In between lexing it will "eat whitespace".
pub fn lex(s: &[char]) -> Result<Vec<Token>, String> {
let mut loc = Location {
col: 0,
index: 0,
line: 0,
};
let size = s.len();
let mut tokens: Vec<Token> = vec![];
let lexers = [
lex_keyword,
lex_identifier,
lex_number,
lex_syntax,
lex_operator,
];
'outer: while loc.index < size {
loc = eat_whitespace(s, loc);
if loc.index == size {
break;
}
for lexer in lexers {
let res = lexer(s, loc);
if let Some((t, next_loc)) = res {
loc = next_loc;
tokens.push(t);
continue 'outer;
}
}
return Err(loc.debug(s, "Unrecognized character while lexing:"));
}
Ok(tokens)
}
Whitespace
Eating whitespace is just incrementing the location while we see a space, tab, newline, etc.
fn eat_whitespace(raw: &[char], initial_loc: Location) -> Location {
let mut c = raw[initial_loc.index];
let mut next_loc = initial_loc;
while [' ', '\n', '\r', '\t'].contains(&c) {
next_loc = next_loc.increment(c == '\n');
if next_loc.index == raw.len() {
break;
}
c = raw[next_loc.index];
}
next_loc
}
Numbers
Lexing numbers iterates through the source starting at a position until it stops seeing decimal digits (this implementation only supports integers).
fn lex_number(raw: &[char], initial_loc: Location) -> Option<(Token, Location)> {
let mut ident = String::new();
let mut next_loc = initial_loc;
let mut c = raw[initial_loc.index];
while c.is_digit(10) {
ident.push_str(&c.to_string());
next_loc = next_loc.increment(false);
c = raw[next_loc.index];
}
If there are no digits in the string then this is not a number.
if !ident.is_empty() {
Some((
Token {
value: ident,
loc: initial_loc,
kind: TokenKind::Number,
},
next_loc,
))
} else {
None
}
}
Identifiers
Identifiers are any collection of alphabet characters, numbers, and underscores.
fn lex_identifier(raw: &Vec<char>, initial_loc: Location) -> Option<(Token, Location)> {
let mut ident = String::new();
let mut next_loc = initial_loc;
let mut c = raw[initial_loc.index];
while c.is_alphanumeric() || c == '_' {
ident.push_str(&c.to_string());
next_loc = next_loc.increment(false);
c = raw[next_loc.index];
}
But they cannot start with a number.
// First character must not be a digit
if ident.len() > 0 && !ident.chars().next().unwrap().is_digit(10) {
Some((
Token {
value: ident,
loc: initial_loc,
kind: TokenKind::Identifier,
},
next_loc,
))
} else {
None
}
}
Keywords
Keywords are alphabetical like identifiers are but they cannot be reused as variables by the user.
fn lex_keyword(raw: &[char], initial_loc: Location) -> Option<(Token, Location)> {
let syntax = ["function", "end", "if", "then", "local", "return"];
let mut next_loc = initial_loc;
let mut value = String::new();
'outer: for possible_syntax in syntax {
let mut c = raw[initial_loc.index];
next_loc = initial_loc;
while c.is_alphanumeric() || c == '_' {
value.push_str(&c.to_string());
next_loc = next_loc.increment(false);
c = raw[next_loc.index];
let n = next_loc.index - initial_loc.index;
if value != possible_syntax[..n] {
value = String::new();
continue 'outer;
}
}
// Not a complete match
if value.len() < possible_syntax.len() {
value = String::new();
continue;
}
// If it got to this point it found a match, so exit early.
// We don't need a longest match.
break;
}
if value.is_empty() {
return None;
}
Aside from matching a list of strings we have to make sure
there is a complete match. For example function1 is not a keyword,
it's a valid identifier. Whereas function 1 is a valid set of tokens
(the keyword function and the number 1), even if it's not a valid
Lua grammar.
// If the next character would be part of a valid identifier, then
// this is not a keyword.
if next_loc.index < raw.len() - 1 {
let next_c = raw[next_loc.index];
if next_c.is_alphanumeric() || next_c == '_' {
return None;
}
}
Some((
Token {
value: value,
loc: initial_loc,
kind: TokenKind::Keyword,
},
next_loc,
))
}
Syntax
Syntax (in this context) is just language junk that isn't operators. Things like commas, parenthesis, etc.
fn lex_syntax(raw: &[char], initial_loc: Location) -> Option<(Token, Location)> {
let syntax = [";", "=", "(", ")", ","];
for possible_syntax in syntax {
let c = raw[initial_loc.index];
let next_loc = initial_loc.increment(false);
// TODO: this won't work with multiple-character syntax bits like >= or ==
if possible_syntax == c.to_string() {
return Some((
Token {
value: possible_syntax.to_string(),
loc: initial_loc,
kind: TokenKind::Syntax,
},
next_loc,
));
}
}
None
}
Operators
Operators are things like plus, minus, and less than symbols. Operators are syntax but it helps us later on to break these out into a seperate type of token.
fn lex_operator(raw: &[char], initial_loc: Location) -> Option<(Token, Location)> {
let operators = ["+", "-", "<"];
for possible_syntax in operators {
let c = raw[initial_loc.index];
let next_loc = initial_loc.increment(false);
// TODO: this won't work with multiple-character operators like >= or ==
if possible_syntax == c.to_string() {
return Some((
Token {
value: possible_syntax.to_string(),
loc: initial_loc,
kind: TokenKind::Operator,
},
next_loc,
));
}
}
None
}
And now we're all done lexing!
Grammar analysis
Parsing finds grammatical (tree) patterns in a flat list of tokens. This is called a syntax tree or abstract syntax tree (AST).
The boring part is defining the tree. Generally speaking (and specifically for this project), the syntax tree is a list of statements. Statements can be function definitions or expression statements or if statements or return statements or local declarations.
This goes in src/parse.rs.
#[derive(Debug)]
pub enum Statement {
Expression(Expression),
If(If),
FunctionDeclaration(FunctionDeclaration),
Return(Return),
Local(Local),
}
pub type Ast = Vec<Statement>;
There's almost nothing special at all about the rest of the tree definitions.
#[derive(Debug)]
pub enum Literal {
Identifier(Token),
Number(Token),
}
#[derive(Debug)]
pub struct FunctionCall {
pub name: Token,
pub arguments: Vec<Expression>,
}
#[derive(Debug)]
pub struct BinaryOperation {
pub operator: Token,
pub left: Box<Expression>,
pub right: Box<Expression>,
}
#[derive(Debug)]
pub enum Expression {
FunctionCall(FunctionCall),
BinaryOperation(BinaryOperation),
Literal(Literal),
}
#[derive(Debug)]
pub struct FunctionDeclaration {
pub name: Token,
pub parameters: Vec<Token>,
pub body: Vec<Statement>,
}
#[derive(Debug)]
pub struct If {
pub test: Expression,
pub body: Vec<Statement>,
}
#[derive(Debug)]
pub struct Local {
pub name: Token,
pub expression: Expression,
}
#[derive(Debug)]
pub struct Return {
pub expression: Expression,
}
And that's it for the AST!
Some helpers
Lastly before the fun part, we'll define a few helpers for validating each kind of token.
fn expect_keyword(tokens: &[Token], index: usize, value: &str) -> bool {
if index >= tokens.len() {
return false;
}
let t = tokens[index].clone();
t.kind == TokenKind::Keyword && t.value == value
}
fn expect_syntax(tokens: &[Token], index: usize, value: &str) -> bool {
if index >= tokens.len() {
return false;
}
let t = tokens[index].clone();
t.kind == TokenKind::Syntax && t.value == value
}
fn expect_identifier(tokens: &[Token], index: usize) -> bool {
if index >= tokens.len() {
return false;
}
let t = tokens[index].clone();
t.kind == TokenKind::Identifier
}
Now on to the fun part, actually detecting these trees!
Top-level parse
The top-level parse function and it's major helper,
parse_statement, dispatch very similarly to the top-level lex
function. For each statement in the file we look for function
declarations, if statements, return statements, local declarations,
and expression statements.
fn parse_statement(raw: &[char], tokens: &[Token], index: usize) -> Option<(Statement, usize)> {
let parsers = [
parse_if,
parse_expression_statement,
parse_return,
parse_function,
parse_local,
];
for parser in parsers {
let res = parser(raw, tokens, index);
if res.is_some() {
return res;
}
}
None
}
pub fn parse(raw: &[char], tokens: Vec<Token>) -> Result<Ast, String> {
let mut ast = vec![];
let mut index = 0;
let ntokens = tokens.len();
while index < ntokens {
let res = parse_statement(raw, &tokens, index);
if let Some((stmt, next_index)) = res {
index = next_index;
ast.push(stmt);
continue;
}
return Err(tokens[index].loc.debug(raw, "Invalid token while parsing:"));
}
Ok(ast)
}
Expression statements
Expression statements are just a wrapper for the Rust type
system. They call parse_expression (which we'll define shortly),
expect a semicolon afterward, and wrap the expression in a statement.
fn parse_expression_statement(
raw: &[char],
tokens: &[Token],
index: usize,
) -> Option<(Statement, usize)> {
let mut next_index = index;
let res = parse_expression(raw, tokens, next_index)?;
let (expr, next_next_index) = res;
next_index = next_next_index;
if !expect_syntax(tokens, next_index, ";") {
println!(
"{}",
tokens[next_index]
.loc
.debug(raw, "Expected semicolon after expression:")
);
return None;
}
next_index += 1; // Skip past semicolon
Some((Statement::Expression(expr), next_index))
}
Expressions
Expressions in this minimal Lua are only one of function calls,
literals (numbers, identifiers), or binary operations. To keep things
very simple, binary operations cannot be combined. So instead of 1 +
2 + 3 we'd need to do local tmp1 = 1 + 2; local tmp2 = tmp1 + 3;
and so on.
fn parse_expression(raw: &[char], tokens: &[Token], index: usize) -> Option<(Expression, usize)> {
if index >= tokens.len() {
return None;
}
let t = tokens[index].clone();
let left = match t.kind {
TokenKind::Number => Expression::Literal(Literal::Number(t)),
TokenKind::Identifier => Expression::Literal(Literal::Identifier(t)),
_ => {
return None;
}
};
If what follows the first literal is an open parenthesis then we try to parse a function call.
let mut next_index = index + 1;
if expect_syntax(tokens, next_index, "(") {
next_index += 1; // Skip past open paren
// Function call
let mut arguments: Vec<Expression> = vec![];
We need to call parse_expression recursively for every possible
argument passed to the function.
while !expect_syntax(tokens, next_index, ")") {
if arguments.is_empty() {
if !expect_syntax(tokens, next_index, ",") {
println!(
"{}",
tokens[next_index]
.loc
.debug(raw, "Expected comma between function call arguments:")
);
return None;
}
next_index += 1; // Skip past comma
}
let res = parse_expression(raw, tokens, next_index);
if let Some((arg, next_next_index)) = res {
next_index = next_next_index;
arguments.push(arg);
} else {
println!(
"{}",
tokens[next_index]
.loc
.debug(raw, "Expected valid expression in function call arguments:")
);
return None;
}
}
next_index += 1; // Skip past closing paren
return Some((
Expression::FunctionCall(FunctionCall {
name: tokens[index].clone(),
arguments,
}),
next_index,
));
}
Otherwise if there isn't an opening parenthesis then we could be parsing either a literal expression or a binary operation. If the token that follows is an operator token then we know it's a binary operation.
// Might be a literal expression
if next_index >= tokens.len() || tokens[next_index].clone().kind != TokenKind::Operator {
return Some((left, next_index));
}
// Otherwise is a binary operation
let op = tokens[next_index].clone();
next_index += 1; // Skip past op
if next_index >= tokens.len() {
println!(
"{}",
tokens[next_index]
.loc
.debug(raw, "Expected valid right hand side binary operand:")
);
return None;
}
let rtoken = tokens[next_index].clone();
It is at this point that we could (but won't) call
parse_expression recursively. I don't want to deal with operator
precedence right now so we'll just require that the right hand side is
another literal.
let right = match rtoken.kind {
TokenKind::Number => Expression::Literal(Literal::Number(rtoken)),
TokenKind::Identifier => Expression::Literal(Literal::Identifier(rtoken)),
_ => {
println!(
"{}",
rtoken
.loc
.debug(raw, "Expected valid right hand side binary operand:")
);
return None;
}
};
next_index += 1; // Skip past right hand operand
Some((
Expression::BinaryOperation(BinaryOperation {
left: Box::new
December 27, 2021
Efficient MySQL Performance
After 17 years with MySQL, I wrote a book: Efficient MySQL Performance.
I’ll make a bold claim: a MySQL book like this has never been written—not even close.
The preface explains why this book is unique: