database.news | database technology news aggregator

Phil Eaton

Writing a minimal Lua implementation with a virtual machine from scratch in Rust

By the end of this guide we'll have a minimal, working implementation of a small part of Lua from scratch. It will be able to run the following program (among others):

function fib(n)
   if n < 2 then
      return n;
   end

   local n1 = fib(n-1);
   local n2 = fib(n-2);
   return n1 + n2;
end

print(fib(30));

This is my second project in Rust and only the third time I've invented an instruction set so don't take my style as gospel. However, I have found some Rust parsing tutorials overly complex so I'm hoping you'll find this one simpler.

All source code is available on Github.

Entrypoint

Running cargo init will give the boilerplate necessary. In src/main.rs we'll accept a file name from the command line, perform lexical analysis to retrieve all tokens from the file, perform grammar analysis on the tokens to retrieve a tree structure, compile the tree to a linear set of virtual machine instructions, and finally interpret the virtual machine instructions.

mod eval;
mod lex;
mod parse;

use std::env;
use std::fs;

fn main() {
    let args: Vec<String> = env::args().collect();
    let contents = fs::read_to_string(&args[1]).expect("Could not read file");

    let raw: Vec<char> = contents.chars().collect();

    let tokens = match lex::lex(&raw) {
        Ok(tokens) => tokens,
        Err(msg) => panic!("{}", msg),
    };

    let ast = match parse::parse(&raw, tokens) {
        Ok(ast) => ast,
        Err(msg) => panic!("{}", msg),
    };

    let pgrm = eval::compile(&raw, ast);

    eval::eval(pgrm);
}

Easy peasy. Now let's implement lex.

Lexical analysis

Lexical analysis drops whitespace (Lua is not whitespace sensitive) and chunks all source code characters into their smallest possible meaningful pieces like commas, numbers, identifiers, keywords, etc.

In order to have useful error messages, we'll keep track of state in the file with a Location struct that implements increment and debug.

This goes in src/lex.rs.

#[derive(Copy, Clone, Debug)]
pub struct Location {
    col: i32,
    line: i32,
    index: usize,
}

The increment function will update line and column numbers as well as the current index in the file.

impl Location {
    fn increment(&self, newline: bool) -> Location {
        if newline {
            Location {
                index: self.index + 1,
                col: 0,
                line: self.line + 1,
            }
        } else {
            Location {
                index: self.index + 1,
                col: self.col + 1,
                line: self.line,
            }
        }
    }

And the debug function will dump the current line with a pointer in text to the current column along with a message.

    pub fn debug<S: Into<String>>(&self, raw: &[char], msg: S) -> String {
        let mut line = 0;
        let mut line_str = String::new();
        // Find the whole line of original source
        for c in raw {
            if *c == '\n' {
                line += 1;

                // Done discovering line in question
                if !line_str.is_empty() {
                    break;
                }

                continue;
            }

            if self.line == line {
                line_str.push_str(&c.to_string());
            }
        }

        let space = " ".repeat(self.col as usize);
        format!("{}\n\n{}\n{}^ Near here", msg.into(), line_str, space)
    }
}

The smallest individual unit after lexical analysis is a token which is either a keyword, number, identifier, operator, or syntax. (This implementation is clearly skipping lots of real Lua syntax like strings.)

#[derive(Debug, PartialEq, Eq, Clone)]
pub enum TokenKind {
    Identifier,
    Syntax,
    Keyword,
    Number,
    Operator,
}

#[derive(Debug, Clone)]
pub struct Token {
    pub value: String,
    pub kind: TokenKind,
    pub loc: Location,
}

The top-level lex function will iterate over the file and call a lex helper for each kind of token, returning an array of all tokens on success. In between lexing it will "eat whitespace".

pub fn lex(s: &[char]) -> Result<Vec<Token>, String> {
    let mut loc = Location {
        col: 0,
        index: 0,
        line: 0,
    };
    let size = s.len();
    let mut tokens: Vec<Token> = vec![];

    let lexers = [
        lex_keyword,
        lex_identifier,
        lex_number,
        lex_syntax,
        lex_operator,
    ];
    'outer: while loc.index < size {
        loc = eat_whitespace(s, loc);
        if loc.index == size {
            break;
        }

        for lexer in lexers {
            let res = lexer(s, loc);
            if let Some((t, next_loc)) = res {
                loc = next_loc;
                tokens.push(t);
                continue 'outer;
            }
        }

        return Err(loc.debug(s, "Unrecognized character while lexing:"));
    }

    Ok(tokens)
}

Whitespace

Eating whitespace is just incrementing the location while we see a space, tab, newline, etc.

fn eat_whitespace(raw: &[char], initial_loc: Location) -> Location {
    let mut c = raw[initial_loc.index];
    let mut next_loc = initial_loc;
    while [' ', '\n', '\r', '\t'].contains(&c) {
        next_loc = next_loc.increment(c == '\n');
        if next_loc.index == raw.len() {
            break;
        }
        c = raw[next_loc.index];
    }

    next_loc
}

Numbers

Lexing numbers iterates through the source starting at a position until it stops seeing decimal digits (this implementation only supports integers).

fn lex_number(raw: &[char], initial_loc: Location) -> Option<(Token, Location)> {
    let mut ident = String::new();
    let mut next_loc = initial_loc;
    let mut c = raw[initial_loc.index];
    while c.is_digit(10) {
        ident.push_str(&c.to_string());
        next_loc = next_loc.increment(false);
        c = raw[next_loc.index];
    }

If there are no digits in the string then this is not a number.

    if !ident.is_empty() {
        Some((
            Token {
                value: ident,
                loc: initial_loc,
                kind: TokenKind::Number,
            },
            next_loc,
        ))
    } else {
        None
    }
}

Identifiers

Identifiers are any collection of alphabet characters, numbers, and underscores.

fn lex_identifier(raw: &Vec<char>, initial_loc: Location) -> Option<(Token, Location)> {
    let mut ident = String::new();
    let mut next_loc = initial_loc;
    let mut c = raw[initial_loc.index];
    while c.is_alphanumeric() || c == '_' {
        ident.push_str(&c.to_string());
        next_loc = next_loc.increment(false);
        c = raw[next_loc.index];
    }

But they cannot start with a number.

    // First character must not be a digit
    if ident.len() > 0 && !ident.chars().next().unwrap().is_digit(10) {
        Some((
            Token {
                value: ident,
                loc: initial_loc,
                kind: TokenKind::Identifier,
            },
            next_loc,
        ))
    } else {
        None
    }
}

Keywords

Keywords are alphabetical like identifiers are but they cannot be reused as variables by the user.

fn lex_keyword(raw: &[char], initial_loc: Location) -> Option<(Token, Location)> {
    let syntax = ["function", "end", "if", "then", "local", "return"];

    let mut next_loc = initial_loc;
    let mut value = String::new();
    'outer: for possible_syntax in syntax {
        let mut c = raw[initial_loc.index];
        next_loc = initial_loc;
        while c.is_alphanumeric() || c == '_' {
            value.push_str(&c.to_string());
            next_loc = next_loc.increment(false);
            c = raw[next_loc.index];

            let n = next_loc.index - initial_loc.index;
            if value != possible_syntax[..n] {
                value = String::new();
                continue 'outer;
            }
        }

        // Not a complete match
        if value.len() < possible_syntax.len() {
            value = String::new();
            continue;
        }

        // If it got to this point it found a match, so exit early.
        // We don't need a longest match.
        break;
    }

    if value.is_empty() {
        return None;
    }

Aside from matching a list of strings we have to make sure there is a complete match. For example function1 is not a keyword, it's a valid identifier. Whereas function 1 is a valid set of tokens (the keyword function and the number 1), even if it's not a valid Lua grammar.

    // If the next character would be part of a valid identifier, then
    // this is not a keyword.
    if next_loc.index < raw.len() - 1 {
        let next_c = raw[next_loc.index];
        if next_c.is_alphanumeric() || next_c == '_' {
            return None;
        }
    }

    Some((
        Token {
            value: value,
            loc: initial_loc,
            kind: TokenKind::Keyword,
        },
        next_loc,
    ))
}

Syntax

Syntax (in this context) is just language junk that isn't operators. Things like commas, parenthesis, etc.

fn lex_syntax(raw: &[char], initial_loc: Location) -> Option<(Token, Location)> {
    let syntax = [";", "=", "(", ")", ","];

    for possible_syntax in syntax {
        let c = raw[initial_loc.index];
        let next_loc = initial_loc.increment(false);
        // TODO: this won't work with multiple-character syntax bits like >= or ==
        if possible_syntax == c.to_string() {
            return Some((
                Token {
                    value: possible_syntax.to_string(),
                    loc: initial_loc,
                    kind: TokenKind::Syntax,
                },
                next_loc,
            ));
        }
    }

    None
}

Operators

Operators are things like plus, minus, and less than symbols. Operators are syntax but it helps us later on to break these out into a seperate type of token.

fn lex_operator(raw: &[char], initial_loc: Location) -> Option<(Token, Location)> {
    let operators = ["+", "-", "<"];

    for possible_syntax in operators {
        let c = raw[initial_loc.index];
        let next_loc = initial_loc.increment(false);
        // TODO: this won't work with multiple-character operators like >= or ==
        if possible_syntax == c.to_string() {
            return Some((
                Token {
                    value: possible_syntax.to_string(),
                    loc: initial_loc,
                    kind: TokenKind::Operator,
                },
                next_loc,
            ));
        }
    }

    None
}

And now we're all done lexing!

Grammar analysis

Parsing finds grammatical (tree) patterns in a flat list of tokens. This is called a syntax tree or abstract syntax tree (AST).

The boring part is defining the tree. Generally speaking (and specifically for this project), the syntax tree is a list of statements. Statements can be function definitions or expression statements or if statements or return statements or local declarations.

This goes in src/parse.rs.

#[derive(Debug)]
pub enum Statement {
    Expression(Expression),
    If(If),
    FunctionDeclaration(FunctionDeclaration),
    Return(Return),
    Local(Local),
}

pub type Ast = Vec<Statement>;

There's almost nothing special at all about the rest of the tree definitions.

#[derive(Debug)]
pub enum Literal {
    Identifier(Token),
    Number(Token),
}

#[derive(Debug)]
pub struct FunctionCall {
    pub name: Token,
    pub arguments: Vec<Expression>,
}

#[derive(Debug)]
pub struct BinaryOperation {
    pub operator: Token,
    pub left: Box<Expression>,
    pub right: Box<Expression>,
}

#[derive(Debug)]
pub enum Expression {
    FunctionCall(FunctionCall),
    BinaryOperation(BinaryOperation),
    Literal(Literal),
}

#[derive(Debug)]
pub struct FunctionDeclaration {
    pub name: Token,
    pub parameters: Vec<Token>,
    pub body: Vec<Statement>,
}

#[derive(Debug)]
pub struct If {
    pub test: Expression,
    pub body: Vec<Statement>,
}

#[derive(Debug)]
pub struct Local {
    pub name: Token,
    pub expression: Expression,
}

#[derive(Debug)]
pub struct Return {
    pub expression: Expression,
}

And that's it for the AST!

Some helpers

Lastly before the fun part, we'll define a few helpers for validating each kind of token.

fn expect_keyword(tokens: &[Token], index: usize, value: &str) -> bool {
    if index >= tokens.len() {
        return false;
    }

    let t = tokens[index].clone();
    t.kind == TokenKind::Keyword && t.value == value
}

fn expect_syntax(tokens: &[Token], index: usize, value: &str) -> bool {
    if index >= tokens.len() {
        return false;
    }

    let t = tokens[index].clone();
    t.kind == TokenKind::Syntax && t.value == value
}

fn expect_identifier(tokens: &[Token], index: usize) -> bool {
    if index >= tokens.len() {
        return false;
    }

    let t = tokens[index].clone();
    t.kind == TokenKind::Identifier
}

Now on to the fun part, actually detecting these trees!

Top-level parse

The top-level parse function and it's major helper, parse_statement, dispatch very similarly to the top-level lex function. For each statement in the file we look for function declarations, if statements, return statements, local declarations, and expression statements.

fn parse_statement(raw: &[char], tokens: &[Token], index: usize) -> Option<(Statement, usize)> {
    let parsers = [
        parse_if,
        parse_expression_statement,
        parse_return,
        parse_function,
        parse_local,
    ];
    for parser in parsers {
        let res = parser(raw, tokens, index);
        if res.is_some() {
            return res;
        }
    }

    None
}

pub fn parse(raw: &[char], tokens: Vec<Token>) -> Result<Ast, String> {
    let mut ast = vec![];
    let mut index = 0;
    let ntokens = tokens.len();
    while index < ntokens {
        let res = parse_statement(raw, &tokens, index);
        if let Some((stmt, next_index)) = res {
            index = next_index;
            ast.push(stmt);
            continue;
        }

        return Err(tokens[index].loc.debug(raw, "Invalid token while parsing:"));
    }

    Ok(ast)
}

Expression statements

Expression statements are just a wrapper for the Rust type system. They call parse_expression (which we'll define shortly), expect a semicolon afterward, and wrap the expression in a statement.

fn parse_expression_statement(
    raw: &[char],
    tokens: &[Token],
    index: usize,
) -> Option<(Statement, usize)> {
    let mut next_index = index;
    let res = parse_expression(raw, tokens, next_index)?;

    let (expr, next_next_index) = res;
    next_index = next_next_index;
    if !expect_syntax(tokens, next_index, ";") {
        println!(
            "{}",
            tokens[next_index]
                .loc
                .debug(raw, "Expected semicolon after expression:")
        );
        return None;
    }

    next_index += 1; // Skip past semicolon

    Some((Statement::Expression(expr), next_index))
}

Expressions

Expressions in this minimal Lua are only one of function calls, literals (numbers, identifiers), or binary operations. To keep things very simple, binary operations cannot be combined. So instead of 1 + 2 + 3 we'd need to do local tmp1 = 1 + 2; local tmp2 = tmp1 + 3; and so on.

fn parse_expression(raw: &[char], tokens: &[Token], index: usize) -> Option<(Expression, usize)> {
    if index >= tokens.len() {
        return None;
    }

    let t = tokens[index].clone();
    let left = match t.kind {
        TokenKind::Number => Expression::Literal(Literal::Number(t)),
        TokenKind::Identifier => Expression::Literal(Literal::Identifier(t)),
        _ => {
            return None;
        }
    };

If what follows the first literal is an open parenthesis then we try to parse a function call.

    let mut next_index = index + 1;
    if expect_syntax(tokens, next_index, "(") {
        next_index += 1; // Skip past open paren

        // Function call
        let mut arguments: Vec<Expression> = vec![];

We need to call parse_expression recursively for every possible argument passed to the function.

        while !expect_syntax(tokens, next_index, ")") {
            if arguments.is_empty() {
                if !expect_syntax(tokens, next_index, ",") {
                    println!(
                        "{}",
                        tokens[next_index]
                            .loc
                            .debug(raw, "Expected comma between function call arguments:")
                    );
                    return None;
                }

                next_index += 1; // Skip past comma
            }

            let res = parse_expression(raw, tokens, next_index);
            if let Some((arg, next_next_index)) = res {
                next_index = next_next_index;
                arguments.push(arg);
            } else {
                println!(
                    "{}",
                    tokens[next_index]
                        .loc
                        .debug(raw, "Expected valid expression in function call arguments:")
                );
                return None;
            }
        }

        next_index += 1; // Skip past closing paren

        return Some((
            Expression::FunctionCall(FunctionCall {
                name: tokens[index].clone(),
                arguments,
            }),
            next_index,
        ));
    }

Otherwise if there isn't an opening parenthesis then we could be parsing either a literal expression or a binary operation. If the token that follows is an operator token then we know it's a binary operation.

    // Might be a literal expression
    if next_index >= tokens.len() || tokens[next_index].clone().kind != TokenKind::Operator {
        return Some((left, next_index));
    }

    // Otherwise is a binary operation
    let op = tokens[next_index].clone();
    next_index += 1; // Skip past op

    if next_index >= tokens.len() {
        println!(
            "{}",
            tokens[next_index]
                .loc
                .debug(raw, "Expected valid right hand side binary operand:")
        );
        return None;
    }

    let rtoken = tokens[next_index].clone();

It is at this point that we could (but won't) call parse_expression recursively. I don't want to deal with operator precedence right now so we'll just require that the right hand side is another literal.

    let right = match rtoken.kind {
        TokenKind::Number => Expression::Literal(Literal::Number(rtoken)),
        TokenKind::Identifier => Expression::Literal(Literal::Identifier(rtoken)),
        _ => {
            println!(
                "{}",
                rtoken
                    .loc
                    .debug(raw, "Expected valid right hand side binary operand:")
            );
            return None;
        }
    };
    next_index += 1; // Skip past right hand operand

    Some((
        Expression::BinaryOperation(BinaryOperation {
            left: Box::new
                            December 27, 2021
                            
                            
                                
                                    
                                        Hack MySQL
                                    
                                    
                                        Efficient MySQL Performance
                                    
                                    
                                    
                                        After 17 years with MySQL, I wrote a book: Efficient MySQL Performance.
I’ll make a bold claim: a MySQL book like this has never been written—not even close.
The preface explains why this book is unique:
                                    
                                    
                                    
                                    
                                        by Daniel Nichter

December 21, 2021

Tinybird Engineering Blog

The era of JSON data analytics

JSON is the de facto standard for data communication in the web and that's why we are supporting it natively: from a Kafka stream or from local or remote NDJSON files (and very soon in other flavours)

December 20, 2021

Alex Miller

Deterministic Simulation Testing: BUGGIFY

When application code provides high-level fault injection points, fuzzing becomes significantly more effective.

December 17, 2021

Supabase Blog

Holiday Hackdays Winners 2021

Celebrating many amazing projects submitted to our Holiday Hackdays Hackathon.

Vitess Blog

Log4j RCE 0-day Mitigation

Background # A critical vulnerability CVE-2021-44228 in the Apache Log4j logging library was disclosed on Dec 9. The project provided release 2.15.0 with a patch that mitigates the impact of this CVE. It was quickly found that the initial patch was insufficient, and an additional CVE CVE-2021-45046 followed. This has been fixed in release 2.16.0. Who is affected? # The bulk of vitess code is in golang, and is unaffected by these vulnerabilities.

← 1 ... 61 62 63 64 65 66 67 68 69 70 ... 81 →

January 04, 2022

Operational Analytics in Real Time with Tinybird and Retool

January 01, 2022

Recurse Center: Winter Break

December 28, 2021

Databases in 2021: A Year in Review