a curated list of database news from authoritative sources

April 12, 2020

Writing a SQL database from scratch in Go: 2. binary expressions and WHERE filters

Previously in database basics: <! forgive me, for I have sinned >
1. SELECT, INSERT, CREATE and a REPL

Next in database basics:
3. indexes
4. a database/sql driver

In this post, we'll extend gosql to support binary expressions and very simple filtering on SELECT results via WHERE. We'll introduce a general mechanism for interpreting an expression on a row in a table. The expression may be an identifier (where the result is the value of the cell corresponding to that column in the row), a numeric literal, a combination via a binary expression, etc.

The following interactions will be possible:

# CREATE TABLE users (name TEXT, age INT);
ok
#  INSERT INTO users VALUES ('Stephen', 16);
ok
# SELECT name, age FROM users;
name   | age
----------+------
Stephen |  16
(1 result)
ok
# INSERT INTO users VALUES ('Adrienne', 23);
ok
# SELECT age + 2, name FROM users WHERE age = 23;
age |   name
------+-----------
25 | Adrienne
(1 result)
ok
# SELECT name FROM users;
name
------------
Stephen
Adrienne
(2 results)
ok

The changes we'll make in this post are roughly a walk through of this commit.

Boilerplate updates

There are a few updates to pick up that I won't go into in this post. Grab the following files from the main repo:

  • lexer.go
    • The big change here is to use the same keyword matching algorithm for symbols. This allows us to support symbols that are longer than one character.
    • This file also now includes the following keywords and symbols: and, or, true, false, =, , ||, and +.
  • cmd/main.go

Parsing boilerplate

We'll redefine three helper functions in parser.go before going further: parseToken, parseTokenKind, and helpMessage.

The parseToken helper will consume a token if it matches the one provided as an argument (ignoring location).

func parseToken(tokens []*token, initialCursor uint, t token) (*token, uint, bool) {
    cursor := initialCursor

    if cursor >= uint(len(tokens)) {
        return nil, initialCursor, false
    }

    if p := tokens[cursor]; t.equals(p) {
        return p, cursor + 1, true
    }

    return nil, initialCursor, false
}

The parseTokenKind helper will consume a token if it is the same kind as an argument provided.

func parseTokenKind(tokens []*token, initialCursor uint, kind tokenKind) (*token, uint, bool) {
    cursor := initialCursor

    if cursor >= uint(len(tokens)) {
        return nil, initialCursor, false
    }

    current := tokens[cursor]
    if current.kind == kind {
        return current, cursor + 1, true
    }

    return nil, initialCursor, false
}

And the helpMessage helper will give an indication of where in a program something happened.

func helpMessage(tokens []*token, cursor uint, msg string) {
    var c *token
    if cursor+1 < uint(len(tokens)) {
        c = tokens[cursor+1]
    } else {
        c = tokens[cursor]
    }

    fmt.Printf("[%d,%d]: %s, near: %s\n", c.loc.line, c.loc.col, msg, c.value)
}

Parsing binary expressions

Next we'll extend the AST structure in ast.go to support a "binary kind" of expression. The binary expression will have two sub-expressions and an operator.

const (
    literalKind expressionKind
    binaryKind
)

type binaryExpression struct {
    a  expression
    b  expression
    op token
}

type expression struct {
    literal *token
    binary  *binaryExpression
    kind    expressionKind
}

We'll use Pratt parsing to handle operator precedence. There is an excellent introduction to this technique here.

If at the beginning of parsing we see a left parenthesis, we'll consume it and parse an expression within it. Then we'll look for a right parenthesis. Otherwise we'll look for a non-binary expression first (e.g. symbol, number).

func parseExpression(tokens []*token, initialCursor uint, delimiters []token, minBp uint) (*expression, uint, bool) {
    cursor := initialCursor

    var exp *expression
    _, newCursor, ok := parseToken(tokens, cursor, tokenFromSymbol(leftParenSymbol))
    if ok {
        cursor = newCursor
        rightParenToken := tokenFromSymbol(rightParenSymbol)

        exp, cursor, ok = parseExpression(tokens, cursor, append(delimiters, rightParenToken), minBp)
        if !ok {
            helpMessage(tokens, cursor, "Expected expression after opening paren")
            return nil, initialCursor, false
        }

        _, cursor, ok = parseToken(tokens, cursor, rightParenToken)
        if !ok {
            helpMessage(tokens, cursor, "Expected closing paren")
            return nil, initialCursor, false
        }
    } else {
        exp, cursor, ok = parseLiteralExpression(tokens, cursor)
        if !ok {
            return nil, initialCursor, false
        }
    }

    ...

    return exp, cursor, true
}

Then we'll look for a binary operator (e.g. =, and) or delimiter. If we find an operator and it of lesser "binding power" than the current minimum (minBp passed as an argument to the parse function with a default value of 0), we'll return the current expression.

    ...

    lastCursor := cursor
outer:
    for cursor < uint(len(tokens)) {
        for _, d := range delimiters {
            _, _, ok = parseToken(tokens, cursor, d)
            if ok {
                break outer
            }
        }

        binOps := []token{
            tokenFromKeyword(andKeyword),
            tokenFromKeyword(orKeyword),
            tokenFromSymbol(eqSymbol),
            tokenFromSymbol(neqSymbol),
            tokenFromSymbol(concatSymbol),
            tokenFromSymbol(plusSymbol),
        }

        var op *token = nil
        for _, bo := range binOps {
            var t *token
            t, cursor, ok = parseToken(tokens, cursor, bo)
            if ok {
                op = t
                break
            }
        }

        if op == nil {
            helpMessage(tokens, cursor, "Expected binary operator")
            return nil, initialCursor, false
        }

        bp := op.bindingPower()
        if bp < minBp {
            cursor = lastCursor
            break
        }

        ...
    }

    return exp, cursor, true

The bindingPower function on tokens can be defined for now such that sum and concatenation have the highest binding power, followed by equality operations, then boolean operators, and then everything else at zero.

func (t token) bindingPower() uint {
    switch t.kind {
    case keywordKind:
        switch keyword(t.value) {
        case andKeyword:
            fallthrough
        case orKeyword:
            return 1
        }
    case symbolKind:
        switch symbol(t.value) {
        case eqSymbol:
            fallthrough
        case neqSymbol:
            fallthrough
        case concatSymbol:
            fallthrough
        case plusSymbol:
            return 3
        }
    }

    return 0
}

Back in parseExpression, if the new operator has greater binding power we'll parse the next operand expression (a recursive call, passing the binding power of the new operator as the new minBp).

Upon completion, the current expression (the return value of the recursive call) is set to a new binary expression containing the previously current expression on the left and the just-parsed expression on the right.

        ...

        b, newCursor, ok := parseExpression(tokens, cursor, delimiters, bp)
        if !ok {
            helpMessage(tokens, cursor, "Expected right operand")
            return nil, initialCursor, false
        }
        exp = &expression{
            binary: &binaryExpression{
                *exp,
                *b,
                *op,
            },
            kind: binaryKind,
        }
        cursor = newCursor
        lastCursor = cursor
    }

    return exp, cursor, true
}

All together:

func parseExpression(tokens []*token, initialCursor uint, delimiters []token, minBp uint) (*expression, uint, bool) {
    cursor := initialCursor

    var exp *expression
    _, newCursor, ok := parseToken(tokens, cursor, tokenFromSymbol(leftParenSymbol))
    if ok {
        cursor = newCursor
        rightParenToken := tokenFromSymbol(rightParenSymbol)

        exp, cursor, ok = parseExpression(tokens, cursor, append(delimiters, rightParenToken), minBp)
        if !ok {
            helpMessage(tokens, cursor, "Expected expression after opening paren")
            return nil, initialCursor, false
        }

        _, cursor, ok = parseToken(tokens, cursor, rightParenToken)
        if !ok {
            helpMessage(tokens, cursor, "Expected closing paren")
            return nil, initialCursor, false
        }
    } else {
        exp, cursor, ok = parseLiteralExpression(tokens, cursor)
        if !ok {
            return nil, initialCursor, false
        }
    }

    lastCursor := cursor
outer:
    for cursor < uint(len(tokens)) {
        for _, d := range delimiters {
            _, _, ok = parseToken(tokens, cursor, d)
            if ok {
                break outer
            }
        }

        binOps := []token{
            tokenFromKeyword(andKeyword),
            tokenFromKeyword(orKeyword),
            tokenFromSymbol(eqSymbol),
            tokenFromSymbol(neqSymbol),
            tokenFromSymbol(concatSymbol),
            tokenFromSymbol(plusSymbol),
        }

        var op *token = nil
        for _, bo := range binOps {
            var t *token
            t, cursor, ok = parseToken(tokens, cursor, bo)
            if ok {
                op = t
                break
            }
        }

        if op == nil {
            helpMessage(tokens, cursor, "Expected binary operator")
            return nil, initialCursor, false
        }

        bp := op.bindingPower()
        if bp < minBp {
            cursor = lastCursor
            break
        }

        b, newCursor, ok := parseExpression(tokens, cursor, delimiters, bp)
        if !ok {
            helpMessage(tokens, cursor, "Expected right operand")
            return nil, initialCursor, false
        }
        exp = &expression{
            binary: &binaryExpression{
                *exp,
                *b,
                *op,
            },
            kind: binaryKind,
        }
        cursor = newCursor
        lastCursor = cursor
    }

    return exp, cursor, true
}

Now that we have this general parse expression helper in place, we can add support for parsing WHERE in SELECT statements.

Parsing WHERE

This part's pretty easy. We modify the existing parseSelectStatement to search for an optional WHERE token followed by an expression.

func parseSelectStatement(tokens []*token, initialCursor uint, delimiter token) (*SelectStatement, uint, bool) {
    var ok bool
    cursor := initialCursor
    _, cursor, ok = parseToken(tokens, cursor, tokenFromKeyword(selectKeyword))
    if !ok {
        return nil, initialCursor, false
    }

    slct := SelectStatement{}

    fromToken := tokenFromKeyword(fromKeyword)
    item, newCursor, ok := parseSelectItem(tokens, cursor, []token{fromToken, delimiter})
    if !ok {
        return nil, initialCursor, false
    }

    slct.item = item
    cursor = newCursor

    whereToken := tokenFromKeyword(whereKeyword)
    delimiters := []token{delimiter, whereToken}

    _, cursor, ok = parseToken(tokens, cursor, fromToken)
    if ok {
        from, newCursor, ok := parseFromItem(tokens, cursor, delimiters)
        if !ok {
            helpMessage(tokens, cursor, "Expected FROM item")
            return nil, initialCursor, false
        }

        slct.from = from
        cursor = newCursor
    }

    _, cursor, ok = parseToken(tokens, cursor, whereToken)
    if ok {
        where, newCursor, ok := parseExpression(tokens, cursor, []token{delimiter}, 0)
        if !ok {
            helpMessage(tokens, cursor, "Expected WHERE conditionals")
            return nil, initialCursor, false
        }

        slct.where = where
        cursor = newCursor
    }

    return &slct, cursor, true
}

Now we're all done with parsing binary expressions and WHERE filters! If in doubt, refer to parser.go in the project.

Re-thinking query execution

In the first post in this series, we didn't establish any standard way for interpreting an expression in any kind of statement. In SQL though, every expression is always run in the context of a row in a table. We'll handle cases like SELECT 1 and INSERT INTO users VALUES (1) by creating a table with a single empty row to act as the context.

This requires a bit of re-architecting. So we'll rewrite the memory.go implementation in this post from scratch.

We'll also stop panic-ing when things go wrong. Instead we'll print a message. This allows the REPL loop to keep going.

Memory cells

Again the fundamental blocks of memory in the table will be an untyped array of bytes. We'll provide conversion methods from this memory cell into integers, strings, and boolean Go values.

type MemoryCell []byte

func (mc MemoryCell) AsInt() int32 {
    var i int32
    err := binary.Read(bytes.NewBuffer(mc), binary.BigEndian, &i)
    if err != nil {
        fmt.Printf("Corrupted data [%s]: %s\n", mc, err)
        return 0
    }

    return i
}

func (mc MemoryCell) AsText() string {
    return string(mc)
}

func (mc MemoryCell) AsBool() bool {
    return len(mc) != 0
}

func (mc MemoryCell) equals(b MemoryCell) bool {
    // Seems verbose but need to make sure if one is nil, the
    // comparison still fails quickly
    if mc == nil || b == nil {
        return mc == nil && b == nil
    }

    return bytes.Compare(mc, b) == 0
}

We'll also extend the Cell interface in backend.go to support the new boolean type.

package gosql

type ColumnType uint

const (
    TextType ColumnType = iota
    IntType
    BoolType
)

type Cell interface {
    AsText() string
    AsInt() int32
    AsBool() bool
}

...

Finally, we need a way for mapping a Go value into a memory cell.

func literalToMemoryCell(t *token) MemoryCell {
    if t.kind == numericKind {
        buf := new(bytes.Buffer)
        i, err := strconv.Atoi(t.value)
        if err != nil {
            fmt.Printf("Corrupted data [%s]: %s\n", t.value, err)
            return MemoryCell(nil)
        }

        // TODO: handle bigint
        err = binary.Write(buf, binary.BigEndian, int32(i))
        if err != nil {
            fmt.Printf("Corrupted data [%s]: %s\n", string(buf.Bytes()), err)
            return MemoryCell(nil)
        }
        return MemoryCell(buf.Bytes())
    }

    if t.kind == stringKind {
        return MemoryCell(t.value)
    }

    if t.kind == boolKind {
        if t.value == "true" {
            return MemoryCell([]byte{1})
        } else {
            return MemoryCell(nil)
        }
    }

    return nil
}

And we'll provide global true and false values:

var (
    trueToken  = token{kind: boolKind, value: "true"}
    falseToken = token{kind: boolKind, value: "false"}

    trueMemoryCell  = literalToMemoryCell(&trueToken)
    falseMemoryCell = literalToMemoryCell(&falseToken)
)

Tables

A table has a list of rows (an array of memory cells) and a list of column names and types.

type table struct {
    columns     []string
    columnTypes []ColumnType
    rows        [][]MemoryCell
}

Finally we'll add a series of methods on table that, given a row index, interprets an expression AST against that row in the table.

Interpreting literals

First we'll implement evaluateLiteralCell that will look up an identifier or return the value of integers, strings, and booleans.

func (t *table) evaluateLiteralCell(rowIndex uint, exp expression) (MemoryCell, string, ColumnType, error) {
    if exp.kind != literalKind {
        return nil, "", 0, ErrInvalidCell
    }

    lit := exp.literal
    if lit.kind == identifierKind {
        for i, tableCol := range t.columns {
            if tableCol == lit.value {
                return t.rows[rowIndex][i], tableCol, t.columnTypes[i], nil
            }
        }

        return nil, "", 0, ErrColumnDoesNotExist
    }

    columnType := IntType
    if lit.kind == stringKind {
        columnType = TextType
    } else if lit.kind == boolKind {
        columnType = BoolType
    }

    return literalToMemoryCell(lit), "?column?", columnType, nil
}

Interpreting binary expressions

Now we can implement evaluateBinaryCell that will evaluate it's two sub-expressions and combine them together according to the operator. The SQL operators we have defined so far do no coercion. So we'll fail immediately if the two sides of the operation are not of the same type. Additionally, the concatenation and addition operators require that their arguments are strings and numbers, respectively.

func (t *table) evaluateBinaryCell(rowIndex uint, exp expression) (MemoryCell, string, ColumnType, error) {
    if exp.kind != binaryKind {
        return nil, "", 0, ErrInvalidCell
    }

    bexp := exp.binary

    l, _, lt, err := t.evaluateCell(rowIndex, bexp.a)
    if err != nil {
        return nil, "", 0, err
    }

    r, _, rt, err := t.evaluateCell(rowIndex, bexp.b)
    if err != nil {
        return nil, "", 0, err
    }

    switch bexp.op.kind {
    case symbolKind:
        switch symbol(bexp.op.value) {
        case eqSymbol:
            eq := l.equals(r)
            if lt == TextType && rt == TextType && eq {
                return trueMemoryCell, "?column?", BoolType, nil
            }

            if lt == IntType && rt == IntType && eq {
                return trueMemoryCell, "?column?", BoolType, nil
            }

            if lt == BoolType && rt == BoolType && eq {
                return trueMemoryCell, "?column?"
                                    
                                    
                                

April 10, 2020

April 04, 2020

Studying foreign languages with inbox zero

The only time I've been able to seriously, rapidly improve my ability to speak a foreign language was through intensive language courses in college. I was forced to actively speak, read, and write Chinese for 6-8 hours a week (1-2 hours every day). Then study another 5-10 hours a week in preparation for the active sessions. I went three semesters like this before I left school.

I've been trying to recreate that intensity since and mostly failed. After marrying a Korean, I've redirected the little effort I can muster to learning Korean. Aside from stints over the years (mostly for a month or two before or after a trip to Korea), I haven't been able to keep up any practice.

One thing I've tried over the years to commit myself to learning a number of different topics is to set up recurring calendar invites: "Study Linux", "Study TCP/IP", "Study Korean", etc.

This has mostly failed too. However, I do always look at the invites as I get notified.

I keep inbox zero and I check my email many times a day, marking each email read dilligently when I no longer need to think about it.

Tools like Quizlet, Anki, or even Duolingo let you self-learn vocabulary when you feel like it. But basically no service will try to keep giving you exposure to some set of topics whether you spend time on it or not.

The most important thing I can think of is forced exposure to vocabulary. So I've been planning for some time to hook up a list of the one thousand most common Korean words to scheduled emails.

This weekend I finally got around to scripting the Google Calendar API against the words list. I have an event for each word for the next 1000 days. Each day I receive a summary email including all events of the day and the new word is part of it.

This is a pretty indirect approach but it's pretty simple to set up. It's not very easy to reconfigure.

The code for doing this is available on Github if you're interested. And if you know a service that can build and manage scheduled notifications against a spreadsheet or database I'd rather be looking at that.

We'll see how this works out.

April 03, 2020

March 25, 2020

March 24, 2020

March 20, 2020