Writing a SQL database from scratch in Go: 2. binary expressions and WHERE filters
Previously in database basics:
<! forgive me, for I have sinned >
1. SELECT, INSERT, CREATE and a REPL
Next in database basics:
3. indexes
4. a database/sql driver
In this post, we'll extend gosql to support binary expressions and very simple filtering on SELECT results via WHERE. We'll introduce a general mechanism for interpreting an expression on a row in a table. The expression may be an identifier (where the result is the value of the cell corresponding to that column in the row), a numeric literal, a combination via a binary expression, etc.
The following interactions will be possible:
# CREATE TABLE users (name TEXT, age INT);
ok
# INSERT INTO users VALUES ('Stephen', 16);
ok
# SELECT name, age FROM users;
name | age
----------+------
Stephen | 16
(1 result)
ok
# INSERT INTO users VALUES ('Adrienne', 23);
ok
# SELECT age + 2, name FROM users WHERE age = 23;
age | name
------+-----------
25 | Adrienne
(1 result)
ok
# SELECT name FROM users;
name
------------
Stephen
Adrienne
(2 results)
ok
The changes we'll make in this post are roughly a walk through of this commit.
Boilerplate updates
There are a few updates to pick up that I won't go into in this post. Grab the following files from the main repo:
- lexer.go
- The big change here is to use the same keyword matching algorithm for symbols. This allows us to support symbols that are longer than one character.
- This file also now includes the following keywords and symbols:
and
,or
,true
,false
,=
,,
||
, and+
.
- cmd/main.go
- This file now uses a third-party table-rendering library instead of the hacky, handwritten original one.
- This also uses a third-party readline implementation so you get history and useful cursor movement in the REPL.
Parsing boilerplate
We'll redefine three helper functions in parser.go
before
going further:
parseToken
, parseTokenKind
, and
helpMessage
.
The parseToken
helper will consume a token if it matches
the one provided as an argument (ignoring location).
func parseToken(tokens []*token, initialCursor uint, t token) (*token, uint, bool) {
cursor := initialCursor
if cursor >= uint(len(tokens)) {
return nil, initialCursor, false
}
if p := tokens[cursor]; t.equals(p) {
return p, cursor + 1, true
}
return nil, initialCursor, false
}
The parseTokenKind
helper will consume a token if it is
the same kind as an argument provided.
func parseTokenKind(tokens []*token, initialCursor uint, kind tokenKind) (*token, uint, bool) {
cursor := initialCursor
if cursor >= uint(len(tokens)) {
return nil, initialCursor, false
}
current := tokens[cursor]
if current.kind == kind {
return current, cursor + 1, true
}
return nil, initialCursor, false
}
And the helpMessage
helper will give an indication of
where in a program something happened.
func helpMessage(tokens []*token, cursor uint, msg string) {
var c *token
if cursor+1 < uint(len(tokens)) {
c = tokens[cursor+1]
} else {
c = tokens[cursor]
}
fmt.Printf("[%d,%d]: %s, near: %s\n", c.loc.line, c.loc.col, msg, c.value)
}
Parsing binary expressions
Next we'll extend the AST structure in ast.go
to
support a "binary kind" of expression. The binary expression will have
two sub-expressions and an operator.
const (
literalKind expressionKind
binaryKind
)
type binaryExpression struct {
a expression
b expression
op token
}
type expression struct {
literal *token
binary *binaryExpression
kind expressionKind
}
We'll use Pratt parsing to handle operator precedence. There is an excellent introduction to this technique here.
If at the beginning of parsing we see a left parenthesis, we'll consume it and parse an expression within it. Then we'll look for a right parenthesis. Otherwise we'll look for a non-binary expression first (e.g. symbol, number).
func parseExpression(tokens []*token, initialCursor uint, delimiters []token, minBp uint) (*expression, uint, bool) {
cursor := initialCursor
var exp *expression
_, newCursor, ok := parseToken(tokens, cursor, tokenFromSymbol(leftParenSymbol))
if ok {
cursor = newCursor
rightParenToken := tokenFromSymbol(rightParenSymbol)
exp, cursor, ok = parseExpression(tokens, cursor, append(delimiters, rightParenToken), minBp)
if !ok {
helpMessage(tokens, cursor, "Expected expression after opening paren")
return nil, initialCursor, false
}
_, cursor, ok = parseToken(tokens, cursor, rightParenToken)
if !ok {
helpMessage(tokens, cursor, "Expected closing paren")
return nil, initialCursor, false
}
} else {
exp, cursor, ok = parseLiteralExpression(tokens, cursor)
if !ok {
return nil, initialCursor, false
}
}
...
return exp, cursor, true
}
Then we'll look for a binary operator (e.g. =
,
and
) or delimiter. If we find an operator and it of
lesser "binding power" than the current minimum (minBp
passed as an argument to the parse function with a default value of
0
), we'll return the current expression.
...
lastCursor := cursor
outer:
for cursor < uint(len(tokens)) {
for _, d := range delimiters {
_, _, ok = parseToken(tokens, cursor, d)
if ok {
break outer
}
}
binOps := []token{
tokenFromKeyword(andKeyword),
tokenFromKeyword(orKeyword),
tokenFromSymbol(eqSymbol),
tokenFromSymbol(neqSymbol),
tokenFromSymbol(concatSymbol),
tokenFromSymbol(plusSymbol),
}
var op *token = nil
for _, bo := range binOps {
var t *token
t, cursor, ok = parseToken(tokens, cursor, bo)
if ok {
op = t
break
}
}
if op == nil {
helpMessage(tokens, cursor, "Expected binary operator")
return nil, initialCursor, false
}
bp := op.bindingPower()
if bp < minBp {
cursor = lastCursor
break
}
...
}
return exp, cursor, true
The bindingPower
function on tokens can be defined for
now such that sum and concatenation have the highest binding power,
followed by equality operations, then boolean operators, and then
everything else at zero.
func (t token) bindingPower() uint {
switch t.kind {
case keywordKind:
switch keyword(t.value) {
case andKeyword:
fallthrough
case orKeyword:
return 1
}
case symbolKind:
switch symbol(t.value) {
case eqSymbol:
fallthrough
case neqSymbol:
fallthrough
case concatSymbol:
fallthrough
case plusSymbol:
return 3
}
}
return 0
}
Back in parseExpression
, if the new operator has greater
binding power we'll parse the next operand expression (a recursive
call, passing the binding power of the new operator as the new
minBp
).
Upon completion, the current expression (the return value of the recursive call) is set to a new binary expression containing the previously current expression on the left and the just-parsed expression on the right.
...
b, newCursor, ok := parseExpression(tokens, cursor, delimiters, bp)
if !ok {
helpMessage(tokens, cursor, "Expected right operand")
return nil, initialCursor, false
}
exp = &expression{
binary: &binaryExpression{
*exp,
*b,
*op,
},
kind: binaryKind,
}
cursor = newCursor
lastCursor = cursor
}
return exp, cursor, true
}
All together:
func parseExpression(tokens []*token, initialCursor uint, delimiters []token, minBp uint) (*expression, uint, bool) {
cursor := initialCursor
var exp *expression
_, newCursor, ok := parseToken(tokens, cursor, tokenFromSymbol(leftParenSymbol))
if ok {
cursor = newCursor
rightParenToken := tokenFromSymbol(rightParenSymbol)
exp, cursor, ok = parseExpression(tokens, cursor, append(delimiters, rightParenToken), minBp)
if !ok {
helpMessage(tokens, cursor, "Expected expression after opening paren")
return nil, initialCursor, false
}
_, cursor, ok = parseToken(tokens, cursor, rightParenToken)
if !ok {
helpMessage(tokens, cursor, "Expected closing paren")
return nil, initialCursor, false
}
} else {
exp, cursor, ok = parseLiteralExpression(tokens, cursor)
if !ok {
return nil, initialCursor, false
}
}
lastCursor := cursor
outer:
for cursor < uint(len(tokens)) {
for _, d := range delimiters {
_, _, ok = parseToken(tokens, cursor, d)
if ok {
break outer
}
}
binOps := []token{
tokenFromKeyword(andKeyword),
tokenFromKeyword(orKeyword),
tokenFromSymbol(eqSymbol),
tokenFromSymbol(neqSymbol),
tokenFromSymbol(concatSymbol),
tokenFromSymbol(plusSymbol),
}
var op *token = nil
for _, bo := range binOps {
var t *token
t, cursor, ok = parseToken(tokens, cursor, bo)
if ok {
op = t
break
}
}
if op == nil {
helpMessage(tokens, cursor, "Expected binary operator")
return nil, initialCursor, false
}
bp := op.bindingPower()
if bp < minBp {
cursor = lastCursor
break
}
b, newCursor, ok := parseExpression(tokens, cursor, delimiters, bp)
if !ok {
helpMessage(tokens, cursor, "Expected right operand")
return nil, initialCursor, false
}
exp = &expression{
binary: &binaryExpression{
*exp,
*b,
*op,
},
kind: binaryKind,
}
cursor = newCursor
lastCursor = cursor
}
return exp, cursor, true
}
Now that we have this general parse expression helper in place, we can
add support for parsing WHERE
in SELECT
statements.
Parsing WHERE
This part's pretty easy. We modify the existing
parseSelectStatement
to search for an optional
WHERE
token followed by an expression.
func parseSelectStatement(tokens []*token, initialCursor uint, delimiter token) (*SelectStatement, uint, bool) {
var ok bool
cursor := initialCursor
_, cursor, ok = parseToken(tokens, cursor, tokenFromKeyword(selectKeyword))
if !ok {
return nil, initialCursor, false
}
slct := SelectStatement{}
fromToken := tokenFromKeyword(fromKeyword)
item, newCursor, ok := parseSelectItem(tokens, cursor, []token{fromToken, delimiter})
if !ok {
return nil, initialCursor, false
}
slct.item = item
cursor = newCursor
whereToken := tokenFromKeyword(whereKeyword)
delimiters := []token{delimiter, whereToken}
_, cursor, ok = parseToken(tokens, cursor, fromToken)
if ok {
from, newCursor, ok := parseFromItem(tokens, cursor, delimiters)
if !ok {
helpMessage(tokens, cursor, "Expected FROM item")
return nil, initialCursor, false
}
slct.from = from
cursor = newCursor
}
_, cursor, ok = parseToken(tokens, cursor, whereToken)
if ok {
where, newCursor, ok := parseExpression(tokens, cursor, []token{delimiter}, 0)
if !ok {
helpMessage(tokens, cursor, "Expected WHERE conditionals")
return nil, initialCursor, false
}
slct.where = where
cursor = newCursor
}
return &slct, cursor, true
}
Now we're all done with parsing binary expressions and
WHERE
filters! If in doubt, refer to
parser.go
in the project.
Re-thinking query execution
In the first post in this series, we didn't establish any standard way
for interpreting an expression in any kind of statement. In SQL
though, every expression is always run in the context of a row in a
table. We'll handle cases like SELECT 1
and INSERT INTO
users VALUES (1)
by creating a table with a single empty row to act
as the context.
This requires a bit of re-architecting. So we'll rewrite the
memory.go
implementation in this post from scratch.
We'll also stop panic
-ing when things go wrong. Instead
we'll print a message. This allows the REPL loop to keep going.
Memory cells
Again the fundamental blocks of memory in the table will be an untyped array of bytes. We'll provide conversion methods from this memory cell into integers, strings, and boolean Go values.
type MemoryCell []byte
func (mc MemoryCell) AsInt() int32 {
var i int32
err := binary.Read(bytes.NewBuffer(mc), binary.BigEndian, &i)
if err != nil {
fmt.Printf("Corrupted data [%s]: %s\n", mc, err)
return 0
}
return i
}
func (mc MemoryCell) AsText() string {
return string(mc)
}
func (mc MemoryCell) AsBool() bool {
return len(mc) != 0
}
func (mc MemoryCell) equals(b MemoryCell) bool {
// Seems verbose but need to make sure if one is nil, the
// comparison still fails quickly
if mc == nil || b == nil {
return mc == nil && b == nil
}
return bytes.Compare(mc, b) == 0
}
We'll also extend the Cell
interface in
backend.go
to support the new boolean type.
package gosql
type ColumnType uint
const (
TextType ColumnType = iota
IntType
BoolType
)
type Cell interface {
AsText() string
AsInt() int32
AsBool() bool
}
...
Finally, we need a way for mapping a Go value into a memory cell.
func literalToMemoryCell(t *token) MemoryCell {
if t.kind == numericKind {
buf := new(bytes.Buffer)
i, err := strconv.Atoi(t.value)
if err != nil {
fmt.Printf("Corrupted data [%s]: %s\n", t.value, err)
return MemoryCell(nil)
}
// TODO: handle bigint
err = binary.Write(buf, binary.BigEndian, int32(i))
if err != nil {
fmt.Printf("Corrupted data [%s]: %s\n", string(buf.Bytes()), err)
return MemoryCell(nil)
}
return MemoryCell(buf.Bytes())
}
if t.kind == stringKind {
return MemoryCell(t.value)
}
if t.kind == boolKind {
if t.value == "true" {
return MemoryCell([]byte{1})
} else {
return MemoryCell(nil)
}
}
return nil
}
And we'll provide global true
and false
values:
var (
trueToken = token{kind: boolKind, value: "true"}
falseToken = token{kind: boolKind, value: "false"}
trueMemoryCell = literalToMemoryCell(&trueToken)
falseMemoryCell = literalToMemoryCell(&falseToken)
)
Tables
A table has a list of rows (an array of memory cells) and a list of column names and types.
type table struct {
columns []string
columnTypes []ColumnType
rows [][]MemoryCell
}
Finally we'll add a series of methods on table
that,
given a row index, interprets an expression AST against that row in
the table.
Interpreting literals
First we'll implement evaluateLiteralCell
that will look
up an identifier or return the value of integers, strings, and
booleans.
func (t *table) evaluateLiteralCell(rowIndex uint, exp expression) (MemoryCell, string, ColumnType, error) {
if exp.kind != literalKind {
return nil, "", 0, ErrInvalidCell
}
lit := exp.literal
if lit.kind == identifierKind {
for i, tableCol := range t.columns {
if tableCol == lit.value {
return t.rows[rowIndex][i], tableCol, t.columnTypes[i], nil
}
}
return nil, "", 0, ErrColumnDoesNotExist
}
columnType := IntType
if lit.kind == stringKind {
columnType = TextType
} else if lit.kind == boolKind {
columnType = BoolType
}
return literalToMemoryCell(lit), "?column?", columnType, nil
}
Interpreting binary expressions
Now we can implement evaluateBinaryCell
that will
evaluate it's two sub-expressions and combine them together according
to the operator. The SQL operators we have defined so far do no
coercion. So we'll fail immediately if the two sides of the operation
are not of the same type. Additionally, the concatenation and addition
operators require that their arguments are strings and numbers,
respectively.
func (t *table) evaluateBinaryCell(rowIndex uint, exp expression) (MemoryCell, string, ColumnType, error) {
if exp.kind != binaryKind {
return nil, "", 0, ErrInvalidCell
}
bexp := exp.binary
l, _, lt, err := t.evaluateCell(rowIndex, bexp.a)
if err != nil {
return nil, "", 0, err
}
r, _, rt, err := t.evaluateCell(rowIndex, bexp.b)
if err != nil {
return nil, "", 0, err
}
switch bexp.op.kind {
case symbolKind:
switch symbol(bexp.op.value) {
case eqSymbol:
eq := l.equals(r)
if lt == TextType && rt == TextType && eq {
return trueMemoryCell, "?column?", BoolType, nil
}
if lt == IntType && rt == IntType && eq {
return trueMemoryCell, "?column?", BoolType, nil
}
if lt == BoolType && rt == BoolType && eq {
return trueMemoryCell, "?column?"
April 10, 2020
New ideas often emerge or are developed in response to extreme needs arising during a social crisis – What our team is reading
April 04, 2020
Studying foreign languages with inbox zero
The only time I've been able to seriously, rapidly improve my ability to speak a foreign language was through intensive language courses in college. I was forced to actively speak, read, and write Chinese for 6-8 hours a week (1-2 hours every day). Then study another 5-10 hours a week in preparation for the active sessions. I went three semesters like this before I left school.
I've been trying to recreate that intensity since and mostly failed. After marrying a Korean, I've redirected the little effort I can muster to learning Korean. Aside from stints over the years (mostly for a month or two before or after a trip to Korea), I haven't been able to keep up any practice.
One thing I've tried over the years to commit myself to learning a number of different topics is to set up recurring calendar invites: "Study Linux", "Study TCP/IP", "Study Korean", etc.
This has mostly failed too. However, I do always look at the invites as I get notified.
I keep inbox zero and I check my email many times a day, marking each email read dilligently when I no longer need to think about it.
Tools like Quizlet, Anki, or even Duolingo let you self-learn vocabulary when you feel like it. But basically no service will try to keep giving you exposure to some set of topics whether you spend time on it or not.
The most important thing I can think of is forced exposure to vocabulary. So I've been planning for some time to hook up a list of the one thousand most common Korean words to scheduled emails.
This weekend I finally got around to scripting the Google Calendar API against the words list. I have an event for each word for the next 1000 days. Each day I receive a summary email including all events of the day and the new word is part of it.
This is a pretty indirect approach but it's pretty simple to set up. It's not very easy to reconfigure.
The code for doing this is available on Github if you're interested. And if you know a service that can build and manage scheduled notifications against a spreadsheet or database I'd rather be looking at that.
We'll see how this works out.
Daily new words in my inbox feels like the only way I can "force" myself to get exposed to new vocabulary. Wish there were a service for scheduling notifications from a spreadsheet. Finally got to scripting GCal's API populating daily events from 1000 most common Korean words
— Phil Eaton (@phil_eaton) April 4, 2020