March 13, 2020
March 11, 2020
Create a Static Application to Analyze +50M Github Events
March 06, 2020
Writing a SQL database from scratch in Go: 1. SELECT, INSERT, CREATE and a REPL
Next in database basics:
<! forgive me, for I have sinned >
2. binary expressions and WHERE filters
3. indexes
4. a database/sql driver
In this series we'll write a rudimentary database from scratch in Go. Project source code is available on Github.
In this first post we'll build enough of a parser to run some simple
CREATE
, INSERT
, and SELECT
queries. Then we'll build an in-memory backend
supporting TEXT
and INT
types and write a
basic REPL.
We'll be able to support the following interaction:
$ go run *.go
Welcome to gosql.
# CREATE TABLE users (id INT, name TEXT);
ok
# INSERT INTO users VALUES (1, 'Phil');
ok
# SELECT id, name FROM users;
| id | name |
====================
| 1 | Phil |
ok
# INSERT INTO users VALUES (2, 'Kate');
ok
# SELECT name, id FROM users;
| name | id |
====================
| Phil | 1 |
| Kate | 2 |
ok
The first stage will be to map a SQL source into a list of tokens
(lexing). Then we'll call parse functions to find individual SQL
statements (such as SELECT
). These parse functions will
in turn call their own helper functions to find patterns of
recursively parseable chunks, keywords, symbols (like parenthesis),
identifiers (like a table name), and numeric or string literals.
Then, we'll write an in-memory backend to do operations based on an AST. Finally, we'll write a REPL to accept SQL from a CLI and pass it to the in-memory backend.
This post assumes a basic understanding of parsing concepts. We
won't skip any code, but also won't go into great detail on why we
structure the way we do.
For a simpler introduction to parsing and parsing concepts,
see this post on
parsing JSON.
Lexing
The lexer is responsible for finding every distinct group of characters in source code: tokens. This will consist primarily of identifiers, numbers, strings, and symbols.
What follows is a second, more orthodox pass at lexing. The first
pass took a number of shortcuts and couldn't handle spaces in
strings, for example.
Here is the
relevant pull request in gosql if you are curious.
The gist of the logic will be to pass control to a helper function for each kind of token. If the helper function succeeds in finding a token, it will return true and the location for the lexer to start at next. It will continue doing this until it reaches the end of the source.
First off, we'll define a few types and constants for use
in lexer.go
:
package gosql
import (
"fmt"
"strings"
)
type location struct {
line uint
col uint
}
type keyword string
const (
selectKeyword keyword = "select"
fromKeyword keyword = "from"
asKeyword keyword = "as"
tableKeyword keyword = "table"
createKeyword keyword = "create"
insertKeyword keyword = "insert"
intoKeyword keyword = "into"
valuesKeyword keyword = "values"
intKeyword keyword = "int"
textKeyword keyword = "text"
)
type symbol string
const (
semicolonSymbol symbol = ";"
asteriskSymbol symbol = "*"
commaSymbol symbol = ","
leftparenSymbol symbol = "("
rightparenSymbol symbol = ")"
)
type tokenKind uint
const (
keywordKind tokenKind = iota
symbolKind
identifierKind
stringKind
numericKind
)
type token struct {
value string
kind tokenKind
loc location
}
type cursor struct {
pointer uint
loc location
}
func (t *token) equals(other *token) bool {
return t.value == other.value && t.kind == other.kind
}
type lexer func(string, cursor) (*token, cursor, bool)
Next we'll write out the main loop:
func lex(source string) ([]*token, error) {
tokens := []*token{}
cur := cursor{}
lex:
for cur.pointer < uint(len(source)) {
lexers := []lexer{lexKeyword, lexSymbol, lexString, lexNumeric, lexIdentifier}
for _, l := range lexers {
if token, newCursor, ok := l(source, cur); ok {
cur = newCursor
// Omit nil tokens for valid, but empty syntax like newlines
if token != nil {
tokens = append(tokens, token)
}
continue lex
}
}
hint := ""
if len(tokens) > 0 {
hint = " after " + tokens[len(tokens)-1].value
}
return nil, fmt.Errorf("Unable to lex token%s, at %d:%d", hint, cur.loc.line, cur.loc.col)
}
return tokens, nil
}
Then we'll write a helper for each kind of fundemental token.
Analyzing numbers
Numbers are the most complex. So we'll refer to the PostgreSQL documentation (section 4.1.2.6) for what constitutes a valid number.
func lexNumeric(source string, ic cursor) (*token, cursor, bool) {
cur := ic
periodFound := false
expMarkerFound := false
for ; cur.pointer < uint(len(source)); cur.pointer++ {
c := source[cur.pointer]
cur.loc.col++
isDigit := c >= '0' && c <= '9'
isPeriod := c == '.'
isExpMarker := c == 'e'
// Must start with a digit or period
if cur.pointer == ic.pointer {
if !isDigit && !isPeriod {
return nil, ic, false
}
periodFound = isPeriod
continue
}
if isPeriod {
if periodFound {
return nil, ic, false
}
periodFound = true
continue
}
if isExpMarker {
if expMarkerFound {
return nil, ic, false
}
// No periods allowed after expMarker
periodFound = true
expMarkerFound = true
// expMarker must be followed by digits
if cur.pointer == uint(len(source)-1) {
return nil, ic, false
}
cNext := source[cur.pointer+1]
if cNext == '-' || cNext == '+' {
cur.pointer++
cur.loc.col++
}
continue
}
if !isDigit {
break
}
}
// No characters accumulated
if cur.pointer == ic.pointer {
return nil, ic, false
}
return &token{
value: source[ic.pointer:cur.pointer],
loc: ic.loc,
kind: numericKind,
}, cur, true
}
Analyzing strings
Strings must start and end with a single apostrophe. They can contain a single apostophe if it is followed by another single apostrophe. We'll put this kind of character delimited lexing logic into a helper function so we can use it again when analyzing identifiers.
func lexCharacterDelimited(source string, ic cursor, delimiter byte) (*token, cursor, bool) {
cur := ic
if len(source[cur.pointer:]) == 0 {
return nil, ic, false
}
if source[cur.pointer] != delimiter {
return nil, ic, false
}
cur.loc.col++
cur.pointer++
var value []byte
for ; cur.pointer < uint(len(source)); cur.pointer++ {
c := source[cur.pointer]
if c == delimiter {
// SQL escapes are via double characters, not backslash.
if cur.pointer+1 >= uint(len(source)) || source[cur.pointer+1] != delimiter {
return &token{
value: string(value),
loc: ic.loc,
kind: stringKind,
}, cur, true
} else {
value = append(value, delimiter)
cur.pointer++
cur.loc.col++
}
}
value = append(value, c)
cur.loc.col++
}
return nil, ic, false
}
func lexString(source string, ic cursor) (*token, cursor, bool) {
return lexCharacterDelimited(source, ic, '\'')
}
Analyzing symbols and keywords
Symbols come from a fixed set of strings, so they're easy to compare against. Whitespace should be thrown away.
func lexSymbol(source string, ic cursor) (*token, cursor, bool) {
c := source[ic.pointer]
cur := ic
// Will get overwritten later if not an ignored syntax
cur.pointer++
cur.loc.col++
switch c {
// Syntax that should be thrown away
case '\n':
cur.loc.line++
cur.loc.col = 0
fallthrough
case '\t':
fallthrough
case ' ':
return nil, cur, true
}
// Syntax that should be kept
symbols := []symbol{
commaSymbol,
leftParenSymbol,
rightParenSymbol,
semicolonSymbol,
asteriskSymbol,
}
var options []string
for _, s := range symbols {
options = append(options, string(s))
}
// Use `ic`, not `cur`
match := longestMatch(source, ic, options)
// Unknown character
if match == "" {
return nil, ic, false
}
cur.pointer = ic.pointer + uint(len(match))
cur.loc.col = ic.loc.col + uint(len(match))
return &token{
value: match,
loc: ic.loc,
kind: symbolKind,
}, cur, true
}
Keywords are even simpler, and use the same longestMatch
helper.
func lexKeyword(source string, ic cursor) (*token, cursor, bool) {
cur := ic
keywords := []keyword{
selectKeyword,
insertKeyword,
valuesKeyword,
tableKeyword,
createKeyword,
whereKeyword,
fromKeyword,
intoKeyword,
textKeyword,
}
var options []string
for _, k := range keywords {
options = append(options, string(k))
}
match := longestMatch(source, ic, options)
if match == "" {
return nil, ic, false
}
cur.pointer = ic.pointer + uint(len(match))
cur.loc.col = ic.loc.col + uint(len(match))
return &token{
value: match,
kind: kind,
loc: ic.loc,
}, cur, true
}
And finally we implement the longestMatch
helper:
// longestMatch iterates through a source string starting at the given
// cursor to find the longest matching substring among the provided
// options
func longestMatch(source string, ic cursor, options []string) string {
var value []byte
var skipList []int
var match string
cur := ic
for cur.pointer < uint(len(source)) {
value = append(value, strings.ToLower(string(source[cur.pointer]))...)
cur.pointer++
match:
for i, option := range options {
for _, skip := range skipList {
if i == skip {
continue match
}
}
// Deal with cases like INT vs INTO
if option == string(value) {
skipList = append(skipList, i)
if len(option) > len(match) {
match = option
}
continue
}
sharesPrefix := string(value) == option[:cur.pointer-ic.pointer]
tooLong := len(value) > len(option)
if tooLong || !sharesPrefix {
skipList = append(skipList, i)
}
}
if len(skipList) == len(options) {
break
}
}
return match
}
Analyzing identifiers
An identifier is either a double-quoted string or a group of characters starting with an alphabetical character and possibly containing numbers and underscores.
func lexIdentifier(source string, ic cursor) (*token, cursor, bool) {
// Handle separately if is a double-quoted identifier
if token, newCursor, ok := lexCharacterDelimited(source, ic, '"'); ok {
return token, newCursor, true
}
cur := ic
c := source[cur.pointer]
// Other characters count too, big ignoring non-ascii for now
isAlphabetical := (c >= 'A' && c <= 'Z') || (c >= 'a' && c <= 'z')
if !isAlphabetical {
return nil, ic, false
}
cur.pointer++
cur.loc.col++
value := []byte{c}
for ; cur.pointer < uint(len(source)); cur.pointer++ {
c = source[cur.pointer]
// Other characters count too, big ignoring non-ascii for now
isAlphabetical := (c >= 'A' && c <= 'Z') || (c >= 'a' && c <= 'z')
isNumeric := c >= '0' && c <= '9'
if isAlphabetical || isNumeric || c == '$' || c == '_' {
value = append(value, c)
cur.loc.col++
continue
}
break
}
if len(value) == 0 {
return nil, ic, false
}
return &token{
// Unquoted dentifiers are case-insensitive
value: strings.ToLower(string(value)),
loc: ic.loc,
kind: identifierKind,
}, cur, true
}
And that's it for the lexer! If you copy lexer_test.go from the main project, the tests should now pass.
AST model
At the highest level, an AST is a collection of statements:
package main
type Ast struct {
Statements []*Statement
}
A statement, for now, is one of INSERT
,
CREATE
, or SELECT
:
type AstKind uint
const (
SelectKind AstKind = iota
CreateTableKind
InsertKind
)
type Statement struct {
SelectStatement *SelectStatement
CreateTableStatement *CreateTableStatement
InsertStatement *InsertStatement
Kind AstKind
}
INSERT
An insert statement, for now, has a table name and a list of values to insert:
type InsertStatement struct {
table token
values *[]*expression
}
An expression is a literal token or (in the future) a function call or inline operation:
type expressionKind uint
const (
literalKind expressionKind = iota
)
type expression struct {
literal *token
kind expressionKind
}
CREATE
A create statement, for now, has a table name and a list of column names and types:
type columnDefinition struct {
name token
datatype token
}
type CreateTableStatement struct {
name token
cols *[]*columnDefinition
}
SELECT
A select statement, for now, has a table name and a list of column names:
type SelectStatement struct {
item []*expression
from token
}
And that's it for the AST.
Parsing
The Parse
entrypoint will take a list of tokens and
attempt to parse statements, separated by a semi-colon, until it
reaches the last token.
In general our strategy will be to increment and pass around a cursor containing the current position of unparsed tokens. Each helper will return the new cursor that the caller should start from.
package main
import (
"errors"
"fmt"
)
func tokenFromKeyword(k keyword) token {
return token{
kind: keywordKind,
value: string(k),
}
}
func tokenFromSymbol(s symbol) token {
return token{
kind: symbolKind,
value: string(s),
}
}
func expectToken(tokens []*token, cursor uint, t token) bool {
if cursor >= uint(len(tokens)) {
return false
}
return t.equals(tokens[cursor])
}
func helpMessage(tokens []*token, cursor uint, msg string) {
var c *token
if cursor < uint(len(tokens)) {
c = tokens[cursor]
} else {
c = tokens[cursor-1]
}
fmt.Printf("[%d,%d]: %s, got: %s\n", c.loc.line, c.loc.col, msg, c.value)
}
func Parse(source string) (*Ast, error) {
tokens, err := lex(source)
if err != nil {
return nil, err
}
a := Ast{}
cursor := uint(0)
for cursor < uint(len(tokens)) {
stmt, newCursor, ok := parseStatement(tokens, cursor, tokenFromSymbol(semicolonSymbol))
if !ok {
helpMessage(tokens, cursor, "Expected statement")
return nil, errors.New("Failed to parse, expected statement")
}
cursor = newCursor
a.Statements = append(a.Statements, stmt)
atLeastOneSemicolon := false
for expectToken(tokens, cursor, tokenFromSymbol(semicolonSymbol)) {
cursor++
atLeastOneSemicolon = true
}
if !atLeastOneSemicolon {
helpMessage(tokens, cursor, "Expected semi-colon delimiter between statements")
return nil, errors.New("Missing semi-colon between statements")
}
}
return &a, nil
}
Parsing statements
Each statement will be one of INSERT
,
CREATE
, or SELECT
. The
parseStatement
helper will call a helper on each of these
statement types and return true
if one of them succeeds
in parsing.
func parseStatement(tokens []*token, initialCursor uint, delimiter token) (*Statement, uint, bool) {
cursor := initialCursor
// Look for a SELECT statement
semicolonToken := tokenFromSymbol(semicolonSymbol)
slct, newCursor, ok := parseSelectStatement(tokens, cursor, semicolonToken)
if ok {
return &Statement{
Kind: SelectKind,
SelectStatement: slct,
}, newCursor, true
}
// Look for a INSERT statement
inst, newCursor, ok := parseInsertStatement(tokens, cursor, semicolonToken)
if ok {
return &Statement{
Kind: InsertKind,
InsertStatement: inst,
}, newCursor, true
}
// Look for a CREATE statement
crtTbl, newCursor, ok := parseCreateTableStatement(tokens, cursor, semicolonToken)
if ok {
return &Statement{