Previously in database basics:
<! forgive me, for I have sinned >
1. SELECT, INSERT, CREATE and a REPL
Next in database basics:
3. indexes
4. a database/sql driver
In this post, we'll extend gosql
to support binary expressions and very simple filtering on SELECT
results via WHERE. We'll introduce a general mechanism for
interpreting an expression on a row in a table. The expression may be
an identifier (where the result is the value of the cell corresponding
to that column in the row), a numeric literal, a combination via a
binary expression, etc.
The following interactions will be possible:
# CREATE TABLE users (name TEXT, age INT);
ok
# INSERT INTO users VALUES ('Stephen', 16);
ok
# SELECT name, age FROM users;
name | age
----------+------
Stephen | 16
(1 result)
ok
# INSERT INTO users VALUES ('Adrienne', 23);
ok
# SELECT age + 2, name FROM users WHERE age = 23;
age | name
------+-----------
25 | Adrienne
(1 result)
ok
# SELECT name FROM users;
name
------------
Stephen
Adrienne
(2 results)
ok
The changes we'll make in this post are roughly a walk through of
this
commit.
Boilerplate updates
There are a few updates to pick up that I won't go into in this
post. Grab the following files from the main repo:
- lexer.go
- The big change here is to use the same keyword matching algorithm
for symbols. This allows us to support symbols that are longer
than one character.
- This file also now includes the following keywords and symbols:
and
, or
, true
,
false
, =
,
,
||
, and +
.
- cmd/main.go
Parsing boilerplate
We'll redefine three helper functions in parser.go
before
going further:
parseToken
, parseTokenKind
, and
helpMessage
.
The parseToken
helper will consume a token if it matches
the one provided as an argument (ignoring location).
func parseToken(tokens []*token, initialCursor uint, t token) (*token, uint, bool) {
cursor := initialCursor
if cursor >= uint(len(tokens)) {
return nil, initialCursor, false
}
if p := tokens[cursor]; t.equals(p) {
return p, cursor + 1, true
}
return nil, initialCursor, false
}
The parseTokenKind
helper will consume a token if it is
the same kind as an argument provided.
func parseTokenKind(tokens []*token, initialCursor uint, kind tokenKind) (*token, uint, bool) {
cursor := initialCursor
if cursor >= uint(len(tokens)) {
return nil, initialCursor, false
}
current := tokens[cursor]
if current.kind == kind {
return current, cursor + 1, true
}
return nil, initialCursor, false
}
And the helpMessage
helper will give an indication of
where in a program something happened.
func helpMessage(tokens []*token, cursor uint, msg string) {
var c *token
if cursor+1 < uint(len(tokens)) {
c = tokens[cursor+1]
} else {
c = tokens[cursor]
}
fmt.Printf("[%d,%d]: %s, near: %s\n", c.loc.line, c.loc.col, msg, c.value)
}
Parsing binary expressions
Next we'll extend the AST structure in ast.go
to
support a "binary kind" of expression. The binary expression will have
two sub-expressions and an operator.
const (
literalKind expressionKind
binaryKind
)
type binaryExpression struct {
a expression
b expression
op token
}
type expression struct {
literal *token
binary *binaryExpression
kind expressionKind
}
We'll use Pratt parsing to handle operator precedence. There is an
excellent introduction to this technique
here.
If at the beginning of parsing we see a left parenthesis, we'll
consume it and parse an expression within it. Then we'll look for a
right parenthesis. Otherwise we'll look for a non-binary expression
first (e.g. symbol, number).
func parseExpression(tokens []*token, initialCursor uint, delimiters []token, minBp uint) (*expression, uint, bool) {
cursor := initialCursor
var exp *expression
_, newCursor, ok := parseToken(tokens, cursor, tokenFromSymbol(leftParenSymbol))
if ok {
cursor = newCursor
rightParenToken := tokenFromSymbol(rightParenSymbol)
exp, cursor, ok = parseExpression(tokens, cursor, append(delimiters, rightParenToken), minBp)
if !ok {
helpMessage(tokens, cursor, "Expected expression after opening paren")
return nil, initialCursor, false
}
_, cursor, ok = parseToken(tokens, cursor, rightParenToken)
if !ok {
helpMessage(tokens, cursor, "Expected closing paren")
return nil, initialCursor, false
}
} else {
exp, cursor, ok = parseLiteralExpression(tokens, cursor)
if !ok {
return nil, initialCursor, false
}
}
...
return exp, cursor, true
}
Then we'll look for a binary operator (e.g. =
,
and
) or delimiter. If we find an operator and it of
lesser "binding power" than the current minimum (minBp
passed as an argument to the parse function with a default value of
0
), we'll return the current expression.
...
lastCursor := cursor
outer:
for cursor < uint(len(tokens)) {
for _, d := range delimiters {
_, _, ok = parseToken(tokens, cursor, d)
if ok {
break outer
}
}
binOps := []token{
tokenFromKeyword(andKeyword),
tokenFromKeyword(orKeyword),
tokenFromSymbol(eqSymbol),
tokenFromSymbol(neqSymbol),
tokenFromSymbol(concatSymbol),
tokenFromSymbol(plusSymbol),
}
var op *token = nil
for _, bo := range binOps {
var t *token
t, cursor, ok = parseToken(tokens, cursor, bo)
if ok {
op = t
break
}
}
if op == nil {
helpMessage(tokens, cursor, "Expected binary operator")
return nil, initialCursor, false
}
bp := op.bindingPower()
if bp < minBp {
cursor = lastCursor
break
}
...
}
return exp, cursor, true
The bindingPower
function on tokens can be defined for
now such that sum and concatenation have the highest binding power,
followed by equality operations, then boolean operators, and then
everything else at zero.
func (t token) bindingPower() uint {
switch t.kind {
case keywordKind:
switch keyword(t.value) {
case andKeyword:
fallthrough
case orKeyword:
return 1
}
case symbolKind:
switch symbol(t.value) {
case eqSymbol:
fallthrough
case neqSymbol:
fallthrough
case concatSymbol:
fallthrough
case plusSymbol:
return 3
}
}
return 0
}
Back in parseExpression
, if the new operator has greater
binding power we'll parse the next operand expression (a recursive
call, passing the binding power of the new operator as the new
minBp
).
Upon completion, the current expression (the return value of the
recursive call) is set to a new binary expression containing the
previously current expression on the left and the just-parsed
expression on the right.
...
b, newCursor, ok := parseExpression(tokens, cursor, delimiters, bp)
if !ok {
helpMessage(tokens, cursor, "Expected right operand")
return nil, initialCursor, false
}
exp = &expression{
binary: &binaryExpression{
*exp,
*b,
*op,
},
kind: binaryKind,
}
cursor = newCursor
lastCursor = cursor
}
return exp, cursor, true
}
All together:
func parseExpression(tokens []*token, initialCursor uint, delimiters []token, minBp uint) (*expression, uint, bool) {
cursor := initialCursor
var exp *expression
_, newCursor, ok := parseToken(tokens, cursor, tokenFromSymbol(leftParenSymbol))
if ok {
cursor = newCursor
rightParenToken := tokenFromSymbol(rightParenSymbol)
exp, cursor, ok = parseExpression(tokens, cursor, append(delimiters, rightParenToken), minBp)
if !ok {
helpMessage(tokens, cursor, "Expected expression after opening paren")
return nil, initialCursor, false
}
_, cursor, ok = parseToken(tokens, cursor, rightParenToken)
if !ok {
helpMessage(tokens, cursor, "Expected closing paren")
return nil, initialCursor, false
}
} else {
exp, cursor, ok = parseLiteralExpression(tokens, cursor)
if !ok {
return nil, initialCursor, false
}
}
lastCursor := cursor
outer:
for cursor < uint(len(tokens)) {
for _, d := range delimiters {
_, _, ok = parseToken(tokens, cursor, d)
if ok {
break outer
}
}
binOps := []token{
tokenFromKeyword(andKeyword),
tokenFromKeyword(orKeyword),
tokenFromSymbol(eqSymbol),
tokenFromSymbol(neqSymbol),
tokenFromSymbol(concatSymbol),
tokenFromSymbol(plusSymbol),
}
var op *token = nil
for _, bo := range binOps {
var t *token
t, cursor, ok = parseToken(tokens, cursor, bo)
if ok {
op = t
break
}
}
if op == nil {
helpMessage(tokens, cursor, "Expected binary operator")
return nil, initialCursor, false
}
bp := op.bindingPower()
if bp < minBp {
cursor = lastCursor
break
}
b, newCursor, ok := parseExpression(tokens, cursor, delimiters, bp)
if !ok {
helpMessage(tokens, cursor, "Expected right operand")
return nil, initialCursor, false
}
exp = &expression{
binary: &binaryExpression{
*exp,
*b,
*op,
},
kind: binaryKind,
}
cursor = newCursor
lastCursor = cursor
}
return exp, cursor, true
}
Now that we have this general parse expression helper in place, we can
add support for parsing WHERE
in SELECT
statements.
Parsing WHERE
This part's pretty easy. We modify the existing
parseSelectStatement
to search for an optional
WHERE
token followed by an expression.
func parseSelectStatement(tokens []*token, initialCursor uint, delimiter token) (*SelectStatement, uint, bool) {
var ok bool
cursor := initialCursor
_, cursor, ok = parseToken(tokens, cursor, tokenFromKeyword(selectKeyword))
if !ok {
return nil, initialCursor, false
}
slct := SelectStatement{}
fromToken := tokenFromKeyword(fromKeyword)
item, newCursor, ok := parseSelectItem(tokens, cursor, []token{fromToken, delimiter})
if !ok {
return nil, initialCursor, false
}
slct.item = item
cursor = newCursor
whereToken := tokenFromKeyword(whereKeyword)
delimiters := []token{delimiter, whereToken}
_, cursor, ok = parseToken(tokens, cursor, fromToken)
if ok {
from, newCursor, ok := parseFromItem(tokens, cursor, delimiters)
if !ok {
helpMessage(tokens, cursor, "Expected FROM item")
return nil, initialCursor, false
}
slct.from = from
cursor = newCursor
}
_, cursor, ok = parseToken(tokens, cursor, whereToken)
if ok {
where, newCursor, ok := parseExpression(tokens, cursor, []token{delimiter}, 0)
if !ok {
helpMessage(tokens, cursor, "Expected WHERE conditionals")
return nil, initialCursor, false
}
slct.where = where
cursor = newCursor
}
return &slct, cursor, true
}
Now we're all done with parsing binary expressions and
WHERE
filters! If in doubt, refer to
parser.go
in the project.
Re-thinking query execution
In the first post in this series, we didn't establish any standard way
for interpreting an expression in any kind of statement. In SQL
though, every expression is always run in the context of a row in a
table. We'll handle cases like SELECT 1
and INSERT INTO
users VALUES (1)
by creating a table with a single empty row to act
as the context.
This requires a bit of re-architecting. So we'll rewrite the
memory.go
implementation in this post from scratch.
We'll also stop panic
-ing when things go wrong. Instead
we'll print a message. This allows the REPL loop to keep going.
Memory cells
Again the fundamental blocks of memory in the table will be an untyped
array of bytes. We'll provide conversion methods from this memory cell
into integers, strings, and boolean Go values.
type MemoryCell []byte
func (mc MemoryCell) AsInt() int32 {
var i int32
err := binary.Read(bytes.NewBuffer(mc), binary.BigEndian, &i)
if err != nil {
fmt.Printf("Corrupted data [%s]: %s\n", mc, err)
return 0
}
return i
}
func (mc MemoryCell) AsText() string {
return string(mc)
}
func (mc MemoryCell) AsBool() bool {
return len(mc) != 0
}
func (mc MemoryCell) equals(b MemoryCell) bool {
// Seems verbose but need to make sure if one is nil, the
// comparison still fails quickly
if mc == nil || b == nil {
return mc == nil && b == nil
}
return bytes.Compare(mc, b) == 0
}
We'll also extend the Cell
interface in
backend.go
to support the new boolean type.
package gosql
type ColumnType uint
const (
TextType ColumnType = iota
IntType
BoolType
)
type Cell interface {
AsText() string
AsInt() int32
AsBool() bool
}
...
Finally, we need a way for mapping a Go value into a memory
cell.
func literalToMemoryCell(t *token) MemoryCell {
if t.kind == numericKind {
buf := new(bytes.Buffer)
i, err := strconv.Atoi(t.value)
if err != nil {
fmt.Printf("Corrupted data [%s]: %s\n", t.value, err)
return MemoryCell(nil)
}
// TODO: handle bigint
err = binary.Write(buf, binary.BigEndian, int32(i))
if err != nil {
fmt.Printf("Corrupted data [%s]: %s\n", string(buf.Bytes()), err)
return MemoryCell(nil)
}
return MemoryCell(buf.Bytes())
}
if t.kind == stringKind {
return MemoryCell(t.value)
}
if t.kind == boolKind {
if t.value == "true" {
return MemoryCell([]byte{1})
} else {
return MemoryCell(nil)
}
}
return nil
}
And we'll provide global true
and false
values:
var (
trueToken = token{kind: boolKind, value: "true"}
falseToken = token{kind: boolKind, value: "false"}
trueMemoryCell = literalToMemoryCell(&trueToken)
falseMemoryCell = literalToMemoryCell(&falseToken)
)
Tables
A table has a list of rows (an array of memory cells) and a list of
column names and types.
type table struct {
columns []string
columnTypes []ColumnType
rows [][]MemoryCell
}
Finally we'll add a series of methods on table
that,
given a row index, interprets an expression AST against that row in
the table.
Interpreting literals
First we'll implement evaluateLiteralCell
that will look
up an identifier or return the value of integers, strings, and
booleans.
func (t *table) evaluateLiteralCell(rowIndex uint, exp expression) (MemoryCell, string, ColumnType, error) {
if exp.kind != literalKind {
return nil, "", 0, ErrInvalidCell
}
lit := exp.literal
if lit.kind == identifierKind {
for i, tableCol := range t.columns {
if tableCol == lit.value {
return t.rows[rowIndex][i], tableCol, t.columnTypes[i], nil
}
}
return nil, "", 0, ErrColumnDoesNotExist
}
columnType := IntType
if lit.kind == stringKind {
columnType = TextType
} else if lit.kind == boolKind {
columnType = BoolType
}
return literalToMemoryCell(lit), "?column?", columnType, nil
}
Interpreting binary expressions
Now we can implement evaluateBinaryCell
that will
evaluate it's two sub-expressions and combine them together according
to the operator. The SQL operators we have defined so far do no
coercion. So we'll fail immediately if the two sides of the operation
are not of the same type. Additionally, the concatenation and addition
operators require that their arguments are strings and numbers,
respectively.
func (t *table) evaluateBinaryCell(rowIndex uint, exp expression) (MemoryCell, string, ColumnType, error) {
if exp.kind != binaryKind {
return nil, "", 0, ErrInvalidCell
}
bexp := exp.binary
l, _, lt, err := t.evaluateCell(rowIndex, bexp.a)
if err != nil {
return nil, "", 0, err
}
r, _, rt, err := t.evaluateCell(rowIndex, bexp.b)
if err != nil {
return nil, "", 0, err
}
switch bexp.op.kind {
case symbolKind:
switch symbol(bexp.op.value) {
case eqSymbol:
eq := l.equals(r)
if lt == TextType && rt == TextType && eq {
return trueMemoryCell, "?column?", BoolType, nil
}
if lt == IntType && rt == IntType && eq {
return trueMemoryCell, "?column?", BoolType, nil
}
if lt == BoolType && rt == BoolType && eq {
return trueMemoryCell, "?column?"
April 10, 2020
Tinybird Engineering Blog
World War II, for example, forced innovation or accelerated development and commercialization of the jet engine, pressurized aircraft cabins, helicopters, at...
April 04, 2020
Phil Eaton
The only time I've been able to seriously, rapidly improve my ability
to speak a foreign language was through intensive language courses in
college. I was forced to actively speak, read, and write Chinese for
6-8 hours a week (1-2 hours every day). Then study another 5-10 hours
a week in preparation for the active sessions. I went three semesters
like this before I left school.
I've been trying to recreate that intensity since and mostly
failed. After marrying a Korean, I've redirected the little effort I
can muster to learning Korean. Aside from stints over the years
(mostly for a month or two before or after a trip to Korea), I haven't
been able to keep up any practice.
One thing I've tried over the years to commit myself to learning a
number of different topics is to set up recurring calendar invites:
"Study Linux", "Study TCP/IP", "Study Korean", etc.
This has mostly failed too. However, I do always look at the
invites as I get notified.
I keep inbox zero and I check my email many times a day, marking each
email read dilligently when I no longer need to think about it.
Tools like Quizlet, Anki, or even Duolingo let you self-learn
vocabulary when you feel like it. But basically no service
will try to keep giving you exposure to some set of topics whether you
spend time on it or not.
The most important thing I can think of is forced exposure to
vocabulary. So I've been planning for some time to hook up a list of
the one thousand most common Korean words to scheduled emails.
This weekend I finally got around to scripting the Google Calendar API
against the words list. I have an event for each word for the next
1000 days. Each day I receive a summary email including all events of
the day and the new word is part of it.
This is a pretty indirect approach but it's pretty simple to set
up. It's not very easy to reconfigure.
The code for doing this is available on
Github if you're
interested. And if you know a service that can build and manage
scheduled notifications against a spreadsheet or database I'd rather
be looking at that.
We'll see how this works out.