a curated list of database news from authoritative sources

March 19, 2020

Enriching Kafka streams for real-time queries

If you are using Kafka to capture large quantities of events or transactional data, you are probably also looking for ways to enrich that data in real-time. Here is how to do it with Tinybird.

March 18, 2020

Reviewing the Surface Book 2

The first few paragraphs cover what I was looking for and what I considered. Then the review.

Why the Surface Book 2

I used a Macbook throughout my professional career until I had the choice a few years ago when I started my current job. Here, I ran Gentoo, then FreeBSD, then Arch, and now Windows 10 on the Dell XPS 15.

I enjoy Windows and I think Microsoft is doing a better job on hardware and software these days. At least, compared to Apple, they appear to be trying. So when my personal 2015 Macbook Pro died this year I decided to buy and run Windows at home.

On my Mac, I dealt with bad battery life for a while: running VMs, running Docker, compiling Go, running Node.js kills any battery. So I moved my development into the cloud and gained on battery life and network speeds at the cost of memory (I am paying for 4GB of RAM).

My ideal replacement was a cheaper machine that felt as good as a 2015 Macbook Pro. (The build quality has not been good since.) I was hoping not to pay more than $1000. My shortlist included the Surface Book 2, the Surface Pro X, the Surface Laptop 3, the Lenovo Yoga 14, and the LG Gram. So I went to Best Buy to try them out.

I was impressed by every Surface device. At first sight, I mistook the Surface Book and Surface Laptop for an old Macbook Pro. They both have a brushed aluminum body with a large trackpad and great keyboards. Even the Surface Pro X, which is a tablet, has an addon keyboard that is easy to type (that is, program) on.

I tried out the Lenovo Yoga 14 and it was solid, but I preferred the brushed aluminum body of the Surface devices. I did not get a chance to feel out the LG Gram.

I eliminated the Surface Laptop 3 because I like tablet mode. While the Surface Laptop 3 is a touchscreen, it is not a 2-in-1 device and does not have tablet mode.

And I eliminated the Surface Pro X because it is one of the first mainstream Windows ARM devices. While Windows on ARM is now the same operating system as Windows on a desktop, most consumer software ships x86_64 (not ARM) binaries. Windows on ARM can emulate x86 but not yet x86_64. I didn't feel like working around this on my primary personal device.

I bought the 13.5", 7th generation i5 Surface Book 2 for $999. It comes with 8GB DDR4 RAM and a 128GB SSD. I have had the device for two weeks now and I use it at least 10 hours a day.

Keyboard

The keyboard layout is standard, easy to use. The control, shift, caps, function, and alt keys are big enough that it is easy to program without staring at the keyboard. The up and down arrow keys are smaller than would be nice. But they are easier for me to find than on a 2019 Macbook Pro.

The function key is modal by default (like a Caps key) and indicates if function is enabled with a small LED. I have never seen a function key like this. I find it annoying when I turn it on.

And while there is builtin volume controls and a play/pause button, there is no media forward/back button. I assigned Ctrl+Windows+Alt+Left/Right to be media forward/back.

There is also no right Ctrl key. Instead there is a "media key" which is the equivalent of right-clicking... I guess. This is useless so I mapped it back to another Ctrl key.

Unlike macOS, which needs an app like Spectacle, Windows default window control shortcuts are great. Windows+Left to send to the left half, Windows+Right to send to the right half, Windows+Up to make full screen.

But macOS default swipe gestures are more intuitive: swipe left to go backwards, swipe right to go forwards. So I mapped this back myself.

Here is my autohotkey script.

Screen

The 13.5" screen feels top-heavy but may not actually weigh more than the keyboard/body. The bevel is larger than it feels like it should be. But the camera is in the right location: top and center.

Additionally, the default behavior when attaching/detaching the screen is to prompt you to enter/exit tablet mode rather than doing it for you. This prompt is easy to click out of and after doing so the option to switch between disappears until you reattach and detach again.

The screen isn't flush with the body when you close it. Few marketing pictures show you this, but here's one. This makes me worry something may snap if the laptop is ever slammed against a wall for some reason.

And fully open, it only goes back 120 degrees. This makes it hard to look at if it is on your legs and your legs are up higher than 90 degrees.

Finally, the headphone jack is not on the body but on the screen. This makes sense since the screen is detachable. But the jack is on the top-right corner, further away than usual. This requires me to be closer to the screen to feel like I am not pulling the screen when I am wearing headphones.

Pen

The Surface Pen is awesome and the screen's palm detection is too. I have had a lot of fun drawing on it in Paint 3. And it has been useful in annotating mockups for work too.

It costs $100 and comes with a AAAA battery. It is magnetized and sticks to the left side of the screen.

Body

As mentioned, the body is a brushed aluminum. It feels great. The power input is magnetic, which is helpful. But it uses a novel Surface-specific input rather than USB-C, so that sucks. A new charger from Microsoft costs $100.

The speakers are as good as Macbook speakers were 5 years ago. They don't have much bass. Additionally, these speakers get a little distorted at top volume.

The battery lasts 7-8 hours without charging. While this is as advertised, it is still disappointing of a new laptop in 2020 that is only running Chrome, Spotify, and Windows Terminal.

Tablet

To release the screen from the body, there is a key on the function row. However, it is not a hardware release. So when I accidentally killed the battery while the screen was flipped, I couldn't detach the screen after booting (to turn it back into a laptop) until after 10-20 minutes of charging.

The screen isn't easy to detach. It requires both hands lifting up from the base of the screen to get enough leverage. You cannot pull up from the top of the screen.

Aside from drawing apps, tablet mode apps on Windows aren't great. Kindle for Windows on tablet is terrible. I got stuck in Kindle's full screen mode and couldn't adjust the page size or exit full screen mode without reverting back to laptop mode first.

Tablet mode also throws away the standard Windows menu and shortcuts to give you a desktop of application cards. However, these cards don't adapt to recent or frequent applications. After I deleted Candy Crush and other built in apps I will never use, this desktop is blank except for Edge and Groove Music. It is incredible how bad the tablet desktop is. You have to use the full application list view every time you want to open a new program.

In summary

It's not a bad Windows machine for $1000. The body is great quality and the pen/screen interaction is solid. But I'd like to see Windows invest more in a useful tablet experience. And the detachable screen comes at the cost of being a awkward. So I'd go with the Surface Pro X or Surface Laptop 3 next time.

But above all I can't shake the expectation that a laptop built in 2020 running GMail and Slack in Chrome, Spotify, and a terminal application should last at least 10 hours.

March 13, 2020

March 11, 2020

Create a Static Application to Analyze +50M Github Events

Creating applications just using HTML, javascript and CSS is easier than ever with new frameworks and platforms like Zeit. Let see how to create a simple analytics application that queries 50M rows in real time without backend.

March 06, 2020

Writing a SQL database from scratch in Go: 1. SELECT, INSERT, CREATE and a REPL

Next in database basics: <! forgive me, for I have sinned >
2. binary expressions and WHERE filters
3. indexes
4. a database/sql driver

In this series we'll write a rudimentary database from scratch in Go. Project source code is available on Github.

In this first post we'll build enough of a parser to run some simple CREATE, INSERT, and SELECT queries. Then we'll build an in-memory backend supporting TEXT and INT types and write a basic REPL.

We'll be able to support the following interaction:

$ go run *.go
Welcome to gosql.
# CREATE TABLE users (id INT, name TEXT);
ok
# INSERT INTO users VALUES (1, 'Phil');
ok
# SELECT id, name FROM users;
| id | name |
====================
| 1 |  Phil |
ok
# INSERT INTO users VALUES (2, 'Kate');
ok
# SELECT name, id FROM users;
| name | id |
====================
| Phil |  1 |
| Kate |  2 |
ok

The first stage will be to map a SQL source into a list of tokens (lexing). Then we'll call parse functions to find individual SQL statements (such as SELECT). These parse functions will in turn call their own helper functions to find patterns of recursively parseable chunks, keywords, symbols (like parenthesis), identifiers (like a table name), and numeric or string literals.

Then, we'll write an in-memory backend to do operations based on an AST. Finally, we'll write a REPL to accept SQL from a CLI and pass it to the in-memory backend.

This post assumes a basic understanding of parsing concepts. We won't skip any code, but also won't go into great detail on why we structure the way we do.

For a simpler introduction to parsing and parsing concepts, see this post on parsing JSON.

Lexing

The lexer is responsible for finding every distinct group of characters in source code: tokens. This will consist primarily of identifiers, numbers, strings, and symbols.

What follows is a second, more orthodox pass at lexing. The first pass took a number of shortcuts and couldn't handle spaces in strings, for example.

Here is the relevant pull request in gosql if you are curious.

The gist of the logic will be to pass control to a helper function for each kind of token. If the helper function succeeds in finding a token, it will return true and the location for the lexer to start at next. It will continue doing this until it reaches the end of the source.

First off, we'll define a few types and constants for use in lexer.go:

package gosql

import (
    "fmt"
    "strings"
)

type location struct {
    line uint
    col  uint
}

type keyword string

const (
    selectKeyword keyword = "select"
    fromKeyword   keyword = "from"
    asKeyword     keyword = "as"
    tableKeyword  keyword = "table"
    createKeyword keyword = "create"
    insertKeyword keyword = "insert"
    intoKeyword   keyword = "into"
    valuesKeyword keyword = "values"
    intKeyword    keyword = "int"
    textKeyword   keyword = "text"
)

type symbol string

const (
    semicolonSymbol  symbol = ";"
    asteriskSymbol   symbol = "*"
    commaSymbol      symbol = ","
    leftparenSymbol  symbol = "("
    rightparenSymbol symbol = ")"
)

type tokenKind uint

const (
    keywordKind tokenKind = iota
    symbolKind
    identifierKind
    stringKind
    numericKind
)

type token struct {
    value string
    kind  tokenKind
    loc   location
}

type cursor struct {
    pointer uint
    loc     location
}

func (t *token) equals(other *token) bool {
    return t.value == other.value && t.kind == other.kind
}

type lexer func(string, cursor) (*token, cursor, bool)

Next we'll write out the main loop:

func lex(source string) ([]*token, error) {
    tokens := []*token{}
    cur := cursor{}

lex:
    for cur.pointer < uint(len(source)) {
        lexers := []lexer{lexKeyword, lexSymbol, lexString, lexNumeric, lexIdentifier}
        for _, l := range lexers {
            if token, newCursor, ok := l(source, cur); ok {
                cur = newCursor

                // Omit nil tokens for valid, but empty syntax like newlines
                if token != nil {
                    tokens = append(tokens, token)
                }

                continue lex
            }
        }

        hint := ""
        if len(tokens) > 0 {
            hint = " after " + tokens[len(tokens)-1].value
        }
        return nil, fmt.Errorf("Unable to lex token%s, at %d:%d", hint, cur.loc.line, cur.loc.col)
    }

    return tokens, nil
}

Then we'll write a helper for each kind of fundemental token.

Analyzing numbers

Numbers are the most complex. So we'll refer to the PostgreSQL documentation (section 4.1.2.6) for what constitutes a valid number.

func lexNumeric(source string, ic cursor) (*token, cursor, bool) {
    cur := ic

    periodFound := false
    expMarkerFound := false

    for ; cur.pointer < uint(len(source)); cur.pointer++ {
        c := source[cur.pointer]
        cur.loc.col++

        isDigit := c >= '0' && c <= '9'
        isPeriod := c == '.'
        isExpMarker := c == 'e'

        // Must start with a digit or period
        if cur.pointer == ic.pointer {
            if !isDigit && !isPeriod {
                return nil, ic, false
            }

            periodFound = isPeriod
            continue
        }

        if isPeriod {
            if periodFound {
                return nil, ic, false
            }

            periodFound = true
            continue
        }

        if isExpMarker {
            if expMarkerFound {
                return nil, ic, false
            }

            // No periods allowed after expMarker
            periodFound = true
            expMarkerFound = true

            // expMarker must be followed by digits
            if cur.pointer == uint(len(source)-1) {
                return nil, ic, false
            }

            cNext := source[cur.pointer+1]
            if cNext == '-' || cNext == '+' {
                cur.pointer++
                cur.loc.col++
            }

            continue
        }

        if !isDigit {
            break
        }
    }

    // No characters accumulated
    if cur.pointer == ic.pointer {
        return nil, ic, false
    }

    return &token{
        value: source[ic.pointer:cur.pointer],
        loc:   ic.loc,
        kind:  numericKind,
    }, cur, true
}

Analyzing strings

Strings must start and end with a single apostrophe. They can contain a single apostophe if it is followed by another single apostrophe. We'll put this kind of character delimited lexing logic into a helper function so we can use it again when analyzing identifiers.

func lexCharacterDelimited(source string, ic cursor, delimiter byte) (*token, cursor, bool) {
    cur := ic

    if len(source[cur.pointer:]) == 0 {
        return nil, ic, false
    }

    if source[cur.pointer] != delimiter {
        return nil, ic, false
    }

    cur.loc.col++
    cur.pointer++

    var value []byte
    for ; cur.pointer < uint(len(source)); cur.pointer++ {
        c := source[cur.pointer]

        if c == delimiter {
            // SQL escapes are via double characters, not backslash.
            if cur.pointer+1 >= uint(len(source)) || source[cur.pointer+1] != delimiter {
                return &token{
                    value: string(value),
                    loc:   ic.loc,
                    kind:  stringKind,
                }, cur, true
            } else {
                value = append(value, delimiter)
                cur.pointer++
                cur.loc.col++
            }
        }

        value = append(value, c)
        cur.loc.col++
    }

    return nil, ic, false
}

func lexString(source string, ic cursor) (*token, cursor, bool) {
    return lexCharacterDelimited(source, ic, '\'')
}

Analyzing symbols and keywords

Symbols come from a fixed set of strings, so they're easy to compare against. Whitespace should be thrown away.

func lexSymbol(source string, ic cursor) (*token, cursor, bool) {
    c := source[ic.pointer]
    cur := ic
    // Will get overwritten later if not an ignored syntax
    cur.pointer++
    cur.loc.col++

    switch c {
    // Syntax that should be thrown away
    case '\n':
        cur.loc.line++
        cur.loc.col = 0
        fallthrough
    case '\t':
        fallthrough
    case ' ':
        return nil, cur, true
    }

    // Syntax that should be kept
    symbols := []symbol{
        commaSymbol,
        leftParenSymbol,
        rightParenSymbol,
        semicolonSymbol,
        asteriskSymbol,
    }

    var options []string
    for _, s := range symbols {
        options = append(options, string(s))
    }

    // Use `ic`, not `cur`
    match := longestMatch(source, ic, options)
    // Unknown character
    if match == "" {
        return nil, ic, false
    }

    cur.pointer = ic.pointer + uint(len(match))
    cur.loc.col = ic.loc.col + uint(len(match))

    return &token{
        value: match,
        loc:   ic.loc,
        kind:  symbolKind,
    }, cur, true
}

Keywords are even simpler, and use the same longestMatch helper.

func lexKeyword(source string, ic cursor) (*token, cursor, bool) {
    cur := ic
    keywords := []keyword{
        selectKeyword,
        insertKeyword,
        valuesKeyword,
        tableKeyword,
        createKeyword,
        whereKeyword,
        fromKeyword,
        intoKeyword,
        textKeyword,
    }

    var options []string
    for _, k := range keywords {
        options = append(options, string(k))
    }

    match := longestMatch(source, ic, options)
    if match == "" {
        return nil, ic, false
    }

    cur.pointer = ic.pointer + uint(len(match))
    cur.loc.col = ic.loc.col + uint(len(match))

    return &token{
        value: match,
        kind:  kind,
        loc:   ic.loc,
    }, cur, true
}

And finally we implement the longestMatch helper:

// longestMatch iterates through a source string starting at the given
// cursor to find the longest matching substring among the provided
// options
func longestMatch(source string, ic cursor, options []string) string {
    var value []byte
    var skipList []int
    var match string

    cur := ic

    for cur.pointer < uint(len(source)) {

        value = append(value, strings.ToLower(string(source[cur.pointer]))...)
        cur.pointer++

    match:
        for i, option := range options {
            for _, skip := range skipList {
                if i == skip {
                    continue match
                }
            }

            // Deal with cases like INT vs INTO
            if option == string(value) {
                skipList = append(skipList, i)
                if len(option) > len(match) {
                    match = option
                }

                continue
            }

            sharesPrefix := string(value) == option[:cur.pointer-ic.pointer]
            tooLong := len(value) > len(option)
            if tooLong || !sharesPrefix {
                skipList = append(skipList, i)
            }
        }

        if len(skipList) == len(options) {
            break
        }
    }

    return match
}

Analyzing identifiers

An identifier is either a double-quoted string or a group of characters starting with an alphabetical character and possibly containing numbers and underscores.

func lexIdentifier(source string, ic cursor) (*token, cursor, bool) {
    // Handle separately if is a double-quoted identifier
    if token, newCursor, ok := lexCharacterDelimited(source, ic, '"'); ok {
        return token, newCursor, true
    }

    cur := ic

    c := source[cur.pointer]
    // Other characters count too, big ignoring non-ascii for now
    isAlphabetical := (c >= 'A' && c <= 'Z') || (c >= 'a' && c <= 'z')
    if !isAlphabetical {
        return nil, ic, false
    }
    cur.pointer++
    cur.loc.col++

    value := []byte{c}
    for ; cur.pointer < uint(len(source)); cur.pointer++ {
        c = source[cur.pointer]

        // Other characters count too, big ignoring non-ascii for now
        isAlphabetical := (c >= 'A' && c <= 'Z') || (c >= 'a' && c <= 'z')
        isNumeric := c >= '0' && c <= '9'
        if isAlphabetical || isNumeric || c == '$' || c == '_' {
            value = append(value, c)
            cur.loc.col++
            continue
        }

        break
    }

    if len(value) == 0 {
        return nil, ic, false
    }

    return &token{
        // Unquoted dentifiers are case-insensitive
        value: strings.ToLower(string(value)),
        loc:   ic.loc,
        kind:  identifierKind,
    }, cur, true
}

And that's it for the lexer! If you copy lexer_test.go from the main project, the tests should now pass.

AST model

At the highest level, an AST is a collection of statements:

package main

type Ast struct {
    Statements []*Statement
}

A statement, for now, is one of INSERT, CREATE, or SELECT:

type AstKind uint

const (
    SelectKind AstKind = iota
    CreateTableKind
    InsertKind
)

type Statement struct {
    SelectStatement      *SelectStatement
    CreateTableStatement *CreateTableStatement
    InsertStatement      *InsertStatement
    Kind                 AstKind
}

INSERT

An insert statement, for now, has a table name and a list of values to insert:

type InsertStatement struct {
    table  token
    values *[]*expression
}

An expression is a literal token or (in the future) a function call or inline operation:

type expressionKind uint

const (
    literalKind expressionKind = iota
)

type expression struct {
    literal *token
    kind    expressionKind
}

CREATE

A create statement, for now, has a table name and a list of column names and types:

type columnDefinition struct {
    name     token
    datatype token
}

type CreateTableStatement struct {
    name token
    cols *[]*columnDefinition
}

SELECT

A select statement, for now, has a table name and a list of column names:

type SelectStatement struct {
    item []*expression
    from token
}

And that's it for the AST.

Parsing

The Parse entrypoint will take a list of tokens and attempt to parse statements, separated by a semi-colon, until it reaches the last token.

In general our strategy will be to increment and pass around a cursor containing the current position of unparsed tokens. Each helper will return the new cursor that the caller should start from.

package main

import (
    "errors"
    "fmt"
)

func tokenFromKeyword(k keyword) token {
    return token{
        kind:  keywordKind,
        value: string(k),
    }
}

func tokenFromSymbol(s symbol) token {
    return token{
        kind:  symbolKind,
        value: string(s),
    }
}

func expectToken(tokens []*token, cursor uint, t token) bool {
    if cursor >= uint(len(tokens)) {
        return false
    }

    return t.equals(tokens[cursor])
}

func helpMessage(tokens []*token, cursor uint, msg string) {
    var c *token
    if cursor < uint(len(tokens)) {
        c = tokens[cursor]
    } else {
        c = tokens[cursor-1]
    }

    fmt.Printf("[%d,%d]: %s, got: %s\n", c.loc.line, c.loc.col, msg, c.value)
}

func Parse(source string) (*Ast, error) {
    tokens, err := lex(source)
    if err != nil {
        return nil, err
    }

    a := Ast{}
    cursor := uint(0)
    for cursor < uint(len(tokens)) {
        stmt, newCursor, ok := parseStatement(tokens, cursor, tokenFromSymbol(semicolonSymbol))
        if !ok {
            helpMessage(tokens, cursor, "Expected statement")
            return nil, errors.New("Failed to parse, expected statement")
        }
        cursor = newCursor

        a.Statements = append(a.Statements, stmt)

        atLeastOneSemicolon := false
        for expectToken(tokens, cursor, tokenFromSymbol(semicolonSymbol)) {
            cursor++
            atLeastOneSemicolon = true
        }

        if !atLeastOneSemicolon {
            helpMessage(tokens, cursor, "Expected semi-colon delimiter between statements")
            return nil, errors.New("Missing semi-colon between statements")
        }
    }

    return &a, nil
}

Parsing statements

Each statement will be one of INSERT, CREATE, or SELECT. The parseStatement helper will call a helper on each of these statement types and return true if one of them succeeds in parsing.

func parseStatement(tokens []*token, initialCursor uint, delimiter token) (*Statement, uint, bool) {
    cursor := initialCursor

    // Look for a SELECT statement
    semicolonToken := tokenFromSymbol(semicolonSymbol)
    slct, newCursor, ok := parseSelectStatement(tokens, cursor, semicolonToken)
    if ok {
        return &Statement{
            Kind:            SelectKind,
            SelectStatement: slct,
        }, newCursor, true
    }

    // Look for a INSERT statement
    inst, newCursor, ok := parseInsertStatement(tokens, cursor, semicolonToken)
    if ok {
        return &Statement{
            Kind:            InsertKind,
            InsertStatement: inst,
        }, newCursor, true
    }

    // Look for a CREATE statement
    crtTbl, newCursor, ok := parseCreateTableStatement(tokens, cursor, semicolonToken)
    if ok {
        return &Statement{
            
                                    
                                    
                                

February 28, 2020

February 21, 2020