a curated list of database news from authoritative sources

March 06, 2020

Writing a SQL database from scratch in Go: 1. SELECT, INSERT, CREATE and a REPL

Next in database basics: <! forgive me, for I have sinned >
2. binary expressions and WHERE filters
3. indexes
4. a database/sql driver

In this series we'll write a rudimentary database from scratch in Go. Project source code is available on Github.

In this first post we'll build enough of a parser to run some simple CREATE, INSERT, and SELECT queries. Then we'll build an in-memory backend supporting TEXT and INT types and write a basic REPL.

We'll be able to support the following interaction:

$ go run *.go
Welcome to gosql.
# CREATE TABLE users (id INT, name TEXT);
ok
# INSERT INTO users VALUES (1, 'Phil');
ok
# SELECT id, name FROM users;
| id | name |
====================
| 1 |  Phil |
ok
# INSERT INTO users VALUES (2, 'Kate');
ok
# SELECT name, id FROM users;
| name | id |
====================
| Phil |  1 |
| Kate |  2 |
ok

The first stage will be to map a SQL source into a list of tokens (lexing). Then we'll call parse functions to find individual SQL statements (such as SELECT). These parse functions will in turn call their own helper functions to find patterns of recursively parseable chunks, keywords, symbols (like parenthesis), identifiers (like a table name), and numeric or string literals.

Then, we'll write an in-memory backend to do operations based on an AST. Finally, we'll write a REPL to accept SQL from a CLI and pass it to the in-memory backend.

This post assumes a basic understanding of parsing concepts. We won't skip any code, but also won't go into great detail on why we structure the way we do.

For a simpler introduction to parsing and parsing concepts, see this post on parsing JSON.

Lexing

The lexer is responsible for finding every distinct group of characters in source code: tokens. This will consist primarily of identifiers, numbers, strings, and symbols.

What follows is a second, more orthodox pass at lexing. The first pass took a number of shortcuts and couldn't handle spaces in strings, for example.

Here is the relevant pull request in gosql if you are curious.

The gist of the logic will be to pass control to a helper function for each kind of token. If the helper function succeeds in finding a token, it will return true and the location for the lexer to start at next. It will continue doing this until it reaches the end of the source.

First off, we'll define a few types and constants for use in lexer.go:

package gosql

import (
    "fmt"
    "strings"
)

type location struct {
    line uint
    col  uint
}

type keyword string

const (
    selectKeyword keyword = "select"
    fromKeyword   keyword = "from"
    asKeyword     keyword = "as"
    tableKeyword  keyword = "table"
    createKeyword keyword = "create"
    insertKeyword keyword = "insert"
    intoKeyword   keyword = "into"
    valuesKeyword keyword = "values"
    intKeyword    keyword = "int"
    textKeyword   keyword = "text"
)

type symbol string

const (
    semicolonSymbol  symbol = ";"
    asteriskSymbol   symbol = "*"
    commaSymbol      symbol = ","
    leftparenSymbol  symbol = "("
    rightparenSymbol symbol = ")"
)

type tokenKind uint

const (
    keywordKind tokenKind = iota
    symbolKind
    identifierKind
    stringKind
    numericKind
)

type token struct {
    value string
    kind  tokenKind
    loc   location
}

type cursor struct {
    pointer uint
    loc     location
}

func (t *token) equals(other *token) bool {
    return t.value == other.value && t.kind == other.kind
}

type lexer func(string, cursor) (*token, cursor, bool)

Next we'll write out the main loop:

func lex(source string) ([]*token, error) {
    tokens := []*token{}
    cur := cursor{}

lex:
    for cur.pointer < uint(len(source)) {
        lexers := []lexer{lexKeyword, lexSymbol, lexString, lexNumeric, lexIdentifier}
        for _, l := range lexers {
            if token, newCursor, ok := l(source, cur); ok {
                cur = newCursor

                // Omit nil tokens for valid, but empty syntax like newlines
                if token != nil {
                    tokens = append(tokens, token)
                }

                continue lex
            }
        }

        hint := ""
        if len(tokens) > 0 {
            hint = " after " + tokens[len(tokens)-1].value
        }
        return nil, fmt.Errorf("Unable to lex token%s, at %d:%d", hint, cur.loc.line, cur.loc.col)
    }

    return tokens, nil
}

Then we'll write a helper for each kind of fundemental token.

Analyzing numbers

Numbers are the most complex. So we'll refer to the PostgreSQL documentation (section 4.1.2.6) for what constitutes a valid number.

func lexNumeric(source string, ic cursor) (*token, cursor, bool) {
    cur := ic

    periodFound := false
    expMarkerFound := false

    for ; cur.pointer < uint(len(source)); cur.pointer++ {
        c := source[cur.pointer]
        cur.loc.col++

        isDigit := c >= '0' && c <= '9'
        isPeriod := c == '.'
        isExpMarker := c == 'e'

        // Must start with a digit or period
        if cur.pointer == ic.pointer {
            if !isDigit && !isPeriod {
                return nil, ic, false
            }

            periodFound = isPeriod
            continue
        }

        if isPeriod {
            if periodFound {
                return nil, ic, false
            }

            periodFound = true
            continue
        }

        if isExpMarker {
            if expMarkerFound {
                return nil, ic, false
            }

            // No periods allowed after expMarker
            periodFound = true
            expMarkerFound = true

            // expMarker must be followed by digits
            if cur.pointer == uint(len(source)-1) {
                return nil, ic, false
            }

            cNext := source[cur.pointer+1]
            if cNext == '-' || cNext == '+' {
                cur.pointer++
                cur.loc.col++
            }

            continue
        }

        if !isDigit {
            break
        }
    }

    // No characters accumulated
    if cur.pointer == ic.pointer {
        return nil, ic, false
    }

    return &token{
        value: source[ic.pointer:cur.pointer],
        loc:   ic.loc,
        kind:  numericKind,
    }, cur, true
}

Analyzing strings

Strings must start and end with a single apostrophe. They can contain a single apostophe if it is followed by another single apostrophe. We'll put this kind of character delimited lexing logic into a helper function so we can use it again when analyzing identifiers.

func lexCharacterDelimited(source string, ic cursor, delimiter byte) (*token, cursor, bool) {
    cur := ic

    if len(source[cur.pointer:]) == 0 {
        return nil, ic, false
    }

    if source[cur.pointer] != delimiter {
        return nil, ic, false
    }

    cur.loc.col++
    cur.pointer++

    var value []byte
    for ; cur.pointer < uint(len(source)); cur.pointer++ {
        c := source[cur.pointer]

        if c == delimiter {
            // SQL escapes are via double characters, not backslash.
            if cur.pointer+1 >= uint(len(source)) || source[cur.pointer+1] != delimiter {
                return &token{
                    value: string(value),
                    loc:   ic.loc,
                    kind:  stringKind,
                }, cur, true
            } else {
                value = append(value, delimiter)
                cur.pointer++
                cur.loc.col++
            }
        }

        value = append(value, c)
        cur.loc.col++
    }

    return nil, ic, false
}

func lexString(source string, ic cursor) (*token, cursor, bool) {
    return lexCharacterDelimited(source, ic, '\'')
}

Analyzing symbols and keywords

Symbols come from a fixed set of strings, so they're easy to compare against. Whitespace should be thrown away.

func lexSymbol(source string, ic cursor) (*token, cursor, bool) {
    c := source[ic.pointer]
    cur := ic
    // Will get overwritten later if not an ignored syntax
    cur.pointer++
    cur.loc.col++

    switch c {
    // Syntax that should be thrown away
    case '\n':
        cur.loc.line++
        cur.loc.col = 0
        fallthrough
    case '\t':
        fallthrough
    case ' ':
        return nil, cur, true
    }

    // Syntax that should be kept
    symbols := []symbol{
        commaSymbol,
        leftParenSymbol,
        rightParenSymbol,
        semicolonSymbol,
        asteriskSymbol,
    }

    var options []string
    for _, s := range symbols {
        options = append(options, string(s))
    }

    // Use `ic`, not `cur`
    match := longestMatch(source, ic, options)
    // Unknown character
    if match == "" {
        return nil, ic, false
    }

    cur.pointer = ic.pointer + uint(len(match))
    cur.loc.col = ic.loc.col + uint(len(match))

    return &token{
        value: match,
        loc:   ic.loc,
        kind:  symbolKind,
    }, cur, true
}

Keywords are even simpler, and use the same longestMatch helper.

func lexKeyword(source string, ic cursor) (*token, cursor, bool) {
    cur := ic
    keywords := []keyword{
        selectKeyword,
        insertKeyword,
        valuesKeyword,
        tableKeyword,
        createKeyword,
        whereKeyword,
        fromKeyword,
        intoKeyword,
        textKeyword,
    }

    var options []string
    for _, k := range keywords {
        options = append(options, string(k))
    }

    match := longestMatch(source, ic, options)
    if match == "" {
        return nil, ic, false
    }

    cur.pointer = ic.pointer + uint(len(match))
    cur.loc.col = ic.loc.col + uint(len(match))

    return &token{
        value: match,
        kind:  kind,
        loc:   ic.loc,
    }, cur, true
}

And finally we implement the longestMatch helper:

// longestMatch iterates through a source string starting at the given
// cursor to find the longest matching substring among the provided
// options
func longestMatch(source string, ic cursor, options []string) string {
    var value []byte
    var skipList []int
    var match string

    cur := ic

    for cur.pointer < uint(len(source)) {

        value = append(value, strings.ToLower(string(source[cur.pointer]))...)
        cur.pointer++

    match:
        for i, option := range options {
            for _, skip := range skipList {
                if i == skip {
                    continue match
                }
            }

            // Deal with cases like INT vs INTO
            if option == string(value) {
                skipList = append(skipList, i)
                if len(option) > len(match) {
                    match = option
                }

                continue
            }

            sharesPrefix := string(value) == option[:cur.pointer-ic.pointer]
            tooLong := len(value) > len(option)
            if tooLong || !sharesPrefix {
                skipList = append(skipList, i)
            }
        }

        if len(skipList) == len(options) {
            break
        }
    }

    return match
}

Analyzing identifiers

An identifier is either a double-quoted string or a group of characters starting with an alphabetical character and possibly containing numbers and underscores.

func lexIdentifier(source string, ic cursor) (*token, cursor, bool) {
    // Handle separately if is a double-quoted identifier
    if token, newCursor, ok := lexCharacterDelimited(source, ic, '"'); ok {
        return token, newCursor, true
    }

    cur := ic

    c := source[cur.pointer]
    // Other characters count too, big ignoring non-ascii for now
    isAlphabetical := (c >= 'A' && c <= 'Z') || (c >= 'a' && c <= 'z')
    if !isAlphabetical {
        return nil, ic, false
    }
    cur.pointer++
    cur.loc.col++

    value := []byte{c}
    for ; cur.pointer < uint(len(source)); cur.pointer++ {
        c = source[cur.pointer]

        // Other characters count too, big ignoring non-ascii for now
        isAlphabetical := (c >= 'A' && c <= 'Z') || (c >= 'a' && c <= 'z')
        isNumeric := c >= '0' && c <= '9'
        if isAlphabetical || isNumeric || c == '$' || c == '_' {
            value = append(value, c)
            cur.loc.col++
            continue
        }

        break
    }

    if len(value) == 0 {
        return nil, ic, false
    }

    return &token{
        // Unquoted dentifiers are case-insensitive
        value: strings.ToLower(string(value)),
        loc:   ic.loc,
        kind:  identifierKind,
    }, cur, true
}

And that's it for the lexer! If you copy lexer_test.go from the main project, the tests should now pass.

AST model

At the highest level, an AST is a collection of statements:

package main

type Ast struct {
    Statements []*Statement
}

A statement, for now, is one of INSERT, CREATE, or SELECT:

type AstKind uint

const (
    SelectKind AstKind = iota
    CreateTableKind
    InsertKind
)

type Statement struct {
    SelectStatement      *SelectStatement
    CreateTableStatement *CreateTableStatement
    InsertStatement      *InsertStatement
    Kind                 AstKind
}

INSERT

An insert statement, for now, has a table name and a list of values to insert:

type InsertStatement struct {
    table  token
    values *[]*expression
}

An expression is a literal token or (in the future) a function call or inline operation:

type expressionKind uint

const (
    literalKind expressionKind = iota
)

type expression struct {
    literal *token
    kind    expressionKind
}

CREATE

A create statement, for now, has a table name and a list of column names and types:

type columnDefinition struct {
    name     token
    datatype token
}

type CreateTableStatement struct {
    name token
    cols *[]*columnDefinition
}

SELECT

A select statement, for now, has a table name and a list of column names:

type SelectStatement struct {
    item []*expression
    from token
}

And that's it for the AST.

Parsing

The Parse entrypoint will take a list of tokens and attempt to parse statements, separated by a semi-colon, until it reaches the last token.

In general our strategy will be to increment and pass around a cursor containing the current position of unparsed tokens. Each helper will return the new cursor that the caller should start from.

package main

import (
    "errors"
    "fmt"
)

func tokenFromKeyword(k keyword) token {
    return token{
        kind:  keywordKind,
        value: string(k),
    }
}

func tokenFromSymbol(s symbol) token {
    return token{
        kind:  symbolKind,
        value: string(s),
    }
}

func expectToken(tokens []*token, cursor uint, t token) bool {
    if cursor >= uint(len(tokens)) {
        return false
    }

    return t.equals(tokens[cursor])
}

func helpMessage(tokens []*token, cursor uint, msg string) {
    var c *token
    if cursor < uint(len(tokens)) {
        c = tokens[cursor]
    } else {
        c = tokens[cursor-1]
    }

    fmt.Printf("[%d,%d]: %s, got: %s\n", c.loc.line, c.loc.col, msg, c.value)
}

func Parse(source string) (*Ast, error) {
    tokens, err := lex(source)
    if err != nil {
        return nil, err
    }

    a := Ast{}
    cursor := uint(0)
    for cursor < uint(len(tokens)) {
        stmt, newCursor, ok := parseStatement(tokens, cursor, tokenFromSymbol(semicolonSymbol))
        if !ok {
            helpMessage(tokens, cursor, "Expected statement")
            return nil, errors.New("Failed to parse, expected statement")
        }
        cursor = newCursor

        a.Statements = append(a.Statements, stmt)

        atLeastOneSemicolon := false
        for expectToken(tokens, cursor, tokenFromSymbol(semicolonSymbol)) {
            cursor++
            atLeastOneSemicolon = true
        }

        if !atLeastOneSemicolon {
            helpMessage(tokens, cursor, "Expected semi-colon delimiter between statements")
            return nil, errors.New("Missing semi-colon between statements")
        }
    }

    return &a, nil
}

Parsing statements

Each statement will be one of INSERT, CREATE, or SELECT. The parseStatement helper will call a helper on each of these statement types and return true if one of them succeeds in parsing.

func parseStatement(tokens []*token, initialCursor uint, delimiter token) (*Statement, uint, bool) {
    cursor := initialCursor

    // Look for a SELECT statement
    semicolonToken := tokenFromSymbol(semicolonSymbol)
    slct, newCursor, ok := parseSelectStatement(tokens, cursor, semicolonToken)
    if ok {
        return &Statement{
            Kind:            SelectKind,
            SelectStatement: slct,
        }, newCursor, true
    }

    // Look for a INSERT statement
    inst, newCursor, ok := parseInsertStatement(tokens, cursor, semicolonToken)
    if ok {
        return &Statement{
            Kind:            InsertKind,
            InsertStatement: inst,
        }, newCursor, true
    }

    // Look for a CREATE statement
    crtTbl, newCursor, ok := parseCreateTableStatement(tokens, cursor, semicolonToken)
    if ok {
        return &Statement{
            
                                    
                                    
                                

February 28, 2020

February 21, 2020

February 18, 2020

The state of Orchestrator, 2020 (spoiler: healthy)

This post serves as a pointer to my previous announcement about The state of Orchestrator, 2020. Thank you to Tom Krouper who applied his operational engineer expertise to content publishing problems.

The state of Orchestrator, 2020 (spoiler: healthy)

Yesterday was my last day at GitHub, and this post explains what this means for orchestrator. First, a quick historical review: 2014: I began work on orchestrator at Outbrain, as https://github.com/outbrain/orchestrator. I authored several open source projects while working for Outbrain, and created orchestrator to solve discovery, visualization and simple refactoring needs. Outbrain was happy […]

February 04, 2020

Announcing the Release of Vitess 5.0

On behalf of the Vitess maintainers team, I am pleased to announce the immediate release of Vitess 5.0. Vitess 5.0 is largely a clean up release, where we have both deprecated and removed some minor features, and improved both the installation experience and health of the Vitess code-base. When upgrading from a previous release of Vitess, we recommend reading the Incompatible Changes highlighted in the release notes for both 5.0 as well as Beta 0 and Beta 1.

February 01, 2020

A minimal REST API in Java

There's a style of Java that is a joy to write. This post will cover how to set up a basic PostgreSQL-integrated REST API using Jersey and JOOQ in a style not dissimilar to Flask and SQLAlchemy in Python.

In particular, we'll try to avoid as much runtime reflection/class-loading as possible. This will make the application less flexible but easier to debug and understand.

I'd appreciate pointers in email if you see anything weird or can fix any of my bugs.

Dependencies

Install Maven, a recent JDK, and PostgreSQL.

Copy the following into pom.xml to tell Maven about Java dependencies:

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>

  <groupId>api</groupId>
  <artifactId>api</artifactId>
  <version>1.0-SNAPSHOT</version>

  <properties>
    <maven.compiler.source>13</maven.compiler.source>
    <maven.compiler.target>13</maven.compiler.target>
  </properties>

  <build>
    <plugins>
      <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-compiler-plugin</artifactId>
        <version>3.8.1</version>
        <configuration>
          <compilerArgs>
            <arg>-Xlint:all,-options,-path</arg>
          </compilerArgs>
        </configuration>
      </plugin>

      <plugin>
        <groupId>org.codehaus.mojo</groupId>
        <artifactId>exec-maven-plugin</artifactId>
        <version>1.6.0</version>
        <configuration>
          <mainClass>api.Main</mainClass>
        </configuration>
      </plugin>
    </plugins>
  </build>

  <dependencies>
    <dependency>
      <groupId>org.glassfish.jersey.containers</groupId>
      <artifactId>jersey-container-jetty-http</artifactId>
      <version>2.30</version>
    </dependency>

    <dependency>
      <groupId>org.jooq</groupId>
      <artifactId>jooq</artifactId>
      <version>3.12.3</version>
    </dependency>

    <dependency>
      <groupId>org.jooq</groupId>
      <artifactId>jooq-meta</artifactId>
      <version>3.12.3</version>
    </dependency>

    <dependency>
      <groupId>org.postgresql</groupId>
      <artifactId>postgresql</artifactId>
      <version>42.2.9</version>
    </dependency>

    <dependency>
      <groupId>org.glassfish.jersey.inject</groupId>
      <artifactId>jersey-hk2</artifactId>
      <version>2.30</version>
    </dependency>

    <dependency>
      <groupId>ch.qos.logback</groupId>
      <artifactId>logback-core</artifactId>
      <version>1.2.3</version>
    </dependency>

    <dependency>
      <groupId>org.slf4j</groupId>
      <artifactId>slf4j-api</artifactId>
      <version>1.7.30</version>
    </dependency>

    <dependency>
      <groupId>ch.qos.logback</groupId>
      <artifactId>logback-classic</artifactId>
      <version>1.2.3</version>
    </dependency>

    <dependency>
      <groupId>org.glassfish.jersey.media</groupId>
      <artifactId>jersey-media-json-jackson</artifactId>
      <version>2.30</version>
    </dependency>

    <dependency>
      <groupId>javax.persistence</groupId>
      <artifactId>javax.persistence-api</artifactId>
      <version>2.2</version>
    </dependency>

    <dependency>
      <groupId>org.projectlombok</groupId>
      <artifactId>lombok</artifactId>
      <version>1.18.10</version>
      <scope>provided</scope>
    </dependency>

    <dependency>
      <groupId>com.fasterxml.jackson</groupId>
      <artifactId>jackson-bom</artifactId>
      <version>2.10.2</version>
      <type>pom</type>
    </dependency>
  </dependencies>
</project>

Now run mvn install to download and configure all dependencies.

Project setup

The Main class will be our entrypoint within src/main/java/api/Main.java.

It will handle loading configuration, setting up the application server, and starting it.

package api;

import java.io.InputStream;

import api.app.Application;
import api.app.Config;

public class Main {
  public static void main(String[] args) {
    try {
      var cfg = new Config();
      var server = new Application(cfg);
      server.start();
    } catch (Exception e) {
      e.printStackTrace();
    }
  }
}

The Config class in src/main/java/api/app/Config.java will contain a few hard-coded settings for now. In the future it could be read from a file.

package api.app;

import java.io.InputStream;
import java.time.Duration;
import java.util.Properties;

public final class Config {
  public final String server_address = "http://localhost";
  public final int server_port = 7780;

  public final String db_connection = "jdbc:postgresql://localhost/todo";
  public final String db_username = "todo";
  public final String db_password = "todo";
}

And finally the Application class in src/main/java/api/app/Application.java will handle loading a PostgreSQL connection, registering the class path to look for Jersey routes/controllers, registering the PostgreSQL connection in the dependency injection controller and starting the Jersey controller.

package api.app;

import javax.ws.rs.core.UriBuilder;

import org.glassfish.jersey.internal.inject.AbstractBinder;
import org.glassfish.jersey.jetty.JettyHttpContainerFactory;
import org.glassfish.jersey.server.ResourceConfig;
import org.slf4j.LoggerFactory;

import api.dao.Dao;
import ch.qos.logback.classic.Level;
import ch.qos.logback.classic.Logger;

public class Application {
  private static final Logger logger = (Logger) LoggerFactory.getLogger(Application.class);
  private static final Logger rootLogger = (Logger) LoggerFactory.getLogger(Logger.ROOT_LOGGER_NAME);
  static {
    rootLogger.setLevel(Level.INFO);
  }

  Config cfg;

  public Application(final Config _cfg) {
    cfg = _cfg;
  }

  private void addShutdownHook(final Runnable hook) {
    Runtime.getRuntime().addShutdownHook(new Thread(hook));
  }

  public void start() {
    var dao = new Dao(cfg.db_connection, cfg.db_username, cfg.db_password);
    try {
      dao.initialize();
    } catch (Exception e) {
      e.printStackTrace();
      return;
    }
    addShutdownHook(() -> {
      try {
        dao.close();
      } catch (java.sql.SQLException e) {
        e.printStackTrace();
      }
    });

    var resourceConfig = new ResourceConfig();
    resourceConfig.packages("api.controller");
    resourceConfig.register(new AbstractBinder() {
      @Override
      protected void configure() {
        bind(dao).to(Dao.class);
      }
    });

    var baseUri = UriBuilder.fromUri(cfg.server_address).port(cfg.server_port).build();
    var server = JettyHttpContainerFactory.createServer(baseUri, resourceConfig);
    logger.info("Started listening on {}:{}", cfg.server_address, cfg.server_port);
  }
}

I couldn't figure out a reasonable way to avoid the class path registration for routes.

It's also important to note that the AbstractBinder appears to search the class path implicitly for any available dependency injection controller. I'd rather we had specified it explicitly but I'm not sure how. It will succeed because we installed HK2 as a dependency (see pom.xml).

With the Application code finished, we'll need to build out the referenced Dao and controller classes.

Dao

The Dao class in src/main/java/api/dao/Dao.java will enclose the connection to PostgreSQL via JOOQ.

package api.dao;

import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.SQLException;

import org.jooq.DSLContext;
import org.jooq.SQLDialect;
import org.jooq.impl.DSL;

public class Dao {
  private Connection conn;
  private String url;
  private String username;
  private String password;

  public Dao(final String _url, final String _username, final String _password) {
    url = _url;
    username = _username;
    password = _password;
  }

  public void initialize() throws SQLException {
    conn = DriverManager.getConnection(url, username, password);
  }

  public void close() throws SQLException {
    conn.close();
  }

  public DSLContext getDSLContext() {
    return DSL.using(conn, SQLDialect.POSTGRES);
  }
}

And this will be enough to use in our controller. But let's take a moment to talk about the data model.

Data

This API will return results from a TODO list. The database should store each TODO item and a timestamp of completion, if completed.

We'll start by creating a database and user for the application:

$ sudo su postgres
postgres $ psql
postgres=# CREATE DATABASE todo;
postgres=# CREATE USER todo WITH PASSWORD 'todo';
postgres=# GRANT ALL ON DATABASE todo TO todo;

Then we'll write an initial migration:

$ cat migrations/1_init.sql
CREATE TABLE todo_item (
  id BIGSERIAL NOT NULL PRIMARY KEY,
  item TEXT NOT NULL,
  created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  completed_at TIMESTAMPTZ
);

And a helper script for running migrations:

$ cat scripts/migrate.sh
#!/usr/bin/env bash

set -e

export PGPASSWORD=todo

for file in $(ls migrations); do
    echo "Running migration: $file"
    psql -U todo -f "migrations/$file"
done

Run it:

$ chmod +x ./scripts/migrate.sh
$ ./scripts/migrate.sh
Running migration: 1_init.sql
CREATE TABLE

And let's add some data:

$ sudo su postgres
postgres $ psql -U todo
todo=# INSERT INTO todo_item (item) VALUES ('My note');

Now we're ready to model the data in Java.

Models

While it's possible to have JOOQ generate Java data classes (or POJOs) by reading the database schema, the generated class cannot be directly serialized to a JSON string.

So for each table (there's only one) we'll write a class with fields for each column. We'll use the Java Persistence API (JPA) to annotate the class and fields so JOOQ will know how to deserialize query results into an instance of the model.

We'll use Lombok to label the whole object as Data so that getter and setter methods are generated automatically for each private field. And we'll use a Jackson annotation to label the JSON field name of each column.

This is the TodoItem class in src/main/java/api/model/TodoItem.java:

package api.model;

import java.time.OffsetDateTime;

import javax.persistence.Column;
import javax.persistence.Id;
import javax.persistence.Table;

import com.fasterxml.jackson.annotation.JsonFormat;
import com.fasterxml.jackson.annotation.JsonProperty;

import lombok.Data;

@Data
@Table(name = "todo_item")
public class TodoItem {
  @Column(name = "id")
  @JsonProperty("id")
  @Id
  private long id;

  @Column(name = "name")
  @JsonProperty("name")
  private String name;

  @Column(name = "created_at")
  @JsonProperty("createdAt")
  @JsonFormat(pattern = "yyyy-MM-dd'T'HH:mm:ssZ")
  private OffsetDateTime createdAt;

  @Column(name = "completed_at")
  @JsonProperty("completedAt")
  @JsonFormat(pattern = "yyyy-MM-dd'T'HH:mm:ssZ")
  private OffsetDateTime completedAt;
}

The JSON format specifications for the timestamp fields aren't actually working. The formatted JSON returns a giant object and I haven't figured out how to get it to serialize to the RFC3339 string yet.

We're set! The last step is a simple controller to return a list of TODO items.

The /items controller

In the ItemsController class in src/main/java/api/model/ItemsController.java we'll inject the Dao object and use it to return a page of TODO items as JSON.

package api.controller;

import java.util.List;

import javax.inject.Inject;
import javax.persistence.Table;
import javax.ws.rs.GET;
import javax.ws.rs.Path;
import javax.ws.rs.Produces;
import javax.ws.rs.core.MediaType;

import org.jooq.DSLContext;

import api.dao.Dao;
import api.model.TodoItem;

@Path("items")
public class ItemsController {
  @Inject
  Dao dao;

  @GET
  @Produces(MediaType.APPLICATION_JSON)
  public List<TodoItem> getServers() {
    DSLContext dslCtx = dao.getDSLContext();
    Table table = TodoItem.class.getAnnotation(Table.class);
    return dslCtx.select().from(table.name()).fetch().into(TodoItem.class);
  }
}

There's some more implicit magic here when we return a list of TodoItems. Since we marked the endpoint as producing JSON, and since Jackson is in our class path, Jersey will automatically use Jackson to serialize the list to JSON.

The API is quite nice but I could do without the automatic class-loading magic.

Now we're ready to build, run and test.

Building and running

$ mvn clean compile
[INFO] Scanning for projects...
[INFO] 
[INFO] ------------------------------< api:api >-------------------------------
[INFO] Building api 1.0-SNAPSHOT
[INFO] --------------------------------[ jar ]---------------------------------
[INFO] 
[INFO] --- maven-clean-plugin:2.5:clean (default-clean) @ api ---
[INFO] Deleting /Users/philipeaton/tmp/test/target
[INFO] 
[INFO] --- maven-resources-plugin:2.6:resources (default-resources) @ api ---
[WARNING] Using platform encoding (UTF-8 actually) to copy filtered resources, i.e. build is platform dependent!
[INFO] skip non existing resourceDirectory /Users/philipeaton/tmp/test/src/main/resources
[INFO] 
[INFO] --- maven-compiler-plugin:3.8.1:compile (default-compile) @ api ---
[INFO] Changes detected - recompiling the module!
[WARNING] File encoding has not been set, using platform encoding UTF-8, i.e. build is platform dependent!
[INFO] Compiling 6 source files to /Users/philipeaton/tmp/test/target/classes
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  2.198 s
[INFO] Finished at: 2020-02-01T17:07:14-05:00
[INFO] ------------------------------------------------------------------------
$ mvn exec:java
[INFO] Scanning for projects...
[INFO]
[INFO] ------------------------------< api:api >-------------------------------
[INFO] Building api 1.0-SNAPSHOT
[INFO] --------------------------------[ jar ]---------------------------------
[INFO]
[INFO] --- exec-maven-plugin:1.6.0:java (default-cli) @ api ---
17:06:53.793 [api.Main.main()] INFO org.eclipse.jetty.util.log - Logging initialized @2017ms to org.eclipse.jetty.util.log.Slf4jLog
17:06:54.378 [api.Main.main()] INFO org.eclipse.jetty.server.Server - jetty-9.4.17.v20190418; built: 2019-04-18T19:45:35.259Z; git: aa1c656c315c011c01e7b21aabb04066635b9f67; jvm 13+33
17:06:54.425 [api.Main.main()] INFO org.eclipse.jetty.server.AbstractConnector - Started ServerConnector@3943a159{HTTP/1.1,[http/1.1]}{0.0.0.0:7780}
17:06:54.425 [api.Main.main()] INFO org.eclipse.jetty.server.Server - Started @2651ms
17:06:54.425 [api.Main.main()] INFO api.app.Application - Started listening on http://localhost:7780

In a new terminal curl the endpoint:

$ curl localhost:7780/items | jq
[
  {
    "id": 1,
    "name": null,
    "createdAt": {
      "offset": {
        "totalSeconds": -18000,
        "id": "-05:00",
        "rules": {
          "transitions": [],
          "transitionRules": [],
          "fixedOffset": true
        }
      },
      "dayOfWeek": "SATURDAY",
      "dayOfYear": 32,
      "nano": 594440000,
      "year": 2020,
      "monthValue": 2,
      "dayOfMonth": 1,
      "hour": 17,
      "minute": 8,
      "second": 0,
      "month": "FEBRUARY"
    },
    "completedAt": null
  }
]

And we're done!

January 30, 2020

All hash table sizes you will ever need


When picking a hash table size we usually have two choices: Either, we pick a prime number or a power of 2. Powers of 2 are easy to use, as a modulo by a power of 2 is just a bit-wise and, but 1) they waste quite a bit of space, as we have to round up to the next power of 2, and 2) they require "good" hash functions, where looking at just a subset of bits is ok.

Prime numbers are more forgiving concerning the hash function, and we have more choices concerning the size, which leads to less overhead. But using a prime number requires a modulo computation, which is expensive. And we have to find a suitable prime number at runtime, which is not that simple either.

Fortunately we can solve both problems simultaneously, which is what this blog post is about. We can tackle the problem of finding prime numbers by pre-computing suitable numbers with a given maximum distance. For example when when only considering prime numbers that are at least 5% away from each other we can cover the whole space from 0 to 2^64 with just 841 prime numbers. We can solve the performance problem by pre-computing the magic numbers from Hacker's Delight for each prime number in our list, which allows us to use multiplications instead of expensive modulo computations. And we can skip prime numbers with unpleasant magic numbers (i.e., the ones that require an additional add fixup), preferring the next cheap prime number instead.

The resulting code can be found here. It contains every prime number you will ever need for hash tables, covering the whole 64bit address space. Usage is very simple, we just ask for a prime number and then perform modulo operations as needed:


class HashTable {
   primes::Prime prime;
   vector table;
public:
   HashTable(uint64_t size) { 
      prime = prime::Prime::pick(size);
      table.resize(prime.get());
   }
   ...
   Entry* getEntry(uint64_t hash) { return table[prime.mod(hash)]; }
   ...
};

The performance is quite good. On an AMD 1950X, computing the modulo for 10M values (and computing the sum of the results) takes about 4.7ns per value when using a plain (x%p), but only 0.63ns per value when using p.mod(x).

Getting this into unordered_map would be useful, it would probably improve the performance quite significantly when we have few cache misses.