a curated list of database news from authoritative sources

November 22, 2022

Introducing VDiff V2

Vitess is a solution that allows you to infinitely scale MySQL while providing clients and apps with a single logical view of the fleet of MySQL instances comprising any number of Keyspaces and Shards. Vitess also provides the cluster and data management tools that make it possible to manage a massive cluster and perform complex workflows using VReplication, such as: Moving tables into Vitess or between keyspaces Resharding to adjust to changes in data size and load Materialized views and rollups for data analytics and data locality Online schema changes that are trackable, cancellable, revertible, and retryable Why a Diff Tool?

November 20, 2022

MySQL IOPS for Reads and Surprsies

When you think about IOPS, you probably think about writes because MySQL write I/O has a long tradition of optimization, benchmarking, new algorithms, new storage engines, and so forth. There’s no shortage of material on MySQL write I/O; just two examples from Percona are Scaling IO-Bound Workloads for MySQL in the Cloud and Tuning MySQL/InnoDB Flushing for a Write-Intensive Workload. But in this short blog post I highlight two other, less common aspects of MySQL I/O: reads and surprises.

November 18, 2022

PlanetScale and HIPAA

PlanetScale can now enter into Business Associate Agreements (BAA) with customers on an Enterprise plan.

November 17, 2022

November 13, 2022

Writing a SQL database, take two: Zig and RocksDB

For my second project while learning Zig, I decided to port an old, minimal SQL database project from Go to Zig.

In this post, in ~1700 lines of code (yes, I'm sorry it's bigger than my usual), we'll create a basic embedded SQL database in Zig on top of RocksDB. Other than the RocksDB layer it will not use third-party libraries.

The code for this project is available on GitHub.

Here are a few example interactions we'll support:

$ ./main --database data --script <(echo "CREATE TABLE y (year int, age int, name text)")
echo "CREATE TABLE y (year int, age int, name text)"
ok
$ ./main --database data --script <(echo "INSERT INTO y VALUES (2010, 38, 'Gary')")
echo "INSERT INTO y VALUES (2010, 38, 'Gary')"
ok
$ ./main --database data --script <(echo "INSERT INTO y VALUES (2021, 92, 'Teej')")
echo "INSERT INTO y VALUES (2021, 92, 'Teej')"
ok
$ ./main --database data --script <(echo "INSERT INTO y VALUES (1994, 18, 'Mel')")
echo "INSERT INTO y VALUES (1994, 18, 'Mel')"
ok

# Basic query
$ ./main --database data --script <(echo "SELECT name, age, year FROM y")
echo "SELECT name, age, year FROM y"
| name          |age            |year           |
+ ====          +===            +====           +
| Mel           |18             |1994           |
| Gary          |38             |2010           |
| Teej          |92             |2021           |

# With WHERE
$ ./main --database data --script <(echo "SELECT name, year, age FROM y WHERE age < 40")
echo "SELECT name, year, age FROM y WHERE age < 40"
| name          |year           |age            |
+ ====          +====           +===            +
| Mel           |1994           |18             |
| Gary          |2010           |38             |

# With operations
$ ./main --database data --script <(echo "SELECT 'Name: ' || name, year + 30, age FROM y WHERE age < 40")
echo "SELECT 'Name: ' || name, year + 30, age FROM y WHERE age < 40"
| unknown               |unknown                |age            |
+ =======               +=======                +===            +
| Name: Mel             |2024           |18             |
| Name: Gary            |2040           |38             |

This post is standalone (except for the RocksDB layer which I wrote about here) but it builds on a number of ideas I've explored that you may be interested in:

This project is mostly a port of my SQL database from scratch in Go project, but unlike that series this project will have persistent storage via RocksDB.

And unlike that post, this project is written in Zig!

Let's get started. :)

Components

We're going to split up the project into the following major components:

  • Lexing
  • Parsing
  • Storage
    • RocksDB
  • Execution
  • Entrypoint (main)

Lexing takes a query and breaks it into an array of tokens.

Parsing takes the lexed array of tokens and pattern matches into a syntax tree (AST).

Storage maps high-level SQL entities like tables and rows into bytes that can be easily stored on disk. And it handles recovering high-level tables and rows from bytes on disk.

Invisible to users of the Storage component is RocksDB, which is how the bytes are actually stored on disk. RocksDB is a persistent store that maps arbitary byte keys to arbitrary byte values. We'll use it for storing and recovering both table metadata and actual row data.

Execution takes a query AST and executes it against Storage, potentially returning result rows.

These terms are a vast simplification of real-world database design. But they are helpful structure to have even in a project this small.

Memory Management

Zig doesn't have a garbage collector. Mitchell Hashimoto wrote bindings to Boehm GC. But Zig also has a builtin Arena allocator which is perfect for this simple project.

The main function will create the arena and pass it to each component, where they can do allocations as they please. At the end of main, the entire arena will be freed at once.

The only other place where we must do manual memory management is in the RocksDB wrapper. But I've already covered that in a separate post.

Zig Specifics

I'm not going to cover the basics of Zig syntax. If you are new to Zig, read this first! (It's short.)

Now that we've got the basic idea, we can start coding!

Types (types.zig, 10 LoC)

Let's create a few helper types that we'll use in the rest of the code.

pub const String = []const u8;

pub const Error = String;

pub fn Result(comptime T: type) type {
    return union(enum) {
        val: T,
        err: Error,
    };
}

That's it. :) Makes things a little more readable.

Lexing (lex.zig, 308 LoC)

Lexing turns a query string into an array of tokens.

There are a few kinds of tokens we'll define:

  • Keywords (like CREATE, true, false, null)
    • Syntax (commas, parentheses, operators, and all other builtin symbols)
  • Strings
  • Integers
  • Identifiers

And not listed there but important to skip past is whitespace.

Let's turn this into a Zig struct!

const std = @import("std");

const Error = @import("types.zig").Error;
const String = @import("types.zig").String;

pub const Token = struct {
    start: u64,
    end: u64,
    kind: Kind,
    source: String,

    pub const Kind = enum {
        // Keywords
        select_keyword,
        create_table_keyword,
        insert_keyword,
        values_keyword,
        from_keyword,
        where_keyword,

        // Operators
        plus_operator,
        equal_operator,
        lt_operator,
        concat_operator,

        // Other syntax
        left_paren_syntax,
        right_paren_syntax,
        comma_syntax,

        // Literals
        identifier,
        integer,
        string,
    };

    pub fn string(self: Token) String {
        return self.source[self.start..self.end];
    }

Using an enum helps us with type safety. And since we're storing location in the token, we can build a nice debug function for when lexing or parsing fails.

    fn debug(self: Token, msg: String) void {
        var line: usize = 0;
        var column: usize = 0;
        var lineStartIndex: usize = 0;
        var lineEndIndex: usize = 0;
        var i: usize = 0;
        var source = self.source;
        while (i < source.len) {
            if (source[i] == '\n') {
                line = line + 1;
                column = 0;
                lineStartIndex = i;
            } else {
                column = column + 1;
            }

            if (i == self.start) {
                // Find the end of the line
                lineEndIndex = i;
                while (source[lineEndIndex] != '\n') {
                    lineEndIndex = lineEndIndex + 1;
                }
                break;
            }

            i = i + 1;
        }

        std.debug.print(
            "{s}\nNear line {}, column {}.\n{s}\n",
            .{ msg, line + 1, column, source[lineStartIndex..lineEndIndex] },
        );
        while (column - 1 > 0) {
            std.debug.print(" ", .{});
            column = column - 1;
        }
        std.debug.print("^ Near here\n\n", .{});
    }
};

And similarly, let's add a debug helper for when we're dealing with an array of tokens.

pub fn debug(tokens: []Token, preferredIndex: usize, msg: String) void {
    var i = preferredIndex;
    while (i >= tokens.len) {
        i = i - 1;
    }

    tokens[i].debug(msg);
}

Token <> String Mapping

Before we get too far from Token definition, let's define a mapping from the Token.kind enum to strings we can see in a query.

const Builtin = struct {
    name: String,
    kind: Token.Kind,
};

// These must be sorted by length of the name text, descending, for lexKeyword.
var BUILTINS = [_]Builtin{
    .{ .name = "CREATE TABLE", .kind = Token.Kind.create_table_keyword },
    .{ .name = "INSERT INTO", .kind = Token.Kind.insert_keyword },
    .{ .name = "SELECT", .kind = Token.Kind.select_keyword },
    .{ .name = "VALUES", .kind = Token.Kind.values_keyword },
    .{ .name = "WHERE", .kind = Token.Kind.where_keyword },
    .{ .name = "FROM", .kind = Token.Kind.from_keyword },
    .{ .name = "||", .kind = Token.Kind.concat_operator },
    .{ .name = "=", .kind = Token.Kind.equal_operator },
    .{ .name = "+", .kind = Token.Kind.plus_operator },
    .{ .name = "<", .kind = Token.Kind.lt_operator },
    .{ .name = "(", .kind = Token.Kind.left_paren_syntax },
    .{ .name = ")", .kind = Token.Kind.right_paren_syntax },
    .{ .name = ",", .kind = Token.Kind.comma_syntax },
};

We'll use this in a few lexing functions below.

Whitespace

Outside of tokens, we need to be able to skip past whitespace.

fn eatWhitespace(source: String, index: usize) usize {
    var res = index;
    while (source[res] == ' ' or
        source[res] == '\n' or
        source[res] == '\t' or
        source[res] == '\r')
    {
        res = res + 1;
        if (res == source.len) {
            break;
        }
    }

    return res;
}

All lexing functions will look like this. They'll take the source as one argument and a cursor to the current index in the source as another.

Keywords

Let's handle lexing keyword tokens next. Keywords are case insensitive. I don't think there's a builtin case insensitive string comparison function in Zig. So let's write that first.

fn asciiCaseInsensitiveEqual(left: String, right: String) bool {
    var min = left;
    if (right.len < left.len) {
        min = right;
    }

    for (min) |_, i| {
        var l = left[i];
        if (l >= 97 and l <= 122) {
            l = l - 32;
        }

        var r = right[i];
        if (r >= 97 and r <= 122) {
            r = r - 32;
        }

        if (l != r) {
            return false;
        }
    }

    return true;
}

Unfortunately it only supports ASCII for now.

Now we can write a simple longest-matching-substring function. It is simple because the keyword mapping we set up above is already ordered by length descending.

fn lexKeyword(source: String, index: usize) struct { nextPosition: usize, token: ?Token } {
    var longestLen: usize = 0;
    var kind = Token.Kind.select_keyword;
    for (BUILTINS) |builtin| {
        if (index + builtin.name.len >= source.len) {
            continue;
        }

        if (asciiCaseInsensitiveEqual(source[index .. index + builtin.name.len], builtin.name)) {
            longestLen = builtin.name.len;
            kind = builtin.kind;
            // First match is the longest match
            break;
        }
    }

    if (longestLen == 0) {
        return .{ .nextPosition = 0, .token = null };
    }

    return .{
        .nextPosition = index + longestLen,
        .token = Token{
            .source = source,
            .start = index,
            .end = index + longestLen,
            .kind = kind,
        },
    };
}

That's it!

Integers

For integers we read through the source until we stop seeing decimal digits. Obviously this is a subset of what people consider integers, but it will do for now!

fn lexInteger(source: String, index: usize) struct { nextPosition: usize, token: ?Token } {
    var start = index;
    var end = index;
    var i = index;
    while (source[i] >= '0' and source[i] <= '9') {
        end = end + 1;
        i = i + 1;
    }

    if (start == end) {
        return .{ .nextPosition = 0, .token = null };
    }

    return .{
        .nextPosition = end,
        .token = Token{
            .source = source,
            .start = start,
            .end = end,
            .kind = Token.Kind.integer,
        },
    };
}

Strings

Strings are enclosed in single quotes.

fn lexString(source: String, index: usize) struct { nextPosition: usize, token: ?Token } {
    var i = index;
    if (source[i] != '\'') {
        return .{ .nextPosition = 0, .token = null };
    }
    i = i + 1;

    var start = i;
    var end = i;
    while (source[i] != '\'') {
        end = end + 1;
        i = i + 1;
    }

    if (source[i] == '\'') {
        i = i + 1;
    }

    if (start == end) {
        return .{ .nextPosition = 0, .token = null };
    }

    return .{
        .nextPosition = i,
        .token = Token{
            .source = source,
            .start = start,
            .end = end,
            .kind = Token.Kind.string,
        },
    };
}

Identifiers

Identifiers for this project are alphanumeric characters. We could support more by optionally checking for double quote enclosed strings. But I'll leave that as an exercise for the reader.

fn lexIdentifier(source: String, index: usize) struct { nextPosition: usize, token: ?Token } {
    var start = index;
    var end = index;
    var i = index;
    while ((source[i] >= 'a' and source[i] <= 'z') or
        (source[i] >= 'A' and source[i] <= 'Z') or
        (source[i] == '*'))
    {
        end = end + 1;
        i = i + 1;
    }

    if (start == end) {
        return .{ .nextPosition = 0, .token = null };
    }

    return .{
        .nextPosition = end,
        .token = Token{
            .source = source,
            .start = start,
            .end = end,
            .kind = Token.Kind.identifier,
        },
    };
}

lex

Now we can pull together all these helper functions in a public entrypoint for lexing.

It will loop through a query string, eating whitespace and checking for tokens. It will continue until it hits the end of the query string. If it ever can't continue it fails.

pub fn lex(source: String, tokens: *std.ArrayList(Token)) ?Error {
    var i: usize = 0;
    while (true) {
        i = eatWhitespace(source, i);
        if (i >= source.len) {
            break;
        }

        const keywordRes = lexKeyword(source, i);
        if (keywordRes.token) |token| {
            tokens.append(token) catch return "Failed to allocate space for keyword token";
            i = keywordRes.nextPosition;
            continue;
        }

        const integerRes = lexInteger(source, i);
        if (integerRes.token) |token| {
            tokens.append(token) catch return "Failed to allocate space for integer token";
            i = integerRes.nextPosition;
            continue;
        }

        const stringRes = lexString(source, i);
        if (stringRes.token) |token| {
            tokens.append(token) catch return "Failed to allocate space for string token";
            i = stringRes.nextPosition;
            continue;
        }

        const identifierRes = lexIdentifier(source, i);
        if (identifierRes.token) |token| {
            tokens.append(token) catch return "Failed to allocate space for identifier token";
            i = identifierRes.nextPosition;
            continue;
        }

        if (tokens.items.len > 0) {
            debug(tokens.items, tokens.items.len - 1, "Last good token.\n");
        }
        return "Bad token";
    }

    return null;
}

That's it for lexing! Now we can do parsing.

Parsing (parse.zig, 407 LoC)

Parsing takes an array of tokens from the lexing stage and discovers the tree structure in them that maps to a predefined syntax tree (AST).

If it can't discover a valid tree from the array of tokens, it fails.

Let's set up the basics of the Parser struct:

const std = @import("std");

const lex = @import("lex.zig");
const Result = @import("types.zig").Result;

const Token = lex.Token;

pub const Parser = struct {
    allocator: std.mem.Allocator,

    pub fn init(allocator: std.mem.Allocator) Parser {
        return Parser{ .allocator = allocator };
    }

    fn expectTokenKind(tokens: []Token, index: usize, kind: Token.Kind) bool {
        if (index >= tokens.len) {
            return false;
        }

        return tokens[index].kind == kind;
    }

Expressions

Expressions are at the bottom of the syntax tree.

They can be:

  • Literals (like strings, integers, booleans, etc.)
  • Or binary operations

Let's define these in Zig:

    pub const BinaryOperationAST = struct {
        operator: Token,
        left: *ExpressionAST,
        right: *ExpressionAST,

        fn print(self: BinaryOperationAST) void {
            self.left.print();
            std.debug.print(" {s} ", .{self.operator.string()});
            self.right.print();
        }
    };

    pub const ExpressionAST = union(enum) {
        literal: Token,
        binary_operation: BinaryOperationAST,

        fn print(self: ExpressionAST) void {
            switch (self) {
                .literal => |literal| switch (literal.kind) {
                    .string => std.debug.print("'{s}'", .{literal.string()}),
                    else => std.debug.print("{s}", .{literal.string()}),
                },
                .binary_operation => self.binary_operation.print(),
            }
        }
    };

Now we can attempt to parse either of these from an array of tokens.

    fn parseExpression(self: Parser, tokens: []Token, index: usize) Result(struct {
        ast: ExpressionAST,
        nextPosition: usize,
    }) {
        var i = index;

        var e: ExpressionAST = undefined;

        if (expectTokenKind(tokens, i, Token.Kind.integer) or
                expectTokenKind(tokens, i, Token.Kind.identifier) or
                expectTokenKind(tokens, i, Token.Kind.string))
            {
                e = ExpressionAST{ .literal = tokens[i] };
                i = i + 1;
        } else {
                return .{ .err = "No expression" };
        }

        if (expectTokenKind(tokens, i, Token.Kind.equal_operator) or
                expectTokenKind(tokens, i, Token.Kind.lt_opera... (truncated)
                                    

November 11, 2022

November 08, 2022