a curated list of database news from authoritative sources

May 07, 2020

A new dashboard for Tinybird Analytics

A key part of running an effective Analytics Platform within an organization is being able to keep a tight control over usage and performance, ingestion jobs...

May 01, 2020

Writing a SQL database from scratch in Go: 3. indexes

Previously in database basics: <! forgive me, for I have sinned >
1. SELECT, INSERT, CREATE and a REPL
2. binary expressions and WHERE filters

Next in database basics:
4. a database/sql driver

In this post, we extend gosql to support indexes. We focus on the addition of PRIMARY KEY constraints on table creation and some easy optimizations during SELECT statements.

$ go run cmd/main.go
Welcome to gosql.
# CREATE TABLE users (id INT PRIMARY KEY, name TEXT, age INT);
ok
# \d users
Table "users"
Column |  Type   | Nullable
---------+---------+-----------
id     | integer | not null
name   | text    |
age    | integer |
Indexes:
        "users_pkey" PRIMARY KEY, rbtree ("id")

This post will broadly be a discussion of this commit.

What is an index?

An index is a mapping of a value to a row in a table. The value is often a column, but it can be many kinds of expressions. Databases typically store indexes in tree structures that provide O(log(n)) lookup time. When SELECTing and filtering on a column that is indexed, a database can greatly improve lookup time by filtering first on this index. Without an index, a database must do a linear scan for matching rows. Though sometimes if a condition is broad enough, even with an index, a database may still end up doing a linear scan.

While it may make sense initially to map a value to a row using a hash table for constant lookup times, hash tables don't provide ordering. So this would prevent an index from being applicable on anything but equality checks. For example, SELECT x FROM y WHERE x > 2 couldn't use a hash index on x.

Indexes in many SQL databases default to a B-Tree, which offers efficient ordering of elements. These indexes are thus not constant-time lookups even if filtering on a unique column for a single item. Some databases, like PostgreSQL, allow you to use a hash-based index instead of a tree. Here the previously listed restrictions apply (i.e. only equality checks will use the index).

Upgrading gosql

We proceed as follows:

  • Upgrade table creation to support specifying a primary key
    • Pick a tree data structure for the index, adding it to the table
  • Upgrade INSERTs to let any indexes on the table process the new row
  • Upgrade SELECTs to make use of any indexes, if possible

Upgrading table creation

To allow the specification of a single column as the primary key when creating a table, we have to first modify the lexer and parser.

Lexing/parsing

Since we've covered this process a few times already suffice it so say we make the following key additions:

In-memory backend

Next we move on to handling a primary key during table creation.

Since there are many existing papers and blogs on implementing tree data structures, we will import an open-source implementation. And while most databases use a B-Tree, the most important properties of the tree for our purposes are 1) efficient ordering and 2) optionally duplicate keys. We go with a Red-Black Tree, GoLLRB.

The full definition of an index now includes:

  • A name
  • An expression (at first we only support this being an identifier referring to a column)
  • A unique flag
  • A type name (it will just be rbtree for now)
  • A primary key flag (so we know to apply null checks among other things)
  • And the actual tree itself
type index struct {
    name       string
    exp        expression
    unique     bool
    primaryKey bool
    tree       *llrb.LLRB
    typ        string
}

When we create a table, we add an index if one of the columns is a primary key. We call out to a new public method, CreateIndex, that will handle actually setting things up.

func (mb *MemoryBackend) CreateTable(crt *CreateTableStatement) error {
    if _, ok := mb.tables[crt.name.value]; ok {
        return ErrTableAlreadyExists
    }

    t := createTable()
    t.name = crt.name.value
    mb.tables[t.name] = t
    if crt.cols == nil {
        return nil
    }

    var primaryKey *expression = nil
    for _, col := range *crt.cols {
        t.columns = append(t.columns, col.name.value)

        var dt ColumnType
        switch col.datatype.value {
        case "int":
            dt = IntType
        case "text":
            dt = TextType
        case "boolean":
            dt = BoolType
        default:
            delete(mb.tables, t.name)
            return ErrInvalidDatatype
        }

        if col.primaryKey {
            if primaryKey != nil {
                delete(mb.tables, t.name)
                return ErrPrimaryKeyAlreadyExists
            }

            primaryKey = &expression{
                literal: &col.name,
                kind:    literalKind,
            }
        }

        t.columnTypes = append(t.columnTypes, dt)
    }

    if primaryKey != nil {
        err := mb.CreateIndex(&CreateIndexStatement{
            table:      crt.name,
            name:       token{value: t.name + "_pkey"},
            unique:     true,
            primaryKey: true,
            exp:        *primaryKey,
        })
        if err != nil {
            delete(mb.tables, t.name)
            return err
        }
    }

    return nil
}

Implementing CreateIndex is just a matter of adding a new index to the table.

func (mb *MemoryBackend) CreateIndex(ci *CreateIndexStatement) error {
    table, ok := mb.tables[ci.table.value]
    if !ok {
        return ErrTableDoesNotExist
    }

    for _, index := range table.indexes {
        if index.name == ci.name.value {
            return ErrIndexAlreadyExists
        }
    }

    index := &index{
        exp:        ci.exp,
        unique:     ci.unique,
        primaryKey: ci.primaryKey,
        name:       ci.name.value,
        tree:       llrb.New(),
        typ:        "rbtree",
    }
    table.indexes = append(table.indexes, index)
    return nil
}

And that's it for creation of tables and indexes! Table creation is also the last time we need to make changes to the gosql frontend. The rest of the changes simply wrap existing insertion and selection.

Upgrading INSERT

When a row is inserted into a table, each index on that table needs to process the row so it can add value-to-row mappings to the index.

In the project code, you'll notice logic in CreateIndex to also go back over all existing rows to add them to the new index. This post omits further discussing the case where an index is created after a table is created. After reading this post, that case should be easy to follow.

Adding a row to an index is a matter of evaluting the index expression against that row and storing the resulting value in the tree. Along with the value, we store the integer index of the row in the table.

If the index is required to be unique, we first check that the value does not yet exist.

func (i *index) addRow(t *table, rowIndex uint) error {
    indexValue, _, _, err := t.evaluateCell(rowIndex, i.exp)
    if err != nil {
        return err
    }

    if indexValue == nil {
        return ErrViolatesNotNullConstraint
    }

    if i.unique && i.tree.Has(treeItem{value: indexValue}) {
        return ErrViolatesUniqueConstraint
    }

    i.tree.InsertNoReplace(treeItem{
        value: indexValue,
        index: rowIndex,
    })
    return nil
}

And that's it for insertion!

Upgrading SELECT

Until now, the logic for selecting rows from a table is to pick the table and iterate over all rows. If the row does not match the WHERE filter, we pass the row.

If the table has an index and we are using the index in a recognized pattern in the WHERE AST (more on that later), we can pre-filter the table based on the index before iterating over each row. We can do this for each index and for each time a recognized pattern shows up.

This process is called query planning. We build a simplified version of what you may see in SQL databases, specifically focusing on index usage since we don't yet support JOINs. For further reading, SQLite has an excellent document on their query planner for index usage.

func (mb *MemoryBackend) Select(slct *SelectStatement) (*Results, error) {
    t := createTable()

    if slct.from != nil {
        var ok bool
        t, ok = mb.tables[slct.from.value]
        if !ok {
            return nil, ErrTableDoesNotExist
        }
    }

    if slct.item == nil || len(*slct.item) == 0 {
        return &Results{}, nil
    }

    results := [][]Cell{}
    columns := []ResultColumn{}

    if slct.from == nil {
        t = createTable()
        t.rows = [][]memoryCell{{}}
    }

    for _, iAndE := range t.getApplicableIndexes(slct.where) {
        index := iAndE.i
        exp := iAndE.e
        t = index.newTableFromSubset(t, exp)
    }

    for i := range t.rows {
        result := []Cell{}
        isFirstRow := len(results) == 0

        if slct.where != nil {
            val, _, _, err := t.evaluateCell(uint(i), *slct.where)
            if err != nil {
                return nil, err
            }

            if !*val.AsBool() {
                continue
            }
        }

        for _, col := range finalItems {
            value, columnName, columnType, err := t.evaluateCell(uint(i), *col.exp)
            if err != nil {
                return nil, err
            }

            if isFirstRow {
                columns = append(columns, ResultColumn{
                    Type: columnType,
                    Name: columnName,
                })
            }

            result = append(result, value)
        }

        results = append(results, result)
    }

    return &Results{
        Columns: columns,
        Rows:    results,
    }, nil
}

It's very simple and easy to miss, here is the change called out:

    for _, iAndE := range t.getApplicableIndexes(slct.where) {
        index := iAndE.i
        exp := iAndE.e
        t = index.newTableFromSubset(t, exp)
    }

getApplicableIndexes

There are probably a few very simple patterns we could look for, but for now we look for boolean expressions joined by AND that contain an index expression.

func (t *table) getApplicableIndexes(where *expression) []indexAndExpression {
    var linearizeExpressions func(where *expression, exps []expression) []expression
    linearizeExpressions = func(where *expression, exps []expression) []expression {
        if where == nil || where.kind != binaryKind {
            return exps
        }

        if where.binary.op.value == string(orKeyword) {
            return exps
        }

        if where.binary.op.value == string(andKeyword) {
            exps := linearizeExpressions(&where.binary.a, exps)
            return linearizeExpressions(&where.binary.b, exps)
        }

        return append(exps, *where)
    }

    exps := linearizeExpressions(where, []expression{})

    iAndE := []indexAndExpression{}
    for _, exp := range exps {
        for _, index := range t.indexes {
            if index.applicableValue(exp) != nil {
                iAndE = append(iAndE, indexAndExpression{
                    i: index,
                    e: exp,
                })
            }
        }
    }

    return iAndE
}

More specifically though, within binary operations we only support matching on an index if the following three conditions are met:

  • the operator is one of =, , >, <, >=, or <=
  • one of the operands is an identifier literal that matches the index's exp value
  • the other operand is a literal value

This is a simpler, stricter matching of an index than PostgreSQL where you can index expressions more generally, not just identifer literals.

func (i *index) applicableValue(exp expression) *expression {
    if exp.kind != binaryKind {
        return nil
    }

    be := exp.binary
    // Find the column and the value in the boolean expression
    columnExp := be.a
    valueExp := be.b
    if columnExp.generateCode() != i.exp.generateCode() {
        columnExp = be.b
        valueExp = be.a
    }

    // Neither side is applicable, return nil
    if columnExp.generateCode() != i.exp.generateCode() {
        return nil
    }

    supportedChecks := []symbol{eqSymbol, neqSymbol, gtSymbol, gteSymbol, ltSymbol, lteSymbol}
    supported := false
    for _, sym := range supportedChecks {
        if string(sym) == be.op.value {
            supported = true
            break
        }
    }
    if !supported {
        return nil
    }

    if valueExp.kind != literalKind {
        fmt.Println("Only index checks on literals supported")
        return nil
    }

    return &valueExp
}

And that's it for finding applicable indexes.

newTableFromSubset

The last remaining piece is to go from a boolean expression in a WHERE clause (where an index is applicable) to a subset of rows in a table.

Since we are only working with patterns of the type indexed-column OP literal-value, we grab the literal using the previous applicableValue helper. Then we look up that literal value in the index and return a new table with every row in the index that meets the condition of the operator for the literal value.

func (i *index) newTableFromSubset(t *table, exp expression) *table {
    valueExp := i.applicableValue(exp)
    if valueExp == nil {
        return t
    }

    value, _, _, err := createTable().evaluateCell(0, *valueExp)
    if err != nil {
        fmt.Println(err)
        return t
    }

    tiValue := treeItem{value: value}

    indexes := []uint{}
    switch symbol(exp.binary.op.value) {
    case eqSymbol:
        i.tree.AscendGreaterOrEqual(tiValue, func(i llrb.Item) bool {
            ti := i.(treeItem)

            if !bytes.Equal(ti.value, value) {
                return false
            }

            indexes = append(indexes, ti.index)
            return true
        })
    case neqSymbol:
        i.tree.AscendGreaterOrEqual(llrb.Inf(-1), func(i llrb.Item) bool {
            ti := i.(treeItem)
            if bytes.Equal(ti.value, value) {
                indexes = append(indexes, ti.index)
            }

            return true
        })
    case ltSymbol:
        i.tree.DescendLessOrEqual(tiValue, func(i llrb.Item) bool {
            ti := i.(treeItem)
            if bytes.Compare(ti.value, value) < 0 {
                indexes = append(indexes, ti.index)
            }

            return true
        })
    case lteSymbol:
        i.tree.DescendLessOrEqual(tiValue, func(i llrb.Item) bool {
            ti := i.(treeItem)
            if bytes.Compare(ti.value, value) <= 0 {
                indexes = append(indexes, ti.index)
            }

            return true
        })
    case gtSymbol:
        i.tree.AscendGreaterOrEqual(tiValue, func(i llrb.Item) bool {
            ti := i.(treeItem)
            if bytes.Compare(ti.value, value) > 0 {
                indexes = append(indexes, ti.index)
            }

            return true
        })
    case gteSymbol:
        i.tree.AscendGreaterOrEqual(tiValue, func(i llrb.Item) bool {
            ti := i.(treeItem)
            if bytes.Compare(ti.value, value) >= 0 {
                indexes = append(indexes, ti.index)
            }

            return true
        })
    }

    newT := createTable()
    newT.columns = t.columns
    newT.columnTypes = t.columnTypes
    newT.indexes = t.indexes
    newT.rows = [][]memoryCell{}

    for _, index := range indexes {
        newT.rows = append(newT.rows, t.rows[index])
    }

    return newT
}

As you can see, an index may not necessarily improve on a linear search in some conditions. Imagine a table of 1 million rows indexed on an autoincrementing column. Imagine filtering on col > 10. The index may be able to eliminate 10 items but still return a pre-filtered table of around 1 million rows that must be passed through the WHERE filter.

Additionally since we process each boolean expression one at a time, we can't take advantage of knowledge that might seem obvious to a human for two boolean expressions that together bound a range. For example in x > 10 AND x < 20 we can see that only integers from 11 to 19 are applicable. But the current logic would go through each expression separately and find all rows that match either before the final linear search through all pre-filtered rows would eliminate the bulk.

Thankfully real databases have decades of optimizations. But even then it can be difficult to know what index usages are being optimized without reading documentation, benchmarking, using EXPLAIN ANALYSE, or reading the source.

But that's it for changes needed to support basic indexes end-to-end!

Trialing an index

Since the addition of indexes is so seamless, it is difficult to tell without trial that the index is effective. So we write a simple program that inserts N rows with and without an index. Finally it will query for the first and last items inserted. We show time and memory used during both insertion and selection.

package main

import (
    "fmt"
    "os"
    "runtime"
    "strconv"
    "time"

    "github.com/eatonphil/gosql"
)

var inserts = 0
var lastId = 0
var firstId = 0

func doInsert(mb gosql.Backend) {
    parser := gosql.Parser{}
    for i := 0; i < inserts; i++ {
        lastId = i
        if i == 0 {
            firstId = lastId
        }
        ast, err := parser.Parse(fmt.Sprintf("INSERT INTO users VALUES (%d)", lastId))
        if err != nil {
            panic(err)
        }

        err = mb.Insert(ast.Statements[0].InsertStatement)
        if err != nil {
            panic(err)
        }
    }
}

func doSelect(mb gosql.Backend) {
    parser := gosql.Parser{}
    ast, err := parser.Parse(fmt.Sprintf("SELECT id FROM users WHERE id = %d", lastId))
    if err != nil {
        panic(err)
    }

    r, err := mb.Select(ast.Statements[0].SelectStatement)
    if err != nil {
        panic(err)
    }

    if len(r.Rows) != 1 {
        panic("Expected 1 row")
    }

    if int(*r.Rows[0][1].AsInt()) != inserts-1 {
        panic(fmt.Sprintf("Bad row, got: %d", r.Rows[0][1].AsInt()))
    }

    ast, err = parser.Parse(fmt.Sprintf("SELECT id FROM users WHERE id = %d", firstId))
    if err != nil {
        panic(err)
    }

    r, err = mb.Select(ast.Statements[0].SelectStatement)
    if err != nil {
        panic(err)
    }

    if len(r.Rows) != 1 {
        panic("Expected 1 row")
    }

    if int(*r.Rows[0][1].AsInt()) != 0 {
        panic(fmt.Sprintf("Bad row, got: %d", r.Rows[0][1].AsInt()))
    }
}

func perf(name string, b gosql.Backend, cb func(b gosql.Backend)) {
    start := time.Now()
    fmt.Println("Starting", name)
    cb(b)
    fmt.Printf("Finished %s: %f seconds\n", name, time.Since(start).Seconds())

    var m runtime.MemStats
    runtime.ReadMemStats(&m)
    fmt.Printf("Alloc = %d MiB\n\n", m.Alloc/1024/1024)
}

func main() {
    mb := gosql.NewMemoryBackend()

    index := false
    for i, arg := range os.Args {
        if arg == "--with-index" {
            index = true
        }

        if arg == "--inserts" {
            inserts, _ = strconv.Atoi(os.Args[i+1])
        }
    }

    primaryKey := ""
    if index {
        primaryKey = " PRIMARY KEY"
    }

    parser := gosql.Parser{}
    ast, err ... (truncated)
                                    

The Fremen – What our team is reading

If you want to know how to work with new or limited resources, find a population that’s used to not having many alternatives.

April 29, 2020

Announcing Vitess 6

I am excited to announce the general availability of Vitess 6, the second release to follow our new accelerated release schedule. While only 12 weeks have elapsed since the previous release, it feels like a few key investments have started to pay dividends all at once. To provide some personal highlights: Improved SQL Support # Vitess now understands much more of MySQL’s syntax. We have taken the approach of studying the queries issued by common applications and frameworks, and baking them right into the end-to-end test suite.

Announcing Vitess 6

I am excited to announce the general availability of Vitess 6 the second release to follow our new accelerated release schedule.

ACID Transactions are not just for banks — the Vitess approach

Build systems that do not lose data. Vitess prevents asynchronous failure in two ways: (1) ensuring that the changes are saved locally on storage with the redo log and binary logs safely written to disk and (2) making use of semi-synchronous replication.

April 28, 2020

Linear Time Liveness Analysis

Standard compiler are usually used with hand-written programs. These programs tend to have reasonably small functions, and can be processed in (nearly) linear time. Generated programs however can be quite large, and compilers sometimes struggle to compile them at all. This can be seen with the following (silly) demonstration script:

 import subprocess  
 from timeit import default_timer as timer  
 def doTest(size):  
   with open("foo.cpp", "w") as out:  
     print("int foo(int x) {", file=out)  
     for s in range(size):  
       p="x" if s==0 else f'l{s-1}'  
       print (f'int l{s}; if (__builtin_sadd_overflow({p},1,&l{s})) goto error;', file=out)  
     print(f'return l{size-1};error: throw;}}', file=out);  
   start = timer()  
   subprocess.run(["gcc", "-c", "foo.cpp"])   
   stop = timer()  
   print(size, ": ", (stop-start))  
 for size in [10,100,1000,10000,100000]:  
   doTest(size)  

It generates one function with n statements of the form "int lX; if (__builtin_sadd_overflow(lY,1,&lX)) goto error;" which are basically just n additions with overflow checks, and then measures the compile time. The generated code is conceptually a very simple, but it contains a lot of basic blocks due to the large number of ifs. When compiling with gcc we get the following compile times:

n101001,00010,000100,000
compilation [s]0.020.040.1934.99> 1h

The compile time is dramatically super linear, gcc is basically unable to compile the function if it contains 10,000 ifs or more. In this simple example clang fares better when using -O0, but with -O1 it shows super-linear compile times, too. This is disastrous when processing generated code, where we cannot easily limit the size of individual functions. In our own system we use neither gcc nor clang for query compilation, but we have same problem, namely compiling large generated code. And super-linear runtime quickly becomes an issue when the input is large.

One particular important problem in this context is liveness analysis, i.e, figuring out which value is alive at which part of the program. The textbook solution for that problem involves propagating liveness information for each variable across the blocks, but that is clearly super linear and does not scale to large program sizes. We therefore developed a different approach that we recently refined even more and the I want to present here:

Instead of propagating liveness sets or bitmasks, we study the control flow graph of the program. For a simple queries with a for loop and an if within the loop it might look like this:



Now if we define a variable x in block 0, and use it in block 2, the variable has to be alive on every path between definition and usage. Obviously that includes the blocks 0-2, but that is not enough. We see that there is a loop involving all loops between 1 and 4, and we can take that before coming to 2. Thus, the lifetime has to be extended to include the full loop, and is there range 0-4. If, however, we define a variable in 1 and use it in 2, the range is indeed 1-2, as we do not have to wrap around.

Algorithmically, we identify all loops in the program, which we can do in almost linear time, and remember how the loops are nested within each other. Then, we examine each occurrence of a variable (in SSA form). In principle the lifetime is the span from the first to the last occurrences. If, however, two occurrences are in different loops, we walk "up" from the lower loop level until we occur in the same loop (or top level), extending the lifetime to cover the full loop while doing so. In this example x in block 0 is top level, while x in block 2 is in loop level 1. Thus, we leave the loop, expand 2 into 1-4, and find the lifetime to be 0-4.

This assumes, of course, that the block numbers are meaningful. We can guarantee that by first labeling all blocks in reverse postorder. This guarantees that all dominating blocks will have a lower number than their successors. We can further improve the labeling by re-arranging the blocks such that all blocks within a certain loop are next to each other, keeping the original reverse post order within the loop. This leads to nice, tight liveness intervals. Using just an interval instead of a bitmap is of course less precise, but the block reordering makes sure that the intervals primarily contain the blocks that are indeed involved in the execution.

Asymptotically such an interval based approach is far superior to a classical liveness propagation. All algorithms involved are linear or almost linear, and we only to have to store two numbers per variable. When handling large, generated code such an approach is mandatory. And even for classical, smaller programs it is quite attractive. I looked around a bit at lectures about compiler construction, and I am somewhat surprised that nobody seems to teach similar techniques to handle large programs. When you cannot control the input size, super linear runtime is not an option.

April 27, 2020

Life of a Vitess Cluster

This post goes into the details of what goes on behind the scenes when a cluster is brought up, for example using helm or the local installation guide. This can be used both as a learning tool and troubleshooting guide. We assume that you have downloaded and installed all the necessary binaries before proceeding further. Vitess is a very flexible system that is capable of running in a variety of environments.

April 25, 2020

MySQL Threads Running

Queries per second (QPS) measures database throughput, but it does not reflect how hard MySQL is working. The latter is measured by Threads_running, expressed as a gauge (whereas QPS is a rate). Before discussing Threads_running, let’s consider an analogy:

April 23, 2020

Pulling this blog out of Planet MySQL aggregator, over community concerns

I’ve decided to pull this blog (http://code.openark.org/blog/) out of the planet.mysql.com aggregator. planet.mysql.com (formerly planetmysql.com) serves as a blog aggregator, and collects news and blog posts on various MySQL and its ecosystem topics. It collects some vendor and team blogs as well as “indie” blogs such as this one. It has traditionally been the go-to […]