May 07, 2020
May 01, 2020
Writing a SQL database from scratch in Go: 3. indexes
Previously in database basics:
<! forgive me, for I have sinned >
1. SELECT, INSERT, CREATE and a REPL
2. binary expressions and WHERE filters
Next in database basics:
4. a database/sql driver
In this post, we extend gosql
to support indexes. We focus on the addition of PRIMARY
KEY
constraints on table creation and some easy optimizations
during SELECT
statements.
$ go run cmd/main.go
Welcome to gosql.
# CREATE TABLE users (id INT PRIMARY KEY, name TEXT, age INT);
ok
# \d users
Table "users"
Column | Type | Nullable
---------+---------+-----------
id | integer | not null
name | text |
age | integer |
Indexes:
"users_pkey" PRIMARY KEY, rbtree ("id")
This post will broadly be a discussion of this commit.
What is an index?
An index is a mapping of a value to a row in a table. The value is
often a column, but it can be many kinds of expressions. Databases
typically store indexes in tree structures that provide O(log(n))
lookup time. When SELECT
ing and filtering on a column
that is indexed, a database can greatly improve lookup time by
filtering first on this index. Without an index, a database must do a
linear scan for matching rows. Though sometimes if a condition is
broad enough, even with an index, a database may still end up doing a
linear scan.
While it may make sense initially to map a value to a row using a hash
table for constant lookup times, hash tables don't provide
ordering. So this would prevent an index from being applicable on
anything but equality checks. For example, SELECT x FROM y WHERE
x > 2
couldn't use a hash index on x
.
Indexes in many SQL databases default to a B-Tree, which offers efficient ordering of elements. These indexes are thus not constant-time lookups even if filtering on a unique column for a single item. Some databases, like PostgreSQL, allow you to use a hash-based index instead of a tree. Here the previously listed restrictions apply (i.e. only equality checks will use the index).
Upgrading gosql
We proceed as follows:
- Upgrade table creation to support specifying a primary key
- Pick a tree data structure for the index, adding it to the table
- Upgrade
INSERT
s to let any indexes on the table process the new row - Upgrade
SELECT
s to make use of any indexes, if possible
Upgrading table creation
To allow the specification of a single column as the primary key when creating a table, we have to first modify the lexer and parser.
Lexing/parsing
Since we've covered this process a few times already suffice it so say we make the following key additions:
- Add
PRIMARY KEY
as a new keyword token to the lexer - Add a check for this token to the parsing of column definitions
- Modify the AST to store a boolean value whether a column is a primary key
In-memory backend
Next we move on to handling a primary key during table creation.
Since there are many existing papers and blogs on implementing tree data structures, we will import an open-source implementation. And while most databases use a B-Tree, the most important properties of the tree for our purposes are 1) efficient ordering and 2) optionally duplicate keys. We go with a Red-Black Tree, GoLLRB.
The full definition of an index now includes:
- A name
- An expression (at first we only support this being an identifier referring to a column)
- A unique flag
- A type name (it will just be
rbtree
for now) - A primary key flag (so we know to apply null checks among other things)
- And the actual tree itself
type index struct {
name string
exp expression
unique bool
primaryKey bool
tree *llrb.LLRB
typ string
}
When we create a table, we add an index if one of the columns is a
primary key. We call out to a new public
method, CreateIndex
, that will handle actually setting
things up.
func (mb *MemoryBackend) CreateTable(crt *CreateTableStatement) error {
if _, ok := mb.tables[crt.name.value]; ok {
return ErrTableAlreadyExists
}
t := createTable()
t.name = crt.name.value
mb.tables[t.name] = t
if crt.cols == nil {
return nil
}
var primaryKey *expression = nil
for _, col := range *crt.cols {
t.columns = append(t.columns, col.name.value)
var dt ColumnType
switch col.datatype.value {
case "int":
dt = IntType
case "text":
dt = TextType
case "boolean":
dt = BoolType
default:
delete(mb.tables, t.name)
return ErrInvalidDatatype
}
if col.primaryKey {
if primaryKey != nil {
delete(mb.tables, t.name)
return ErrPrimaryKeyAlreadyExists
}
primaryKey = &expression{
literal: &col.name,
kind: literalKind,
}
}
t.columnTypes = append(t.columnTypes, dt)
}
if primaryKey != nil {
err := mb.CreateIndex(&CreateIndexStatement{
table: crt.name,
name: token{value: t.name + "_pkey"},
unique: true,
primaryKey: true,
exp: *primaryKey,
})
if err != nil {
delete(mb.tables, t.name)
return err
}
}
return nil
}
Implementing CreateIndex
is just a matter of adding a new
index to the table.
func (mb *MemoryBackend) CreateIndex(ci *CreateIndexStatement) error {
table, ok := mb.tables[ci.table.value]
if !ok {
return ErrTableDoesNotExist
}
for _, index := range table.indexes {
if index.name == ci.name.value {
return ErrIndexAlreadyExists
}
}
index := &index{
exp: ci.exp,
unique: ci.unique,
primaryKey: ci.primaryKey,
name: ci.name.value,
tree: llrb.New(),
typ: "rbtree",
}
table.indexes = append(table.indexes, index)
return nil
}
And that's it for creation of tables and indexes! Table creation is also the last time we need to make changes to the gosql frontend. The rest of the changes simply wrap existing insertion and selection.
Upgrading INSERT
When a row is inserted into a table, each index on that table needs to process the row so it can add value-to-row mappings to the index.
In the project code, you'll notice logic in CreateIndex
to also go back over all existing rows to add them to the new index.
This post omits further discussing the case where an index is
created after a table is created. After reading this post, that case
should be easy to follow.
Adding a row to an index is a matter of evaluting the index expression against that row and storing the resulting value in the tree. Along with the value, we store the integer index of the row in the table.
If the index is required to be unique, we first check that the value does not yet exist.
func (i *index) addRow(t *table, rowIndex uint) error {
indexValue, _, _, err := t.evaluateCell(rowIndex, i.exp)
if err != nil {
return err
}
if indexValue == nil {
return ErrViolatesNotNullConstraint
}
if i.unique && i.tree.Has(treeItem{value: indexValue}) {
return ErrViolatesUniqueConstraint
}
i.tree.InsertNoReplace(treeItem{
value: indexValue,
index: rowIndex,
})
return nil
}
And that's it for insertion!
Upgrading SELECT
Until now, the logic for selecting rows from a table is to pick the
table and iterate over all rows. If the row does not match
the WHERE
filter, we pass the row.
If the table has an index and we are using the index in a recognized
pattern in the WHERE
AST (more on that later), we can
pre-filter the table based on the index before iterating over each
row. We can do this for each index and for each time a recognized
pattern shows up.
This process is called query planning. We build a simplified
version of what you may see in SQL databases, specifically focusing
on index usage since we don't yet support JOIN
s. For
further reading, SQLite has
an excellent
document on their query planner for index usage.
func (mb *MemoryBackend) Select(slct *SelectStatement) (*Results, error) {
t := createTable()
if slct.from != nil {
var ok bool
t, ok = mb.tables[slct.from.value]
if !ok {
return nil, ErrTableDoesNotExist
}
}
if slct.item == nil || len(*slct.item) == 0 {
return &Results{}, nil
}
results := [][]Cell{}
columns := []ResultColumn{}
if slct.from == nil {
t = createTable()
t.rows = [][]memoryCell{{}}
}
for _, iAndE := range t.getApplicableIndexes(slct.where) {
index := iAndE.i
exp := iAndE.e
t = index.newTableFromSubset(t, exp)
}
for i := range t.rows {
result := []Cell{}
isFirstRow := len(results) == 0
if slct.where != nil {
val, _, _, err := t.evaluateCell(uint(i), *slct.where)
if err != nil {
return nil, err
}
if !*val.AsBool() {
continue
}
}
for _, col := range finalItems {
value, columnName, columnType, err := t.evaluateCell(uint(i), *col.exp)
if err != nil {
return nil, err
}
if isFirstRow {
columns = append(columns, ResultColumn{
Type: columnType,
Name: columnName,
})
}
result = append(result, value)
}
results = append(results, result)
}
return &Results{
Columns: columns,
Rows: results,
}, nil
}
It's very simple and easy to miss, here is the change called out:
for _, iAndE := range t.getApplicableIndexes(slct.where) {
index := iAndE.i
exp := iAndE.e
t = index.newTableFromSubset(t, exp)
}
getApplicableIndexes
There are probably a few very simple patterns we could look for, but
for now we look for boolean expressions joined by AND
that contain an index expression.
func (t *table) getApplicableIndexes(where *expression) []indexAndExpression {
var linearizeExpressions func(where *expression, exps []expression) []expression
linearizeExpressions = func(where *expression, exps []expression) []expression {
if where == nil || where.kind != binaryKind {
return exps
}
if where.binary.op.value == string(orKeyword) {
return exps
}
if where.binary.op.value == string(andKeyword) {
exps := linearizeExpressions(&where.binary.a, exps)
return linearizeExpressions(&where.binary.b, exps)
}
return append(exps, *where)
}
exps := linearizeExpressions(where, []expression{})
iAndE := []indexAndExpression{}
for _, exp := range exps {
for _, index := range t.indexes {
if index.applicableValue(exp) != nil {
iAndE = append(iAndE, indexAndExpression{
i: index,
e: exp,
})
}
}
}
return iAndE
}
More specifically though, within binary operations we only support matching on an index if the following three conditions are met:
- the operator is one of
=
,,
>
,<
,>=
, or<=
- one of the operands is an identifier literal that matches the index's
exp
value - the other operand is a literal value
This is a simpler, stricter matching of an index than PostgreSQL where you can index expressions more generally, not just identifer literals.
func (i *index) applicableValue(exp expression) *expression {
if exp.kind != binaryKind {
return nil
}
be := exp.binary
// Find the column and the value in the boolean expression
columnExp := be.a
valueExp := be.b
if columnExp.generateCode() != i.exp.generateCode() {
columnExp = be.b
valueExp = be.a
}
// Neither side is applicable, return nil
if columnExp.generateCode() != i.exp.generateCode() {
return nil
}
supportedChecks := []symbol{eqSymbol, neqSymbol, gtSymbol, gteSymbol, ltSymbol, lteSymbol}
supported := false
for _, sym := range supportedChecks {
if string(sym) == be.op.value {
supported = true
break
}
}
if !supported {
return nil
}
if valueExp.kind != literalKind {
fmt.Println("Only index checks on literals supported")
return nil
}
return &valueExp
}
And that's it for finding applicable indexes.
newTableFromSubset
The last remaining piece is to go from a boolean expression in
a WHERE
clause (where an index is applicable) to a subset
of rows in a table.
Since we are only working with patterns of the type
indexed-column OP literal-value
, we grab the literal
using the previous applicableValue
helper. Then we
look up that literal value in the index and return a new table with
every row in the index that meets the condition of the operator for the
literal value.
func (i *index) newTableFromSubset(t *table, exp expression) *table {
valueExp := i.applicableValue(exp)
if valueExp == nil {
return t
}
value, _, _, err := createTable().evaluateCell(0, *valueExp)
if err != nil {
fmt.Println(err)
return t
}
tiValue := treeItem{value: value}
indexes := []uint{}
switch symbol(exp.binary.op.value) {
case eqSymbol:
i.tree.AscendGreaterOrEqual(tiValue, func(i llrb.Item) bool {
ti := i.(treeItem)
if !bytes.Equal(ti.value, value) {
return false
}
indexes = append(indexes, ti.index)
return true
})
case neqSymbol:
i.tree.AscendGreaterOrEqual(llrb.Inf(-1), func(i llrb.Item) bool {
ti := i.(treeItem)
if bytes.Equal(ti.value, value) {
indexes = append(indexes, ti.index)
}
return true
})
case ltSymbol:
i.tree.DescendLessOrEqual(tiValue, func(i llrb.Item) bool {
ti := i.(treeItem)
if bytes.Compare(ti.value, value) < 0 {
indexes = append(indexes, ti.index)
}
return true
})
case lteSymbol:
i.tree.DescendLessOrEqual(tiValue, func(i llrb.Item) bool {
ti := i.(treeItem)
if bytes.Compare(ti.value, value) <= 0 {
indexes = append(indexes, ti.index)
}
return true
})
case gtSymbol:
i.tree.AscendGreaterOrEqual(tiValue, func(i llrb.Item) bool {
ti := i.(treeItem)
if bytes.Compare(ti.value, value) > 0 {
indexes = append(indexes, ti.index)
}
return true
})
case gteSymbol:
i.tree.AscendGreaterOrEqual(tiValue, func(i llrb.Item) bool {
ti := i.(treeItem)
if bytes.Compare(ti.value, value) >= 0 {
indexes = append(indexes, ti.index)
}
return true
})
}
newT := createTable()
newT.columns = t.columns
newT.columnTypes = t.columnTypes
newT.indexes = t.indexes
newT.rows = [][]memoryCell{}
for _, index := range indexes {
newT.rows = append(newT.rows, t.rows[index])
}
return newT
}
As you can see, an index may not necessarily improve on a linear
search in some conditions. Imagine a table of 1 million rows indexed
on an autoincrementing column. Imagine filtering on col >
10
. The index may be able to eliminate 10 items but still
return a pre-filtered table of around 1 million rows that must
be passed through the WHERE
filter.
Additionally since we process each boolean expression one at a time,
we can't take advantage of knowledge that might seem obvious to a
human for two boolean expressions that together bound a range. For
example in x > 10 AND x < 20
we can see that only
integers from 11 to 19 are applicable. But the current logic would go
through each expression separately and find all rows that match either
before the final linear search through all pre-filtered rows would
eliminate the bulk.
Thankfully real databases have decades of optimizations. But even
then it can be difficult to know what index usages are being
optimized without reading documentation, benchmarking, using
EXPLAIN ANALYSE
, or reading the source.
But that's it for changes needed to support basic indexes end-to-end!
Trialing an index
Since the addition of indexes is so seamless, it is difficult to tell without trial that the index is effective. So we write a simple program that inserts N rows with and without an index. Finally it will query for the first and last items inserted. We show time and memory used during both insertion and selection.
package main
import (
"fmt"
"os"
"runtime"
"strconv"
"time"
"github.com/eatonphil/gosql"
)
var inserts = 0
var lastId = 0
var firstId = 0
func doInsert(mb gosql.Backend) {
parser := gosql.Parser{}
for i := 0; i < inserts; i++ {
lastId = i
if i == 0 {
firstId = lastId
}
ast, err := parser.Parse(fmt.Sprintf("INSERT INTO users VALUES (%d)", lastId))
if err != nil {
panic(err)
}
err = mb.Insert(ast.Statements[0].InsertStatement)
if err != nil {
panic(err)
}
}
}
func doSelect(mb gosql.Backend) {
parser := gosql.Parser{}
ast, err := parser.Parse(fmt.Sprintf("SELECT id FROM users WHERE id = %d", lastId))
if err != nil {
panic(err)
}
r, err := mb.Select(ast.Statements[0].SelectStatement)
if err != nil {
panic(err)
}
if len(r.Rows) != 1 {
panic("Expected 1 row")
}
if int(*r.Rows[0][1].AsInt()) != inserts-1 {
panic(fmt.Sprintf("Bad row, got: %d", r.Rows[0][1].AsInt()))
}
ast, err = parser.Parse(fmt.Sprintf("SELECT id FROM users WHERE id = %d", firstId))
if err != nil {
panic(err)
}
r, err = mb.Select(ast.Statements[0].SelectStatement)
if err != nil {
panic(err)
}
if len(r.Rows) != 1 {
panic("Expected 1 row")
}
if int(*r.Rows[0][1].AsInt()) != 0 {
panic(fmt.Sprintf("Bad row, got: %d", r.Rows[0][1].AsInt()))
}
}
func perf(name string, b gosql.Backend, cb func(b gosql.Backend)) {
start := time.Now()
fmt.Println("Starting", name)
cb(b)
fmt.Printf("Finished %s: %f seconds\n", name, time.Since(start).Seconds())
var m runtime.MemStats
runtime.ReadMemStats(&m)
fmt.Printf("Alloc = %d MiB\n\n", m.Alloc/1024/1024)
}
func main() {
mb := gosql.NewMemoryBackend()
index := false
for i, arg := range os.Args {
if arg == "--with-index" {
index = true
}
if arg == "--inserts" {
inserts, _ = strconv.Atoi(os.Args[i+1])
}
}
primaryKey := ""
if index {
primaryKey = " PRIMARY KEY"
}
parser := gosql.Parser{}
ast, err ... (truncated)
The Fremen – What our team is reading
April 29, 2020
Announcing Vitess 6
Announcing Vitess 6
ACID Transactions are not just for banks — the Vitess approach
April 28, 2020
Linear Time Liveness Analysis
import subprocess
from timeit import default_timer as timer
def doTest(size):
with open("foo.cpp", "w") as out:
print("int foo(int x) {", file=out)
for s in range(size):
p="x" if s==0 else f'l{s-1}'
print (f'int l{s}; if (__builtin_sadd_overflow({p},1,&l{s})) goto error;', file=out)
print(f'return l{size-1};error: throw;}}', file=out);
start = timer()
subprocess.run(["gcc", "-c", "foo.cpp"])
stop = timer()
print(size, ": ", (stop-start))
for size in [10,100,1000,10000,100000]:
doTest(size)
It generates one function with n statements of the form "int lX; if (__builtin_sadd_overflow(lY,1,&lX)) goto error;" which are basically just n additions with overflow checks, and then measures the compile time. The generated code is conceptually a very simple, but it contains a lot of basic blocks due to the large number of ifs. When compiling with gcc we get the following compile times:
n | 10 | 100 | 1,000 | 10,000 | 100,000 |
compilation [s] | 0.02 | 0.04 | 0.19 | 34.99 | > 1h |
The compile time is dramatically super linear, gcc is basically unable to compile the function if it contains 10,000 ifs or more. In this simple example clang fares better when using -O0, but with -O1 it shows super-linear compile times, too. This is disastrous when processing generated code, where we cannot easily limit the size of individual functions. In our own system we use neither gcc nor clang for query compilation, but we have same problem, namely compiling large generated code. And super-linear runtime quickly becomes an issue when the input is large.
One particular important problem in this context is liveness analysis, i.e, figuring out which value is alive at which part of the program. The textbook solution for that problem involves propagating liveness information for each variable across the blocks, but that is clearly super linear and does not scale to large program sizes. We therefore developed a different approach that we recently refined even more and the I want to present here:
Instead of propagating liveness sets or bitmasks, we study the control flow graph of the program. For a simple queries with a for loop and an if within the loop it might look like this:
Now if we define a variable x in block 0, and use it in block 2, the variable has to be alive on every path between definition and usage. Obviously that includes the blocks 0-2, but that is not enough. We see that there is a loop involving all loops between 1 and 4, and we can take that before coming to 2. Thus, the lifetime has to be extended to include the full loop, and is there range 0-4. If, however, we define a variable in 1 and use it in 2, the range is indeed 1-2, as we do not have to wrap around.
Algorithmically, we identify all loops in the program, which we can do in almost linear time, and remember how the loops are nested within each other. Then, we examine each occurrence of a variable (in SSA form). In principle the lifetime is the span from the first to the last occurrences. If, however, two occurrences are in different loops, we walk "up" from the lower loop level until we occur in the same loop (or top level), extending the lifetime to cover the full loop while doing so. In this example x in block 0 is top level, while x in block 2 is in loop level 1. Thus, we leave the loop, expand 2 into 1-4, and find the lifetime to be 0-4.
This assumes, of course, that the block numbers are meaningful. We can guarantee that by first labeling all blocks in reverse postorder. This guarantees that all dominating blocks will have a lower number than their successors. We can further improve the labeling by re-arranging the blocks such that all blocks within a certain loop are next to each other, keeping the original reverse post order within the loop. This leads to nice, tight liveness intervals. Using just an interval instead of a bitmap is of course less precise, but the block reordering makes sure that the intervals primarily contain the blocks that are indeed involved in the execution.
Asymptotically such an interval based approach is far superior to a classical liveness propagation. All algorithms involved are linear or almost linear, and we only to have to store two numbers per variable. When handling large, generated code such an approach is mandatory. And even for classical, smaller programs it is quite attractive. I looked around a bit at lectures about compiler construction, and I am somewhat surprised that nobody seems to teach similar techniques to handle large programs. When you cannot control the input size, super linear runtime is not an option.
April 27, 2020
Life of a Vitess Cluster
April 25, 2020
MySQL Threads Running
Queries per second (QPS) measures database throughput, but it does not reflect how hard MySQL is working. The latter is measured by Threads_running, expressed as a gauge (whereas QPS is a rate). Before discussing Threads_running, let’s consider an analogy: