a curated list of database news from authoritative sources

September 30, 2019

Administering Kubernetes is hard

Kubernetes is easy to use after some exposure; it's pretty convenient too. But it is super hard to set up.

eksctl is a good tool for folks who don't want to spend hours/days/weeks debugging VPC configuration in 1000s of lines of CloudFormation. None of the other tools seem to be that much easier to use (kops, kubeadm, etc.).

But even with EKS and eksctl you are constrained to Amazon Linux worker nodes. AMIs are practically impossible to discover.

I haven't spent much time with GKE.

And while eksctl operates on the right level for developers needing to administrate small/medium-sized systems, it... doesn't exist outside EKS.

It is unfortunate the only major container orchestration system is this complex to administer. The user-facing APIs are pretty solid and guide toward sustainable system design. It is really hard to see the value for most companies with medium-sized deployments tasked with administration. Among serious proprietary alternatives, sure, there's ECS and Google App Engine. But there's little advantage in existing Kubernetes user knowledge. The OSS alternatives don't have the adoption to seem like a good investment.

OpenStack's magnum or OpenShift seem like possible high-level providers for a generic environment. But neither are particularly known for stability.

In all, the ecosystem has gotten friendlier. There will probably be a time in the future (3-5 years from now?) when Kubernetes is fairly easy to administer.

I'd love to hear your thoughts and experiences administering Kubernetes.

September 26, 2019

Tinybird at South Summit Madrid 2019

We will be demoing Tinybird Analytics thanks to Google for Startups at South Summit Madrid, one of the best places to connect with other technology startups.

September 03, 2019

Why does Vitess recommend 250GB per MySQL server?

Vitess has an opinionated approach to database scalability. Some of those opinions have minimal controversy such as how durability should be provided via replication, but the one I find interesting is the 250GB per MySQL server recommendation. Is this a physical MySQL Limit? # In short: no. By “physical limit” I mean is there a file format restriction that says databases can not be greater than 250GB? The physical limit for InnoDB is 64TB per tablespace, and in the default configuration each table is its own tablespace.

August 31, 2019

Unit testing C code with gtest

This post covers building and testing a minimal, but still useful, C project. We'll use Google's gtest and CMake for testing C code. This will serve as a foundation for some upcoming posts/projects on programming Linux, userland networking and interpreters.

The first version of this post only included one module to test. The test/CMakeLists.txt would also only expose a single pass-fail status for all modules. The second version of this post extends the test/CMakeLists.txt to expose each test/*.cpp file as its own CMake test so that results are displayed by ctest per file. The second version also splits the original src/testy.c and include/testy/testy.h module into a widget and customer module to demonstrate the changes to the CMake configuration.

The "testy" sample project

In this project, we'll put source code in src/ and publicly exported symbols (functions, structs, etc.) in header files in include/testy/. There will be a main.c in the src/ directory. Tests are written in C++ (since gtest is a C++ testing framework) and are in the test/ directory.

Here's an overview of the source and test code.

src/widget.c

This file has some library code that we should be able to test.

#include "testy/widget.h"

int private_ok_value = 2;

int widget_ok(int a, int b) {
  return a + b == private_ok_value;
}

include/testy/widget.h

This file handles exported symbols for widget code.

#ifndef _WIDGET_H_
#define _WIDGET_H_

int widget_ok(int, int);

#endif

src/customer.c

This file has some more library code that we should be able to test.

#include "testy/customer.h"

int customer_check(int a) {
  return a == 5;
}

include/testy/customer.h

This file handles exported symbols for customer code.

#ifndef _CUSTOMER_H_
#define _CUSTOMER_H_

int customer_check(int);

#endif

src/main.c

This is the entrypoint to a program built around libtesty.

#include "testy/customer.h"
#include "testy/widget.h"

int main() {
  if (widget_ok(1, 1)) {
    return customer_check(5);
  }

  return 0;
}

test/widget.cpp

This is one of our test files. It registers test cases and uses gtest to make assertions. We need to wrap the testy/widget.h include in an extern "C" to stop C++ from name-mangling.

#include "gtest/gtest.h"

extern "C" {
#include "testy/widget.h"
}

TEST(widget, ok) {
  ASSERT_EQ(widget_ok(1, 1), 1);
}

TEST(testy, not_ok) {
  ASSERT_EQ(widget_ok(1, 2), 0);
}

You can see a good high-level overview of gtest testing utilities like ASSERT_EQ and TEST here.

test/customer.cpp

This is another one of our test files.

#include "gtest/gtest.h"

extern "C" {
#include "testy/customer.h"
}

TEST(customer, ok) {
  ASSERT_EQ(customer_check(5), 1);
}

TEST(testy, not_ok) {
  ASSERT_EQ(customer_check(0), 0);  
}

test/main.cpp

This is a standard entrypoint for the test runner.

#include "gtest/gtest.h"

int main(int argc, char **argv) {
  ::testing::InitGoogleTest(&argc, argv);
  return RUN_ALL_TESTS();
}

Building with CMake

CMake is a build tool that (among other things) produces a Makefile we can run to build our code. We will also use it for dependency management. But fundementally we use it because gtest requires it.

CMake options/rules are defined in a CMakeLists.txt file. We'll have one in the root directory, one in the test directory, and a template for one that will handle the gtest dependency.

A first draft of the top-level CMakeLists.txt might look like this:

cmake_minimum_required(VERSION 3.1)

project(testy)

##
### Source definitions ###
##

include_directories("${PROJECT_SOURCE_DIR}/include")

file(GLOB sources "${PROJECT_SOURCE_DIR}/src/*.c")

add_executable(testy ${sources})

Using include_directory will make sure we compile with the -I flag set up correctly for our include directory.

Using add_executable sets up the binary name to produce from the given sources. And we're using the file helper to get a glob match of C files rather than listing them all out verbatim in the add_executable call.

Building and running

CMake pollutes the current directory, and is fine running in a different directory, so we'll make a build/ directory so we don't pollute root. Then we'll build a Makefile with CMake, run Make, and run our program.

$ mkdir build
$ cd build
$ cmake ..
-- The C compiler identification is AppleClang 10.0.1.10010046
-- The CXX compiler identification is AppleClang 10.0.1.10010046
-- Check for working C compiler: /Library/Developer/CommandLineTools/usr/bin/cc
-- Check for working C compiler: /Library/Developer/CommandLineTools/usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /Library/Developer/CommandLineTools/usr/bin/c++
-- Check for working CXX compiler: /Library/Developer/CommandLineTools/usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Configuring done
-- Generating done
-- Build files have been written to: /Users/philipeaton/tmp/testy/build
$ make
[ 25%] Building C object CMakeFiles/testy.dir/src/customer.c.o
[ 50%] Building C object CMakeFiles/testy.dir/src/widget.c.o
[ 75%] Building C object CMakeFiles/testy.dir/src/main.c.o
[100%] Linking C executable testy
[100%] Built target testy
$ ./testy
$ echo $?
1

CMakeLists.txt.in

This template file handles downloading the gtest dependency from github.com pinned to a release. It will be copied into a subdirectory during the cmake .. step.

cmake_minimum_required(VERSION 3.1)

project(googletest-download NONE)

include(ExternalProject)
ExternalProject_Add(googletest
  GIT_REPOSITORY    https://github.com/google/googletest.git
  GIT_TAG           release-1.8.1
  SOURCE_DIR        "${CMAKE_BINARY_DIR}/googletest-src"
  BINARY_DIR        "${CMAKE_BINARY_DIR}/googletest-build"
  CONFIGURE_COMMAND ""
  BUILD_COMMAND     ""
  INSTALL_COMMAND   ""
  TEST_COMMAND      ""
)

Now we can tell CMake about it and how to build, within the top-level CMakeLists.txt file.

cmake_minimum_required(VERSION 3.1)

project(testy)

##
### Test definitions ###
##

configure_file(CMakeLists.txt.in
        googletest-download/CMakeLists.txt)
execute_process(COMMAND ${CMAKE_COMMAND} -G "${CMAKE_GENERATOR}" .
        WORKING_DIRECTORY ${CMAKE_BINARY_DIR}/googletest-download )
execute_process(COMMAND ${CMAKE_COMMAND} --build .
        WORKING_DIRECTORY ${CMAKE_BINARY_DIR}/googletest-download )

add_subdirectory(${CMAKE_BINARY_DIR}/googletest-src
        ${CMAKE_BINARY_DIR}/googletest-build)

enable_testing()
add_subdirectory(test)

##
### Source definitions ###
##

include_directories("${PROJECT_SOURCE_DIR}/include")

file(GLOB sources
  "${PROJECT_SOURCE_DIR}/include/testy/*.h"
  "${PROJECT_SOURCE_DIR}/src/*.c")

add_executable(testy ${sources})

The add_subdirectory calls register a directory that contains its own CMakeLists.txt. It would fail now without a CMakeLists.txt file in the test/ directory.

test/CMakeLists.txt

This final file registers a unit_test executable compiling against the source and test code, and includes the project header files.

include_directories("${PROJECT_SOURCE_DIR}/include")

file(GLOB sources "${PROJECT_SOURCE_DIR}/src/*.c")
list(REMOVE_ITEM sources "${PROJECT_SOURCE_DIR}/src/main.c")

file(GLOB tests "${PROJECT_SOURCE_DIR}/test/*.cpp")
list(REMOVE_ITEM tests "${PROJECT_SOURCE_DIR}/test/main.cpp")

foreach(file ${tests})
  set(name)
  get_filename_component(name ${file} NAME_WE)
  add_executable("${name}_tests"
    ${sources}
    ${file}
    "${PROJECT_SOURCE_DIR}/test/main.cpp")
  target_link_libraries("${name}_tests" gtest_main)
  add_test(NAME ${name} COMMAND "${name}_tests")
endforeach()

We have to register a test for each file otherwise each file's tests won't show up by default (i.e. without a --verbose flag).

Building and running tests

Similar to building and running the source, we run CMake in a subdirectory but run make test or ctest after building all sources and tests with make.

$ cd build
$ cmake ..
-- Configuring done
-- Generating done
-- Build files have been written to: /Users/philipeaton/tmp/testy/build/googletest-download
Scanning dependencies of target googletest
[ 11%] Creating directories for 'googletest'
[ 22%] Performing download step (git clone) for 'googletest'
Cloning into 'googletest-src'...
Note: checking out 'release-1.8.1'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by performing another checkout.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -b with the checkout command again. Example:

  git checkout -b <new-branch-name>

HEAD is now at 2fe3bd99 Merge pull request #1433 from dsacre/fix-clang-warnings
[ 33%] No patch step for 'googletest'
[ 44%] Performing update step for 'googletest'
[ 55%] No configure step for 'googletest'
[ 66%] No build step for 'googletest'
[ 77%] No install step for 'googletest'
[ 88%] No test step for 'googletest'
[100%] Completed 'googletest'
[100%] Built target googletest
-- Found PythonInterp: /usr/local/bin/python (found version "2.7.16")
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
-- Found Threads: TRUE
-- Configuring done
-- Generating done
-- Build files have been written to: /Users/philipeaton/tmp/testy/build
$ make
[  4%] Building C object CMakeFiles/testy.dir/src/customer.c.o
[  9%] Building C object CMakeFiles/testy.dir/src/widget.c.o
[ 13%] Building C object CMakeFiles/testy.dir/src/main.c.o
[ 18%] Linking C executable testy
[ 18%] Built target testy
[ 22%] Building CXX object googletest-build/googlemock/gtest/CMakeFiles/gtest.dir/src/gtest-all.cc.o
[ 27%] Linking CXX static library libgtest.a
[ 27%] Built target gtest
[ 31%] Building CXX object googletest-build/googlemock/CMakeFiles/gmock.dir/src/gmock-all.cc.o
[ 36%] Linking CXX static library libgmock.a
[ 36%] Built target gmock
[ 40%] Building CXX object googletest-build/googlemock/CMakeFiles/gmock_main.dir/src/gmock_main.cc.o
[ 45%] Linking CXX static library libgmock_main.a
[ 45%] Built target gmock_main
[ 50%] Building CXX object googletest-build/googlemock/gtest/CMakeFiles/gtest_main.dir/src/gtest_main.cc.o
[ 54%] Linking CXX static library libgtest_main.a
[ 54%] Built target gtest_main
[ 59%] Building C object test/CMakeFiles/customer_tests.dir/__/src/customer.c.o
[ 63%] Building C object test/CMakeFiles/customer_tests.dir/__/src/widget.c.o
[ 68%] Building CXX object test/CMakeFiles/customer_tests.dir/customer.cpp.o
[ 72%] Building CXX object test/CMakeFiles/customer_tests.dir/main.cpp.o
[ 77%] Linking CXX executable customer_tests
[ 77%] Built target customer_tests
Scanning dependencies of target widget_tests
[ 81%] Building C object test/CMakeFiles/widget_tests.dir/__/src/customer.c.o
[ 86%] Building C object test/CMakeFiles/widget_tests.dir/__/src/widget.c.o
[ 90%] Building CXX object test/CMakeFiles/widget_tests.dir/widget.cpp.o
[ 95%] Building CXX object test/CMakeFiles/widget_tests.dir/main.cpp.o
[100%] Linking CXX executable widget_tests
[100%] Built target widget_tests

After running cmake and make, we're finally ready to run ctest.

$ ctest
Test project /Users/philipeaton/tmp/testy/build
    Start 1: customer
1/2 Test #1: customer ..........................   Passed    0.01 sec
    Start 2: widget
2/2 Test #2: widget ............................   Passed    0.00 sec

100% tests passed, 0 tests failed out of 2

Total Test time (real) =   0.01 sec

Now we're in a good place with most of the challenges of unit testing C code (i.e. ignoring mocks) past us.

August 08, 2019

Our focus on Speed

There is no good way to uncover new insights underlying terabytes of data unless you make the process of working with it tremendously rewarding and fast.

August 02, 2019

July 24, 2019

Cuckoo Filters with arbitrarily sized tables

Cuckoo Filters are an interesting alternative to Bloom filters. Instead of maintaining a filter bitmap, they maintain a small (cuckoo-)hash table of key signatures, which has several good properties. For example is stores just the signature of a key instead of the key itself, but is nevertheless able to move an element to a different position in the case of conflicts.

This conflict resolution mechanism is quite interesting: Just like regular cuckoo hash tables each element has two potential positions where is be placed, a primary position i1 and a secondary position i2. These can be computed as follows:

i1 = hash(x)
i2 = i1 xor hash(signature(x))

Remember that the cuckoo filter stores only the (small) signature(x), not x itself. Thus, when we encounter a value, we cannot know if it is at its position i1 or position i2. However, we can nevertheless alternate between positions because the following holds

i1 = i2 xor hash(signature(x))

and we have the signature stored in the table. Thus, we can just use the self-inverse xor hash(signature(x)) to switch between i1 and i2, regardless of whether are currently at i1 or i2. Which is a neat little trick. This allows is to switch between positions, which is used in the cuckoo filter conflict resolution logic.

However all this hold only because the original cuckoo filters use power-of-two hash tables. If our hash table size is not a power of 2, the xor can place the alternative position beyond the size of the hash table, which breaks the filter. Thus cuckoo filter tables always had to be powers of two, even if that wasted a lot of memory.

In more recent work Lang et al. proposed using cuckoo filters with size C, where C did not have to be a power of two, offering much better space utilization. They achieved this by using a different self-inverse function:

i1 = hash(x) mod C
i2 = -(i1 + hash(signature(x)) mod C

Note that the modulo computation can be made reasonable efficient by using magic numbers, which can be precomputed when allocating the filter.

A slightly different way to formulate this is to introduce a switch function f, which switches between positions:

f(i,sig,C) = -(i + hash(sig)) mod C
i1 = hash(x) mod C
i2 = f(i1, signature(x), C)
i1 = f(i2, signature(x), C)
All this works because f is self-inverse, i.e.,

i = f(f(i, signature(x), C), signature(x), C)

for all C>0, i between 0 and C-1, and signature(x)>0.

The only problem is: Is this true? In a purely mathematical sense it is, as you can convince yourself by expanding the formula, but the cuckoo filters are not executed on abstract machines but on real CPUs. And there something unpleasant happens: We can get numerical overflows of our integer registers, which implicitly introduces a modulo 2^32 into our computation. Which breaks the self-inverseness of f in some cases, except when C is power of two itself.

Andreas Kipf noticed this problem when using the cuckoo filters with real world data. Which teaches us not to trust in formulas without additional extensive empirical validation... Fortunately we can repair the function f by using proper modular arithmetic. In pseudo-code this looks like this

f(i,sig,C)
    x=(C-1)-(hash(sig) mod C)
    if (x>=i)
       return (x-i);
    // The natural formula would be C-(i-x), but we prefer this one...
    return C+(x-i);

This computes the correct wrap-around module C, at the cost of one additional if. We can avoid the if by using predication, as shown below

f(i,sig,C)
    x=(C-1)-(hash(sig) mod C)
    m = (x<i)*(~0)
    return (m&C)+(x-i);


which can be attractive for SSE implementations where the comparison produces a bit mask anyway.

We have validated that this new f function is now self-inverse for all possible values of i, sig, and C. And we did this by not just looking at the formula, but by trying out all values programmatically. Which is a good way to get confidence in your approach; there is only a finite number of combinations, and we can test them all.
With this small fix, we can now enjoy Cuckoo Filters with arbitrarily sized tables.

Edit:  The original post did not mirror the hash space correctly (using C-... instead of (C-1)-...), thanks to Andreas Kipf for pointing this out.