June 11, 2024

CedarDB Blog

Why Your SSD (Probably) Sucks and What Your Database Can Do About It

Database system developers have a complicated relationship with storage devices: They can store terabytes of data cheaply, and everything is still there after a system crash. On the other hand, storage can be a spoilsport by being slow when it matters most.

This blog post shows

how SSDs are used in database systems,
where SSDs have limitations,
and how to get around them.

When are SSDs fast?

When we colloquially talk about speed, we usually think in terms of throughput, i.e., how much data we can store or retrieve per second. Let’s use the fantastic bench-fio tool to measure how a consumer-grade Crucial T700 SSD used in one of our build servers performs for random reads:

June 10, 2024

Tinybird Engineering Blog

Tinybird is now available in AWS us-west-2

We’re excited to announce Tinybird availability in AWS US-West-2. Lower latency for customers deploying their applications on the US West Coast.

June 06, 2024

Tinybird Engineering Blog

Building real-time leaderboards with Tinybird

Leaderboards aren’t just for games. In this post, you’ll learn how and why leaderboards can help you drive engagement in your app and how to build your first one quickly with Tinybird.

Database Architects

B-trees Require Fewer Comparisons Than Balanced Binary Search Trees

Due to better access locality, B-trees are faster than binary search trees in practice -- but are they also better in theory? To answer this question, let's look at the number of comparisons required for a search operation. Assuming we store n elements in a binary search tree, the lower bound for the number of comparisons is log₂ n in the worst case. However, this is only achievable for a perfectly balanced tree. Maintaining such a tree's perfect balance during insert/delete operations requires O(n) time in the worst case.

Balanced binary search trees, therefore, leave some slack in terms of how balanced they are and have slightly worse bounds. For example, it is well known that an AVL tree guarantees at most 1.44 log₂ n comparisons, and a Red-Black tree guarantees 2 log₂ n comparisons. In other words, AVL trees require at most 1.44 times the minimum number of comparisons, and Red-Black trees require up to twice the minimum.

How many comparisons does a B-tree need? In B-trees with degree k, each node (except the root) has between k and 2k children. For k=2, a B-tree is essentially the same data structure as a Red-Black tree and therefore provides the same guarantee of 2 log₂ n comparisons. So how about larger, more realistic values of k?

To analyze the general case, we start with a B-tree that has the highest possible height for n elements. The height is maximal when each node has only k children (for simplicity, this analysis ignores the special case of underfull root nodes). This implies that the worst-case height of a B-tree is log_k n. During a lookup, one has to perform a binary search that takes log₂ k comparisons in each of the log_k n nodes. So in total, we have log₂ k * log_k n = log₂ n comparisons.

This actually matches the best case, and to construct the worst case, we have to modify the tree somewhat. On one (and only one) arbitrary path from the root to a single leaf node, we increase the number of children from k to 2k. In this situation, the tree height is still less than or equal to log_k n, but we now have one worst-case path where we need log₂ 2k (instead of log₂ k) comparisons. On this worst-case path, we have log₂ 2k * log_k n = (log₂ 2k) / (log₂ k) * log₂ n comparisons.

Using this formula, we get the following bounds:
k=2: 2 log₂ n
k=4: 1.5 log₂ n
k=8: 1.33 log₂ n
k=16: 1.25 log₂ n
...
k=512: 1.11 log₂ n

We see that as k grows, B-trees get closer to the lower bound. For k>=8, B-trees are guaranteed to perform fewer comparisons than AVL trees in the worst case. As k increases, B-trees become more balanced. One intuition for this result is that for larger k values, B-trees become increasingly similar to sorted arrays which achieve the log₂ n lower bound. Practical B-trees often use fairly large values of k (e.g., 100) and therefore offer tight bounds -- in addition to being more cache-friendly than binary search trees.

(Caveat: For simplicity, the analysis assumes that log₂ n and log₂ 2k are integers, and that the root has either k or 2k entries. Nevertheless, the observation that larger k values lead to tighter bounds should hold in general.)

by Viktor Leis (noreply@blogger.com)

Alex Miller

SIGMOD Programming Contest Archive: Sorting (2019)

June 05, 2024

Tinybird Engineering Blog

Tinybird has joined the AWS ISV Accelerate Program

We’re excited to be able to tap into AWS’s infrastructure and tools to deliver great real-time analytics solutions for our joint customers.

CedarDB Blog

Simple, Efficient, and Robust Hash Tables for Join Processing

Hash tables are probably the most versatile data structures for data processing. For that reason, CedarDB depends on hash table to perform some of the most crucial parts of its query execution engine. Most prominently, CedarDB implements relational joins as hash joins. This blog post assumes you know what a hash join is. If not, the Wikipedia article has a short introduction into the topic for you. During the development of Umbra and now CedarDB, we rewrote our join hash table implementation several times. To share our latest design, TUM and CedarDB published a peer-reviewed scientific paper, which Altan will present at DaMoN'24 in Santiago de Chile next week.

Alex Miller

SIGMOD Programming Contest Archive: Join Processing (2018)

Alex Miller

Building BerkeleyDB: Introduction

Welcome to the B-Tree tutorial.

June 04, 2024

Alex Miller

SIGMOD Programming Contest Archive: Streaming N-Gram Filter (2017)

June 03, 2024

Tinybird Engineering Blog

June 02, 2024

Alex Miller

June 11, 2024