a curated list of database news from authoritative sources

December 28, 2024

December 27, 2024

Advent of Code 2024 in pure SQL

 On a whim I decided to do this years advent of code in pure SQL. That was an interesting experience that I can recommend to everybody because it forces you to think differently about the problems. And I can report that it was possible to solve every problem in pure SQL.

In many cases SQL was actually surprisingly pleasant to use. The full solution for day 11 (including the puzzle input) is shown below:


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
with recursive aoc10_input(i) as (select '
89010123
78121874
87430965
96549874
45678903
32019012
01329801
10456732
'),
lines(y,line) as (
   select 0, substr(i,1,position(E'\n' in i)-1), substr(i,position(E'\n' in i)+1)
   from aoc10_input
   union all
   select y+1,substr(r,1,position(E'\n' in r)-1), substr(r,position(E'\n' in r)+1)
   from lines l(y,l,r) where position(E'\n' in r)>0
),
field(x,y,v) as (
   select x,y,ascii(substr(line,x::integer,1))-48
   from (select * from lines l where line<>'') s, lateral generate_series(1,length(line)) g(x)
),
paths(x,y,v,sx,sy) as (
   select x,y,9,x,y from field where v = 9
   union all
   select f.x,f.y,f.v,p.sx,p.sy
   from field f, paths p
   where f.v=p.v-1 and ((f.x=p.x and abs(f.y-p.y)=1) or (f.y=p.y and abs(f.x-p.x)=1)) and p.v>0),
results as (select * from paths where v=0),
part1 as (select distinct * from results)
select (select count(*) from part1)  as part1, (select count(*) from results) as part2

Parsing the input is a bit painful in SQL, but it is not too bad. Lines 1-10 are simply the puzzle input, lines 11-17 split the input into individual lines, and lines 18-21 construct a 2D array from the input. The algorithm itself is pretty short, lines 22-27 perform a recursive traversal of the field, and lines 28-39 extract the puzzle answer from the traversal results. For this kind of small scale traversals SQL works just fine.

Other days were more painful. Day 16 for example does conceptually a very similar traversal of a field, and it computes the minimal traversal distance for each visited. Expressing that in SQL in easy, but evaluation is wasteful. When replacing the reference input with a real puzzle input the field is quite large, and the recursive query generates and preserves a lot of state, even though we only care about the last iteration of the recursive query. As a consequence you need a machine with over 200GB memory to execute that query, even though most of the computed tuples are irrelevant. We could fix that excessive memory consumption by using iteration semantic during recursion, but that is not widely supported by DBMSes. Umbra could do it, but Postgres and DuckDB cannot, thus I have not used it in my solutions.

And sometimes the programming model of recursive SQL clashes with what we want to do. On day 23 we had to find the maximum clique in sparse graph. This can be computed reasonably well with the Bron-Kerbosch algorithm, but expressing that in recursive SQL is quite convoluted because the algorithm wants to maintain multiple sets, but recursive SQL only passes a single set along. It can be done, but the result does not look pretty.

This experiment has shown two things to me 1) it is possible to code quite complex algorithms in SQL, and often the SQL code is surprisingly pleasant, and 2) recursive SQL would be much more efficient and more pleasant to use if we had mechanisms to update state. There is ongoing work on supporting more complex control flow in recursion via a trampoline mechanisms, which is very useful, too, but we should definitively look into more complex state manipulation mechanisms. With just a bit extra functionality SQL would be quite a solid choice for running complex algorithms directly inside a database.

Speedb vs RocksDB on a large server

I am happy to read about storage engines that claim to be faster than RocksDB. Sometimes the claims are true and might lead to ideas for making RocksDB better. I am wary about evaluating such claims because that takes a lot of time and when the claim is bogus I am reluctant to blog about that because I don't want to punch down on a startup.

Here I share results from the RocksDB benchmark scripts to compare Speedb and RocksDB and I am happy to claim that Speedb does some things better than RocksDB.

tl;dr

  • RocksDB and Speedb have similar average throughput for ...
    • a cached database
    • an IO-bound database when using O_DIRECT
  • RocksDB 8.6+ is slower than Speedb for write-heavy workloads with an IO-bound database when O_DIRECT isn't used. 
    • This problem arrived in RocksDB 8.6 (see issue 12038) which introduces the use of the readahead system call to prefetch data that compaction will soon read. I am not sure what was used prior. The regression that arrived in 8.6 was partially fixed in release 9.9. 
  • In general, Speedb has better QoS (less throughput variance) for write-heavy workloads
Updates
  • Update 1 - with a minor hack I can remove about 1/3 of the regression between RocksDB 8.5 and 9.9 in the overwrite benchmark for the iobuf workload

On issue 12038

RocksDB 8.6 switched to using the readahead system call to prefetch SSTs that will soon be read by compaction. The goal is to reduce the time that compaction threads must wait for data. But what I see with iostat is that a readahead call is ignored when the value of the count argument is larger than max_sectors_kb for the storage device. And this happens on one or both of ext-4 and xfs. I am not a kernel guru and I have yet to read this nice writeup of readahead internals. I do read this great note from Jens Axboe every few years.

I opened issue 12038 for this issue and it was fixed in RocksDB 9.9 by adding code that reduces the value of compaction_readahead_size to be <= the value of max_sectors_kb for the database's storage device. However the fix in 9.9 doesn't restore the performance that existed prior to the change (see 8.5 results). I assume the real fix is to have code in RocksDB to do the prefetches rather than rely on the readahead system call.

Hardware

The server is an ax162-s from Hetzner with an AMD EPYC 9454P processor, 48 cores, AMD SMT disabled and 128G RAM. The OS is Ubuntu 22.04. Storage is 2 NVMe devices with SW RAID 1 and ext4.

The values of max_sectors_kb and max_hw_sectors_kb for the database's storage device is 128 (KB) for both the SW RAID device (md2) and the underlying storage devices (nvme0n1, nvme1n1).

Builds

I compiled db_bench from source. I used RocksDB versions 7.3.2, 7.10.2, 8.5.4, 8.6.7, 9.7.4 and 9.9.3. 

For Speedb I used the latest diff in their github repo which appears to be based on RocksDB 8.6.7, although I am confused that it doesn't suffer from issue 12038 which is in RocksDB 8.6.7.

commit 8d850b666cce6f39fbd4064e80b85f9690eaf385 (HEAD -> main, origin/main, origin/HEAD)
Author: udi-speedb <106253580+udi-speedb@users.noreply.github.com>
Date:   Mon Mar 11 14:00:03 2024 +0200

    Support Speedb's Paired Bloom Filter in db_bloom_filter_test (#810)

Benchmark

All tests used 2MB for compaction_readahead_size which and the hyper clock block cache.

I used my fork of the RocksDB benchmark scripts that are wrappers to run db_bench. These run db_bench tests in a special sequence -- load in key order, read-only, do some overwrites, read-write and then write-only. The benchmark was run using 40 threads. How I do benchmarks for RocksDB is explained here and here. The command line to run the tests is: 
    bash x3.sh 40 no 1800 c48r128 100000000 2000000000 iobuf iobuf iodir

The tests on the charts are named as:
  • fillseq -- load in key order with the WAL disabled
  • revrangeww -- reverse range while writing, do short reverse range scans as fast as possible while another thread does writes (Put) at a fixed rate
  • fwdrangeww -- like revrangeww except do short forward range scans
  • readww - like revrangeww except do point queries
  • overwrite - do overwrites (Put) as fast as possible
For configuration options that Speedb and RocksDB have in common I set those options to the same values. I didn't experiment with Speedb-only options except for use_dynamic_delay.

Workloads

There are three workloads, all of which use 40 threads:

  • byrx - the database is cached by RocksDB (100M KV pairs)
  • iobuf - the database is larger than memory and RocksDB uses buffered IO (2B KV pairs)
  • iodir - the database is larger than memory and RocksDB uses O_DIRECT (2B KV pairs)

Spreadsheets with charts are here and here. Performance summaries are here for byrx, iobuf and iodir.

Results: average throughput

These charts plot relative QPS by test where relative QPS is (QPS for me / QPS for speedb.udd0). When this value is less than 1.0 then the given version is slower than speedb.udd0. When the value is greater than 1.0 then the given version is faster than speedb.udd0. The base case is speedb.udd0 which is Speedb with use_dynamic_delay=0. The versions listed in the charts are:
  • speedb.udd1 - Speedb with use_dynamic_delay=1
  • rocksdb.7.3 - RocksDB 7.3.2
  • rocksdb.7.10 - RocksDB 7.10.2
  • rocksdb.8.5 - RocksDB 8.5.4
  • rocksdb.8.6 - RocksDB 8.6.7
  • rocksdb.9.7 - RocksDB 9.7.4
  • rocksdb.9.9 - RocksDB 9.9.3
For a cached workload (byrx):
  • RocksDB is faster at fillseq (load in key order), otherwise modern RocksDB and Speedb have similar average throughput
  • RocksDB 7.3 was much slower on the read while writing tests (revrangeww, fwdrangeww and readww) but that was fixed by 7.10 and might have been related to issue 9423 or it might be from improvements to the hyper clock cache.
For an IO-bound workload that doesn't use O_DIRECT (iobuf)
  • Modern RocksDB is faster than Speedb at fillseq, has similar perf on the read while writing tests and is slower on overwrite (write-only with keys in random order. The difference in overwrite perf is probably from issue 12038 which arrives in RocksDB 8.6 and then has a fix in 9.9.
  • Similar to above RocksDB 7.3 has a few perf problems that have since been fixed
For an IO-bound workload that uses O_DIRECT (iodir)
  • Modern RocksDB is much faster than Speedb at fillseq and then has similar perf on the other tests
  • Similar to above RocksDB 7.3 has a few perf problems that have since been fixed.

Results: throughput over time

The previous section shows average throughput. And while more average QPS is nice, if that comes with more variance than it is less than great. The charts in this section show QPS at 1-second intervals for fillseq and overwrite for two releases:
  • speedb.udd1 - Speedb with use_dynamic_delay=1
  • rocksdb.9.9.3 - RocksDB 9.9.3

fillseq (load in key order) for byrx, iobuf and iodir
  • For the cached workload the better perf from RocksDB is obvious
  • For the IO-bound workloads the results are closer
  • Variance is less with Speedb than with RocksDB based on the thickness of the lines
With overwrite for a cached workload (byrx) there are two charts - one from the entire run and one from the last 300 seconds.
  • Using the eyeball method (much hand waving) the variance is similar for RocksDB and Speedb. Both suffer from write stalls.
  • Response time percentiles (see p50, p99, p99.9, p99.99 and pmax here) where pmax is <= 57 milliseconds for everything but RocksDB 7.3. Note the numbers in the gist are in usecs.
With overwrite for an IO-bound workload (iobuf) without O_DIRECT there are two charts - one from the entire run and one from the last 300 seconds.
  • Speedb has slightly better average throughput
  • Speedb has much better QoS (less variance). That is obvious based on the thickness of the red line vs the blue line. RocksDB also has more write stalls based on the number of times the blue lines drop to near zero.
  • Using the response time percentiles here the differences between Speedb and RocksDB are less obvious.
  • The percentage of time that writes are stalled is <= 9% for Speedb and >= 20% for modern RocksDB (see stall% here). This is a good way to measure QoS.
With overwrite for an IO-bound workload (iodir) with O_DIRECT there are two charts - one from the entire run and one from the last 300 seconds.
  • Average throughput is slightly better for RocksDB than for Speedb
  • QoS is slightly better for Speedb than for RocksDB (see the blue lines dropping to zero)
  • RocksDB does much better here with O_DIRECT than it does above without O_DIRECT
  • Speedb still does better than RocksDB at avoiding write stalls. See stall% here.
Sources of write stalls

The output from db_bench includes a summary of the sources of write stalls. However, that summary just shows the number of times each was invoked without telling you have much each contributes to the stall% (total percentage of time that write stalls are in effect).

For the IO-bound workload (iobuf and iodir) the number of write stalls is much lower with Speedb. It appears to be more clever about managing write throughput.

The summary for iobuf with Speedb (speedb.udd1) and RocksDB (9.9.3)

speedb  rocksdb
1429      285   cf-l0-file-count-limit-delays-with-ongoing-compaction
   0        0   cf-l0-file-count-limit-stops-with-ongoing-compaction
1429      285   l0-file-count-limit-delays
   0        0   l0-file-count-limit-stops
   0        0   memtable-limit-delays
   0        0   memtable-limit-stops
1131    12585   pending-compaction-bytes-delays
   0        0   pending-compaction-bytes-stops
2560    12870   total-delays
   0        0   total-stops

The summary for iodir with Speedb (speedb.udd1) and RocksDB (9.9.3)

speedb  rocksdb
   5      287   cf-l0-file-count-limit-delays-with-ongoing-compaction
   0        0   cf-l0-file-count-limit-stops-with-ongoing-compaction
   5      287   l0-file-count-limit-delays
   0        0   l0-file-count-limit-stops
  22        0   memtable-limit-delays
   0        0   memtable-limit-stops
   0      687   pending-compaction-bytes-delays
   0        0   pending-compaction-bytes-stops
  27      974   total-delays
   0        0   total-stops

Compaction efficiency

The final sample of compaction IO statistics at the end of the overwrite test is here for iobuf and iodir for speedb.udd1 and RocksDB 9.9.3.

For iobuf the wall clock time for which compaction threads are busy (the Comp(sec) column) is about 1.10X larger for RocksDB than Speedb. This is likely because Speedb is doing larger reads from storage (see the next section) so RocksDB has more IO waits. But the CPU time for compaction (the CompMergeCPU(sec) column) is about 1.12X larger for Speedb (I am not sure why).

For iodir the values in the Comp(sec) and CompMergeCPU(sec) columns are not as different across Speedb and RocksDB as they were above for iobuf.

Understanding IO via iostat

The numbers below are the average values from iostat collected during overwrite.

For iobuf the average read size (rareqsz) drops to 4.7 in RocksDB 8.6 courtesy of issue 12038 while it was >= 50KB prior to RocksDB 8.6. The value improves to 34.2 in RocksDB 9.9.3 but it is still much less than what it used to be in RocksDB 8.5.

For iobuf the average read size (rareqsz) is ~112KB in all cases.

The RocksDB compaction threads read from the compaction input a RocksDB block at a time. For these tests I use an 8kb RocksDB block but in the IO-bound tests the block are compressed for the larger levels of the LSM tree, and a compressed block is ~4kb. Thus some kind of prefetching that does large read requests is need to improve read IO efficiency.

Legend:
  • rps - reads/s
  • rMBps - read MB/s
  • rawait - read wait latency in milliseconds
  • rareqsz - read request size in KB
  • wps - writes/s
  • wMBps - write MB/s
  • wawait - read wait latency in milliseconds
  • wareqsz - write request size in KB
iobuf (buffered IO)
rps     rMBps   rawait  rareqsz wps     wMBps   wawait  wareqsz
6336    364.4   0.43    52.9    14344   1342.0  0.40    95.5    speedb.udd0
6209    357.6   0.41    51.6    14177   1322.8  0.41    95.3    speedb.udd1
2133    164.0   0.19    26.1    8954    854.4   0.31    99.3    rocksdb.7.3
4542    333.5   0.49    71.8    14734   1361.6  0.39    94.5    rocksdb.7.10
6101    352.1   0.42    52.4    14878   1391.8  0.41    96.1    rocksdb.8.5
40471   184.7   0.10    4.7     8552    784.1   0.31    93.6    rocksdb.8.6
39201   178.8   0.10    4.7     8783    801.5   0.32    93.7    rocksdb.9.7
7733    268.7   0.29    34.2    12742   1156.0  0.30    93.2    rocksdb.9.9

iodir (O_DIRECT)
rps     rMBps   rawait  rareqsz wps     wMBps   wawait  wareqsz
12757   1415.9  0.82    112.8   16642   1532.6  0.40    93.5    speedb.udd0
12539   1392.9  0.83    112.7   16327   1507.9  0.41    93.6    speedb.udd1
8090    903.5   0.74    114.3   10036   976.6   0.30    100.5   rocksdb.7.3
12155   1346.2  0.90    112.8   15602   1462.4  0.40    95.7    rocksdb.7.10
13315   1484.5  0.85    113.6   17436   1607.1  0.40    94.0    rocksdb.8.5
12978   1444.3  0.84    112.9   16981   1563.0  0.46    93.4    rocksdb.8.6
12641   1411.9  0.84    113.9   16217   1535.6  0.43    96.7    rocksdb.9.7
12990   1450.7  0.83    113.8   16704   1576.0  0.41    96.3    rocksdb.9.9

Update 1

For the overwrite test and the iobuf (buffered IO, IO-bound) workload the QPS is 277795 for RocksDB 8.5.4 vs 227068 for 9.9.3. So 9.9.3 gets ~82% of the QPS relative to 8.5.4. Note that 9.8 only gets about 57% and the improved from 9.8 to 9.9 is from fixing issue 12038 but that fix isn't sufficient given the gap between 8.5 and 9.9.

With a hack I can improve the QPS for 9.9 from 227068/s to 241260/s. The problem, explained to me by the RocksDB team, is that while this code adjusts compaction readahead to be no larger than max_sectors_kb, the code that requests prefetch can request for (X + Y) bytes where X is the adjusted compaction readahead amount and Y is the block size. The sum of these is likely to be larger than max_sectors_kb. So my hack was to reduce compaction readahead to be (max_sectors_kb - 8kb) given that I am using block_size=8kb.

And with the hack the QPS for overwrite improves. The performance summary from the overwrite test is here and the stall% decreased from 25.9 to 22. Alas, the stall% was 9.0 with RocksDB 8.5.4 so there is still room for improvement.

The compaction reasons for 8.5.4, 9.9.3 without the hack (9.9orig) and 9.9.3 with the hack (9.9hack) provide a better idea of what has changed.

8.5.4   9.9orig 9.9hack
1089      285     211   cf-l0-file-count-limit-delays-with-ongoing-compaction
   0        0       0   cf-l0-file-count-limit-stops-with-ongoing-compaction
1089      285     211   l0-file-count-limit-delays
   0        0       0   l0-file-count-limit-stops
   0        0       0   memtable-limit-delays
   0        0       0   memtable-limit-stops
1207    12585   10735   pending-compaction-bytes-delays
   0        0       0   pending-compaction-bytes-stops
2296    12870   10946   total-delays
   0        0       0   total-stops

And finally, averages from iostat during the test show that 9.9.3 with the hack gets the largest average read request size (rareqsz is 56.4) but it still is slower than 8.5.4.

rps     rmbps   rawait  rareqsz wps     wmbps   wawait  wareqsz
6101    352.1   0.42    52.4    14878   1391.8  0.41    96.1    8.5.4
7733    268.7   0.29    34.2    12742   1156.0  0.30    93.2    9.9orig
5020    285.7   0.43    56.4    13467   1220.2  0.32    93.1    9.9hack

December 26, 2024

Use of Time in Distributed Databases (part 2): Use of logical clocks in databases

This is part 2 of our "Use of Time in Distributed Databases" series. We talk about the use of logical clocks in databases in this post. We consider three different approaches:

  • vector clocks
  • dependency graph maintenance
  • epoch service 

In the upcoming posts we will allow in physical clocks for timestamping, so there is no (almost no) physical clocks involved in the systems in part 2.   



1. Vector clocks

Dynamo: Amazon's highly available key-value store (SOSP'07)

Dynamo employs sloppy quorums and hinted hand-off and uses version vector (a special case of vector clocks) to track causal dependencies within the replication group of each key. A version vector contains one entry for each replica (thus the size of clocks grows linearly with the number of replicas). The purpose of this metadata is to detect conflicting updates and to be used in the conflict reconciliation function. Dynamo provides eventual consistency thanks to this reconciliation function and conflict detection by version vectors.

Cassandra, which provided an opensource implementation of Dynamo, decided to forgo vectors clocks in favor of using physical time supplied by the client and Last-Writer-Wins rule for updating replicas. 

So, yeah, somehow use of vector clocks in datastores fizzled out over time. Maybe the size of vector clocks to be included in messages was the headache. Or maybe use of synchronized physical clocks offered more advantages in addition to a single scalar timestamp. Nevertheless, vector clocks may still have applications in version control systems and event logging in distributed systems. And, below we talk about two more systems that uses some form of vector clocks.


ORBE (SOCC'13): Causal consistency

ORBE uses vector clocks, organized as a matrix, to represent dependencies. The vector clock has an entry per partition and data center. Physical clocks are used for generating read snapshot times, and ORBE can complete read-only transactions in one round by relying on these loosely synchronized physical clocks. A drawback with ORBE is the large size of timestamps, which followup work on Gentle Rain aimed to address.


The end of a myth: Distributed transactions can scale (VLDB'17)

NAM-DB aims to addresses scalability challenges in distributed transactions through innovative use of RDMA (Remote Direct Memory Access) adopting a timestamp oracle design. The timestamp oracle uses a partitionable vector clock approach to manage commit timestamps without contention. The timestamp oracle protocol implements a software-based solution where each transaction execution thread maintains its own counter in a timestamp vector, allowing for distributed timestamp management without contention. Transactions obtain read timestamps by reading the vector and commit timestamps by incrementing their specific vector entry through efficient RDMA operations.

Let's dive into how the commit protocol achieves consistency. When committing, transactions create new timestamps by incrementing their counter, verify and lock their write-sets using RDMA operations, and update the timestamp vector upon success. This design offers several advantages: transaction threads operate independently without synchronization overhead, long-running transactions don't block others, and the system maintains monotonic timestamp progression when stored on a single memory server (though this property may not hold with partitioned storage).



2. Dependency graph maintenance

Scalable Causal Consistency for Wide-Area Storage with COPS (SOSP'11)

COPS introduced a dependency tracking approach for achieving causal consistency in geo-replicated datastores. The system assigns scalar version numbers to objects and maintains causality by having clients track the versions of all objects read in their causal past. When updates are propagated between data centers, they carry their dependencies, and receiving data centers only make updates visible once all dependencies are satisfied. A key feature of COPS is its support for causally consistent read-only transactions, which provide a consistent snapshot of the system. These transactions are implemented through a two-round protocol.

COPS chose to perform explicit dependency tracking over using vector clocks. They justified their choice against vector clocks by citing scalability concerns, particularly the O(N) size growth with the number of nodes. They argued that in a datacenter with thousands of nodes, the metadata overhead would become prohibitive. I think they overindexed on the N number of nodes. N doesn't grow to very large numbers in deployments, and especially not for replication. As another reason, they noted that vector clocks only provide happens-before relationships and there would still be a need for additional mechanisms like serialization points or explicit dependency checking to enforce causal consistency across the datacenter. I don't get this argument, either. I think they wanted to take a stance for explicit dependency checking rather than the implicit/wholesale causality we get from logical/vector clocks. 

This explicit dependency tracking approach influenced later systems, including the EPaxos family of consensus protocols. The principle is the same: Each operation maintains dependencies for operations, and replication dependencies are checked at each node, and when they are satisfied the value is updated there. Unfortunately, the dependency graphs can grow significantly in pathological cases, and these systems can experience significant slowdowns when dependency lists grow large. Subsequent systems like Occult and Accord/Cassandra (as we will cover in upcoming posts in this series) have shown that combining dependency tracking approach with loosely synchronized physical clocks can help manage the complexity. 


Kronos: The design and implementation of an event ordering service (Eurosys'14)

Kronos introduces a centralized event ordering service for distributed systems that tracks happens-before relationships through a dedicated API. Rather than having individual nodes maintain and propagate dependency information as in COPS, here the applications explicitly register events and define causal relationships with the Kronos service. This approach allows for cross-system dependency management and fine-grained concurrency detection, with the system binding events to a time order as late as possible. While this provides more flexibility in capturing application-specific causality compared to Logical/Vector Clocks (which automatically assume causal dependence between consecutive events on the same node), it comes with the overhead of communicating with the service and searching dependency graphs.


 

3. Epoch server

Chardonnay (OSDI'23): use of a centralized epoch service

Chardonnay is an in-memory distributed database that employs a logically-centralized (3 MultiPaxos nodes under the trenchcoat) epoch service, whose sole job  is to maintain a monotonic epoch counter. The magic of the epoch counter enters the picture for read-only transactions, but let's first cover the read-write transactions.

For read-write transactions, Chardonnay uses a two-phase approach: first running transactions in "dry run" mode to discover and pin read/write sets in memory, then executing them definitively using 2PC+2PL in-memory for speed. This approach leverages modern datacenter networking being significantly faster than disk I/O, allowing Chardonnay to achieve strictly serializable transactions efficiently by keeping relevant data in memory and avoiding deadlocks through ordered lock acquisition. In that sense, this architecture builds on ideas from deterministic databases like Calvin.

For read-only transactions, Chardonnay implements snapshot isolation within epochs (10ms intervals), enabling contention-free queries. A transaction can get a consistent snapshot as of the beginning of the current epoch ec by ensuring it observes the effects of all committed transactions that have a lower epoch. That is realized by waiting for all the transactions with an epoch e < ec to release their write locks. Hence, the snapshot read algorithm would simply work by reading the epoch ec, then reading the appropriate key versions. It is a neat trick, no?

This algorithm does not guarantee strict serializability, because a transaction T would not observe the effects of transactions in epoch ec that committed before T started. If desired, ensuring linearizability is easy at the cost of some latency; after T starts, wait for the epoch to advance once and then use the new epoch for reads. Another neat trick. Tradeoff latency with efficiency/throughput.

The system has been extended to multi-datacenter deployments through Chablis (CIDR '24), which introduces global epochs for cross-datacenter consistency while maintaining local epoch efficiency.

Picking up volleyball in NYC with Goodrec and New York Urban

I was so intimidated to go at first, but it is in fact easy and fun to start playing beginner volleyball in New York. The people are so friendly and welcoming that it has been easy to keep playing consistently every week since I started for the first time this August. It's been a great workout and a great way to make friends!

The two platforms I've used to find volleyball games are Goodrec and New York Urban. While these platforms may also offer classes and leagues, I mostly use them to play "pickup" games. Pickup games are where you show up and join (or get assigned to) a team to play for an hour or two. Easy to go on your own or with friends.

I'm not an expert! My only hope with this post is that maybe it makes trying out volleyball in New York feel a little less intimidating for you!

Goodrec

With Goodrec you have to use their mobile app. Beginner tier is called "social" on Goodrec. So browse available games until you find one at the level you want to play. You enroll in (buy a place in) sessions individually.

Sessions are between 90-120 minutes long.

They ask you not to arrive more than 10 minutes early at the gym. When you arrive you tell the gym managers (usually in a desk up front somewhere) you're there for Goodrec and the tier (in case the gym has multiple level games going on at the same time). Then you wait until the Goodrec "host" arrives and they will organize everyone into teams.

Goodrec hosts are players who volunteer to organize the games. They'll explain the rules of the game (makes Goodrec very good for beginners) and otherwise help you out.

Always say thank you to your host!

New York Urban

With New York Urban, pickup sessions are called "open play".

There is no mobile app, you just use the website to purchase a spot in a session. The sessions are longer and cheaper than Goodrec. But there is no host; players self-organize.

The options are more limited too. You play at one of four high schools on either a Friday night or on Sunday. And session slots tend to sell out much more quickly than with Goodrec.

Big City Volleyball

You can also check out Big City Volleyball but I haven't used it yet.

Volo

I haven't ever done Volo but I think I've heard it described as "beer league". That even some of the beginner tier sessions with Goodrec and New York Urban are more competitive.

But also, Volo is built around leagues so you have to get the timing right. Goodrec's and New York Urban's pickup games make it easy to get started playing any time of year.

Making friends

It was super awkward to go at first! I went by myself. I didn't know what I was doing. I couldn't remember, and didn't know, many rules. I didn't have court shoes or knee pads.

But the Goodrec host system is particularly great for bringing beginners in and making them feel welcome. You have a great time even if you're terrible.

The first game I went to, I tried to hang out afterward to meet people. But people either came with their SO or with their friends or by themselves so they all just left immediately or hung out in their group.

So you can't just go once and expect to make friends immediately. But if you keep going at the same place and time regularly week over week, you'll see familiar faces. Maybe half the people I play with each week are regulars. If you're friendly you'll start making friends with these people and eventually start going out to bars with them after the games.

Improving

Even if you find yourself embarrassingly bad at first, just keep going! I'm 29, 6'1, 190lbs and from observation the past 5 months, age, height, and weight have a very indirect relation to playing ability.

Most of the people who play are self-taught, especially at the lower tiers I've played at. But some people played for the school team in high school or college. These people are fun to play with and you can learn a lot from them.

Most people who are self-taught seem to watch YouTube videos like Coach Donny, helpful for learning how to serve, set, block, etc. Or they take "clinics" (classes) with Goodrec or other platforms. (I have no idea about these, I've never done them before.)

At first I played 2 hours a week and I was completely exhausted after the session. Over time it got easier so I started playing 2-3 sessions a week (6-9-ish hours). With practice and consistency (after about 3-4 months), I started playing Intermediate tier with Goodrec and New York Urban. And I don't think I'll play Beginner/Social at all anymore.

I still primarily play for fun and for the workout and to meet people. But it's also fun to get better!

I played with one person much better than myself in an Intermediate session one time and he mentioned he will probably stop playing Intermediate and only play High Intermediate. He mentioned you get better when you keep pushing yourself to play with better and better players. Good advice!

December 25, 2024

Seconds Since the Epoch

This is not at all news, but it comes up often enough that I think there should be a concise explanation of the problem. People, myself included, like to say that POSIX time, also known as Unix time, is the number of seconds since the Unix epoch, which was 1970-01-01 at 00:00:00.

This is not true. Or rather, it isn’t true in the sense most people think. For example, it is presently 2024-12-25 at 18:51:26 UTC. The POSIX time is 1735152686. It has been 1735152713 seconds since the POSIX epoch. The POSIX time number is twenty-seven seconds lower.

This is because POSIX time is derived in IEEE 1003.1 from Coordinated Universal Time. The standard assumes that every day is exactly 86,400 seconds long. Specifically:

The time() function returns the value of time in seconds since the Epoch.

Which is defined as:

seconds since the Epoch. A value to be interpreted as the number of seconds between a specified time and the Epoch. A Coordinated Universal Time name (specified in terms of seconds (tm_sec), minutes (tm_min), hours (tm_hour), days since January 1 of the year (tm_yday), and calendar year minus 1900 (tm_year)) is related to a time represented as seconds since the Epoch according to the expression below.

If year < 1970 or the value is negative, the relationship is undefined. If year ≥ 1970 and the value is non-negative, the value is related to a Coordinated Universal Time name according to the expression:

tm_sec + tm_min * 60 + tm_hour * 3600 + tm_yday * 86400 + (tm_year-70) * 31536000 + ((tm_year - 69) / 4) * 86400

The length of the day is not 86,400 seconds, and in fact changes over time. To keep UTC days from drifting too far from solar days, astronomers periodically declare a leap second in UTC. Consequently, every few years POSIX time jumps backwards, wreaking utter havoc. Someday it might jump forward.

Archaeology

Appendix B of IEEE 1003 has a fascinating discussion of leap seconds:

The concept of leap seconds is added for precision; at the time this standard was published, 14 leap seconds had been added since January 1, 1970. These 14 seconds are ignored to provide an easy and compatible method of computing time differences.

I, too, love to ignore things to make my life easy. The standard authors knew “seconds since the epoch” were not, in fact, seconds since the epoch. And they admit as much:

Most systems’ notion of “time” is that of a continuously-increasing value, so this value should increase even during leap seconds. However, not only do most systems not keep track of leap seconds, but most systems are probably not synchronized to any standard time reference. Therefore, it is inappropriate to require that a time represented as seconds since the Epoch precisely represent the number of seconds between the referenced time and the Epoch.

It is sufficient to require that applications be allowed to treat this time as if it represented the number of seconds between the referenced time and the Epoch. It is the responsibility of the vendor of the system, and the administrator of the system, to ensure that this value represents the number of seconds between the referenced time and the Epoch as closely as necessary for the application being run on that system….

I imagine there was some debate over this point. The appendix punts, saying that vendors and administrators must make time align “as closely as necessary”, and that “this value should increase even during leap seconds”. The latter is achievable, but the former is arguably impossible: the standard requires POSIX clocks be twenty-seven seconds off.

Consistent interpretation of seconds since the Epoch can be critical to certain types of distributed applications that rely on such timestamps to synchronize events. The accrual of leap seconds in a time standard is not predictable. The number of leap seconds since the Epoch will likely increase. The standard is more concerned about the synchronization of time between applications of astronomically short duration and the Working Group expects these concerns to become more critical in the future.

In a sense, the opposite happened. Time synchronization is always off, so systems generally function (however incorrectly) when times drift a bit. But leap seconds are rare, and the linearity evoked by the phrase “seconds since the epoch” is so deeply baked in to our intuition, that software can accrue serious, unnoticed bugs. Until a few years later, one of those tiny little leap seconds takes down a big chunk of the internet.

What To Do Instead

If you just need to compute the duration between two events on one computer, use CLOCK_MONOTONIC, or better yet, CLOCK_BOOTTIME. If you don’t need to exchange timestamps with other systems that assume POSIX time, use TAI, GPS, or maybe LORAN. If you do need rough alignment with other POSIX-timestamp systems, smear leap seconds over a longer window of time. Libraries like qntm’s t-a-i can convert back and forth between POSIX and TAI.

There’s an ongoing effort to end leap seconds, hopefully by 2035. It’ll require additional work to build conversion tables into everything that relies on the “86,400 seconds per day” assumption, but it should also make it much simpler to ask questions like “how many seconds between these two times”. At least for times after 2035!

December 24, 2024

Helping Christmas Elves Count Presents (or: Vectorized Overflow Checking)

Vectorized Overflow Checking

In our earlier post on overflow handling we explained how to use the CPU flags to detected integer overflows and wrote: “this doesn’t have zero overhead as the compiler can’t vectorize the function using checked arithmetic.” Not satisfied, we took matters into our own hands: If the compiler can’t help us, we have to help ourselves!

This post explains in very low-level detail how you can very quickly sum up (signed) integers on modern x86 CPUs using specialized vector instructions such as vpternlogd. We also show some numbers comparing the hand-written vectorized sum with the compiler-assisted checked arithmetic that demonstrate that you can gain a lot of performance even when checking for overflows.

December 23, 2024

Use of Time in Distributed Databases (part 1)

Distributed systems are characterized by nodes executing concurrently with no shared state and no common clock. Coordination between nodes are needed to satisfy some correctness properties, but since coordination requires message communication there is a performance tradeoff preventing nodes from frequently communicating/coordinating.


Timestamping and ordering

Why do the nodes, in particular database nodes, need coordinating anyway? The goal of coordination among nodes is to perform event ordering and align independent timelines of nodes with respect to each other. This is why timestamping events and ordering them is very important. Nodes run concurrently without knowing what the other nodes are doing at the moment. They learn about each other's states/events only by sending and receiving messages and this information by definition come from the past state of the nodes. Each node needs to compose a coherent view of the system from these messages and all the while the system is moving along advancing and changing state. This is like trying to compose a panaromic snapshot of a parade moving down the street only by using small 4x6 photos shot by the participants at arbitrary times.

In 1978, Leslie Lamport captured the ordering relationship between events in a distributed system by introducing a neat abstraction called logical clocks. Logical clocks timestamp events based on causality (happened-before) which comes from either precedence within the same thread of execution at the node or via message communication received from another node.

Unfortunately logical clocks are disconnected from real time (wall clocks). Logical clocks would not be able to establish an ordering to event2 at node2 that is occurring a full 24 hours later than event1 at node1, if there had not been a chain of communication linking event1 to event2 with a causal precedence relationship in that period. Logical clocks ignore all back-channel, and require strict enforcement of all communication to be logical-clock timestamped. Moreover, they don't allow you to query physical time an event took place.

Their cousin vector clocks can get you consistent snapshots across nodes (well, with O(n) vector clock overhead), but vector clocks are also vulnerable to the same drawbacks above. To address these drawbacks, we need physical clocks and real-time affinity. To drive this point home, and learn about even more benefits synchronized clocks can give us, here is another round of motivation for the importance of time in distributed systems and databases.


What is synchronized clocks good for? 

Asymmetry of information is the curse on distributed system. The coordinated attack problem provides a great demonstration of this. The two generals in that problem are stuck in a Zeno's paradox of "You may not know that I know that you know that I know" kind of information chaining. A classic paper to checkout on this topic is "Knowledge and common knowledge in a distributed environment".

Synchronized clocks create a consistent global time reference, seeding the build up of common knowledge. Having a globally consistent time reference (i.e., synchronized clocks/time) is a big boon for common knowledge. Without a common reference time, even by using same rate local clock at each node you lose considerable information about the other party. 

Consider this example where node2 is trying to learn for how long more node1's lease would be valid.  Without synchronized time, unilateral communication from node1 to node2 does not suffice to convey this information, because the message from node1 to node2 could have been arbitrarily delayed in the channel. This makes the timer on node2 obsolete for coordinating between the two nodes since by the time the message arrives the lease might have long expired. With unilateral communication node2 would only know when for sure node1's lease is invalidated, but cannot know when node1 is still holding a lease. It can only reason about the absence of lease by node1, but not about its presence.

Without synchronization, node2 requires two-way communication to estimate lease validity. In this scheme, node2 initializes the communication, and starts its timer when it sends the message. When it receives a response back from node1 which contains the remaining time on the lease for node1, it needs to subtract the delta time passed for the round-trip-time (RTT) to get a lowerbound on the validity of node1's lease. It cannot subtract delta/2 from the lease timer, because communication delay may be asymmetrical. 

But, if we have synchronized clocks, this information can be conveyed just with a unilateral communication from node1 to node2. Node1's message would say "I have the lease till T_end", and when node2 receives the message, this T_end information is instantly usable by Node2 without sufffering the RTT delay rounding above, because of the synchronized clocks. Synchronized clocks simplifies coordination and reduces uncertainty in distributed systems.


Loosely synchronized clocks

Ok, let's discuss physical clocks, and clock synchronization at last! Each node has a local clock, which is just a crystal/quartz oscillator. A circuit counts the number of times a crystal oscillates and declares that a millisecond has passed because it counted say 1000 oscillations. This is not precise of course. Atomic clocks use rubidium, which has an oscillation/clock error of one microsecond a day. But quartz is not like that. It is dirt cheap and small, but also inaccurate so it drifts too much. No two quartz crystals are identical. They have big variations. Another thing about quartz crystal is they are very temperature sensitive: when the temperature gets colder they oscillate more when it gets hotter they oscillate less.  Therefore frequent time synchronization is needed when using crystal oscillator based clocks. Otherwise each node's clock would drift apart from each other.

In 1985, Network Time Protocol (NTP) entered the game. NTP is a networking protocol for clock synchronization between computer systems over packet-switched, variable-latency data networks. NTP is by far the most popular time sync distribution protocol on the Internet. However, NTP clock sync errors can be amount to tens of milliseconds. The biggest source of problem for NTP non-deterministic errors come in for synchronization? It comes from the network. The switches and routers are the source of nondeterministic errors for NTP. Asymmetry in the links is a big error source. Consider 100 mbps link feeding into 1 Gbps link. One way there is no delay, but coming back the other way there is queuing delay. 

Unfortunately, using loosely synchronized clocks via NTP, with several tens of millisecond uncertainty, violates causality and monotonicity under errors. This is problematic for ordering events and our consistent snapshots across nodes. Hybrid Logical Clocks (HLC) provides a remedy, by integrating Lamport's logical time with loosely synchronized physical clocks. This makes them resilient to synchronization errors and enables progress in degraded conditions.

Of course using tightly synchronized clocks is even better, and gets you better performance with more certainty about other node's states/events/orderings. Luckily we had great progress in this area in the last decade.


The advent of tightly synchronized clocks

In 2018, I wrote this short survey on the advent of tightly synchronized clocks.

One big development has been the availability of cheaper (but almost as precise clocks) than atomic clocks. Atomic clocks use Rubidium and has 1 µs/day. This is so small that relativistic effect considerations kick in here. OCXO ovenized oscillators provide the next best way to have precise clocks in a box in a much cheaper manner. They achieve a drift of ~25 µs/day. Pair these bad boys with a GPS antenna and they get less than 100 nanosecond of true time.   

The rest is ensuring high quality distribution of time information and avoiding the sources of nondeterminism creeping in. Precision Time Protocol (PTP) with hardware timestamping helps a lot here and gets you to have ~50 µs clock uncertainty as the AWS Timesync provides for public access. 

I would be amiss if I don't talk about the clockbound API here. Google TrueTime (TT) introduced these bounded uncertainty intervals. A call to TT.now() API returns earliest and latest bounds that the true clock could in the worst case fall into. This is awesome because if you ensure non-overlapping intervals on events, you achieve definite event ordering. This allows you to wait out the latest bound on time to strictly order/serialize your commit event. This also provides another level of safety for correctness of timestamp based ordering. For ordering to go awry, both clock synchronization, and clock-bound around it need to go bad concurrently. If your clock synchronization goes bad, the bounds around it would grow, and you would know something is going wrong. 


Timestamp-based Algorithms for Concurrency Control in Distributed Database Systems (VLDB'80)

Ok, this introduction already got long. Let's finish this Part 1 with a banger! This paper is from VLDB 1980, and predates even NTP. It is written with a typewriter for God's sake. But Bernstein & Goldman's vision is ahead of its time for many decades. They assume synchronized clocks and propose Timestamp Ordering (TSO) algorithms for concurrency control in distributed database systems. The banger is that after more than three decades, Amazon DynamoDB (ATC'23) uses TSO based algorithm for its one-shot transactions.

Here is how the TSO-based optimistic concurrency control algorithm works. Transactions are serialized by the start-time T_si. Transactions don't hold any locks, and resort to optimistic validation. Instead, every object x is tagged with the timestamp of the last transaction[s] that successfully read/write x. So, object x has two metadata associated with it: read_ts(x) and write_ts(x). To guarantee the serializability of transactions, it is crucial that these timestamps should be monotonically increasing (hence the need for synchronized clocks!). The update protocol checks and ensures this for each access. If a transaction tries to access (read/write) an object from its future it is aborted. For example, if ti finds that it is reading x, which has write_ts(x)>ts_ti, then this would be a future read, and x aborts itself. It may later be retried with a new/higher ts, and hopefully it will succeed then. That is it. That is the whole protocol.

After 1980, the next splash on the use of synchronized clocks arrive at PODC 1991, with Barbara Liskov's "Practical Uses of Synchronized Clocks in Distributed Systems" paper. Interestingly this paper does not cite the Bernstein-Goldman's banger groundbreaking paper. It proposed several distributed algorithms that use synchronized clocks:

  • Leases for cache consistency and leader-local reads. Technically, these are realized only via same-rate local timers and don't need synchronized clocks. 
  • SCMP protocol for at-most-once message delivery
  • Kerberos authentication tickets

A noteworthy principle the paper advocated was to use clocks for performance, and not correctness. Notice that in the Bernstein-Goldman paper a decade earlier the correctness of serialization of OCC transactions did not rely on the time synchronization, but the performance would be improved with better time synchronization.

Then comes two decades of silence. The next big update on the practical use of synchronized clocks came 21 years later in 2012 with Google Spanner.


Outline of Use of Time in Distributed Databases

So, this is what I am thinking of covering in the next posts on this series. I already had summaries of these work in my blog over the last several years. It will be nice to organize this information and draw some trends/conclusions from these work. One thing is clear: We see an accelerating trend of adoption of synchronized clocks in distributed databases.

Part 2: Use of logical clocks in database design: Discusses use of vector clocks (e.g., Dynamo (SOSP'07)ORBE (SOCC'13), NAM-DB (VLDB'17)), dependency graph maintenance (COPS (SOSP'11), Kronos(EuroSys'14)), and epoch server (e.g., Chardonnay (OSDI'23), Chablis (CIDR'24)).

Part 3: Use of synchronized physical clocks in database design: Granola (ATC'12), Clock-SI (SRDS'13), GentleRain (SOCC'14), Occult (NSDI'17), Nezha (VLDB'23)

Part 4: Use of clocks in production databases: Spanner (OSDI'12), MongoDB (SIGMOD'19), CockroachDB (SIGMOD'20), DynamoDB (ATC'23), Cassandra/Accord (2023), Aurora Limitless (2023), Aurora DSQL (2024) 

Part 5: Lessons learned: Takeaways, correctness of synchronized clock-based algorithms, tradeoffs, and trends

December 22, 2024

How bloom filters made SQLite 10x faster

This is the fascinating story of how researchers used Bloom filters cleverly to make SQLite 10x faster for analytical queries. These are my five-minute notes on the paper SQLite: Past, Present, and Future