a curated list of database news from authoritative sources

January 13, 2025

Evaluating vector indexes in MariaDB and pgvector: part 2

This post has results from the ann-benchmarks with the fashion-mnist-784-euclidean dataset for MariaDB and Postgres (pgvector) with concurrent queries (--batch). My previous post has results when not using concurrent queries. This work was done by Small Datum LLC and sponsored by the MariaDB Corporation.

I compare MariaDB with pgvector because I respect the work that the Postgres community has done to support vector search. And I am happy to report that MariaDB has also done a great job on this. While I don't know the full story of the development effort, this feature came from the MariaDB Foundation and the community and it is wonderful to see that collaboration.

Performance for MariaDB is excellent
  • peak QPS for a given recall target is much better than pgvector
  • time to create the index to achieve a given recall target is almost always much better than pgvector

Benchmark

Part 1 has all of the details.

I am using the ann-benchmarks via my fork of a fork of a fork at this commit. I added --batch to the command lines listed in part 1 to repeat the benchmark using concurrent queries. Note that parallel index create was disabled for Postgres and isn't (yet) supported by MariaDB.

With --batch there is one concurrent query per CPU core and my server has 8 cores.

Files related to these tests are archived here.

Results: QPS vs recall graphs

The recall vs QPS graph is created by: python3 plot.py --dataset fashion-mnist-784-euclidean --batch. This chart is very similar to the chart in part 1. One difference is that the peak QPS for MariaDB and Postgres are ~6000/s and ~4000/s there vs ~22000/s and ~15000/s here.

The results below for MariaDB are excellent. It gets more QPS than pgvector at a given recall target.


Results: create index

I am still trying to figure out how to present this data. All of the numbers are here for the time to create an index and the size of the index. The database configs for Postgres and MariaDB are shared above, and parallel index create is disabled by the config for Postgres (and not supported yet by MariaDB). The summary is:
  • Indexes have a similar size with MariaDB and Postgres with halfvec. The Postgres index without halfvec is about 2X larger.
  • Time to create the index for Postgres is similar with and without halfvec
  • Time to create the index for MariaDB is less than for pgvector. Perhaps the best way to compare this is the time to create the index for a similar point on the QPS/recall graph (see the last section of this blog post)
Results: best QPS for a given recall

Many benchmark results are marketed via peak performance (max throughput or min response time) but these are usually constrained optimization problems -- determine peak performance that satisfies some SLA. And the SLA might be response time or efficiency (cost).

With ann-benchmarks the constraint might be recall where you seek the best QPS that satisfies a recall target. Below I share the best QPS for a given recall target along with the configuration parameters (M, ef_construction, ef_search) at which that occurs for each of the algorithms (MariaDB, pgvector with float32, pgvector with float16).

For all cases below except the first (recall = 1.000) the time to create the index is less with MariaDB.

For all cases below the best QPS at a given recall target is much better, sometimes 3X better, with MariaDB than with pgvector.

Legend
* recall & QPS - results from the benchmark
* isecs - number of seconds to create the index for M and ef_cons (ef_cons)

Best QPS with recall = 1.000
recall  QPS     isecs   algorithm
1.000    4727   115.9   PGVector(m=16, ef_cons=256, ef_search=120)
1.000    5479    98.6   PGVector_halfvec(m=16, ef_cons=192, ef_search=120)
1.000   13828   108.5   MariaDB(m=32, ef_search=10)

Best QPS with recall >= 0.99
recall  QPS     isecs   algorithm
0.990   10704    71.2   PGVector(m=16, ef_cons=96, ef_search=20)
0.991   11377    90.1   PGVector_halfvec(m=16, ef_cons=256, ef_search=20)
0.995   17642    24.5   MariaDB(m=12, ef_search=10)

Best QPS with recall >= 0.98
recall  QPS     isecs   algorithm
0.985   10843    51.0   PGVector(m=16, ef_cons=32, ef_search=20)
0.984   11749    44.5   PGVector_halfvec(m=16, ef_cons=32, ef_search=20)
0.995   17642    24.5   MariaDB(m=12, ef_search=10)

Best QPS with recall >= 0.97
recall  QPS     isecs   algorithm
0.970   12836    35.1   PGVector(m=8, ef_cons=64, ef_search=20)
0.973   13742    33.8   PGVector_halfvec(m=8, ef_cons=96, ef_search=20)
0.995   17642    24.5   MariaDB(m=12, ef_search=10)

Best QPS with recall >= 0.96
recall  QPS     isecs   algorithm
0.970   12836    35.1   PGVector(m=8, ef_cons=64, ef_search=20)
0.973   13742    33.8   PGVector_halfvec(m=8, ef_cons=96, ef_search=20)
0.995   17642    24.5   MariaDB(m=12, ef_search=10)

Best QPS with recall >= 0.95
recall  QPS     isecs   algorithm
0.970   12836    35.1   PGVector(m=8, ef_cons=64, ef_search=20)
0.973   13742    33.8   PGVector_halfvec(m=8, ef_cons=96, ef_search=20)
0.995   17642    24.5   MariaDB(m=12, ef_search=10)

January 12, 2025

Evaluating vector indexes in MariaDB and pgvector: part 1

This post has results for vector index support in MariaDB and Postgres. I am new to vector indexes so I will start small and over time add more results.  This work was done by Small Datum LLC and sponsored by the MariaDB Corporation.

I compare MariaDB with pgvector because I respect the work that the Postgres community has done to support vector search workloads. And I am happy to report that MariaDB has also done a great job on this. While I don't know the full story of the development effort, this feature came from the MariaDB Foundation and the community and it is wonderful to see that collaboration.

tl;dr

  • Performance for MariaDB is excellent
    • peak QPS for a given recall target is much better than pgvector
    • time to create the index to achieve a given recall target is almost always much better than pgvector
  • MariaDB is easier to tune than pgvector
  • MariaDB needs more documentation
The good and the less good

The good for MariaDB starts with performance. The results I get are great and match the results in this blog post from upstream. Another good thing about MariaDB is that it is easier to tune. With pgvector I need to set M and ef_construction while creating an index and then ef_search while running a query, With MariaDB there is no option to set ef_construction. And evaluations are easier when there are fewer options to tune. Note that the M with pgvector isn't the same thing as the M with MariaDB, but they are similar (at least in spirit).

The less good for MariaDB is the documentation. We need more, but that is easy to fix. I have been using:

Things that need more documentation
  • the algorithm is described as a variation of HNSW (MHNSW) and the community will benefit from more details on what has been changed. For example, I can't set ef_construction and it always uses float16 in the index. But with pgvector I can use either float32 (default) or float16 (via halfvec).
  • What transaction isolation level is supported? I get that workloads will be read-heavy but I assume that some won't be read-only so I won't to know whether repeatable read and read committed are provided.
  • What concurrent operations are supported? Obviously, reads can be concurrent with other reads. But can writes be concurrent with reads or with other writes on the same table?
Hardware

The hardware is a Beelink SER7 7840HS with a Ryzen 7 7840HS CPU, 32G of RAM and Ubuntu 22.04.

While I already installed some dependencies on this server long ago, to run this benchmark I did:

sudo apt install libmariadb3 libmariadb-dev

pip3 install mariadb

pip3 install pgvector psycopg


Database software

For Postgres I compiled version 17.2 and pgvector 0.8.0 from source. Files that I used include:
For MariaDB I compiled version 11.7.1 from source. Files that I used include:
Benchmark

I am using the ann-benchmarks via my fork of a fork of a fork at this commit. These forks have changes to run the benchmark for MariaDB and pgvector without using Docker containers. A request I have for anyone writing a benchmark client is to limit dependencies, or at least make them optional. I just want to point the benchmark client at an existing installation.

The ann-benchmarks configuration files are here for MariaDB and for pgvector. I am open to feedback that I should try different parameters. I added support to use float16 (halfvec) for the pgvector index (but only the index, not for the vector stored in the table).

In this post I use the fashion-mnist-784-euclidean dataset in non-batch mode where non-batch mode doesn't run concurrent queries and batch mode does. Note that I don't set --count when running the benchmark which means that all of the queries use LIMIT 10.

Files related to these tests are archived here.

The command lines for non-batch mode are:
POSTGRES_CONN_ARGS=root:pw:127.0.0.1:5432 POSTGRES_DB_NAME=ib \
    python3 -u run.py  --algorithm pgvector --dataset fashion-mnist-784-euclidean --local

POSTGRES_CONN_ARGS=root:pw:127.0.0.1:5432 POSTGRES_DB_NAME=ib \
    python3 -u run.py  --algorithm pgvector_halfvec --dataset fashion-mnist-784-euclidean --local

MARIADB_CONN_ARGS=root:pw:127.0.0.1:3306 MARIADB_DB_NAME=test \
    python3 -u run.py  --algorithm mariadb --dataset fashion-mnist-784-euclidean --local

Results: QPS vs recall graphs

The recall vs QPS graph is created by: python3 plot.py --dataset fashion-mnist-784-euclidean

The results below for MariaDB are excellent. It gets more QPS than pgvector at a given recall target.


Results: create index

I am still trying to figure out how to present this data. All of the numbers are here for the time to create an index and the size of the index. The database configs for Postgres and MariaDB are shared above, and parallel index create is disabled by the config for Postgres (and not supported yet by MariaDB). The summary is:
  • Indexes have a similar size with MariaDB and Postgres with halfvec. The Postgres index without halfvec is about 2X larger.
  • Time to create the index for Postgres is similar with and without halfvec
  • Time to create the index for MariaDB is less than for pgvector. Perhaps the best way to compare this is the time to create the index for a similar point on the QPS/recall graph (see the last section of this blog post)
Results: best QPS for a given recall

Many benchmark results are marketed via peak performance (max throughput or min response time) but these are usually constrained optimization problems -- determine peak performance that satisfies some SLA. And the SLA might be response time or efficiency (cost).

With ann-benchmarks the constraint might be recall where you seek the best QPS that satisfies a recall target. Below I share the best QPS for a given recall target along with the configuration parameters (M, ef_construction, ef_search) at which that occurs for each of the algorithms (MariaDB, pgvector with float32, pgvector with float16).

For all cases below except the first (recall = 1.000) the time to create the index is about 2X or more longer with pgvector than with MariaDB.

For all cases below the best QPS at a given recall target is much better, sometimes 3X better, with MariaDB than with pgvector.

Legend
* recall & QPS - results from the benchmark
* isecs - number of seconds to create the index for M and ef_construction (ef_cons)

Best QPS with recall >= 1.000
recall  QPS  isecs  algorithm
1.000   920  97.4   PGVector(m=16, ef_cons=192, ef_search=120)
1.000   991  89.8   PGVector_halfvec(m=16, ef_cons=256, ef_search=120)
1.000  3226 111.8   MariaDB(m=32, ef_search=10)

Best QPS with recall >= 0.99
recall  QPS  isecs  algorithm
0.990  2337  70.7   PGVector(m=16, ef_cons=96, ef_search=20)
0.991  2558  78.3   PGVector_halfvec(m=16, ef_cons=192, ef_search=20)
0.995  4745  27.4   MariaDB(m=12, ef_search=10)

Best QPS with recall >= 0.98
recall  QPS  isecs  algorithm
0.985  2383  50.4   PGVector(m=16, ef_cons=32, ef_search=20)
0.988  2608  52.8   PGVector_halfvec(m=16, ef_cons=64, ef_search=20)
0.984  5351  18.1   MariaDB(m=8, ef_search=10)

Best QPS with recall >= 0.97
recall  QPS  isecs  algorithm
0.972  3033  42.3   PGVector(m=8, ef_cons=96, ef_search=20)
0.973  3185  89.8   PGVector_halfvec(m=16, ef_cons=256, ef_search=10)
0.984  5351  18.1   MariaDB(m=8, ef_search=10)

Best QPS with recall >= 0.96
recall  QPS  isecs  algorithm
0.961  3044   50.4  PGVector(m=16, ef_cons=32, ef_search=10)
0.967  3250   52.8  PGVector_halfvec(m=16, ef_cons=64, ef_search=10)
0.961  5691   14.8  MariaDB(m=6, ef_search=10)

Best QPS with recall >= 0.95
recall  QPS  isecs  algorithm
0.961  3044  50.4   PGVector(m=16, ef_cons=32, ef_search=10)
0.967  3250  52.8   PGVector_halfvec(m=16, ef_cons=64, ef_search=10)
0.961  5691  14.8   MariaDB(m=6, ef_search=10)

January 11, 2025

January 10, 2025

Use of Time in Distributed Databases (part 4): Synchronized clocks in production databases

This is part 4 of our "Use of Time in Distributed Databases" series. In this post, we explore how synchronized physical clocks enhance production database systems.

Spanner

Google's Spanner (OSDI'12) implemented a novel approach to handling time in distributed database systems through its TrueTime API. TrueTime API provides time as an interval that is guaranteed to contain the actual time, maintained within about 6ms (this is 2012 published number which improved significantly since then) of uncertainty using GPS receivers and atomic clocks. This explicit handling of time uncertainty allows Spanner to provide strong consistency guarantees while operating at a global scale.

Spanner uses multi-version concurrency control (MVCC) and achieves external consistency (linearizability) for current transactions through techniques like "commit wait," where transactions wait out the uncertainty in their commit timestamps before making their writes visible. Spanner uses Paxos groups approach to shard (partition) data across up to hundreds of servers, but thanks to TrueTime clocks, the system is able to provide linearizability to transactions without having to coordinate across all shards.

The real power of TrueTime becomes apparent in its handling of lock-free snapshot reads (reads of past database states). Without TrueTime, capturing consistent snapshots across a distributed database would require extensively tracking and storing causality relationships between different versions of data across the system. By using precisely synchronized time as the basis for versioning, Spanner can easily identify globally consistent reads of the database at any point in the past simply by specifying a timestamp in a lock free manner.


CockroachDB

CockroachDB builds on concepts from Google's Spanner, using a distributed key-value store organized into ranges (approximately 64MB chunks)--with each range forming its own Raft consensus group.

The key innovation of CockroachDB lies in achieving strong consistency and geo-distribution without relying on tightly-synchronized atomic clocks. Instead, it uses NTP-synchronized clocks alongside hybrid logical clocks (HLCs) and intent locks to handle time uncertainty. Unlike Spanner, which uses TrueTime and "commit-wait" to handle clock uncertainty, CockroachDB refrains from commit-waits due to longer NTP uncertainty intervals. CockroachDB dynamically adjusts transaction timestamps upward when conflicts occur within the uncertainty window. This approach eliminates the need for tightly-synchronized clocks, but leads to more transaction aborts/restarts under high contention.


CockroachDB implements serializable isolation through a MVCC system. Serializable isolation guarantees transactions appear to execute in isolation, but strict serializability adds a real-time ordering constraint: if transaction T1 commits before T2 begins, T1's commit timestamp must precede T2's. With NTP-synchronized clocks, CockroachDB achieves serializability but may violate strict serializability due to clock skew between nodes.

This limitation is illustrated through a scenario with three transactions across two keys: when transaction T1 reads both keys, it may see T3's update to y but not T2's earlier update to x, violating strict serializability. While this execution is serializable (as the transactions could be ordered T3, T1, T2), it breaks the real-time ordering guarantee that strict serializability requires. CockroachDB's reliance on looser clock synchronization makes such guarantees impossible without significant performance penalties. Spanner avoids these issues through its TrueTime API and tightly synchronized clocks, as it can afford performing commit-wait (delaying transaction commits until the timestamp's upper bound has passed).

A recent paper investigated integrating high-precision clock synchronization into CockroachDB and the resulting impact on performance. This is a good read if you are interested in understanding CockroachDB's handling of time uncertainty in transaction processing.

YugabyteDB, which has a similar Raft-groups based shared nothing architecture, recently posted a great analysis of how to make AWS Time Sync Service with Precision Hardware Clock (with 50 microsecond clockbound) work with their database.


MongoDB

MongoDB's implementation of causal consistency also relies on Hybrid Logical Clocks (HLC). The system tracks ClusterTime as a tuple of Unix epoch seconds and an increment counter, allowing for fine-grained ordering within the same second.

This combines the benefits of physical and logical time, incrementing the logical clock only when operations are written to the primary node's replication log (op log). 

Causal consistency is implemented through sessions, where each operation carries the highest known operationTime, and nodes wait for their operation log to catch up to this time before processing requests. This approach enables cross-session and cross-client causal consistency while maintaining system availability during network partitions.

MongoDB uses ClusterTime also for running OCC general purpose transactions across multiple shards and provides snapshot isolation guarantee for the transactions. This takes longer to explain, so I will explain MongoDB's interactive/ACID distributed transaction protocol in another blog post. 


DynamoDB

DynamoDB's implementation of transactions uses a timestamp-based Bernstein-Goldman optimistic concurrency control (TSO) protocol from 1980, where transactions are assigned timestamps at the start to define their position in the serial order. Correctness does not depend on perfect clock synchronization --the timestamps can be treated as monotonically increasing logical timestamps since the protocol rejects smaller timestamps. Synchronized clocks help improve the performance of transactions by reducing spurious conflicts.

To maintain its promise of predictable performance at scale, DynamoDB implements one-shot transactions rather than interactive transactions, allowing it to avoid the complexities of lock management and recovery. For write transactions, the system uses a two-phase protocol where the transaction coordinator checks timestamps against item timestamps before committing. While better synchronized clocks result in more successful transactions (as real-time ordering aligns better with serialization order), the system includes safeguards against clocks that are too far ahead and isolates transaction managers with faulty clocks.

For read-only transactions, DynamoDB employs a two-phase protocol that avoids maintaining read timestamps on items, which would turn every read into a costly write operation. Instead, it uses log sequence numbers (LSNs) to detect concurrent modifications between the two phases.


Accord

Accord protocol (2023) aims to implement general-purpose transactions in Cassandra. The key innovation is the use of synchronized clocks and hybrid logical clocks (HLCs) to achieve fast, consistent transactions across distributed systems.

Accord is a leaderless consensus protocol aims to achieve optimal latency (one WAN round-trip) under normal conditions while maintaining strict-serializable isolation. It comes from the same family of solutions like EPaxos, Caesar, and Tempo. Improving on Tempo (which we had reviewed in detail here), Accord handles timestamp conflicts more efficiently by allowing for inconsistent but valid sets of dependencies between transactions, requiring only that all dependency sets include those that may be committed with a lower timestamp.

This time-based approach, combined with the Reorder Buffer and Fast Path Electorate concepts, helps achieve improve performance. Accord's reorder buffer uses knowledge of maximum clock skew between nodes and point-to-point latencies to ensure messages are processed in the correct timestamp order. The system buffers timestamp proposals for a period equal to the clock skew plus the longest point-to-point latency, ensuring that potentially conflicting messages have time to arrive before processing begins. Accord uses HLCs that augments timestamps with unique replica IDs to achive both temporal ordering and global uniqueness of transactions.

If you just want to focus on the abstract idea, Detock (SIGMOD 2023) presents a similar idea for serializing transactions using time. So yeah there is a hint of deterministic databases here. 


TIDB: snapshot isolation (Timestamp Oracle)

TiDB employs a centralized timestamp management system through its Placement Driver (PD), which hosts the timestamp oracle (TSO). The TSO generates globally unique timestamps that combine physical time (with millisecond accuracy) and logical time (using 18 bits), serving as transaction IDs. This hybrid logical clock approach ensures strict ordering of transactions while maintaining a close relationship with real time.

TiDB uses these timestamps to implement snapshot isolation (SI) through multi-version concurrency control (MVCC). Unlike CockroachDB which provides serializability, TiDB focuses on snapshot isolation and read committed isolation levels. The timestamps play a crucial role in transaction management, allowing the system to maintain consistent views of the data and protect against write-write conflicts. The system supports both optimistic and pessimistic concurrency control methods, with the optimistic approach providing better throughput in low-conflict situations.

An important limitation of TiDB's time-based approach is that its centralized TSO makes multi-region deployment challenging. While the system uses Raft for replication within a single region (organizing data into 96MB range shards they call "regions"), geo-distribution would require modifications to both the TSO and Placement Driver to handle cross-region timestamp coordination effectively.


Aurora Limitless

Aurora Limitless is not a different engine, but rather a managed sharding option built on top of Aurora (which we reviewed here). Instead of a single writer it allows for multiple writers by leveraging precise time synchronization to achieve scalable transactions. The system uses Amazon TimeSync, which provides microsecond-accurate time synchronization with bounded uncertainty (under 50 microseconds), combined with Hybrid Logical Clocks (HLC) to ensure consistent ordering of events across the distributed system.

Aurora Limitless modifies PostgreSQL's traditional snapshot isolation mechanism to work in a distributed environment by implementing snapshots "as of then" rather than "as of now." To handle potential issues with transaction visibility and commit timing, Aurora Limitless uses clever techniques like predicting commit completion times and setting commit timestamps slightly in the future. This approach helps maintain read performance by ensuring reads are never blocked by writes while still preserving consistency guarantees.

The architecture employs a two-phase commit (2PC) protocol with an interesting twist: transaction routers initially coordinate commits but hand off supervision to local shard coordinators for high availability. This is because we don't want to require high-availability at the router, and keep it soft-state. Shards already  have high availability built-in, so the router hands off the supervision 2PC completion to a shard leader, which is already replicated and highly available.  The system also addresses the challenge of "zombie shards" (nodes that don't know they've been failed over) through consistency leases that expire after a few seconds to prevent stale reads. 


Aurora DSQL

Aurora DSQL (2024) uses time as a fundamental mechanism for managing transaction consistency and isolation. Unlike previous Aurora databases, it employs a different, truly serverless architecture with two key services: an adjudicator service for concurrency control and a journal service for update durability.

The system uses timestamps in two crucial ways: first, when a transaction starts, it chooses a start time (Ts) for performing all reads against the multiversion concurrency control (MVCC) storage system. Second, when committing, it selects a commit time (Tc). These timestamps help maintain snapshot isolation by ensuring that conflicts are properly detected between transactions that overlap in time.

DSQL's commit process is built around temporal coordination. When a transaction wants to commit, it must verify that no other transaction has written to the same keys between its start time (Ts) and commit time (Tc). This verification is handled by the adjudicator service, which makes guarantees about transaction ordering based on timestamps. The adjudicator promises to never commit a transaction with an earlier timestamp once it has allowed a transaction to commit at a given time, establishing a temporal ordering of transactions. Each adjudicator has jurisdiction over specific key-ranges. Adjudicators don't use Paxos for leader-election/fault-tolerance, but rather they again leverage time/fencing during handovers of key-range responsibility.

The storage system uses timestamps to determine when it has a complete view of the data at a given point in time. To handle cases where the write rate might be low, the system implements a heartbeat protocol where adjudicators commit to moving their commit points forward in sync with the physical clock. This ensures that storage nodes can confidently serve reads as of a particular timestamp by knowing they have seen all transactions up to that point.

This time-based approach allows Aurora DSQL to achieve strong consistency reads and snapshot isolated transactions while maintaining high performance through optimistic concurrency control (OCC). By moving all coordination to commit time and using timestamps for ordering, the system minimizes the necessary coordination between machines, availability zones, and regions.

Outgrowing Postgres: How to identify scale problems

Discover early warning signs that you’ve outgrown PostgreSQL and learn how to keep performance high. This introductory article offers diagnostic techniques and proactive strategies to help you scale and plan the future of your analytics without losing momentum.

Sysbench performance over time for InnoDB and MyRocks: part 4

This is part 4 in my (possibly) final series on performance regressions in MySQL using cached sysbench as the workload. For previous posts, see part 1part 2 and part 3. This post covers performance differences between InnoDB in upstream MySQL 8.0.32, InnoDB in FB MySQL 8.0.32 and MyRocks in FB MySQL 8.0.32 using a server with 32 cores and 128G of RAM.

I don't claim that the MyRocks CPU overhead isn't relevant, but this workload (CPU-bound, database is cached) is a worst-case for it.

tl;dr 

  • InnoDB from FB MySQL is no worse than ~10% slower than InnoDB from upstream
  • Fixing bug 1506 is important for InnoDB in FB MySQL
  • MyRocks is ~30% slower than upstream InnoDB at low concurrency and ~45% slower at high, as it uses ~1.5X more CPU/query
  • For writes, MyRocks does worse at high concurrency than at low
Updates: For writes, MyRocks does worse at high concurrency than at low

I looked at vmstat metrics for the update-nonindex benchmark and the number of context switches per update is about 1.2X larger for MyRocks vs InnoDB at high concurrency. 

Then I looked at PMP stacks and MyRocks has more samples for commit processing. The top stacks are here. This should not be a big surprise because MyRocks does more work at commit time (pushes changes from a per-session buffer into the memtable). But I need to look at this more closely.

I browsed the code in Commit_stage_manager::enroll_for, which is on the call stack for the mutext contention, and it is kind of complicated. I am trying to figure out how many mutexes are locked in there and figuring that out will take some time. 

Benchmark, Hardware

Much more detail on the benchmark and hardware is here. I am trying to avoid repeating that information in the posts that follow. 

Results here are from the c32r128 server with 32 CPU cores and 128G of RAM. The benchmarks were repeated for 1 and 24 threads. On the charts below that is indicated by NT=1 and NT=24.

Builds

The previous post has more detail on the builds, my.cnf files and bug fixes.

The encoded names for these builds is:
  • my8032_rel_o2nofp
    • InnoDB from upstream MySQL 8.0.32
  • fbmy8032_rel_o2nofp_end_241023_ba9709c9_971
    • FB MySQL 8.0.32 at git hash ba9709c9 (as of 2024/10/23) using RocksDB 9.7.1. This supports InnoDB and MyRocks.
  • fbmy8032_rel_o2nofp_241023_ba9709c9_971_bug1473_1481_1482_1506
    • FB MySQL 8.0.32 at git hash ba9709c9 (as of 2024/10/23) using RocksDB 9.7.1 with patches applied for bugs 1473, 1481, 1482 and 1506, This supports InnoDB and MyRocks.
The my.cnf files are:
Relative QPS

The charts and summary statistics that follow use a number that I call the relative QPS (rQPS) where:
  • rQPS is: (QPS for my version) / (QPS for base version)
  • base version is InnoDB from upstream MySQL 8.0.32 (my8032_rel_o2nofp)
  • my version is one of the other versions
Results

The microbenchmarks are split into three groups: point queries, range queries, writes. The tables below have summary statistics for InnoDB and MyRocks using the relative QPS (the same data as the charts).

Results are provided in two formats: charts and summary statistics. The summary statistics table have the min, max, average and median relative QPS per group (group = point, range and writes).

The spreadsheets and charts are also here. I don't know how to prevent the microbenchmark names on the x-axis from getting truncated in the png files I use here but they are easier to read on the spreadsheet.

The charts use NT=1, NT=16 and NT=24 to indicate whether sysbench was run with 1, 16 or 24 threads. The charts and table use the following abbreviations for the DBMS versions:
  • fbinno-nofix
    • InnoDB from fbmy8032_rel_o2nofp_end_241023_ba9709c9_971
  • fbinno-somefix
    • InnoDB from fbmy8032_rel_o2nofp_241023_ba9709c9_971_bug1473_1481_1482_1506
  • myrocks-nofix
    • MyRocks from fbmy8032_rel_o2nofp_end_241023_ba9709c9_971
  • myrocks-somefix
    • MyRocks from fbmy8032_rel_o2nofp_241023_ba9709c9_971_bug1473_1481_1482_1506

Summary statistics: InnoDB

Summary:

  • InnoDB from FB MySQL is no worse than ~10% slower than InnoDB from upstream
  • Fixing bug 1506 is important for InnoDB in FB MySQL
1 thread

fbinno-nofixminmaxaveragemedian
point0.890.960.920.91
range0.630.930.820.82
writes0.860.980.890.88
fbinno-somefixminmaxaveragemedian
point0.921.000.960.95
range0.890.960.910.91
writes0.890.990.920.92

24 threads

fbinno-nofixminmaxaveragemedian
point0.920.960.940.94
range0.620.960.810.82
writes0.840.940.880.87
fbinno-somefixminmaxaveragemedian
point0.940.990.970.98
range0.780.990.890.91
writes0.860.950.900.88

Summary statistics: MyRocks

Summary:

  • MyRocks does better at low concurrency than at high. The fix might be as simple as enabling the hyper clock block cache
  • MyRocks is ~30% slower than upstream InnoDB at low concurrency and ~45% slower at high
  • For writes, MyRocks does worse at high concurrency than at low
1 thread

myrocks-nofixminmaxaveragemedian
point0.520.750.660.68
range0.370.720.600.60
writes0.651.210.790.73
myrocks-somefixminmaxaveragemedian
point0.510.790.680.70
range0.430.760.620.61
writes0.661.230.800.74

24 threads

myrocks-nofixminmaxaveragemedian
point0.400.760.490.43
range0.400.710.580.60
writes0.441.370.650.55
myrocks-somefixminmaxaveragemedian
point0.480.770.550.51
range0.430.710.600.60
writes0.451.390.660.55

Results: c32r128 with InnoDB and point queries

Summary
  • InnoDB from FB MySQL is no worse than 10% slower than upstream

Results: c32r128 with MyRocks and point queries

Summary
  • at low concurrency the worst case for MyRocks are the tests that do point lookup on secondary indexes because that uses a range scan rather than a point lookup on the LSM tree, which means that bloom filters cannot be used
  • at high concurrency the difference between primary and secondary index queries is less significant, perhaps this is dominated by mutex contention from the LRU block cache and solved by using hyper clock

Results: c32r128 with InnoDB and range queries

Summary

  • the worst case for InnoDB from FB MySQL are the long range scans and fixing bug 1506 will be a big deal

Results: c32r128 with MyRocks and range queries

Summary

  • while long range scans are the worst case here, bug 1506 is not an issue as that is InnoDB-only

Results: c32r128 with InnoDB and writes

Summary

  • results are stable here, InnoDB from FB MySQL is no worse than ~10% slower than upstream but results at high concurrency are a bit worse than at low

Results: c32r128 with MyRocks and writes

Summary

  • while MyRocks does much better than InnoDB for update-index because it does blind writes rather than RMW for non-unique secondary index maintenance
  • MyRocks does worse at high concurrency than at low




January 09, 2025

Sysbench performance over time for InnoDB and MyRocks: part 3

This is part 3 in my (possibly) final series on performance regressions in MySQL using cached sysbench as the workload. For previous posts, see part 1 and part 2. This post covers performance differences between InnoDB in upstream MySQL 8.0.32, InnoDB in FB MySQL 8.0.32 and MyRocks in FB MySQL 8.0.32 using a server with 24 cores and 64G of RAM.

I don't claim that the MyRocks CPU overhead isn't relevant, but this workload (CPU-bound, database is cached) is a worst-case for it.

tl;dr 

  • InnoDB from FB MySQL is no worse than ~10% slower than InnoDB from upstream
  • MyRocks is ~35% slower than InnoDB from upstream as it uses ~1.5X more CPU/query
  • Fixing bug 1506 is important for InnoDB in FB MySQL
  • For writes, MyRocks does worse at high concurrency than at low
  • while MyRocks does much better than InnoDB for update-index at 1 thread, that benefit goes away at 16 threads
Benchmark, Hardware

Much more detail on the benchmark and hardware is here. I am trying to avoid repeating that information in the posts that follow. 

Results here are from the c24r64 server with 24 CPU cores and 64G of RAM. The benchmarks were repeated for 1 and 16 threads. On the charts below that is indicated by NT=1 and NT=16.

Builds

The previous post has more detail on the builds, my.cnf files and bug fixes.

The encoded names for these builds is:
  • my8032_rel_o2nofp
    • InnoDB from upstream MySQL 8.0.32
  • fbmy8032_rel_o2nofp_end_241023_ba9709c9_971
    • FB MySQL 8.0.32 at git hash ba9709c9 (as of 2024/10/23) using RocksDB 9.7.1. This supports InnoDB and MyRocks.
  • fbmy8032_rel_o2nofp_241023_ba9709c9_971_bug1473_1481_1482_1506
    • FB MySQL 8.0.32 at git hash ba9709c9 (as of 2024/10/23) using RocksDB 9.7.1 with patches applied for bugs 1473, 1481, 1482 and 1506, This supports InnoDB and MyRocks.
The my.cnf files are:
Relative QPS

The charts and summary statistics that follow use a number that I call the relative QPS (rQPS) where:
  • rQPS is: (QPS for my version) / (QPS for base version)
  • base version is InnoDB from upstream MySQL 8.0.32 (my8032_rel_o2nofp)
  • my version is one of the other versions
Results

The microbenchmarks are split into three groups: point queries, range queries, writes. The tables below have summary statistics for InnoDB and MyRocks using the relative QPS (the same data as the charts).

Results are provided in two formats: charts and summary statistics. The summary statistics table have the min, max, average and median relative QPS per group (group = point, range and writes).

The spreadsheets and charts are also here. I don't know how to prevent the microbenchmark names on the x-axis from getting truncated in the png files I use here but they are easier to read on the spreadsheet.

The charts use NT=1, NT=16 and NT=24 to indicate whether sysbench was run with 1, 16 or 24 threads. The charts and table use the following abbreviations for the DBMS versions:
  • fbinno-nofix
    • InnoDB from fbmy8032_rel_o2nofp_end_241023_ba9709c9_971
  • fbinno-somefix
    • InnoDB from fbmy8032_rel_o2nofp_241023_ba9709c9_971_bug1473_1481_1482_1506
  • myrocks-nofix
    • MyRocks from fbmy8032_rel_o2nofp_end_241023_ba9709c9_971
  • myrocks-somefix
    • MyRocks from fbmy8032_rel_o2nofp_241023_ba9709c9_971_bug1473_1481_1482_1506

Summary statistics: InnoDB

Summary:

  • InnoDB from FB MySQL is no worse than 9% slower than InnoDB from upstream
  • Fixing bug 1506 is important for InnoDB in FB MySQL

1 thread

fbinno-nofixminmaxaveragemedian
point0.881.010.940.95
range0.680.970.830.83
writes0.860.950.900.89
fbinno-somefixminmaxaveragemedian
point0.941.050.970.96
range0.881.030.920.91
writes0.880.960.920.93

16 threads

fbinno-nofixminmaxaveragemedian
point0.930.960.940.94
range0.650.950.830.85
writes0.880.940.910.91
fbinno-somefixminmaxaveragemedian
point0.940.970.950.95
range0.850.960.910.91
writes0.890.960.920.91

Summary statistics: MyRocks

Summary

  • MyRocks does better at low concurrency than at high. The fix might be as simple as enabling the hyper clock block cache
  • MyRocks is ~35% slower than upstream InnoDB
  • For writes, MyRocks does worse at high concurrency than at low

1 thread

myrocks-nofixminmaxaveragemedian
point0.460.780.670.70
range0.480.730.630.64
writes0.651.490.810.73
myrocks-somefixminmaxaveragemedian
point0.460.780.660.69
range0.510.730.650.64
writes0.661.540.820.74

16 threads

myrocks-nofixminmaxaveragemedian
point0.520.770.630.63
range0.460.730.630.61
writes0.511.010.670.61
myrocks-somefixminmaxaveragemedian
point0.550.790.630.62
range0.530.740.650.65
writes0.501.010.670.62

Results: c24r64 with InnoDB and point queries

Summary

  • results are stable here, InnoDB from FB MySQL is no worse than 10% slower than upstream

Results: c24r64 with MyRocks and point queries

Summary

  • the worst case for MyRocks are the tests that do point lookup on secondary indexes because that uses a range scan rather than a point lookup on the LSM tree, which means that bloom filters cannot be used

Results: c24r64 with InnoDB and range queries

Summary

  • the worst case for InnoDB from FB MySQL are the long range scans and fixing bug 1506 will be a big deal

Results: c24r64 with MyRocks and range queries

Summary

  • while long range scans are the worst case here, bug 1506 is not an issue as that is InnoDB-only

Results: c24r64 with InnoDB and writes

Summary

  • results are stable here, InnoDB from FB MySQL is no worse than ~10% slower than upstream

Results: c24r64 with MyRocks and writes

Summary

  • while MyRocks does much better than InnoDB for update-index at 1 thread, that benefit goes away at 16 threads. It does better at update-index because it does blind writes rather than RMW for non-unique secondary index maintenance. Perhaps the issue at high concurrency is memory system stalls because this server has 2 sockets.



Sysbench performance over time for InnoDB and MyRocks: part 2

This is part 2 in my (possibly) final series on performance regressions in MySQL using cached sysbench as the workload. Part 1 of this series is here. Part 1 documents performance regressions from MySQL 5.6 to 8.0. This post and the ones that follow cover performance differences between InnoDB in upstream MySQL 8.0.32, InnoDB in FB MySQL 8.0.32 and MyRocks in FB MySQL 8.0.32.

I don't claim that the MyRocks CPU overhead isn't relevant, but this workload (CPU-bound, database is cached) is a worst-case for it.

tl;dr

  • InnoDB in FB MySQL 8.0.32 is about 10% slower than InnoDB from upstream 8.0.32, once a few perf bugs get fixed
  • MyRocks in FB MySQL 8.0.32 uses more CPU than InnoDB, thus QPS for CPU-bound workloads is lower than for InnoDB. On the c8r16 server it gets between 55% and 70% of the QPS relative to InnoDB. On the c8r32 server it gets between 61% and 75% of the QPS relative to InnoDB.
  • Fixing bug 1506 is important for InnoDB in FB MySQL
  • MyRocks does much better at update-index because it does blind writes rather than RMW for non-unique secondary index maintenance
    Benchmark, Hardware 

    Much more detail on the benchmark and hardware is here. I am trying to avoid repeating that information in the posts that follow. 

    Tests were run on four different servers. Results in this post are only from c8r16 and c8r32. Posts that follow will have results for c24r64 and c32r128. The servers are:

    • c8r16
      • The c8r16 name stands for 8 CPU cores and 16G of RAM.
    • c8r32
      • The c8r32 name stands for 8 CPU cores and 32G of RAM.
    • c24r64
      • The c24r64 name stands for 24 CPU cores and 64G of RAM.
    • c32r128
      • The c32r128 name stands for 32 CPU cores and 128G of RAM.
    Builds

    In part 1 results were provided for InnoDB from upstream MySQL 5.6, 5.7 and 8.0 and then from MyRocks from FB MySQL 5.6 and 8.0. Here and in the posts that follow I use InnoDB from upstream MySQL 8.0.32, InnoDB from FB MySQL 8.0.32 and MyRocks from FB MySQL 8.0.32. Everything was compiled with gcc, CMAKE_BUILD_TYPE =Release, -O2 and -fno-omit-frame-pointer.

    The encoded names for these builds is:
    • my8032_rel_o2nofp
      • InnoDB from upstream MySQL 8.0.32
    • fbmy8032_rel_o2nofp_end_241023_ba9709c9_971
      • FB MySQL 8.0.32 at git hash ba9709c9 (as of 2024/10/23) using RocksDB 9.7.1. This supports InnoDB and MyRocks.
    • fbmy8032_rel_o2nofp_241023_ba9709c9_971_bug1473_1481_1482_1506
      • FB MySQL 8.0.32 at git hash ba9709c9 (as of 2024/10/23) using RocksDB 9.7.1 with patches applied for bugs 1473, 1481, 1482 and 1506, This supports InnoDB and MyRocks.
    The my.cnf files are:
    The bugs mentioned above are specific to FB MySQL and have simple fixes:
    • 1473
      • is_thd_db_read_only_by_name accounts for ~2% of CPU time on CPU-bound and write-heavy microbenchmarks
    • 1481
      • ha_statistic_increment accounts for ~5% of CPU time on CPU-bound table scans
    • 1482
      • calls to clock_gettime in sql/optimizer.cc account for ~4% of CPU time on several microbenchmarks
    • 1505
      • the default for yield_check_frequency is zero which makes MySQL waste much time in calls to thd_wait_yield. A better default is 10. Too many my.cnf options isn't a big deal. Too many options with bad default values is a big deal. The workaround for this is to set yield_check_frequency=10 in my.cnf
    • 1506
      • this is limited to InnoDB in the FB MySQL tree. It is from changes in the concurrency ticket code and reduces CPU-bound table scan performance by ~20%.
    Relative QPS

    The charts and summary statistics that follow use a number that I call the relative QPS (rQPS) where:
    • rQPS is: (QPS for my version) / (QPS for base version)
    • base version is InnoDB from upstream MySQL 8.0.32 (my8032_rel_o2nofp)
    • my version is one of the other versions
    Results

    The microbenchmarks are split into three groups: point queries, range queries, writes. The tables below have summary statistics for InnoDB and MyRocks using the relative QPS (the same data as the charts).

    Results are provided in two formats: charts and summary statistics. The summary statistics table have the min, max, average and median relative QPS per group (group = point, range and writes).

    The spreadsheets and charts are also here. I don't know how to prevent the microbenchmark names on the x-axis from getting truncated in the png files I use here but they are easier to read on the spreadsheet.

    The charts use NT=1, NT=16 and NT=24 to indicate whether sysbench was run with 1, 16 or 24 threads. The charts and table use the following abbreviations for the DBMS versions:
    • fbinno-nofix
      • InnoDB from fbmy8032_rel_o2nofp_end_241023_ba9709c9_971
    • fbinno-somefix
      • InnoDB from fbmy8032_rel_o2nofp_241023_ba9709c9_971_bug1473_1481_1482_1506
    • myrocks-nofix
      • MyRocks from fbmy8032_rel_o2nofp_end_241023_ba9709c9_971
    • myrocks-somefix
      • MyRocks from fbmy8032_rel_o2nofp_241023_ba9709c9_971_bug1473_1481_1482_1506

    Results: c8r16 with InnoDB

    Summary:

    • The y-axis on the charts doesn't start at 0 to improve readability
    • InnoDB from FB MySQL with the bug fixes listed above is between 4% and 10% slower than InnoDB from upstream
    • The worst regression for InnoDB from FB MySQL occurs on long range scans where FB MySQL is ~30% slower than upstream without the fix for bug 1506

    fbinno-nofixminmaxaveragemedian
    point0.890.940.920.93
    range0.700.940.830.83
    writes0.840.890.870.87
    fbinno-somefixminmaxaveragemedian
    point0.930.980.960.96
    range0.850.980.900.90
    writes0.850.910.900.90

    Results: c8r16 with MyRocks

    Summary

    • The y-axis on most of the charts doesn't start at 0 to improve readability. There are two charts for writes and the second truncates outliers to improve readability.
    • MyRocks has more CPU overhead per query than InnoDB, thus CPU-bound QPS is much lower with MyRocks than with InnoDB. Improving this isn't trivial as flamegraphs show that the extra CPU is spread out over many functions.
    • Results for range queries are worse than for point queries because MyRocks uses bloom filters for point queries
    • MyRocks does much better at update-index because it does blind writes rather than RMW for non-unique secondary index maintenance
    myrocks-nofixminmaxaveragemedian
    point0.590.720.670.69
    range0.350.680.530.54
    writes0.531.780.790.67
    myrocks-somefixminmaxaveragemedian
    point0.470.740.660.70
    range0.390.680.550.55
    writes0.531.800.800.68

    There are two charts for writes. The second chart truncates the y-axis to improve readability for all but the outlier result from update-index.

    Results: c8r32 with InnoDB

    Summary

    • The y-axis on the charts doesn't start at 0 to improve readability.
    • InnoDB from FB MySQL with the bug fixes listed above is between 4% and 10% slower than InnoDB from upstream
    • The worst regression for InnoDB from FB MySQL occurs on long range scans where FB MySQL is ~30% slower than upstream without the fix for bug 1506

    fbinno-nofixminmaxaveragemedian
    point0.880.950.920.93
    range0.680.950.830.84
    writes0.850.950.880.87
    fbinno-somefixminmaxaveragemedian
    point0.930.970.960.96
    range0.850.970.900.90
    writes0.890.960.910.90

    Results: c8r32 with MyRocks

    Summary

    • The y-axis on most of the charts doesn't start at 0 to improve readability. There are two charts for writes and the second truncates outliers to improve readability.
    • MyRocks has more CPU overhead per query than InnoDB, thus CPU-bound QPS is much lower with MyRocks than with InnoDB. Improving this isn't trivial as flamegraphs show that the extra CPU is spread out over many functions.
    • Results for range queries are worse than for point queries because MyRocks uses bloom filters for point queries
    • MyRocks does much better at update-index because it does blind writes rather than RMW for non-unique secondary index maintenance
    myrocks-nofixminmaxaveragemedian
    point0.450.760.660.70
    range0.410.730.590.59
    writes0.651.630.830.73
    myrocks-somefixminmaxaveragemedian
    point0.480.790.680.72
    range0.420.750.610.61
    writes0.661.640.850.75

    There are two charts for writes. The second chart truncates the y-axis to improve readability for all but the outlier result from update-index.


    Vacasa’s migration to Amazon Aurora for a more efficient Property Management System

    Vacasa is North America’s leading vacation rental management platform, revolutionizing the rental experience with advanced technology and expert teams. In the competitive short-term vacation property management industry, efficient systems are critical. To maintain its edge and continue providing top-notch service, Vacasa needed to modernize its primary transactional database to improve performance, provide high availability, and reduce costs. In this post, we share Vacasa’s journey from Amazon Relational Database Service (Amazon RDS) for MariaDB to Amazon RDS for MySQL, and finally to Amazon Aurora, highlighting the technical steps taken and the outcomes achieved.

    Sysbench performance over time for InnoDB and MyRocks: part 1

    I spent much time in 2024 documenting performance regressions from old to new versions of MySQL with InnoDB and MyRocks. More posts will be published in 2025, including this sequence of posts, but my work on that is winding down. Most of the problems are from many small regressions rather than a few big ones and it is very expensive to find and fix regressions long after they arrive.

    Hopefully someone else will emerge to do work like this for MySQL going forward or upstream finds more resources to prevent new small regressions from arriving. The trend over the past decade has been great for Postgres. And if upstream wants to grow Heatwave, then avoiding regressions in the base product is one way to help with that.

    The purpose of this post is to document the regressions from MySQL 5.6 through 8.0 for MyRocks and upstream InnoDB using sysbench with a cached workload.

    tl;dr, v1

    • Modern InnoDB at low concurrency gets about 70% of the QPS relative to older InnoDB (5.6.51). The regressions aren't as bad at high concurrency.
    • Modern MyRocks gets between 80% and 95% of the QPS relative to older MyRocks (5.6.35) in most cases (high-concurrency writes is the exception where QPS is similar for old and new.

    tl;dr, v2

    • Both upstream MySQL and the FB MySQL project would benefit from changepoint detection (like Nyrkio) using sysbench microbenchmarks to detect regressions
    • Regressions from MySQL 5.6 to 8.0 for InnoDB are worse at low concurrency than at high. MySQL has gotten more efficient at high concurrency workloads, although some of that benefit is lost from code bloat which is more visible at low concurrency. 
    • Regressions from MySQL 5.6 to 8.0 for MyRocks are better than InnoDB at low concurrency, but worse at high concurrency. I suspect that regressions at high concurrency for MyRocks would be better than InnoDB had I enabled the hyper clock cache.

    Builds

    I used InnoDB from upstream MySQL 5.6.51, 5.7.44, 8.0.28 and 8.0.32 and MyRocks from FB MySQL 5.6.35, 8.0.28 and 8.0.32. Everything was compiled with gcc, CMAKE_BUILD_TYPE =Release, -O2 and -fno-omit-frame-pointer.

    For MyRocks, the results in this post use these builds:
    • fbmy5635_rel_o2nofp_210407_f896415f_6190
      • MyRocks 5.6.35 at git hash f896415f (2021/04/07) and RocksDB 6.19.0
    • fbmy5635_rel_o2nofp_231016_4f3a57a1_870
      • MyRocks 5.6.35 at git hash 4f3a57a1 (2023/10/16) and RocksDB 8.7.0
    • fbmy8028_rel_o2nofp_231202_4edf1eec_870
      • MyRocks 8.0.28 at git hash 4edf1eec (2023/12/02) and RocksDB 8.7.0
    • fbmy8032_rel_o2nofp_end_241023_ba9709c9_971
      • MyRocks 8.0.32 at git hash ba9709c9 (2024/10/23) and RocksDB 9.7.1
    • fbmy8032_rel_o2nofp_241023_ba9709c9_971_bug1473_1481_1482_1506
      • MyRocks 8.0.32 at git hash ba9709c9 (2024/10/23) and RocksDB 9.7.1 with fixes applied for bugs 1473, 1481, 1482 and 1506. There is also a workaround for bug 1505 via my.cnf changes. But note that bug 1506 is only for InnoDB.
    The bugs mentioned above are specific to FB MySQL and have simple fixes:
    • 1473
      • is_thd_db_read_only_by_name accounts for ~2% of CPU time on CPU-bound and write-heavy microbenchmarks
    • 1481
      • ha_statistic_increment accounts for ~5% of CPU time on CPU-bound table scans
    • 1482
      • calls to clock_gettime in sql/optimizer.cc account for ~4% of CPU time on several microbenchmarks
    • 1505
      • the default for yield_check_frequency is zero which makes MySQL waste much time in calls to thd_wait_yield. A better default is 10. Too many my.cnf options isn't a big deal. Too many options with bad default values is a big deal. The workaround for this is to set yield_check_frequency=10 in my.cnf
    • 1506
      • this is limited to InnoDB in the FB MySQL tree. It is from changes in the concurrency ticket code and reduces CPU-bound table scan performance by ~20%.
    The my.cnf files are in the subdirectories here. For InnoDB with upstream MySQL I used my.cnf.cz11a_c32r128 and these are here for 5.6.51, 5.7.44, 8.0.28 and 8.0.32

    For MyRocks I used the following. None of these enable the hyper clock block cache which will make modern MyRocks look much better at high concurrency:
    Hardware

    Tests were run on four different servers. Results in this post are only from c32r128 but the posts that follow have results from the other servers. The servers are:
    • c8r16
      • The c8r16 name stands for 8 CPU cores and 16G of RAM. This is a Beelink SER4 with an AMD Ryzen 7 4700 CPU with SMT disabled, 8 cores, 16G of RAM, Ubuntu 22.04 and ext4 on 1 NVMe device.
    • c8r32
      • The c8r32 name stands for 8 CPU cores and 32G of RAM. This is an ASUS ExpertCenter PN53 with AMD Ryzen 7 7735HS, with SMT disabled, 8 cores, 32G RAM, Ubuntu 22.04 and ext4 on 1 NVMe device.
    • c24r64
      • The c24r64 name stands for 24 CPU cores and 64G of RAM. This is a SuperMicro SuperWorkstation 7049A-T with 2 sockets, 12 cores/socket, 64G RAM, one m.2 SSD (2TB,  ext4). The CPUs are Intel Xeon Silver 4214R CPU @ 2.40GHz.
    • c32r128
      • The c32r128 name stands for 32 CPU cores and 128G of RAM. This is a Dell Precision 7865 Tower Workstation with 1 socket, 128G RAM, AMD Ryzen Threadripper PRO 5975WX with 32-Cores, 2 m.2 SSD (each 2TB, RAID SW 0, ext4).
    Benchmark

    I used sysbench and my usage is explained here. A full run has 42 microbenchmarks and most test only 1 type of SQL statement. Here I use abbreviated runs with 26 microbenchmarks to save time. The database is cached by InnoDB and MyRocks.

    The benchmark is run with ...
    • c8r16 - 1 thread, 1 table, 30M rows
    • c8r32 - 1 thread, 1 table, 50M rows
    • c24r64 - 1 thread and then 16 threads, 8 tables, 10M rows/table
    • c32r128 - 1 thread and then 24 threads, 8 tables, 10M rows/table
    Each microbenchmark runs for 300 seconds if read-only and 600 seconds otherwise. Prepared statements were enabled.

    Relative QPS

    The charts and summary statistics that follow use a number that I call the relative QPS (rQPS) where:
    • rQPS is: (QPS for my version) / (QPS for base version)
    • base version for InnoDB is 5.6.51 and for MyRocks is 5.6.35 (fbmy5635_rel_o2nofp_210407_f896415f_6190)
    • my version is one of the other versions
    Summary statistics

    The microbenchmarks are split into three groups: point queries, range queries, writes. The tables below have summary statistics for InnoDB and MyRocks using the relative QPS of InnoDB from MySQL 8.0.32 and MyRocks from fbmy8032_rel_o2nofp_241023_ba9709c9_971_bug1473_1481_1482_1506.

    InnoDB with 1 thread
    • based on the median values, InnoDB in 8.0.32 gets ~72% of the QPS relative to 5.6.51

    minmaxavgmedian
    point0.650.880.730.72
    range0.680.990.770.73
    write0.501.210.750.70

    InnoDB with 24 threads
    • the results here are better than above. For point queries and writes, upstream made InnoDB better for high-concurrency workloads which counters the performance loss from new CPU overheads. However, that isn't true for writes. The issue is made worse by some of the refactoring in InnoDB that landed after 8.0.28.

    minmaxavgmedian
    point0.681.541.091.29
    range0.631.130.840.77
    write1.052.021.471.45

    MyRocks with 1 thread
    • the regressions here aren't as bad as they are above for InnoDB

    minmaxavgmedian
    point0.791.150.930.93
    range0.630.920.840.86
    write0.630.870.790.80

    MyRocks with 24 threads
    • Based on the median values, MyRocks at higher concurrency here does better than at low concurrency above. I am not sure why. Perhaps had I enabled the hyper clock block cache the results here would be better than they are above for InnoDB at 24 threads.

    minmaxavgmedian
    point0.851.221.010.95
    range0.681.070.971.01
    write0.770.990.910.91

    Results 

    The microbenchmarks are split into three groups: point queries, range queries, writes. The charts that follow plot the relative QPS.

    To save space on the charts for MyRocks, the version names on the charts are:
    • 5635.231016
      • fbmy5635_rel_o2nofp_231016_4f3a57a1_870
    • 8028.231202
      • fbmy8028_rel_o2nofp_231202_4edf1eec_870
    • 8032.241023.nofix
      • fbmy8032_rel_o2nofp_end_241023_ba9709c9_971
    • 8032.241023.fix
      • fbmy8032_rel_o2nofp_241023_ba9709c9_971_bug1473_1481_1482_1506
    The spreadsheets and charts are here in the dell.fbin.1 and dell.fbin.24 tabs for InnoDB and the dell.fbrx.1 and dell.fbrx.24 tabs for MyRocks. I don't know how to prevent the microbenchmark names on the x-axis from getting truncated in the png files I use here but they are easier to read on the spreadsheet.

    The chart with NT=1 in the legend has results for 1 thread tests. The chart with NT=24 has results for 24 threads.

    Results: InnoDB and point queries

    Notes for NT=1 (1 thread):
    • the worst regression is from hot-points where 5.6 gets almost 2X more QPS than 8.0
    • for all tests except one, InnoDB in 8.0 gets less than 75% of the throughput relative to 5.6
    • the regression for 8.0.28 in the third group from the right is from bug 102037
    Notes for NT=24 (24 threads)
    • Upstream MySQL has improvements for concurrent workloads in 5.7 and 8.0 which offset the CPU regressions.
      • With 5.7.44, 5 tests are much faster than 5.6.51, 2 tests have similar QPS and 2 are slower
      • With 8.0.32, 5 tests are much faster than 5.6.51 and 4 are slower
    Results: MyRocks and point queries

    Notes for NT=1 (1 thread):
    • Regressions from 5.6 to 8.0 are smaller here for MyRocks than above for InnoDB
    • In 8.0.32, 7 tests are slower than 5.6.35 and 2 are slightly faster
    • The regression for 8.0.28 in the third group from the right is from bug 102037
    Notes for NT=24 24 threads):
    • The results here are about the same as the results above for MyRocks at 1 thread. Were I to enable the hyper clock cache they might be much better.
    Results: InnoDB and range queries

    Notes for NT=1 (1 thread):
    • The regressions are large for everything but long range scans. And for long-range scans while 5.7.44 is faster than 5.6, that benefit no longer exists in 8.0, especially in 8.0.30+.
    Notes for NT=24 (24 threads):
    • Results here are much better than above for InnoDB at 1 thread. But about half of the tests still have large regressions.
    Results: MyRocks and range queries

    Notes for NT=1 (1 thread):
    • Results here are similar to what I see above for InnoDB at 1 thread. I will explain more elsewhere but it has been hard to track down the root causes to anything other than code bloat, as in more cache and TLB activity.
    Notes for NT=24 24 threads):
    • Results here at high concurrency are much better than above at low concurrency, with the exception of the results for the full scan (the last group on the right).
    Results: InnoDB and writes

    Notes for NT=1 (1 thread):
    • All of the tests have large regressions in 8.0.32 except for update-index
    Notes for NT=24 24 threads):
    • Results here are much better than above at 1 thread. Modern InnoDB is much better at high-concurrency workloads.
    Results: MyRocks and writes

    Notes for NT=1 (1 thread):
    • MyRocks has large regressions at low-concurrency.
    Notes for NT=24 24 threads):
    • The regressions here at high-concurrency are about 10% better than above at low-concurrency.







    Database Sharding

    Learn about the database sharding scaling pattern in this interactive blog.

    January 08, 2025

    I Can’t Believe It’s Not Causal! Scalable Causal Consistency with No Slowdown Cascades

    I recently came across the Occult paper (NSDI'17) during my series on "The Use of Time in Distributed Databases." I had high expectations, but my in-depth reading surfaced significant concerns about its contributions and claims. Let me share my analysis, as there are still many valuable lessons to learn from Occult about causality maintenance and distributed systems design.


    The Core Value Proposition

    Occult (Observable Causal Consistency Using Lossy Timestamps) positions itself as a breakthrough in handling causal consistency at scale. The paper's key claim is that it's "the first scalable, geo-replicated data store that provides causal consistency without slowdown cascades."

    The problem they address is illustrated in Figure 1, where a slow/failed shard A (with delayed replication from master to secondary) can create cascading delays across other shards (B and C) due to dependency-waiting during write replication. This is what the paper means by "slowdown cascades". Occult's solution shifts this write-blocking to read-blocking. In other words, Occult eliminates the dependency-blocking for write-replication, and instead for read-serving, it waits-on read operations from shards that are lagging behind to ensure they appear consistent with what a user has already seen to provide causal consistency.


    Questionable Premises

    The paper presents dependency-blocking across shards for writes as a common problem, yet I struggle to identify any production systems that implement causal consistency (or any stronger consistency) this way. The cited examples are all academic systems like COPS, not real-world databases.

    More importantly, the paper's claim of being "first" overlooks many existing solutions. Spanner (2012) had already demonstrated how to handle this challenge effectively using synchronized clocks, achieving even stronger consistency guarantees (strict-serializability) without slowdown cascades. Spanner already does what Occult proposes to do: it shifts write-blocking to read-blocking, and also uses synchronized clocks to help for reads.

    The paper acknowledges Spanner only briefly in related work, noting its "heavier-weight mechanisms" and maybe as a result it "aborting more often" - but this comparison feels incomplete given Spanner's superior consistency model and production-proven success.


    The Secondary Reads Trade-off

    A potential justification for Occult's approach is enabling reads from nearby secondaries without client stickiness. However, the paper doesn't analyze this trade-off at all. Why not just read from the masters? When are secondary reads sufficiently beneficial to justify the added complexity? The paper itself notes in the introduction that inconsistency at secondaries is extremely rare - citing Facebook's study showing fewer than six out of every million reads violating causal consistency even in an eventually-consistent data store.

    In examining potential justifications for secondary reads, I see only two viable scenarios. First, secondary reads might help when the primary is CPU-constrained - but this only applies under very limited circumstances, typically with the smallest available VM instances, which is unlikely for production systems. Second, there's the latency consideration: while cross-availability-zone latencies wouldn't justify secondary reads in this case, cross-region latencies might make them worthwhile for single-key operations. However, even this advantage diminishes for transactional reads within general transactions, where reading from masters is more sensible to avoid transaction aborts due to conflicts, which would waste all the work done as part of the transaction.


    The Client Session Problem

    The paper's handling of client sessions (or rather the lack thereof) reveals another limitation. Their example (from the conference presentation) wholesale couples unrelated operations - like linking a social media post to an academic document share - into the same dependency chain. Modern systems like MongoDB's causal consistency tokens (introduced in 2017, the same year) provide a better approach to session management.

    Comparing Occult with Spanner reveals some interesting nuances. While Spanner can read from secondaries, it requires them to catch up to the current time (T_now). Occult takes a different approach by maintaining client causal shardstamps, allowing reads from secondaries without waiting for T_now synchronization. This theoretically enables causal consistency using earlier clock values than Spanner's current-time requirement.

    However, this theoretical advantage comes with significant practical concerns, as we mentioned above in the secondary reads tradeoff. Occult shifts substantial complexity to the client side, but the paper inadequately addresses the overhead and coordination requirements this imposes. The feasibility of expecting clients (which are typically frontend web proxies in the datacenter) to handle such sophisticated coordination with servers remains questionable.


    The Fatal Flaw

    The most concerning aspect about Occult emerges in Section 6 on fault tolerance. The paper reveals that correctness under crash failures requires synchronous replication of the master using Paxos or similar protocols - before the "asynchronous" replication to secondaries. Say what!? This requirement fundamentally undermines the system's claimed benefits from asynchronous replication.

    Let me quote from the paper. "Occult exhibits a vulnerability window during which writes executed at the master may not yet have been replicated to slaves and may be lost if the master crashes. These missing writes may cause subsequent client requests to fail: if a client c’s write to object o is lost, c cannot read o without violating causality. This scenario is common to all causal systems for which clients do not share fate with the servers to which they write.

    Occult's client-centric approach to causal consistency, however, creates another dangerous scenario: as datacenters are not themselves causally consistent, writes can be replicated out of order. A write y that is dependent on a write x can be replicated to another datacenter despite the loss of x, preventing any subsequent client from reading both x and y.

    Master failures can be handled using well-known techniques: individual machine failures within a datacenter can be handled by replicating the master locally using chain-replication or Paxos, before replicating asynchronously to other replicas."

    Unfortunately, the implementation ignores fault-tolerance and the evaluation omits crash scenarios entirely, focusing only on node slowdown. This significant limitation isn't adequately addressed in the paper's discussion, and is mentioned in one paragraph of the fault-tolerance section.


    Learnings

    Despite these criticisms, studying this paper has been valuable. It prompts important discussions about:

    • Trade-offs in causality maintenance
    • The role of synchronized clocks in distributed systems
    • The importance of evaluating academic proposals against real-world requirements

    For those interested in learning more, I recommend watching the conference presentation, which provides an excellent explanation of the protocol mechanics.


    The protocol

    At its core, Occult builds upon vector clocks with some key modifications. Servers attach vector-timestamps to objects and track shard states using "shardstamps", and clients also maintain vector-timestamps to keep a tab on their interaction with the servers. Like vector clocks, shardstamp updates occur through pairwise maximum operations between corresponding entries.



    This approach works straightforwardly for single-key updates but becomes more complex for transactions, where a commit timestamp must be chosen and applied consistently across all objects involved. Occult implements transactions using Optimistic Concurrency Control (OCC), but with specific validation requirements. The validation phase must verify two critical properties: first, that the transaction's read timestamp represents a consistent cut across the system, and second, that no conflicting updates occurred before commit. Transaction atomicity is preserved by making writes causally dependent, ensuring clients see either all or none of a transaction's writes - all without introducing write delays that could trigger slowdown cascades.

    The protocol employs several optimizations. To manage timestamp overhead, it employs both structural and temporal compression techniques. It also leverages loosely synchronized clocks to avoid spurious dependencies between shards, as I discuss below.  A key contribution is PC-PSI (Per-Client Parallel Snapshot Isolation), a relaxed version of PSI that requires total ordering only within client sessions rather than globally. This modification enables more efficient transaction processing while maintaining essential consistency guarantees.


    Discussion on the use of synchronized clocks

    A critical aspect of Occult is the requirement for monotonic clocks - clock regression would break causality guarantees for reads. While Lamport logical clocks could provide this monotonicity, synchronized physical clocks serve a more sophisticated purpose:  they prevent artificial waiting periods caused by structural compression of timestamps. The compression maps timestamps to the same entry in the compressed shardstamp using modulo n operations.

    The paper illustrates this problem in Section 5 with a clear example: when two shards (i and j) map to the same compressed entry and have vastly different shardstamps (100 and 1000), a client writing to j would fail the consistency check when reading from a slave of i until i has received at least 1000 writes. If shard i never reaches 1000 writes, the client must perpetually failover to reading from i's master shard. Rather than requiring explicit coordination between shards to solve this problem, Occult leverages loosely synchronized physical clocks to bound these differences naturally. This guarantees that shardstamps differ by no more than the relative offset between their clocks, independent of the write rate on different master shards.

    This approach resonates with my earlier work on Hybrid Logical Clocks (2013), where we argued for combining synchronized physical clocks with Lamport clock-based causality tracking. The effectiveness of this strategy is demonstrated dramatically in Occult's results - they achieve compression from 16,000 timestamps down to just 5 timestamps through synchronized clocks.

    Occult's use of synchronized clocks for causality compression ties into our broader theme about the value of shared temporal reference frames in distributed systems. This connection between physical time and distributed system efficiency deserves deeper exploration.

    January 06, 2025

    Read data from dropped columns

    It’s hard to destroy data. Even when a column is dropped, the data is still physically there on the data page. We can use the undocumented/unsupported command DBCC PAGE to look at it. Example First, create a table based on English values from sys.messages: use tempdb; go /* create table based on english values from […]

    The post Read data from dropped columns first appeared on Michael J. Swart.