January 17, 2025
Logical replication in Postgres: Basics
This is an external post of mine. Click here if you are not redirected.
January 16, 2025
I rebuilt the Auth0 Activity Page with webhooks and Tinybird
January 14, 2025
Use of Time in Distributed Databases (part 5): Lessons learned
This concludes our series on the use of time in distributed databases, where we explored how use of time in distributed systems evolved from a simple ordering mechanism to a sophisticated tool for coordination and performance optimization.
A key takeaway is that time serves as a shared reference frame that enables nodes to make consistent decisions without constant communication. While the AI community grapples with alignment challenges, in distributed systems we have long confronted our own fundamental alignment problem. When nodes operate independently, they essentially exist in their own temporal universes. Synchronized time provides the global reference frame that bridges these isolated worlds, allowing nodes to align their events and states coherently.
At its core, synchronized time serves as an alignment mechanism in distributed systems. As explored in Part 1, synchronized clocks enable nodes to establish "common knowledge" through a shared time reference, which is powerful for conveying the presence (not just absence) of information like lease validity without requiring complex two-way communication protocols to account for message delays and asymmetry.
This alignment manifests in several powerful ways. When nodes need to agree on a consistent snapshot of the database, time provides a natural coordination point. So a big benefit of aligning timelines is effortless MVCC consistent snapshots. Commit timestamping creates transaction ordering, and version management uses timestamps to track different versions of data in MVCC systems. Snapshot selection uses timestamps to identify consistent read points at each node independently. Many of the systems we reviewed, including Spanner, CockroachDB, Aurora Limitless and DSQL, demonstrate the power of consistent global reads using timestamps.
Alignment also enables effective conflict detection. By comparing timestamps, systems can detect concurrent modifications—a foundational element of distributed transactions. This particularly benefits OCC-based systems and those using snapshot isolation, including DynamoDB, MongoDB, Aurora Limitless, and DSQL.
Another benefit of alignment is to serve as a fencing mechanism, effectively creating "one-way doors" that prevent stale operations from causing inconsistencies. A simple realization is leases. Aurora Limitless uses consistency leases that expire after a few seconds to prevent stale reads from zombie nodes. This is a classic example of time-based fencing - once a node's lease expires, it cannot serve reads until it obtains a new lease. Aurora DSQL employs time for fencing its adjudicator service. Instead of using Paxos for leader election, adjudicators use time ranges to fence off their authority over key ranges. When an adjudicator takes over responsibility for a key range, it establishes a time-based fence that prevents older adjudicators from making decisions about that range. These applications all stem from time's fundamental role as a shared reference frame, separating past from frothiness of the present.
Another benefit of alignment is speculative execution to achieve better performance on the common case while still maintaining consistency/correctness across all scenarios. Modern systems increasingly embrace time-based speculation. Nezha introduced Deadline-Ordered Multicast (DOM) primitive that assigns deadline timestamps to message-requests and only delivers them after the deadline is reached. This creates a temporal fence ensuring consistent ordering across receivers while providing a speculation window for execution preparation. Aurora DSQL uses predictive commit timing, where it sets commit timestamps slightly in the future. This is a form of speculation that maintains read performance by ensuring reads are never blocked by writes while still preserving consistency guarantees.
The key in both cases is that time serves as a performance improvement without bearing the burden of correctness. Ultimately, we still need information transfer to ensure that the correct scenario played out, and there were no omitted messages, failures, etc. When our speculation fails, a post validation/recovery/cleanup mechanism kicks in. This approach introduces some metastability risk, so careful implementation is required.
Correctness
That brings us to correctness. Barbara Liskov's 1991 principle remains fundamental: leverage time for performance optimization, not correctness. This separation of concerns appears consistently in modern systems. Google's Spanner uses TrueTime for performance optimization while relying on traditional locking and consensus for correctness. DynamoDB leverages timestamps to streamline transaction processing but doesn't depend on perfect synchronization for correctness. Similarly, Accord and Nezha employ synchronized clocks for fast-path execution while maintaining correctness through fallback mechanisms.
A critical aspect when using time is the requirement for monotonic clocks, as clock regression would break causality guarantees for reads. Hybrid Logical Clocks (HLC) provide a good insurance in NTP-based systems. By combining physical clock time with the strengths of Lamport logical clocks, HLC provides this monotonicity. So rather than relying solely on physical time, most of the contemporary databases use HLC to merge physical and logical time to get the benefits of both.
Modern clocks provide dynamically bounded uncertainty, by providing lower and upper time intervals, to account for imperfect synchronization. These clockbound APIs also help prevent putting unfounded confidence for a given timestamp. When using these tightly synchronized clocks with clockbound APIs, two things need to go wrong for correctness to be violated:
- the clock/timestamp needs to go wrong, and
- the clockbound API should still continue showing high confidence in that wrong clock, rather than having increased bounds.
Ok, let's say these HLC or clockbound safeguards failed, what could go wrong? Obviously, you could read a stale value from an MVCC system, if you demand to read with a timestamp in the past. On the other hand, the correctness on the write path of the distributed databases is well-preserved by locks. A full study of correctness is needed to get a more comprehensive understanding, but my understanding is that all of these systems maintain data integrity even when clock synchronization falters. Performance decreases, but write-path correctness persists.
Future trends
Several patterns are emerging. There is definitely an accelerating trend of adoption of synchronized clocks in distributed databases, as our survey shows. We also see that distributed systems and databases are becoming more sophisticated about their use of time. Early systems leveraged synchronized clocks primarily for read-only transactions and snapshots, reaping low-hanging fruit. Over time, these systems evolved to tackle read-write transactions and employ more advanced techniques and synchronized clocks take on increasingly critical roles in database design. By aligning nodes to a shared reference frame, synchronized time eliminate the need for some coordination message exchanges across nodes/partitions, and help cut down latency and boost throughput in distributed multi-key operations.
We also see time-based speculation gaining traction in distributed databases. Maybe as another level of improvement, we will see systems automatically adjusting their use of time-based speculation depending on the smoothness of the operating conditions. The more a system relies on synchronized time for common-case optimization, the more vulnerable it becomes to failure of the common-case scenario or the degradation of clock synchronization. So we need smoother transitions between modes for speculation-based systems.
More research is needed on the tradeoffs between time synchronization precision and isolation guarantees. Stricter isolation levels like external consistency require tighter clock synchronization and longer commit wait times. This raises important questions about the value proposition of strict-serializability. Snapshot isolation hits a really nice tradeoff in the isolation space, and it would be interesting to put more research into isolation-performance tradeoffs using synchronized clocks.
On the infrastructure front, cloud time services like AWS TimeSync grow increasingly crucial as databases migrate to cloud environments. These services offer better precision and availability to support the use of time in distributed systems.
Automating cherry-picks between OSS and private forks
January 13, 2025
Evaluating vector indexes in MariaDB and pgvector: part 2
This post has results from the ann-benchmarks with the fashion-mnist-784-euclidean dataset for MariaDB and Postgres (pgvector) with concurrent queries (--batch). My previous post has results when not using concurrent queries. This work was done by Small Datum LLC and sponsored by the MariaDB Corporation.
I compare MariaDB with pgvector because I respect the work that the Postgres community has done to support vector search. And I am happy to report that MariaDB has also done a great job on this. While I don't know the full story of the development effort, this feature came from the MariaDB Foundation and the community and it is wonderful to see that collaboration.
Performance for MariaDB is excellent- peak QPS for a given recall target is much better than pgvector
- time to create the index to achieve a given recall target is almost always much better than pgvector
Part 1 has all of the details.
I am using the ann-benchmarks via my fork of a fork of a fork at this commit. I added --batch to the command lines listed in part 1 to repeat the benchmark using concurrent queries. Note that parallel index create was disabled for Postgres and isn't (yet) supported by MariaDB.
With --batch there is one concurrent query per CPU core and my server has 8 cores.
The results below for MariaDB are excellent. It gets more QPS than pgvector at a given recall target.
- Indexes have a similar size with MariaDB and Postgres with halfvec. The Postgres index without halfvec is about 2X larger.
- Time to create the index for Postgres is similar with and without halfvec
- Time to create the index for MariaDB is less than for pgvector. Perhaps the best way to compare this is the time to create the index for a similar point on the QPS/recall graph (see the last section of this blog post)
With ann-benchmarks the constraint might be recall where you seek the best QPS that satisfies a recall target. Below I share the best QPS for a given recall target along with the configuration parameters (M, ef_construction, ef_search) at which that occurs for each of the algorithms (MariaDB, pgvector with float32, pgvector with float16).
For all cases below the best QPS at a given recall target is much better, sometimes 3X better, with MariaDB than with pgvector.
How fast do you ship? Measure your deployment velocity with Vercel and Tinybird
January 12, 2025
Evaluating vector indexes in MariaDB and pgvector: part 1
This post has results for vector index support in MariaDB and Postgres. I am new to vector indexes so I will start small and over time add more results. This work was done by Small Datum LLC and sponsored by the MariaDB Corporation.
I compare MariaDB with pgvector because I respect the work that the Postgres community has done to support vector search workloads. And I am happy to report that MariaDB has also done a great job on this. While I don't know the full story of the development effort, this feature came from the MariaDB Foundation and the community and it is wonderful to see that collaboration.
tl;dr
- Performance for MariaDB is excellent
- peak QPS for a given recall target is much better than pgvector
- time to create the index to achieve a given recall target is almost always much better than pgvector
- MariaDB is easier to tune than pgvector
- MariaDB needs more documentation
- this blog post on benchmarks
- this overview
- this syntax description
- the algorithm is described as a variation of HNSW (MHNSW) and the community will benefit from more details on what has been changed. For example, I can't set ef_construction and it always uses float16 in the index. But with pgvector I can use either float32 (default) or float16 (via halfvec).
- What transaction isolation level is supported? I get that workloads will be read-heavy but I assume that some won't be read-only so I won't to know whether repeatable read and read committed are provided.
- What concurrent operations are supported? Obviously, reads can be concurrent with other reads. But can writes be concurrent with reads or with other writes on the same table?
While I already installed some dependencies on this server long ago, to run this benchmark I did:
sudo apt install libmariadb3 libmariadb-dev
pip3 install mariadb
pip3 install pgvector psycopg
For MariaDB I compiled version 11.7.1 from source. Files that I used include:
I am using the ann-benchmarks via my fork of a fork of a fork at this commit. These forks have changes to run the benchmark for MariaDB and pgvector without using Docker containers. A request I have for anyone writing a benchmark client is to limit dependencies, or at least make them optional. I just want to point the benchmark client at an existing installation.
In this post I use the fashion-mnist-784-euclidean dataset in non-batch mode where non-batch mode doesn't run concurrent queries and batch mode does. Note that I don't set --count when running the benchmark which means that all of the queries use LIMIT 10.
Files related to these tests are archived here.
The command lines for non-batch mode are:
python3 -u run.py --algorithm pgvector --dataset fashion-mnist-784-euclidean --local
POSTGRES_CONN_ARGS=root:pw:127.0.0.1:5432 POSTGRES_DB_NAME=ib \
python3 -u run.py --algorithm pgvector_halfvec --dataset fashion-mnist-784-euclidean --local
MARIADB_CONN_ARGS=root:pw:127.0.0.1:3306 MARIADB_DB_NAME=test \
python3 -u run.py --algorithm mariadb --dataset fashion-mnist-784-euclidean --local
Results: QPS vs recall graphs
- Indexes have a similar size with MariaDB and Postgres with halfvec. The Postgres index without halfvec is about 2X larger.
- Time to create the index for Postgres is similar with and without halfvec
- Time to create the index for MariaDB is less than for pgvector. Perhaps the best way to compare this is the time to create the index for a similar point on the QPS/recall graph (see the last section of this blog post)
With ann-benchmarks the constraint might be recall where you seek the best QPS that satisfies a recall target. Below I share the best QPS for a given recall target along with the configuration parameters (M, ef_construction, ef_search) at which that occurs for each of the algorithms (MariaDB, pgvector with float32, pgvector with float16).
For all cases below the best QPS at a given recall target is much better, sometimes 3X better, with MariaDB than with pgvector.
January 11, 2025
SQLite: ANALYZE
January 10, 2025
Use of Time in Distributed Databases (part 4): Synchronized clocks in production databases
This is part 4 of our "Use of Time in Distributed Databases" series. In this post, we explore how synchronized physical clocks enhance production database systems.
Spanner
Google's Spanner (OSDI'12) implemented a novel approach to handling time in distributed database systems through its TrueTime API. TrueTime API provides time as an interval that is guaranteed to contain the actual time, maintained within about 6ms (this is 2012 published number which improved significantly since then) of uncertainty using GPS receivers and atomic clocks. This explicit handling of time uncertainty allows Spanner to provide strong consistency guarantees while operating at a global scale.
Spanner uses multi-version concurrency control (MVCC) and achieves external consistency (linearizability) for current transactions through techniques like "commit wait," where transactions wait out the uncertainty in their commit timestamps before making their writes visible. Spanner uses Paxos groups approach to shard (partition) data across up to hundreds of servers, but thanks to TrueTime clocks, the system is able to provide linearizability to transactions without having to coordinate across all shards.
The real power of TrueTime becomes apparent in its handling of lock-free snapshot reads (reads of past database states). Without TrueTime, capturing consistent snapshots across a distributed database would require extensively tracking and storing causality relationships between different versions of data across the system. By using precisely synchronized time as the basis for versioning, Spanner can easily identify globally consistent reads of the database at any point in the past simply by specifying a timestamp in a lock free manner.
CockroachDB
CockroachDB builds on concepts from Google's Spanner, using a distributed key-value store organized into ranges (approximately 64MB chunks)--with each range forming its own Raft consensus group.
The key innovation of CockroachDB lies in achieving strong consistency and geo-distribution without relying on tightly-synchronized atomic clocks. Instead, it uses NTP-synchronized clocks alongside hybrid logical clocks (HLCs) and intent locks to handle time uncertainty. Unlike Spanner, which uses TrueTime and "commit-wait" to handle clock uncertainty, CockroachDB refrains from commit-waits due to longer NTP uncertainty intervals. CockroachDB dynamically adjusts transaction timestamps upward when conflicts occur within the uncertainty window. This approach eliminates the need for tightly-synchronized clocks, but leads to more transaction aborts/restarts under high contention.
CockroachDB implements serializable isolation through a MVCC system. Serializable isolation guarantees transactions appear to execute in isolation, but strict serializability adds a real-time ordering constraint: if transaction T1 commits before T2 begins, T1's commit timestamp must precede T2's. With NTP-synchronized clocks, CockroachDB achieves serializability but may violate strict serializability due to clock skew between nodes.
This limitation is illustrated through a scenario with three transactions across two keys: when transaction T1 reads both keys, it may see T3's update to y but not T2's earlier update to x, violating strict serializability. While this execution is serializable (as the transactions could be ordered T3, T1, T2), it breaks the real-time ordering guarantee that strict serializability requires. CockroachDB's reliance on looser clock synchronization makes such guarantees impossible without significant performance penalties. Spanner avoids these issues through its TrueTime API and tightly synchronized clocks, as it can afford performing commit-wait (delaying transaction commits until the timestamp's upper bound has passed).
A recent paper investigated integrating high-precision clock synchronization into CockroachDB and the resulting impact on performance. This is a good read if you are interested in understanding CockroachDB's handling of time uncertainty in transaction processing.
YugabyteDB, which has a similar Raft-groups based shared nothing architecture, recently posted a great analysis of how to make AWS Time Sync Service with Precision Hardware Clock (with 50 microsecond clockbound) work with their database.
MongoDB
MongoDB's implementation of causal consistency also relies on Hybrid Logical Clocks (HLC). The system tracks ClusterTime as a tuple of Unix epoch seconds and an increment counter, allowing for fine-grained ordering within the same second.
This combines the benefits of physical and logical time, incrementing the logical clock only when operations are written to the primary node's replication log (op log).
Causal consistency is implemented through sessions, where each operation carries the highest known operationTime, and nodes wait for their operation log to catch up to this time before processing requests. This approach enables cross-session and cross-client causal consistency while maintaining system availability during network partitions.
MongoDB uses ClusterTime also for running OCC general purpose transactions across multiple shards and provides snapshot isolation guarantee for the transactions. This takes longer to explain, so I will explain MongoDB's interactive/ACID distributed transaction protocol in another blog post.
DynamoDB
DynamoDB's implementation of transactions uses a timestamp-based Bernstein-Goldman optimistic concurrency control (TSO) protocol from 1980, where transactions are assigned timestamps at the start to define their position in the serial order. Correctness does not depend on perfect clock synchronization --the timestamps can be treated as monotonically increasing logical timestamps since the protocol rejects smaller timestamps. Synchronized clocks help improve the performance of transactions by reducing spurious conflicts.
To maintain its promise of predictable performance at scale, DynamoDB implements one-shot transactions rather than interactive transactions, allowing it to avoid the complexities of lock management and recovery. For write transactions, the system uses a two-phase protocol where the transaction coordinator checks timestamps against item timestamps before committing. While better synchronized clocks result in more successful transactions (as real-time ordering aligns better with serialization order), the system includes safeguards against clocks that are too far ahead and isolates transaction managers with faulty clocks.
For read-only transactions, DynamoDB employs a two-phase protocol that avoids maintaining read timestamps on items, which would turn every read into a costly write operation. Instead, it uses log sequence numbers (LSNs) to detect concurrent modifications between the two phases.
Accord
Accord protocol (2023) aims to implement general-purpose transactions in Cassandra. The key innovation is the use of synchronized clocks and hybrid logical clocks (HLCs) to achieve fast, consistent transactions across distributed systems.
Accord is a leaderless consensus protocol aims to achieve optimal latency (one WAN round-trip) under normal conditions while maintaining strict-serializable isolation. It comes from the same family of solutions like EPaxos, Caesar, and Tempo. Improving on Tempo (which we had reviewed in detail here), Accord handles timestamp conflicts more efficiently by allowing for inconsistent but valid sets of dependencies between transactions, requiring only that all dependency sets include those that may be committed with a lower timestamp.
This time-based approach, combined with the Reorder Buffer and Fast Path Electorate concepts, helps achieve improve performance. Accord's reorder buffer uses knowledge of maximum clock skew between nodes and point-to-point latencies to ensure messages are processed in the correct timestamp order. The system buffers timestamp proposals for a period equal to the clock skew plus the longest point-to-point latency, ensuring that potentially conflicting messages have time to arrive before processing begins. Accord uses HLCs that augments timestamps with unique replica IDs to achive both temporal ordering and global uniqueness of transactions.
If you just want to focus on the abstract idea, Detock (SIGMOD 2023) presents a similar idea for serializing transactions using time. So yeah there is a hint of deterministic databases here.
TIDB: snapshot isolation (Timestamp Oracle)
TiDB employs a centralized timestamp management system through its Placement Driver (PD), which hosts the timestamp oracle (TSO). The TSO generates globally unique timestamps that combine physical time (with millisecond accuracy) and logical time (using 18 bits), serving as transaction IDs. This hybrid logical clock approach ensures strict ordering of transactions while maintaining a close relationship with real time.
TiDB uses these timestamps to implement snapshot isolation (SI) through multi-version concurrency control (MVCC). Unlike CockroachDB which provides serializability, TiDB focuses on snapshot isolation and read committed isolation levels. The timestamps play a crucial role in transaction management, allowing the system to maintain consistent views of the data and protect against write-write conflicts. The system supports both optimistic and pessimistic concurrency control methods, with the optimistic approach providing better throughput in low-conflict situations.
An important limitation of TiDB's time-based approach is that its centralized TSO makes multi-region deployment challenging. While the system uses Raft for replication within a single region (organizing data into 96MB range shards they call "regions"), geo-distribution would require modifications to both the TSO and Placement Driver to handle cross-region timestamp coordination effectively.
Aurora Limitless
Aurora Limitless is not a different engine, but rather a managed sharding option built on top of Aurora (which we reviewed here). Instead of a single writer it allows for multiple writers by leveraging precise time synchronization to achieve scalable transactions. The system uses Amazon TimeSync, which provides microsecond-accurate time synchronization with bounded uncertainty (under 50 microseconds), combined with Hybrid Logical Clocks (HLC) to ensure consistent ordering of events across the distributed system.
Aurora Limitless modifies PostgreSQL's traditional snapshot isolation mechanism to work in a distributed environment by implementing snapshots "as of then" rather than "as of now." To handle potential issues with transaction visibility and commit timing, Aurora Limitless uses clever techniques like predicting commit completion times and setting commit timestamps slightly in the future. This approach helps maintain read performance by ensuring reads are never blocked by writes while still preserving consistency guarantees.
The architecture employs a two-phase commit (2PC) protocol with an interesting twist: transaction routers initially coordinate commits but hand off supervision to local shard coordinators for high availability. This is because we don't want to require high-availability at the router, and keep it soft-state. Shards already have high availability built-in, so the router hands off the supervision 2PC completion to a shard leader, which is already replicated and highly available. The system also addresses the challenge of "zombie shards" (nodes that don't know they've been failed over) through consistency leases that expire after a few seconds to prevent stale reads.
Aurora DSQL
Aurora DSQL (2024) uses time as a fundamental mechanism for managing transaction consistency and isolation. Unlike previous Aurora databases, it employs a different, truly serverless architecture with two key services: an adjudicator service for concurrency control and a journal service for update durability.
The system uses timestamps in two crucial ways: first, when a transaction starts, it chooses a start time (Ts) for performing all reads against the multiversion concurrency control (MVCC) storage system. Second, when committing, it selects a commit time (Tc). These timestamps help maintain snapshot isolation by ensuring that conflicts are properly detected between transactions that overlap in time.
DSQL's commit process is built around temporal coordination. When a transaction wants to commit, it must verify that no other transaction has written to the same keys between its start time (Ts) and commit time (Tc). This verification is handled by the adjudicator service, which makes guarantees about transaction ordering based on timestamps. The adjudicator promises to never commit a transaction with an earlier timestamp once it has allowed a transaction to commit at a given time, establishing a temporal ordering of transactions. Each adjudicator has jurisdiction over specific key-ranges. Adjudicators don't use Paxos for leader-election/fault-tolerance, but rather they again leverage time/fencing during handovers of key-range responsibility.
The storage system uses timestamps to determine when it has a complete view of the data at a given point in time. To handle cases where the write rate might be low, the system implements a heartbeat protocol where adjudicators commit to moving their commit points forward in sync with the physical clock. This ensures that storage nodes can confidently serve reads as of a particular timestamp by knowing they have seen all transactions up to that point.
This time-based approach allows Aurora DSQL to achieve strong consistency reads and snapshot isolated transactions while maintaining high performance through optimistic concurrency control (OCC). By moving all coordination to commit time and using timestamps for ordering, the system minimizes the necessary coordination between machines, availability zones, and regions.
Outgrowing Postgres: How to identify scale problems
Sysbench performance over time for InnoDB and MyRocks: part 4
This is part 4 in my (possibly) final series on performance regressions in MySQL using cached sysbench as the workload. For previous posts, see part 1, part 2 and part 3. This post covers performance differences between InnoDB in upstream MySQL 8.0.32, InnoDB in FB MySQL 8.0.32 and MyRocks in FB MySQL 8.0.32 using a server with 32 cores and 128G of RAM.
I don't claim that the MyRocks CPU overhead isn't relevant, but this workload (CPU-bound, database is cached) is a worst-case for it.
tl;dr
- InnoDB from FB MySQL is no worse than ~10% slower than InnoDB from upstream
- Fixing bug 1506 is important for InnoDB in FB MySQL
- MyRocks is ~30% slower than upstream InnoDB at low concurrency and ~45% slower at high, as it uses ~1.5X more CPU/query
- For writes, MyRocks does worse at high concurrency than at low
I looked at vmstat metrics for the update-nonindex benchmark and the number of context switches per update is about 1.2X larger for MyRocks vs InnoDB at high concurrency.
Then I looked at PMP stacks and MyRocks has more samples for commit processing. The top stacks are here. This should not be a big surprise because MyRocks does more work at commit time (pushes changes from a per-session buffer into the memtable). But I need to look at this more closely.
I browsed the code in Commit_stage_manager::enroll_for, which is on the call stack for the mutext contention, and it is kind of complicated. I am trying to figure out how many mutexes are locked in there and figuring that out will take some time.
Much more detail on the benchmark and hardware is here. I am trying to avoid repeating that information in the posts that follow.
Results here are from the c32r128 server with 32 CPU cores and 128G of RAM. The benchmarks were repeated for 1 and 24 threads. On the charts below that is indicated by NT=1 and NT=24.
Builds
The previous post has more detail on the builds, my.cnf files and bug fixes.
- my8032_rel_o2nofp
- InnoDB from upstream MySQL 8.0.32
- fbmy8032_rel_o2nofp_end_241023_ba9709c9_971
- FB MySQL 8.0.32 at git hash ba9709c9 (as of 2024/10/23) using RocksDB 9.7.1. This supports InnoDB and MyRocks.
- fbmy8032_rel_o2nofp_241023_ba9709c9_971_bug1473_1481_1482_1506
- FB MySQL 8.0.32 at git hash ba9709c9 (as of 2024/10/23) using RocksDB 9.7.1 with patches applied for bugs 1473, 1481, 1482 and 1506, This supports InnoDB and MyRocks.
- rQPS is: (QPS for my version) / (QPS for base version)
- base version is InnoDB from upstream MySQL 8.0.32 (my8032_rel_o2nofp)
- my version is one of the other versions
- fbinno-nofix
- InnoDB from fbmy8032_rel_o2nofp_end_241023_ba9709c9_971
- fbinno-somefix
- InnoDB from fbmy8032_rel_o2nofp_241023_ba9709c9_971_bug1473_1481_1482_1506
- myrocks-nofix
- MyRocks from fbmy8032_rel_o2nofp_end_241023_ba9709c9_971
- myrocks-somefix
- MyRocks from fbmy8032_rel_o2nofp_241023_ba9709c9_971_bug1473_1481_1482_1506
Summary statistics: InnoDB
Summary:
- InnoDB from FB MySQL is no worse than ~10% slower than InnoDB from upstream
- Fixing bug 1506 is important for InnoDB in FB MySQL
Summary statistics: MyRocks
Summary:
- MyRocks does better at low concurrency than at high. The fix might be as simple as enabling the hyper clock block cache
- MyRocks is ~30% slower than upstream InnoDB at low concurrency and ~45% slower at high
- For writes, MyRocks does worse at high concurrency than at low
- InnoDB from FB MySQL is no worse than 10% slower than upstream
Results: c32r128 with MyRocks and point queries
- at low concurrency the worst case for MyRocks are the tests that do point lookup on secondary indexes because that uses a range scan rather than a point lookup on the LSM tree, which means that bloom filters cannot be used
- at high concurrency the difference between primary and secondary index queries is less significant, perhaps this is dominated by mutex contention from the LRU block cache and solved by using hyper clock
Results: c32r128 with InnoDB and range queries
Summary
- the worst case for InnoDB from FB MySQL are the long range scans and fixing bug 1506 will be a big deal
Results: c32r128 with MyRocks and range queries
Summary
- while long range scans are the worst case here, bug 1506 is not an issue as that is InnoDB-only
Results: c32r128 with InnoDB and writes
Summary
- results are stable here, InnoDB from FB MySQL is no worse than ~10% slower than upstream but results at high concurrency are a bit worse than at low
Results: c32r128 with MyRocks and writes
Summary
- while MyRocks does much better than InnoDB for update-index because it does blind writes rather than RMW for non-unique secondary index maintenance
- MyRocks does worse at high concurrency than at low