January 09, 2025
Sysbench performance over time for InnoDB and MyRocks: part 1
I spent much time in 2024 documenting performance regressions from old to new versions of MySQL with InnoDB and MyRocks. More posts will be published in 2025, including this sequence of posts, but my work on that is winding down. Most of the problems are from many small regressions rather than a few big ones and it is very expensive to find and fix regressions long after they arrive.
Hopefully someone else will emerge to do work like this for MySQL going forward or upstream finds more resources to prevent new small regressions from arriving. The trend over the past decade has been great for Postgres. And if upstream wants to grow Heatwave, then avoiding regressions in the base product is one way to help with that.
The purpose of this post is to document the regressions from MySQL 5.6 through 8.0 for MyRocks and upstream InnoDB using sysbench with a cached workload.
tl;dr, v1
- Modern InnoDB at low concurrency gets about 70% of the QPS relative to older InnoDB (5.6.51). The regressions aren't as bad at high concurrency.
- Modern MyRocks gets between 80% and 95% of the QPS relative to older MyRocks (5.6.35) in most cases (high-concurrency writes is the exception where QPS is similar for old and new.
tl;dr, v2
- Both upstream MySQL and the FB MySQL project would benefit from changepoint detection (like Nyrkio) using sysbench microbenchmarks to detect regressions
- Regressions from MySQL 5.6 to 8.0 for InnoDB are worse at low concurrency than at high. MySQL has gotten more efficient at high concurrency workloads, although some of that benefit is lost from code bloat which is more visible at low concurrency.
- Regressions from MySQL 5.6 to 8.0 for MyRocks are better than InnoDB at low concurrency, but worse at high concurrency. I suspect that regressions at high concurrency for MyRocks would be better than InnoDB had I enabled the hyper clock cache.
Builds
- fbmy5635_rel_o2nofp_210407_f896415f_6190
- MyRocks 5.6.35 at git hash f896415f (2021/04/07) and RocksDB 6.19.0
- fbmy5635_rel_o2nofp_231016_4f3a57a1_870
- MyRocks 5.6.35 at git hash 4f3a57a1 (2023/10/16) and RocksDB 8.7.0
- fbmy8028_rel_o2nofp_231202_4edf1eec_870
- MyRocks 8.0.28 at git hash 4edf1eec (2023/12/02) and RocksDB 8.7.0
- fbmy8032_rel_o2nofp_end_241023_ba9709c9_971
- MyRocks 8.0.32 at git hash ba9709c9 (2024/10/23) and RocksDB 9.7.1
- fbmy8032_rel_o2nofp_241023_ba9709c9_971_bug1473_1481_1482_1506
- MyRocks 8.0.32 at git hash ba9709c9 (2024/10/23) and RocksDB 9.7.1 with fixes applied for bugs 1473, 1481, 1482 and 1506. There is also a workaround for bug 1505 via my.cnf changes. But note that bug 1506 is only for InnoDB.
- 1473
- is_thd_db_read_only_by_name accounts for ~2% of CPU time on CPU-bound and write-heavy microbenchmarks
- 1481
- ha_statistic_increment accounts for ~5% of CPU time on CPU-bound table scans
- 1482
- calls to clock_gettime in sql/optimizer.cc account for ~4% of CPU time on several microbenchmarks
- 1505
- the default for yield_check_frequency is zero which makes MySQL waste much time in calls to thd_wait_yield. A better default is 10. Too many my.cnf options isn't a big deal. Too many options with bad default values is a big deal. The workaround for this is to set yield_check_frequency=10 in my.cnf
- 1506
- this is limited to InnoDB in the FB MySQL tree. It is from changes in the concurrency ticket code and reduces CPU-bound table scan performance by ~20%.
- fbmy5635_rel_o2nofp_210407_f896415f_6190
- fbmy5635_rel_o2nofp_231016_4f3a57a1_870
- fbmy8028_rel_o2nofp_231202_4edf1eec_870
- fbmy8032_rel_o2nofp_end_241023_ba9709c9_971
- my.cnf.cza2_c32r128 (this adds yield_check_frequency=10)
- fbmy8032_rel_o2nofp_241023_ba9709c9_971_bug1473_1481_1482_1506
- my.cnf.cza2_c32r128 (this adds yield_check_frequency=10)
- c8r16
- The c8r16 name stands for 8 CPU cores and 16G of RAM. This is a Beelink SER4 with an AMD Ryzen 7 4700 CPU with SMT disabled, 8 cores, 16G of RAM, Ubuntu 22.04 and ext4 on 1 NVMe device.
- c8r32
- The c8r32 name stands for 8 CPU cores and 32G of RAM. This is an ASUS ExpertCenter PN53 with AMD Ryzen 7 7735HS, with SMT disabled, 8 cores, 32G RAM, Ubuntu 22.04 and ext4 on 1 NVMe device.
- c24r64
- The c24r64 name stands for 24 CPU cores and 64G of RAM. This is a SuperMicro SuperWorkstation 7049A-T with 2 sockets, 12 cores/socket, 64G RAM, one m.2 SSD (2TB, ext4). The CPUs are Intel Xeon Silver 4214R CPU @ 2.40GHz.
- c32r128
- The c32r128 name stands for 32 CPU cores and 128G of RAM. This is a Dell Precision 7865 Tower Workstation with 1 socket, 128G RAM, AMD Ryzen Threadripper PRO 5975WX with 32-Cores, 2 m.2 SSD (each 2TB, RAID SW 0, ext4).
- c8r16 - 1 thread, 1 table, 30M rows
- c8r32 - 1 thread, 1 table, 50M rows
- c24r64 - 1 thread and then 16 threads, 8 tables, 10M rows/table
- c32r128 - 1 thread and then 24 threads, 8 tables, 10M rows/table
Relative QPS
- rQPS is: (QPS for my version) / (QPS for base version)
- base version for InnoDB is 5.6.51 and for MyRocks is 5.6.35 (fbmy5635_rel_o2nofp_210407_f896415f_6190)
- my version is one of the other versions
- based on the median values, InnoDB in 8.0.32 gets ~72% of the QPS relative to 5.6.51
- the results here are better than above. For point queries and writes, upstream made InnoDB better for high-concurrency workloads which counters the performance loss from new CPU overheads. However, that isn't true for writes. The issue is made worse by some of the refactoring in InnoDB that landed after 8.0.28.
- the regressions here aren't as bad as they are above for InnoDB
- Based on the median values, MyRocks at higher concurrency here does better than at low concurrency above. I am not sure why. Perhaps had I enabled the hyper clock block cache the results here would be better than they are above for InnoDB at 24 threads.
- 5635.231016
- fbmy5635_rel_o2nofp_231016_4f3a57a1_870
- 8028.231202
- fbmy8028_rel_o2nofp_231202_4edf1eec_870
- 8032.241023.nofix
- fbmy8032_rel_o2nofp_end_241023_ba9709c9_971
- 8032.241023.fix
- fbmy8032_rel_o2nofp_241023_ba9709c9_971_bug1473_1481_1482_1506
- the worst regression is from hot-points where 5.6 gets almost 2X more QPS than 8.0
- for all tests except one, InnoDB in 8.0 gets less than 75% of the throughput relative to 5.6
- the regression for 8.0.28 in the third group from the right is from bug 102037
- Upstream MySQL has improvements for concurrent workloads in 5.7 and 8.0 which offset the CPU regressions.
- With 5.7.44, 5 tests are much faster than 5.6.51, 2 tests have similar QPS and 2 are slower
- With 8.0.32, 5 tests are much faster than 5.6.51 and 4 are slower
- Regressions from 5.6 to 8.0 are smaller here for MyRocks than above for InnoDB
- In 8.0.32, 7 tests are slower than 5.6.35 and 2 are slightly faster
- The regression for 8.0.28 in the third group from the right is from bug 102037
- The results here are about the same as the results above for MyRocks at 1 thread. Were I to enable the hyper clock cache they might be much better.
Notes for NT=1 (1 thread):
- The regressions are large for everything but long range scans. And for long-range scans while 5.7.44 is faster than 5.6, that benefit no longer exists in 8.0, especially in 8.0.30+.
- Results here are much better than above for InnoDB at 1 thread. But about half of the tests still have large regressions.
Notes for NT=1 (1 thread):
- Results here are similar to what I see above for InnoDB at 1 thread. I will explain more elsewhere but it has been hard to track down the root causes to anything other than code bloat, as in more cache and TLB activity.
- Results here at high concurrency are much better than above at low concurrency, with the exception of the results for the full scan (the last group on the right).
Notes for NT=1 (1 thread):
- All of the tests have large regressions in 8.0.32 except for update-index
- Results here are much better than above at 1 thread. Modern InnoDB is much better at high-concurrency workloads.
Notes for NT=1 (1 thread):
- MyRocks has large regressions at low-concurrency.
- The regressions here at high-concurrency are about 10% better than above at low-concurrency.
Database Sharding
January 08, 2025
I Can’t Believe It’s Not Causal! Scalable Causal Consistency with No Slowdown Cascades
I recently came across the Occult paper (NSDI'17) during my series on "The Use of Time in Distributed Databases." I had high expectations, but my in-depth reading surfaced significant concerns about its contributions and claims. Let me share my analysis, as there are still many valuable lessons to learn from Occult about causality maintenance and distributed systems design.
The Core Value Proposition
Occult (Observable Causal Consistency Using Lossy Timestamps) positions itself as a breakthrough in handling causal consistency at scale. The paper's key claim is that it's "the first scalable, geo-replicated data store that provides causal consistency without slowdown cascades."
The problem they address is illustrated in Figure 1, where a slow/failed shard A (with delayed replication from master to secondary) can create cascading delays across other shards (B and C) due to dependency-waiting during write replication. This is what the paper means by "slowdown cascades". Occult's solution shifts this write-blocking to read-blocking. In other words, Occult eliminates the dependency-blocking for write-replication, and instead for read-serving, it waits-on read operations from shards that are lagging behind to ensure they appear consistent with what a user has already seen to provide causal consistency.
Questionable Premises
The paper presents dependency-blocking across shards for writes as a common problem, yet I struggle to identify any production systems that implement causal consistency (or any stronger consistency) this way. The cited examples are all academic systems like COPS, not real-world databases.
More importantly, the paper's claim of being "first" overlooks many existing solutions. Spanner (2012) had already demonstrated how to handle this challenge effectively using synchronized clocks, achieving even stronger consistency guarantees (strict-serializability) without slowdown cascades. Spanner already does what Occult proposes to do: it shifts write-blocking to read-blocking, and also uses synchronized clocks to help for reads.
The paper acknowledges Spanner only briefly in related work, noting its "heavier-weight mechanisms" and maybe as a result it "aborting more often" - but this comparison feels incomplete given Spanner's superior consistency model and production-proven success.
The Secondary Reads Trade-off
A potential justification for Occult's approach is enabling reads from nearby secondaries without client stickiness. However, the paper doesn't analyze this trade-off at all. Why not just read from the masters? When are secondary reads sufficiently beneficial to justify the added complexity? The paper itself notes in the introduction that inconsistency at secondaries is extremely rare - citing Facebook's study showing fewer than six out of every million reads violating causal consistency even in an eventually-consistent data store.
In examining potential justifications for secondary reads, I see only two viable scenarios. First, secondary reads might help when the primary is CPU-constrained - but this only applies under very limited circumstances, typically with the smallest available VM instances, which is unlikely for production systems. Second, there's the latency consideration: while cross-availability-zone latencies wouldn't justify secondary reads in this case, cross-region latencies might make them worthwhile for single-key operations. However, even this advantage diminishes for transactional reads within general transactions, where reading from masters is more sensible to avoid transaction aborts due to conflicts, which would waste all the work done as part of the transaction.
The Client Session Problem
The paper's handling of client sessions (or rather the lack thereof) reveals another limitation. Their example (from the conference presentation) wholesale couples unrelated operations - like linking a social media post to an academic document share - into the same dependency chain. Modern systems like MongoDB's causal consistency tokens (introduced in 2017, the same year) provide a better approach to session management.
Comparing Occult with Spanner reveals some interesting nuances. While Spanner can read from secondaries, it requires them to catch up to the current time (T_now). Occult takes a different approach by maintaining client causal shardstamps, allowing reads from secondaries without waiting for T_now synchronization. This theoretically enables causal consistency using earlier clock values than Spanner's current-time requirement.
However, this theoretical advantage comes with significant practical concerns, as we mentioned above in the secondary reads tradeoff. Occult shifts substantial complexity to the client side, but the paper inadequately addresses the overhead and coordination requirements this imposes. The feasibility of expecting clients (which are typically frontend web proxies in the datacenter) to handle such sophisticated coordination with servers remains questionable.
The Fatal Flaw
The most concerning aspect about Occult emerges in Section 6 on fault tolerance. The paper reveals that correctness under crash failures requires synchronous replication of the master using Paxos or similar protocols - before the "asynchronous" replication to secondaries. Say what!? This requirement fundamentally undermines the system's claimed benefits from asynchronous replication.
Let me quote from the paper. "Occult exhibits a vulnerability window during which writes executed at the master may not yet have been replicated to slaves and may be lost if the master crashes. These missing writes may cause subsequent client requests to fail: if a client c’s write to object o is lost, c cannot read o without violating causality. This scenario is common to all causal systems for which clients do not share fate with the servers to which they write.
Occult's client-centric approach to causal consistency, however, creates another dangerous scenario: as datacenters are not themselves causally consistent, writes can be replicated out of order. A write y that is dependent on a write x can be replicated to another datacenter despite the loss of x, preventing any subsequent client from reading both x and y.
Master failures can be handled using well-known techniques: individual machine failures within a datacenter can be handled by replicating the master locally using chain-replication or Paxos, before replicating asynchronously to other replicas."
Unfortunately, the implementation ignores fault-tolerance and the evaluation omits crash scenarios entirely, focusing only on node slowdown. This significant limitation isn't adequately addressed in the paper's discussion, and is mentioned in one paragraph of the fault-tolerance section.
Learnings
Despite these criticisms, studying this paper has been valuable. It prompts important discussions about:
- Trade-offs in causality maintenance
- The role of synchronized clocks in distributed systems
- The importance of evaluating academic proposals against real-world requirements
For those interested in learning more, I recommend watching the conference presentation, which provides an excellent explanation of the protocol mechanics.
The protocol
At its core, Occult builds upon vector clocks with some key modifications. Servers attach vector-timestamps to objects and track shard states using "shardstamps", and clients also maintain vector-timestamps to keep a tab on their interaction with the servers. Like vector clocks, shardstamp updates occur through pairwise maximum operations between corresponding entries.
This approach works straightforwardly for single-key updates but becomes more complex for transactions, where a commit timestamp must be chosen and applied consistently across all objects involved. Occult implements transactions using Optimistic Concurrency Control (OCC), but with specific validation requirements. The validation phase must verify two critical properties: first, that the transaction's read timestamp represents a consistent cut across the system, and second, that no conflicting updates occurred before commit. Transaction atomicity is preserved by making writes causally dependent, ensuring clients see either all or none of a transaction's writes - all without introducing write delays that could trigger slowdown cascades.
The protocol employs several optimizations. To manage timestamp overhead, it employs both structural and temporal compression techniques. It also leverages loosely synchronized clocks to avoid spurious dependencies between shards, as I discuss below. A key contribution is PC-PSI (Per-Client Parallel Snapshot Isolation), a relaxed version of PSI that requires total ordering only within client sessions rather than globally. This modification enables more efficient transaction processing while maintaining essential consistency guarantees.
Discussion on the use of synchronized clocks
A critical aspect of Occult is the requirement for monotonic clocks - clock regression would break causality guarantees for reads. While Lamport logical clocks could provide this monotonicity, synchronized physical clocks serve a more sophisticated purpose: they prevent artificial waiting periods caused by structural compression of timestamps. The compression maps timestamps to the same entry in the compressed shardstamp using modulo n operations.
The paper illustrates this problem in Section 5 with a clear example: when two shards (i and j) map to the same compressed entry and have vastly different shardstamps (100 and 1000), a client writing to j would fail the consistency check when reading from a slave of i until i has received at least 1000 writes. If shard i never reaches 1000 writes, the client must perpetually failover to reading from i's master shard. Rather than requiring explicit coordination between shards to solve this problem, Occult leverages loosely synchronized physical clocks to bound these differences naturally. This guarantees that shardstamps differ by no more than the relative offset between their clocks, independent of the write rate on different master shards.
This approach resonates with my earlier work on Hybrid Logical Clocks (2013), where we argued for combining synchronized physical clocks with Lamport clock-based causality tracking. The effectiveness of this strategy is demonstrated dramatically in Occult's results - they achieve compression from 16,000 timestamps down to just 5 timestamps through synchronized clocks.
Occult's use of synchronized clocks for causality compression ties into our broader theme about the value of shared temporal reference frames in distributed systems. This connection between physical time and distributed system efficiency deserves deeper exploration.
January 07, 2025
Convex Cookbook: Dynamic Query Builders
January 06, 2025
Read data from dropped columns
It’s hard to destroy data. Even when a column is dropped, the data is still physically there on the data page. We can use the undocumented/unsupported command DBCC PAGE to look at it. Example First, create a table based on English values from sys.messages: use tempdb; go /* create table based on english values from […]
The post Read data from dropped columns first appeared on Michael J. Swart.A New Postgres Block Storage Layout for Full Text Search
January 04, 2025
Monitoring your Amazon Aurora PostgreSQL-Compatible and Amazon RDS PostgreSQL from integer sequence overflow
January 01, 2025
Databases in 2024: A Year in Review
Andy rises from the ashes of his dead startup and discusses what happened in 2024 in the database game.
December 31, 2024
How I run a coffee club
I started the NYC Systems Coffee Club in December of 2023. It's gone pretty well! I regularly get around 20 people each month. You bring a drink if you feel like it and you hang out with people for an hour or two.
There is no agenda, there is no speaker, there is no structure. The only "structure" is that when the circle of people talking to each other seems gets too big, I break the circle up into two smaller circles so we can get more conversations going.
People tend to talk in a little circle and then move around over time. It's basically no different than a happy hour except it is over a non-alcoholic drink and it's in the morning.
All I have to do as the organizer is periodically tell people about the Google Form to fill out. I got people to sign up to the list by posting about this on Twitter and LinkedIn. And then once a month I send an email bcc-ing everyone on the list and ask them to respond for an invite.
The first 20 people to respond get a calendar invite.
I mention all of this because people ask how they can start a coffee club in their city. They ask how it works. But it's very simple! One of the least-effortful ways to bring together people in your city.
If your city does not have indoor public spaces, you could use a food court, or a cafe, or a park during months where it is warm.
For example, the Cobble Hill Computer Coffee Club is one that meets outdoors at a park.
Good luck! :)
How I run a coffee club, a short guide for others who might be interested in running one. It's very simple!https://t.co/UgRWDQOA3v pic.twitter.com/5wYrLW7u6D
— Phil Eaton (@eatonphil) December 31, 2024