April 22, 2025
April 21, 2025
Speeding Up Percona XtraDB Cluster State Transfers with Kubernetes Volume Snapshots
April 20, 2025
Transactions are a protocol
Transactions are not an intrinsic part of a storage system. Any storage system can be made transactional: Redis, S3, the filesystem, etc. Delta Lake and Orleans demonstrated techniques to make S3 (or cloud storage in general) transactional. Epoxy demonstrated techniques to make Redis (and any other system) transactional. And of course there's always good old Two-Phase Commit.
If you don't want to read those papers, I wrote about a simplified implementation of Delta Lake and also wrote about a simplified MVCC implementation over a generic key-value storage layer.
It is both the beauty and the burden of transactions that they are not intrinsic to a storage system. Postgres and MySQL and SQLite have transactions. But you don't need to use them. It isn't possible to require you to use transactions. Many developers, myself a few years ago included, do not know why you should use them. (Hint: read Designing Data Intensive Applications.)
And you can take it even further by ignoring the transaction layer of an existing transactional database and implement your own transaction layer as Convex has done (the Epoxy paper above also does this). It isn't entirely clear that you have a lot to lose by implementing your own transaction layer since the indexes you'd want on the version field of a value would only be as expensive or slow as any other secondary index in a transactional database. Though why you'd do this isn't entirely clear (I will like to read about this from Convex some time).
It's useful to see transaction protocols as another tool in your system design tool chest when you care about consistency, atomicity, and isolation. Especially as you build systems that span data systems. Maybe, as Ben Hindman hinted at the last NYC Systems, even proprietary APIs will eventually provide something like two-phase commit so physical systems outside our control can become transactional too.
Transactions are a protocol
— Phil Eaton (@eatonphil) April 20, 2025
short new post pic.twitter.com/nTj5LZUpUr
April 19, 2025
Battle of the Mallocators: part 2
This post addresses some of the feedback I received from my previous post on the impact of the malloc library when using RocksDB and MyRocks. Here I test:
- MALLOC_ARENA_MAX with glibc malloc
- see here for more background on MALLOC_ARENA_MAX. By default glibc can use too many arenas for some workloads (8 X number_of_CPU_cores) so I tested it with 1, 8, 48 and 96 arenas.
- compiling RocksDB and MyRocks with jemalloc specific code enabled
- In my previous results I just set malloc-lib in my.cnf which uses LD_LIBRARY_PATH to link with your favorite malloc library implementation.
- For mysqld with jemalloc enabled via malloc-lib (LD_LIBRARY_PATH) versus mysqld with jemalloc specific code enabled
- performance, VSZ and RSS were similar
- After setting rocksdb_cache_dump=0 in the binary with jemalloc specific code
- performance is slightly better (excluding the outlier, the benefit is up to 3%)
- peak VSZ is cut in half
- peak RSS is reduced by ~9%
- With 1 arena performance is lousy but the RSS bloat is mostly solved
- With 8, 48 or 96 arenas the RSS bloat is still there
- With 48 arenas there are still significant (5% to 10%) performance drops
- With 96 arenas the performance drop was mostly ~2%
-DHAVE_JEMALLOC=1-DWITH_JEMALLOC=1
182a183,184> cmd="MALLOC_ARENA_MAX=1 $cmd"> echo Run :: $cmd
- fbmy8032_rel_o2nofp_end_241023_ba9709c9_971.za4_jemalloc_c32r128
- This is the base case in the table below
- this is what I used in my previous post and jemalloc is enabled via setting malloc-lib in my.cnf which uses LD_LIBRARY_PATH
- fbmy8032_rel_o2nofp_end_je_241023_ba9709c9_971.za4_c32r128
- This is col-1 in the table below
- MySQL with jemalloc specific code enabled at compile time
- fbmy8032_rel_o2nofp_end_je_241023_ba9709c9_971.za5_c32r128
- This is col-2 in the table below
- MySQL with jemalloc specific code enabled at compile time and rocksdb_cache_dump=0 added to my.cnf
(QPS with $allocator) / (QPS with glibc malloc)
- results in col-1 are similar to the base case. So compiling in the jemalloc specific code didn't help performance.
- results in col-2 are slightly better than the base case with one outlier (hot-points). So consider setting rocksdb_cache_dump=0 in my.cnf after compiling in jemalloc specific code.
- jemalloc.1
- base case, fbmy8032_rel_o2nofp_end_241023_ba9709c9_971.za4_jemalloc_c32r128
- jemalloc.2
- col-1 above, fbmy8032_rel_o2nofp_end_je_241023_ba9709c9_971.za4_c32r128
- This has little impact on VSZ and RSS
- jemalloc.3
- col-2 above, fbmy8032_rel_o2nofp_end_je_241023_ba9709c9_971.za5_c32r128
- This cuts peak VSZ in half and reduces peak RSS by 9%
- fbmy8032_rel_o2nofp_end_241023_ba9709c9_971.za4_glibcmalloc_c32r128
- base case in the table below
- fbmy8032_rel_o2nofp_end_241023_ba9709c9_971.za4_glibcmalloc_1arena_c32r128
- col-1 in the table below
- uses MALLOC_ARENA_MAX=1
- fbmy8032_rel_o2nofp_end_241023_ba9709c9_971.za4_glibcmalloc_8arena_c32r128
- col-2 in the table below
- uses MALLOC_ARENA_MAX=8
- fbmy8032_rel_o2nofp_end_241023_ba9709c9_971.za4_glibcmalloc_48arena_c32r128
- col-3 in the table below
- uses MALLOC_ARENA_MAX=48
- fbmy8032_rel_o2nofp_end_241023_ba9709c9_971.za4_glibcmalloc_96arena_c32r128
- col-4 in the table below
- uses MALLOC_ARENA_MAX=48
(QPS with $allocator) / (QPS with glibc malloc)
- performance with 1 or 8 arenas is lousy
- performance drops some (often 5% to 10%) with 48 arenas
- performance drops ~2% with 96 arenas