April 25, 2025
April 24, 2025
What It Takes to Be PostgreSQL Compatible
What It Takes to Be PostgreSQL Compatible
We, and many other database enthusiasts, are big fans of PostgreSQL. Even though we built a database system from scratch, we believe there are many good reasons for using PostgreSQL. In fact, we like PostgreSQL so much that we made sure to build CedarDB to be compatible with PostgreSQL.
Because of PostgreSQL’s popularity, we were not the first to develop a PostgreSQL compatible database system. CMU’s “Database of Databases” lists over 40 database systems that claim to be PostgreSQL compatible. Among them you can find database systems from large cloud vendors such as AlloyDB from Google or Aurora from AWS.
April 23, 2025
Does FCV Have Any Impact on MongoDB Performance?
April 22, 2025
Multi-Grained Specifications for Distributed System Model Checking and Verification
This EuroSys 2025 paper wrestles with the messy interface between formal specification and implementation reality in distributed systems. The case study is ZooKeeper. The trouble with verifying something big like ZooKeeper is that the spec and the code don’t match. Spec wants to be succinct and abstract; code has to be performant and dirty.
For instance, a spec might say, “this happens atomically.” But the actual system says, “yeah, buddy, right.” Take the case of FollowerProcessNEWLEADER: the spec bundles updating the epoch and accepting the leader’s history into a single atomic step. But in the real system, those steps are split across threads and separated by queuing, I/O, and asynchronous execution. Modeling them as atomic would miss real, observable intermediate states, and real bugs.
To bridge this model-code gap, the authors use modularization and targeted abstraction. Don’t write one spec, write multi-grained specs. Different parts of the system are modeled at different granularities, depending on what you're verifying. Some parts are modeled in fine detail; others are coarse and blurry. Like reading a novel and skimming the boring parts. These modules, written at different levels of abstraction, are composed into a mixed-grained spec, tailored to the verification task at hand.
Central to the method is the interaction-preserving principle, a key idea borrowed from a 2022 paper by some of the same authors. Coarsening is allowed, but only if you preserve interaction variables--shared state that crosses module boundaries. To abstract away things, modules can internally lie to themselves, but not to each other. Other modules should not be able to distinguish whether they are interacting with the original or a coarsened module. The paper formalizes this using dependency relations over actions and their enabling conditions. This sounds a lot like rely-guarantee, but encoded for TLA+.
What makes this paper stand out is its application. This is a full-on, hard-core TLA+ verification effort against a real-world messy codebase. It’s a good guide for how to do formal work practically, without having to rewrite the system in a verification-oriented language. This paper is not quite a tutorial as it skips a lot of the details, but the engineering is real, and the bugs are real too.
Technical details
The system decomposition follows the Zab protocol phases: Election, Discovery, Synchronization, Broadcast. This is a clever hack. Rather than inventing new module boundaries (say through roles or layers), the authors simply split along the natural phase boundaries in the protocol. The next-state action in the TLA+ spec is already a disjunction across phase actions; this decomposition just formalizes and exploits that.
They focus verification on log replication involving the Synchronization and Broadcast phases, while leaving Election and Discovery mostly coarse. They say that's where the bugs are. They also mention that the leader election is a tangle of nondeterminism and votes-in-flight, and models take days to check. Log replication, by contrast, is where local concurrency hits hard, and where many past bugs and failed fixes appear.
Table 1 summarizes the spec configurations: baseline (system spec), coarsened, and fine-grained. The fine-grained specs are where the action is: they capture atomicity violations, thread interleavings, and missing transitions. These bugs don’t show up when model checking with the baseline spec. It is all about how deep granularity you are willing to go for capturing implementation. Targeted fine-granularity checking really helps keep this practical.
Table 5 shows model performance. Coarse specs are fast but miss bugs. Fine specs find bugs but blow up. Mixed specs get the best of both. For instance, mSpec-3 finds deep bugs in seconds, while the full system spec doesn’t terminate in 24 hours. The bottleneck is the leader election spec, which proves again that targeted coarsening pays when it preserves interaction.
Bug detection is driven by model checking with TLC. Violations are found in model traces when TLC reports a violation of an invariant. The violations found are then confirmed via code-level deterministic replay. The authors built a system (Remix) to instrument ZooKeeper, inject RPC hooks via AspectJ, and replay TLA-level traces using a centralized coordinator that schedules and controls the interleaving of code-level actions using the developer-provided mapping from model actions to code events. This deterministic replay allows them to validate that the model-level bug can actually manifest in real runs.
Figure 8 ties back to the core atomicity problem in FollowerProcessNEWLEADER. Many of these bugs arise when the update of epoch and the logging of history are not properly sequenced or observed. The spec treats them as one unit, but the implementation spreads them across threads and queues. By splitting the spec action into three, the authors model each thread’s contribution separately, and capturing the inter-thread handoff. With this, they catch data loss, inconsistency, and state corruption bugs that had previously escaped detection, or worse, had been "fixed" and re-broken.
The authors didn’t stop at finding bugs. They submitted fixes upstream to ZooKeeper, validating them with the same fine-grained specifications. To fix existing bugs and also make it easy to implement correctly, they removed the atomicity requirement of the two updates from the Zab protocol but require their order: the follower updates its history before updating its epoch. The patched version passed their model checks, and the PR was merged. This tight loop (spec, bug, fix, verify) is what makes this method stand out: formal methods not as theory, but as a workflow.
Conclusions
The results back the thesis: fine-grained modeling is essential to catch real bugs. Coarsening is essential to make model checking scale. Modularity and compositionality is a feasible way to manage verification of a complex, concurrent, evolving system. It's not push-button verification. But it's doable and useful.
The work opens several directions. Could this technique support test generation from the mixed model? If you already have deterministic replay and instrumentation hooks, generating inputs for fault-injection testing seems within reach.
More speculatively, are there heuristics to suggest good modular cuts? The current modularization is protocol-phase aligned, but could role-based, data-centric, or thread-boundary modularization give better results? That seems like a good area for exploration.
Behind the Scenes: How Percona Support Diagnosed a MongoDB FTDC Freeze
April 21, 2025
Speeding Up Percona XtraDB Cluster State Transfers with Kubernetes Volume Snapshots
April 20, 2025
Transactions are a protocol
Transactions are not an intrinsic part of a storage system. Any storage system can be made transactional: Redis, S3, the filesystem, etc. Delta Lake and Orleans demonstrated techniques to make S3 (or cloud storage in general) transactional. Epoxy demonstrated techniques to make Redis (and any other system) transactional. And of course there's always good old Two-Phase Commit.
If you don't want to read those papers, I wrote about a simplified implementation of Delta Lake and also wrote about a simplified MVCC implementation over a generic key-value storage layer.
It is both the beauty and the burden of transactions that they are not intrinsic to a storage system. Postgres and MySQL and SQLite have transactions. But you don't need to use them. It isn't possible to require you to use transactions. Many developers, myself a few years ago included, do not know why you should use them. (Hint: read Designing Data Intensive Applications.)
And you can take it even further by ignoring the transaction layer of an existing transactional database and implement your own transaction layer as Convex has done (the Epoxy paper above also does this). It isn't entirely clear that you have a lot to lose by implementing your own transaction layer since the indexes you'd want on the version field of a value would only be as expensive or slow as any other secondary index in a transactional database. Though why you'd do this isn't entirely clear (I will like to read about this from Convex some time).
It's useful to see transaction protocols as another tool in your system design tool chest when you care about consistency, atomicity, and isolation. Especially as you build systems that span data systems. Maybe, as Ben Hindman hinted at the last NYC Systems, even proprietary APIs will eventually provide something like two-phase commit so physical systems outside our control can become transactional too.
Transactions are a protocol
— Phil Eaton (@eatonphil) April 20, 2025
short new post pic.twitter.com/nTj5LZUpUr
April 19, 2025
Battle of the Mallocators: part 2
This post addresses some of the feedback I received from my previous post on the impact of the malloc library when using RocksDB and MyRocks. Here I test:
- MALLOC_ARENA_MAX with glibc malloc
- see here for more background on MALLOC_ARENA_MAX. By default glibc can use too many arenas for some workloads (8 X number_of_CPU_cores) so I tested it with 1, 8, 48 and 96 arenas.
- compiling RocksDB and MyRocks with jemalloc specific code enabled
- In my previous results I just set malloc-lib in my.cnf which uses LD_LIBRARY_PATH to link with your favorite malloc library implementation.
- For mysqld with jemalloc enabled via malloc-lib (LD_LIBRARY_PATH) versus mysqld with jemalloc specific code enabled
- performance, VSZ and RSS were similar
- After setting rocksdb_cache_dump=0 in the binary with jemalloc specific code
- performance is slightly better (excluding the outlier, the benefit is up to 3%)
- peak VSZ is cut in half
- peak RSS is reduced by ~9%
- With 1 arena performance is lousy but the RSS bloat is mostly solved
- With 8, 48 or 96 arenas the RSS bloat is still there
- With 48 arenas there are still significant (5% to 10%) performance drops
- With 96 arenas the performance drop was mostly ~2%
-DHAVE_JEMALLOC=1-DWITH_JEMALLOC=1
182a183,184> cmd="MALLOC_ARENA_MAX=1 $cmd"> echo Run :: $cmd
- fbmy8032_rel_o2nofp_end_241023_ba9709c9_971.za4_jemalloc_c32r128
- This is the base case in the table below
- this is what I used in my previous post and jemalloc is enabled via setting malloc-lib in my.cnf which uses LD_LIBRARY_PATH
- fbmy8032_rel_o2nofp_end_je_241023_ba9709c9_971.za4_c32r128
- This is col-1 in the table below
- MySQL with jemalloc specific code enabled at compile time
- fbmy8032_rel_o2nofp_end_je_241023_ba9709c9_971.za5_c32r128
- This is col-2 in the table below
- MySQL with jemalloc specific code enabled at compile time and rocksdb_cache_dump=0 added to my.cnf
(QPS with $allocator) / (QPS with glibc malloc)
- results in col-1 are similar to the base case. So compiling in the jemalloc specific code didn't help performance.
- results in col-2 are slightly better than the base case with one outlier (hot-points). So consider setting rocksdb_cache_dump=0 in my.cnf after compiling in jemalloc specific code.
- jemalloc.1
- base case, fbmy8032_rel_o2nofp_end_241023_ba9709c9_971.za4_jemalloc_c32r128
- jemalloc.2
- col-1 above, fbmy8032_rel_o2nofp_end_je_241023_ba9709c9_971.za4_c32r128
- This has little impact on VSZ and RSS
- jemalloc.3
- col-2 above, fbmy8032_rel_o2nofp_end_je_241023_ba9709c9_971.za5_c32r128
- This cuts peak VSZ in half and reduces peak RSS by 9%
- fbmy8032_rel_o2nofp_end_241023_ba9709c9_971.za4_glibcmalloc_c32r128
- base case in the table below
- fbmy8032_rel_o2nofp_end_241023_ba9709c9_971.za4_glibcmalloc_1arena_c32r128
- col-1 in the table below
- uses MALLOC_ARENA_MAX=1
- fbmy8032_rel_o2nofp_end_241023_ba9709c9_971.za4_glibcmalloc_8arena_c32r128
- col-2 in the table below
- uses MALLOC_ARENA_MAX=8
- fbmy8032_rel_o2nofp_end_241023_ba9709c9_971.za4_glibcmalloc_48arena_c32r128
- col-3 in the table below
- uses MALLOC_ARENA_MAX=48
- fbmy8032_rel_o2nofp_end_241023_ba9709c9_971.za4_glibcmalloc_96arena_c32r128
- col-4 in the table below
- uses MALLOC_ARENA_MAX=48
(QPS with $allocator) / (QPS with glibc malloc)
- performance with 1 or 8 arenas is lousy
- performance drops some (often 5% to 10%) with 48 arenas
- performance drops ~2% with 96 arenas