a curated list of database news from authoritative sources

May 20, 2025

Achieve up to 1.7 times higher write throughput and 1.38 times better price performance with Amazon Aurora PostgreSQL on AWS Graviton4-based R8g instances

In this post, we demonstrate how upgrading to Graviton4-based R8g instances with Aurora PostgreSQL-Compatible 17.4 on Aurora I/O-Optimized cluster configuration can deliver significant price-performance gains – delivering up to 1.7 times higher write throughput, 1.38 times better price-performance and reducing commit latency by up to 46% on r8g.16xlarge instances and 38% on r8g.2xlarge instances as compared to Graviton2-based R6g instances.

InnoDB Cluster: Set Up Router and Validate Failover

Setting up an InnoDB Cluster requires three key components: Group Replication, MySQL Shell, and MySQL Router. In the previous post, we covered the process of building a 3-node InnoDB Cluster. In this post, we shift our focus to configuring MySQL Router and validating failover functionality. Environment overview We are using three InnoDB Cluster nodes along […]

May 19, 2025

An Introduction to Dictionary Operations in Data Masking Component

In this blog post, we will describe typical usage scenarios for dictionary operations in the Data Masking Component, which is available in Percona Server for MySQL as an open source alternative to Oracle’s enterprise version. In particular, we will consider the following functions. gen_dictionary() – a function that returns a random term from a dictionary. gen_blocklist() – […]

May 18, 2025

RocksDB 10.2 benchmarks: large & small servers with cached workload

I previously shared benchmark results for RocksDB using the larger server that I have. In this post I share more results from two other large servers and one small server. This is arbitrary but I mean >= 20 cores for large, 10 to 19 cores for medium and less than 10 cores for small.

tl;dr

  • There are several big improvements
  • There might be small regression in fillseq performance, I will revisit this
  • For the block cache hyperclock does much better than LRU on CPU-bound tests

Software

I used RocksDB versions 6.29.5, 7.10.2, 8.11.4, 9.0.1, 9.1.2, 9.2.2, 9.3.2, 9.4.1, 9.5.2, 9.6.2, 9.7.4, 9.8.4, 9.9.3, 9.10.0, 9.11.2, 10.0.1, 10.1.3 and 10.2.1. Everything was compiled with gcc 11.4.0.

For 8.x, 9.x and 10.x the benchmark was repeated using both the LRU block cache (older code) and hyperclock (newer code). That was done by setting the --cache_type argument:

  • lru_cache was used for versions 7.6 and earlier
  • hyper_clock_cache was used for versions 7.7 through 8.5
  • auto_hyper_clock_cache was used for versions 8.5+

Hardware

My servers are described here. From that list I used:

  • The small server is a Ryzen 7 (AMD) CPU with 8 cores and 32G of RAM. It is v5 in the blog post.
  • The first large server has 24 cores with 64G of RAM. It is v6 in the blog post.
  • The other large server has 32 cores and 128G of RAM. It is v7 in the blog post.

Benchmark

Overviews on how I use db_bench are here and here.

Tests were run for a workload with the database cached by RocksDB that I call byrx in my scripts.

The benchmark steps that I focus on are:
  • fillseq
    • load RocksDB in key order with 1 thread
  • revrangeww, fwdrangeww
    • do reverse or forward range queries with a rate-limited writer. Report performance for the range queries
  • readww
    • do point queries with a rate-limited writer. Report performance for the point queries.
  • overwrite
    • overwrite (via Put) random keys using many threads

Relative QPS

Many of the tables below (inlined and via URL) show the relative QPS which is:
    (QPS for my version / QPS for base version)

The base version varies and is listed below. When the relative QPS is > 1.0 then my version is faster than the base version. When it is < 1.0 then there might be a performance regression or there might just be noise 

Small server

The benchmark was run using 1 client thread and 20M KV pairs. Each benchmark step was run for 1800 seconds. Performance summaries are here

For the byrx (cached database) workload with the LRU block cache:

  • see relative and absolute performance summaries, the base version is RocksDB 6.29.5
  • fillseq is ~14% faster in 10.2 vs 6.29 with improvements in 7.x and 9.x
  • revrangeww and fwdrangeww are ~6% slower in 10.2 vs 6.29, I might revisit this
  • readww has similar perf from 6.29 through 10.2
  • overwrite is ~14% faster in10.2 vs 6.29 with most of the improvement in 7.x

For the byrx (cached database) workload with the Hyper Clock block cache

  • see relative and absolute performance summaries, the base version is RocksDB 8.11.4
  • there might be small regression (~3%) or there might be noise in the results
Results from RocksDB 10.2.1 that show relative QPS for 10.2 with the Hyper Clock block cache relative to 10.2 with the LRU block cache. Here the QPS for revrangeww, fwdrangeww and readww are ~10% better with Hyper Clock.

relQPS  test
0.99    fillseq.wal_disabled.v400
1.09    revrangewhilewriting.t1
1.13    fwdrangewhilewriting.t 1
1.15    readwhilewriting.t1
0.96    overwriteandwait.t1.s0

Large server (24 cores)

The benchmark was run using 16 client threads and 40M KV pairs. Each benchmark step was run for 1800 seconds. Performance summaries are here.

For the byrx (cached database) workload with the LRU block cache

  • see relative and absolute performance summaries, the base version is RocksDB 6.29.5
  • fillseq might have a new regression of ~4% in 10.2.1 or that might be noise, I will revisit this
  • revrangeww, fwdrangeww, readww and overwrite are mostly unchanged since 8.x

For the byrx (cached database) workload with the Hyper Clock block cache

  • see relative and absolute performance summaries, the base version is RocksDB 8.11.4
  • fillseq might have a new regression of ~8% in 10.2.1 or that might be noise, I will revisit this
  • revrangeww, fwdrangeww, readww and overwrite are mostly unchanged since 8.x
Results from RocksDB 10.2.1 that show relative QPS for 10.2 with the Hyper Clock block cache relative to 10.2 with the LRU block cache.  Hyper Clock is much better for workloads that have frequent access to the block cache with multiple threads.

relQPS  test
0.97    fillseq.wal_disabled.v400
1.35    revrangewhilewriting.t16
1.43    fwdrangewhilewriting.t16
1.69    readwhilewriting.t16
0.97    overwriteandwait.t16.s0

Large server (32 cores)

The benchmark was run using 24 client threads and 50M KV pairs. Each benchmark step was run for 1800 seconds. Performance summaries are here.

For the byrx (cached database) workload with the LRU block cache

  • see relative and absolute performance summaries, the base version is RocksDB 6.29.5
  • fillseq might have a new regression of ~10% from 7.10 through 10.2, I will revisit this
  • revrangeww, fwdrangeww, readww and overwrite are mostly unchanged since 8.x

For the byrx (cached database) workload with the Hyper Clock block cache

  • see relative and absolute performance summaries, the base version is RocksDB 7.10.2
  • fillseq might have a new regression of ~10% from 7.10 through 10.2, I will revisit this
  • revrangeww, fwdrangeww, readww and overwrite are mostly unchanged since 8.x
Results from RocksDB 10.2.1 that show relative QPS for 10.2 with the Hyper Clock block cache relative to 10.2 with the LRU block cache. Hyper Clock is much better for workloads that have frequent access to the block cache with multiple threads.

relQPS  test
1.02    fillseq.wal_disabled.v400
1.39    revrangewhilewriting.t24
1.55    fwdrangewhilewriting.t24
1.77    readwhilewriting.t24
1.00    overwriteandwait.t24.s0

Chapter 3: Two Phase Locking (Concurrency Control Book)

Chapter 3 presents two-phase locking (2PL). Remember I told you in Chapter 2: Serializability Theory that the discussion is very scheduler-centric? Well, this is a deeper dive into the scheduler, using 2PL as the concurrency control mechanism. The chapter examines the design trade-offs in scheduler behavior, proves the correctness of basic 2PL, dissects how deadlocks arise and are handled, and discusses many variations and implementation issues.

Here are the section headings in Chapter 3.

  1. Aggressive and Conservative Schedulers
  2. Basic Two Phase Locking
  3. Correctness of Basic Two Phase Locking
  4. Deadlocks
  5. Variations of Two Phase Locking
  6. Implementation Issues
  7. The Phantom Problem
  8. Locking Additional Operators
  9. Multigranularity Locking
  10. Distributed Two Phase Locking
  11. Distributed Deadlocks
  12. Locking Performance
  13. Tree Locking

Yep, this is a long chapter: 65 pages. 


3.1 Aggressive and Conservative Schedulers

The chapter opens by asking: how aggressive or conservative should a scheduler be? An aggressive scheduler tries to schedule operations immediately, risking aborts later. A conservative scheduler delays operations to avoid future conflicts, often needlessly.

This choice affects throughput, latency, and starvation. If conflicts are rare, aggressive wins. If conflicts are frequent or costly, conservative may be preferable.

The book says: "A very conservative version of any type of scheduler can usually be built if transactions predeclare their readsets and writesets. This means that the TM begins processing a transaction by giving the scheduler the transaction’s readset and writeset. Predeclaration is more easily and efficiently done if transactions are analyzed by a preprocessor, such as a compiler, before being submitted to the system, rather than being interpreted on the fly."

This reminded me of Chardonnay, OSDI'23, where Bernstein is an author. (Here is my review/summary of it.) Isn't that an interesting follow-up to the above paragraph 36 years later.


3.2 Basic Two Phase Locking

Three rules define 2PL: don’t grant conflicting locks, don’t release until the data manager (DM) acknowledges, and once you release a lock, you’re done acquiring. These encode the "growing" and "shrinking" phases that give 2PL its name.

Here is the notation the book uses: `rl_i[x]` denotes a read lock by transaction Ti on item x, `wl_i[x]` a write lock, and `wu_i[x]` means Ti releases the write lock. These are distinct from the operations themselves (like `r_i[x]` or `w_i[x]`), and the system must track both.

The figure below shows an example of what would go wrong if rule 3 is violated.


3.3 Correctness of Basic Two Phase Locking

In this section, the authors prove that histories produced by 2PL schedulers are serializable. The proof hinges on showing that the serialization graph (SG) is acyclic. They derive this from the locking orderings induced by the scheduler's behavior. The key idea is that if T1 precedes T2 in SG, then T1 must have released a lock before T2 acquired a conflicting one. Since T1 does not acquire any locks after releasing that one, this precludes cycles in the SG.


3.4 Deadlocks

2PL enables serializability, but at the cost of deadlocks. The section opens with classic 2PL-induced deadlocks: mutual lock upgrades. Then it dives into detection via Waits-For Graphs (WFGs), where the idea is to add an edge when one transaction waits on another, and detect deadlocks by finding cycles.

Timeouts versus explicit detection gets compared. Timeouts are crude but simple (deadlocks don’t vanish on their own, so detection can be delayed safely). WFG cycle detection is precise but expensive. The section discusses how often to detect, how to tune timeouts, and how to pick a victim.


3.5 Variations of Two Phase Locking

Conservative 2PL requires that a transaction predeclare all the data it may access and acquire all locks up front before execution begins. If even one lock is unavailable, it waits for all. This avoids deadlocks entirely but sacrifices concurrency and assumes predictability in data access.

Strict 2PL, used in nearly all real systems, releases locks only at commit time. This is more strict than the basic 2PL we discussind in 3.2, where the shrinking could happen gradually. But in return, this guarantees strictness and recoverability. To avoid cascading aborts, write locks are held until after commit; read locks can be released a bit earlier. 


3.6 Implementation Issues

This section gives some guidelines (as of 1987) for implementing a lock manager, handling blocking and aborting transactions, and ensuring atomicity of reads and writes. 


3.7 The Phantom Problem

The phantom problem is set up nicely. Most real databases can dynamically grow and shrink. In addition to Read and Write, they usually support Insert and Delete. Index locking is introduced as a way to lock ranges instead of records. This helps detect conflicts for inserts into a scanned range.

Two locks conflict if there could be a record that satisfies both predicates, i.e., the predicates are mutually satisfiable. This is predicate locking. While more general than index locking, it's more expensive, as it requires the lock manager (LM) to reason about predicates. Predicate locking is rare in practice.

We had discussed index locking and predicate locking in Chapter 7 of the Gray/Reuters book, so I defer to that discussion.


3.8 Locking Additional Operators

Similar to how Chapter 2 extended serializability theory beyond read/write to include increment/decrement, Chapter 3 does the same for locking. I really like this treatment and generalization opportunity!

To add new operation types, follow these rules:

  • Make each new operation atomic w.r.t. all others.
  • Define a lock type for each operation.
  • Build a compatibility matrix where two lock types conflict iff their corresponding operations don’t commute.


3.9 Multigranularity Locking

Multi-granularity locking (MGL) allows each transaction to use granule sizes most appropriate to its behavior. Long transactions can lock coarser items like files; short transactions can lock individual records. In this way, long transactions don’t waste time setting too many locks, and short transactions don’t block others by locking large portions of the database that they don’t access. This balances overhead and concurrency.

We assume the lock instance graph is a tree. A lock on a coarse granule x implicitly locks all its descendants. For instance, a read lock on an area implicitly read-locks its files and records.

To make fine-grained locks compatible with this, we use intention locks. Before locking a record x, the scheduler sets intention locks (`ir`, `iw`) on x’s ancestors (database, area, file). This prevents conflicting coarse-grain locks from being acquired concurrently.


3.10 Distributed Two Phase Locking

Section 3.10 explains how Strict 2PL makes distributed concurrency control simpler. With Basic 2PL, one site might release locks while another is still acquiring them, which breaks the two-phase rule unless schedulers coordinate. But with Strict 2PL, locks are only released after commit, so once a scheduler sees the commit, it knows the transaction is done acquiring locks. No need to talk to other sites. The TM only sends commit after all operations are done, so this setup guarantees that each site’s local two-phase rule lines up with the global one. Strict 2PL gives you correctness without extra coordination.


3.11 Distributed Deadlocks

Distributed deadlocks arise when the global Waits-For Graph (WFG), formed by uniting local WFGs, contains a cycle even if each local WFG is acyclic. Detecting global cycles requires message exchanges among sites. A site may initiate detection when it sees a potential cycle, collecting dependency information and detecting loops.

A deadlock prevention method is to use timestamps. When a transaction T1 is about to wait for T2, the system checks their timestamps. If the wait would lead to a possible deadlock, it aborts one of them.

  • In wait-die, older transactions wait for younger ones; younger ones abort if they encounter older ones.
  • In wound-wait, older transactions preempt younger ones by aborting them; younger transactions are allowed to wait for older ones.

Wait-die favors younger transactions but can cause repeated aborts; wound-wait favors older transactions and avoids starvation.


3.13 Tree Locking

Suppose data items are structured as nodes in a tree, and transactions always follow tree paths. The scheduler can exploit this predictability using Tree Locking (TL), rather than requiring the grow/shrink discipline of 2PL. That is, a transaction can release a lock and later acquire another, so long as it follows the tree order (root to leaf).

TL resembles resource ordering schemes in OSes for deadlock avoidance. It ensures deadlock freedom. Once a transaction locks all children of a node it needs, it can safely release the lock on the parent. This can improve performance by holding locks for shorter durations. But it only makes sense if transactions follow predictable root-to-leaf access patterns. Otherwise, TL can reduce concurrency among transactions.

TL must also be extended if we want recoverability, strictness, or to avoid cascading aborts. For example, writes should be held until commit. In practice, most updates are on leaf nodes, making this workable.

This is a niche idea, but when applicable, it's elegant. I like it! The section closes by discussing locking in B-trees as an application.

May 17, 2025

May 16, 2025

Freedom and Flexibility: Rethinking Your MongoDB Cloud Strategy Beyond Atlas

Let’s be honest: Getting MongoDB up and running quickly in the cloud sounds fantastic. Services like MongoDB Atlas promise easy deployment, automated scaling, and hands-off management on AWS, Azure, and GCP. For teams looking to shed operational burdens, the appeal is tempting. Click a few buttons, get a database… what’s not to like? However, as […]

May 15, 2025

New in Percona Everest 1.6.0: Easily Deploy PMM with a Single Helm Command

Monitoring your databases is critical, especially in Kubernetes environments where visibility and automation are key. That’s why, in Percona Everest 1.6.0, we introduced a highly requested feature: the ability to automatically deploy Percona Monitoring and Management (PMM) as part of the Everest Helm chart using just one flag. This simplifies the process for teams who […]

Query Response Time

Context

This is why you need to learn MySQL query response time:

Performance is query response time.

This book explores that idea from various angles with a single intent: to help you achieve remarkable MySQL performance. Efficient MySQL performance means focusing on the best practices and techniques that directly affect MySQL performance—no superfluous details or deep internals required by DBAs and experts. I presume that you’re a busy professional who is using MySQL, not managing it, and that you need the most results for the least effort. That’s not laziness, that’s efficiency. To that end, this book is direct and to the point. And by the end, you will be able to achieve remarkable MySQL performance.

Indexes and Indexing

Context

This is why you need to learn MySQL indexes and indexing:

Many factors determine MySQL performance, but indexes are special because performance cannot be achieved without them. You can remove other factors—queries, schemas, data, and so on—and still achieve performance, but removing indexes limits performance to brute force: relying on the speed and capacity of hardware. If this book were titled Brute Force MySQL Performance, the contents would be as long as the title: "Buy better, faster hardware." You laugh but just a few days ago I met with a team of developers who had been improving performance in the cloud by purchasing faster hardware until stratospheric costs compelled them to ask, "How else can we improve performance?"

MySQL leverages hardware, optimizations, and indexes to achieve performance when accessing data. Hardware is an obvious leverage because MySQL runs on hardware: the faster the hardware, the better the performance. Less obvious and perhaps more surprising is that hardware provides the least leverage. I explain why in a moment. Optimizations refer to the numerous techniques, algorithms, and data structures that enable MySQL to utilize hardware efficiently. Optimizations bring the power of hardware into focus. And focus is the difference between a light bulb and a laser. Consequently, optimizations provide more leverage than hardware. If databases were small, hardware and optimizations would be sufficient. But increasing data size deleverages the benefits of hardware and optimizations. Without indexes, performance is severely limited.

Data and Access Patterns

Context

This is why you need to learn about MySQL data storage and access patterns:

Even when you master indexes and indexing, you will encounter queries that are simple and properly indexed but still slow. That’s when you begin to optimize around the query, starting with the data that it accesses. To understand why, let’s think about rocks.

Imagine that your job is to move rocks, and you have three piles of different sized rocks. The first pile contains pebbles: very light, no larger than your thumbnail. The second pile contains cobbles: heavy but light enough to pick up, no larger than your head. The third pile contains boulders: too large and heavy to pick up; you need leverage or a machine to move them. Your job is to move one pile from the bottom of a hill to the top. Which pile do you choose?

Sharding

Context

This is why you need to learn about sharding MySQL:

On a single instance of MySQL, performance depends on queries, data, access patterns, and hardware. When direct and indirect query optimization—assiduously applied—no longer deliver acceptable performance, you have reached the relative limit of single-instance MySQL performance for the application workload. To surpass that relative limit, you must divide the application workload across multiple instances of MySQL to achieve MySQL at scale.

Sharding a database is the common and widely used technique of scaling out (or, horizontal scaling): increasing performance by distributing the workload across multiple databases. (By contrast, scaling up, or vertical scaling, increases performance by increasing hardware capacity.) Sharding divides one database into many databases. Each database is a shard, and each shard is typically stored on a separate MySQL instance running on separate hardware. Shards are physically separate but logically the same (very large) database.

Server Metrics and InnoDB

Context

This is why you need to learn about MySQL server metrics, especially InnoDB metrics:

MySQL metrics are closely related to MySQL performance—that’s obvious. After all, the purpose of metrics in any system is to measure and report how the system is operating. What’s not obvious is how they are related. It’s not unreasonable if you currently see MySQL metrics as a black box with metrics inside that, in some way, indicate something about MySQL.

That view is not unreasonable (or uncommon) because MySQL metrics are often discussed but never taught. Even in my career with MySQL, I have never read or heard an exposition of MySQL metrics—and I have worked with people who created them. The lack of pedagogy for MySQL metrics is due to a false presumption that metrics do not require understanding or interpretation because their meaning is self-evident. That presumption has a semblance of truth when considering a single metric in isolation, like Threads_running: the number of threads running—what more is there to know? But isolation is the fallacy: MySQL performance is revealed through a spectrum of MySQL metrics.

Replication

Context

This is why you need to learn the basics of MySQL replication, especially the cause of replication lag:

Replication lag is the delay between the time when a write occurs on a source MySQL instance and the time when that write is applied on a replica MySQL instance. Replication lag is inherent to all database servers because replication across a network incurs network latency.

Replication lag is data loss. Seriously. Do not dismiss replication lag.

Transactions and Data Locks

Context

This is why you need to learn about MySQL transactions and data locks, which are closely related:

MySQL has non-transactional storage engines, but InnoDB is the default and the presumptive norm. Therefore, practically speaking, every MySQL query executes in a transaction by default, even a single SELECT statement.

From our point of view as programmers, transactions appear conceptual: BEGIN, execute queries, and COMMIT. Then we trust MySQL (and InnoDB) to uphold the ACID properties: atomicity, consistency, isolation, and durability. When the application workload—queries, indexes, data, and access patterns—is well optimized, transactions are a nonissue with respect to performance. (Most database topics are a nonissue when the workload is well optimized.) But behind the scenes, transactions invoke a whole new world of considerations because upholding ACID properties while maintaining performance is not an easy feat. Fortunately, MySQL shines at executing transactions.

Common Challenges

Context

This is why you need to learn about common challenges when using MySQL:

This chapter is a short but important laundry list of common MySQL challenges and how to mitigate them. These challenges don’t fit into other chapters because most are not directly related to performance. But don’t underestimate them: the first two challenges, for example, can ruin a database. More importantly, these challenges are not special cases that only happen when the stars align and The Fates conspire to ruin your day. These are common challenges. Take them seriously, and expect to face them.
Excerpt from Efficient MySQL Performance, Chapter 9
Copyright 2021 Daniel Nichter
No reproduction of this excerpt without permission

Key Points

  • Split-brain occurs when two or more MySQL instances in the same replication topology are written to
  • Split-brain is a detriment to data integrity—the data can no longer be trusted
  • Data drift occurs when a replica becomes out of sync with the source
  • Data drift is real but the origin of the drift is virtually impossible to pinpoint
  • Data drift can be detected with pt-table-checksum
  • ORMs can generate very poor queries and overall performance
  • Schemas always change, so an online schema change (OSC) tool is must-have
  • There are three popular OSC tools: pt-online-schema-change, gh-ost, and Spirit
  • MySQL has non-standard SQL statements, options, and clauses
  • Applications do not fail gracefully by default; it takes effort to fail gracefully
  • High performance MySQL is difficult

Pitfalls

  • Not taking into account the key points above

Hack MySQL Articles

Additional Resources

Resource Type About
InnoDB and Online DDL MySQL manual Since schema changes are routine for developers, it’s important to understand how MySQL (and InnoDB) handle them because there are many edge cases depending on the specific type of change.
Spirit Open source tool The fastest, most advanced online schema change (OSC) change tool available.
gh-ost Open source tool The predecessor of Spirit. Both are solid tools, but use Spirit if possible.
pt-table-checksum Open source tool Still the only tool that can detection data drift on replicas
pt-online-schema-change
Before gh-ost and Spirit, pt-online-schema-change (pt-osc) was the de facto standard tool. It's still widely used and referenced today, but as the author of pt-osc I strongly advise that you use Spirit instead. The only reason to use pt-osc is for rare (and risky) edge cases that Spirit (and usually gh-ost) do not support for good reason.

Cloud Performance

Context

This is why you need to learn about the reality of using MySQL in the cloud:

MySQL in the cloud is fundamentally the same MySQL that you know and love (or know and tolerate). In the cloud, best practices and techniques are not only true but eminently true because cloud providers charge for every byte and millisecond of work. Performance is money in the cloud.

I wish it were as simple as “optimize the workload and you’re done”, but MySQL in the cloud raises unique considerations. The goal is to know and mitigate these cloud considerations so that you can focus on MySQL, not the cloud. After all, the cloud is nothing special: behind the proverbial curtain, it’s physical servers in a data center running programs like MySQL.

May 14, 2025

Create a unit testing framework for PostgreSQL using the pgTAP extension

pgTAP (PostgreSQL Test Anything Protocol) is a unit testing framework that empowers developers to write and run tests directly within the database. In this post, we explore how to leverage the pgTAP extension for unit testing on Amazon RDS for PostgreSQL and Amazon Aurora PostgreSQL-Compatible Edition database, helping you build robust and reliable database applications.

Chapter 2: Serializability Theory (Concurrency Control Book)

Chapter 2 of Concurrency Control and Recovery in Database Systems (1987) by Bernstein, Hadzilacos, and Goodman is a foundational treatment of serializability theory. It is precise, formal, yet simple and elegant, a rare combination for foundational theory in a systems domain. Databases got lucky here: serializability theory is both powerful and clean.

The chapter builds up the theory step by step, introducing:

  1. Histories
  2. Serializable histories
  3. The Serializability Theorem
  4. Recoverability and its variants
  5. Generalized operations beyond reads/writes
  6. View equivalence

Each section motivates definitions clearly, presents tight formalism, and illustrates ideas with well-chosen examples.


2.1 Histories

This section lays the groundwork. It starts slow, and doesn't do anything fancy. It first defines what it means for the operations within a transaction to form a well-founded partial order. This intra-transaction ordering extends naturally to inter-transaction operations, forming a superset relation. Building on this, the book then defines a history as a partial order over operations from a set of transactions. Operations include reads (r₁[x]), writes (w₁[x]), commits (c₁), and aborts (a₁). Here the subscripts denote these operations all belong to transaction T₁. A history models the interleaving of these operations from different transactions, respecting the per-transaction order and ensuring all conflicting operations are ordered.

The abstraction here is elegant. The model omits values, conditionals, assignments, anything not visible to the scheduler. Only the structure of dependencies matters. This minimalism is a strength: it gives just enough to reason about correctness. The section is also careful to handle incomplete histories (i.e., prefixes), which is crucial for modeling crashes and recovery.

The discussion is very scheduler-centric. Since the scheduler is the component responsible for enforcing serializability, the model is tailored to what the scheduler can observe and act on. But this leads to missed opportunities as I mention below in Section 2.6 View equivalence.


2.2 Serializable Histories

This is where things get real. The goal is to define when a concurrent execution (history) is “as good as” some serial execution. The section has a simple game plan:  define equivalence between histories (conflict equivalence), define serial histories, and finally say a history is serializable if it is equivalent to some serial history.

The chapter introduces the serialization graph (SG). Nodes are committed transactions. Edges capture conflicts: if T_i writes x before T_j reads or writes x, we add an edge T_i → T_j.

The formalism is tight, but not heavy. A good example is Fig 2-1 (p. 30), which compares equivalent and non-equivalent histories. (Minor typo: second w₁[x] in H₃ should be w₁[y].)


Theorem 2.1 (Serializability Theorem): A history is serializable iff its serialization graph is acyclic.

That's a beautiful punchline: serializability reduces to a cycle check in the conflict graph.

Credit is where it is due. In this case it is Jim Gray again! At the end of the section there is an acknowledgment stating that: The definition of equivalence and serializability used here and the Serializability Theorem are from "Gray, J.N., Lorie, R.A., Putzulo, G.R., Traiger, I.L. Granularity of Locks and Degrees of Consistency in a Shared Database. Research Report, IBM, September, 1975." 


2.3 The Serializability Theorem

This section proves the big theorem about equivalence to a serial execution.

Intuitively, if SG(H) is acyclic, we can topologically sort it into a serial history equivalent to H. And if SG(H) has a cycle, then H cannot be serialized.

The figure shows an example. Edges in SG(H) constrain any valid serial order. A cycle means contradictory constraints, and hence, no valid serial order.


2.4 Recoverable Histories

Serializability ensures correctness in a crash-free world. But in practice, systems crash, transactions abort, and partial executions matter. To account for faults, this section defines increasingly strict notions:

Recoverable (RC): If T_j reads from T_i, then T_i must commit before T_j.

Avoids Cascading Aborts (ACA): T_j can only read from committed transactions.

Strict (ST): Reads and overwrites can happen only after the previous writer commits or aborts.

Theorem 2.2: ST ⊂ ACA ⊂ RC

The inclusion hierarchy is illustrated in Fig 2-2. The intersection of these properties with SR defines realistic schedules that can tolerate failures and support recovery.

Note that RC, ACA, and ST are prefix commit-closed properties, that is if they hold for history H, they also hold for any prefix of H. Nice and clean.


2.5 Operations Beyond Reads and Writes

I was happy to see this section. The authors show how the theory generalizes to operations beyond reads and writes, as long as we redefine the notion of conflict appropriately.

Two operations conflict if the order in which they execute affects: the final state of the database, or the values returned by operations. Fig. 2-3 gives the compatibility matrix to capture conflict relations between Read/Write/Inc/Dec. Increment(x) adds 1 to data item x and Decrement(x) subtracts 1 from x. An Inc or Dec does not return a value to the transaction that issues it.

The example history H₁₁ shows how to construct SG(H) even with these new operations. The theory lifts cleanly. Since SG(H_11) is acyclic, the generalized Serializability Theorem says that H_11 is SR. It is equivalent to the serial history T1 T3 T2 T4 which can be obtained by topologically sorting SG(H_11). DAGs FTW!

Still, this is a limited discussion. It's not expressive enough to model more general commutativity. But the seeds of CRDTs are here. In CRDTs, you model operations based on whether they commute under all interleavings. There is an on-going research direction trying to build a nice theory around more general operations and monotonicity concepts.


2.6 View Equivalence

This section introduces view serializability (VSR), a more permissive alternative to conflict serializability. Two histories are view equivalent if:

  • Every read in one history reads from the same write as in the other.
  • The final write to each data item is the same in both histories.

VSR allows more schedules than CSR. Consider the example below. T3 overwriting both x and y saves the day and masks the conflicts in T1 and T2. Interesting special case!

The book says that testing for VSR is NP-complete, and that kills its usefulness for schedulers. The authors argue that all practical schedulers use conflict serializability, and  view serializability is too expensive to enforce. Yet they acknowledge it’s conceptually useful, especially for reasoning about multicopy databases (Chapters 5 and 8).

Reading this section, I had an interesting question. Is view serializability the same as client-centric consistency? They seem very similar. Client-centric consistency, as discussed in Crooks et al., PODC 2017, defines correctness in terms of what reads observe. VSR is similarly observational.

So while VSR wasn’t practical for scheduling, it found a second life in defining observational consistency in distributed systems. I think there's more fruit on this tree.

Configuring PgBouncer auth_type with Trust and HBA: Examples and Known Issues

PgBouncer is a lightweight external connection pooler that can be introduced between an application and a PostgreSQL database. It manages its own user authentication and has its own database for users, and uses auth_type options to authenticate users.  This blog post explains PgBouncer auth_type trust and hba use cases with configuration examples and known issues. […]

Announcing the CedarDB Community Edition

Almost a year ago, we launched our company with our “Ode to Postgres”, with the goal of making the fruits of the highly successful Umbra research project available to you and everyone else. Umbra was created to incorporate all the lessons learned from the first 40 years of database systems into a new system built from the ground up to take advantage of modern hardware such as multi-core CPUs, fast SSDs, and cloud storage.