a curated list of database news from authoritative sources

January 23, 2025

GaussDB-Global: A Geographically Distributed Database System

This paper, presented in the industry track of ICDE 2024, introduces GaussDB-Global (GlobalDB), Huawei's geographically distributed database system. GlobalDB replaces the centralized transaction management (GTM) of GaussDB with a decentralized system based on synchronized global clocks (GClock). This approach mirrors Google Spanner's TrueTime approach and its commit-wait technique, which provides externally serializable transactions by waiting out the uncertainty interval. However, GlobalDB claims compatibility with commodity hardware, avoiding the need for specialized networking infrastructure for synchronized clock distribution.

The GClock system uses GPS receivers and atomic clocks as the global time source device at each regional cluster. Each node synchronizes its clock with the global time source over TCP every 1 millisecond. Clock deviation is kept low because synchronization is achieved within 60 microseconds as a TCP round trip, and the CPU’s clock drift is bounded within 200 parts per million. The GClock timestamp includes a clock time and an error bound ($T_{err}$), which accounts for network latency and clock drift. The timestamp generation in GlobalDB the formula $TS_{GClock} = T_{clock} + T_{err}$ where $T_{err} = T_{sync} + T_{drift}$.

Syncronizing over TCP every millisecond seems extreme, and the paper does not go into detail about how tight-synchronization is achieved, and how reliable is $T_{err}$. They make a passing reference to the FaRM paper (SIGMOD 2019), and they seem to have adapted their synchronization mechanism from that work.

Transactions in GlobalDB use the following protocol to obtain timestamps:  

  • Invocation: Wait until \( T_{clock} > TS_{GClock} \) and begin the transaction. (Single-shard queries bypass this wait by using the node’s last committed transaction timestamp.)  
  • Commit: Wait until \( T_{clock} > TS_{GClock} \) and commit.

GlobalDB introduces two major algorithmic components:  

  1. Seamless Transition Between Centralized and Decentralized Modes:  GlobalDB supports DUAL mode, enabling zero-downtime transitions between centralized (GTM) and decentralized (GClock) transaction management. 
  2. Flexible Asynchronous Replication: GlobalDB supports asynchronous replication with strong consistency guarantees for reads on replicas. This is achieved through a Replica Consistency Point (RCP), which ensures that all replicas provide a consistent snapshot of the database, even if they are not fully up-to-date.  

It seems like point one, transitioning between the centralized and decentralized, is also heavily inspired by the FaRM paper's protocol on this. I will explain this below.

The second point,  asynchronous replication scheme, raises questions about durability. If the primary crashes before logs are sent to replicas, data loss could occur. The paper does not fully address this issue, leaving it unclear how GlobalDB ensures durability in such scenarios.


So what do we get with GlobalDB?  GlobalDB’s use of synchronized clocks gives us decentralized transaction management, removing the need for a centralized service to order transactions. This improves throughput, especially in geo-distributed deployments. Another key benefit is that synchronized clocks enhance read performance on asynchronous local replicas by returning consistent response from a global snapshot of the database at a given time.

OK, that was the overview. Now we can discuss the architecture and protocol details. 


Architecture

GaussDB is a **shared-nothing distributed database** consisting of:  

  • Computing Nodes (CNs): Stateless nodes that handle query parsing, planning, and coordination.  
  • Data Nodes (DNs): Host portions of tables based on hash or range partitioning. Replica DNs are placed remotely for high availability.  
  • Global Transaction Manager (GTM): A lightweight centralized service that provides timestamps for transaction invocation and commit.  

GlobalDB replaces the GTM with decentralized transaction management using GClock, improving scalability and performance, especially in geo-distributed deployments. Primary DNs continuously transmit updates to replica nodes in the form of **Redo logs**. The paper argues that asynchronous replication avoids the performance degradation of waiting for remote replicas, but it does not discuss how to deal with potential durability gaps, as the primary may crash before logs are replicated.


Clock Transition Between Centralized and Decentralized Modes

As we mentioned above, a key innovation in GlobalDB is the DUAL mode, which enables seamless transitions (in both directions) between centralized (GTM) and decentralized (GClock) transaction management without downtime. This is critical for maintaining system availability during upgrades or failures, such as when global clock synchronization fails. The DUAL mode uses a hybrid timestamp mechanism:  \( TS_{DUAL} = \max(TS_{GTM}, TS_{GClock}) + 1 \)  

A DUAL mode timestamp TSDUAL is guaranteed to be larger than both the most recent GTM timestamp and clock upperbound.  This ensures monotonicity by guaranteeing that new timestamps are always larger than both the most recent GTM and GClock timestamps. Transitions require waiting for \( 2 \times T_{err} \) to prevent anomalies like stale reads due to timestamp inversion. By acting as a bridge between the two modes, DUAL mode allows the system to remain fully operational during transitions, supporting live upgrades and fault recovery while meeting enterprise SLAs.



Asynchronous Replication with Strong Consistency

GlobalDB employs asynchronous replication with strong consistency guarantees for reads on replicas, achieved through a Replica Consistency Point (RCP). In some ways, this reminds me of the Volume Complete LSN calculation in Amazon Aurora. The RCP is a globally consistent snapshot calculated as the minimum of the maximum commit timestamps across all replicas. This ensures that any transaction committed before the RCP is visible and consistent across the system, even if replicas are not fully up-to-date.  To maintain progress, heartbeat transactions prevent RCP stagnation on idle replicas. This approach allows GlobalDB to offer strong consistency for read-only queries on replicas, even in an asynchronous replication setup. Users can query nearby replicas for faster response times, without worrying about getting inconsistent or incorrect results.

Data Definition Language (DDL) statements (such as CREATE TABLE or DROP INDEX) impose extra restrictions on using RCP for reads. GlobalDB ensures consistency for read-only queries (ROR) by requiring that the RCP is greater than either the largest DDL timestamp or the timestamp of each table involved in the query, ensuring compatibility with schema changes.

I think several challenges still remain: calculating RCP across regions introduces latency proportional to replica count, frequent updates can widen gaps between primary and replica timestamps, and asynchronous replication risks data loss if the primary fails before logs are replicated.


Evaluation

The paper evaluates GlobalDB on both single-region and geo-distributed clusters using TPC-C and Sysbench. The results show 14x higher read throughput and 50% higher TPC-C throughput compared to the baseline system. Geo-distributed setups achieve 91% of the throughput of a co-located cluster, despite added network latency. 

However, the evaluation has severe limitations. Tests use Linux `tc` to simulate delays, not representing real-world variable latency and packet loss. Moreover, the baseline is Huawei’s prior centralized system, with no comparisons to other distributed databases like Spanner, CockroachDB, or Yugabyte.  



January 22, 2025

Diving deep into the new Amazon Aurora Global Database writer endpoint

On October 22, 2024, we announced the availability of the Aurora Global Database writer endpoint, a highly available and fully managed endpoint for your global database that Aurora automatically updates to point to the current writer instance in your global cluster after a cross-Region switchover or failover, alleviating the need for application changes and simplifying routing requests to the writer instance. In this post, we dive deep into the new Global Database writer endpoint, covering its benefits and key considerations for using it with your applications.

January 21, 2025

January 20, 2025

Building a Resend analytics dashboard

Resend is a developer platform for sending transactional and marketing emails. If you don’t do much with email, well done, you’ve won in life. But if you do, Resend is probably going to save you a bunch of headaches. Once you’re set up to send with Resend, how do you know you’re “doing email right”? Resend captures events about the status of your emails - when it's sent, delivered, bounced, etc. - and offers webhooks so you can push these events elsewhere. Sending these webhooks to Tinybird’s E

January 17, 2025

Outgrowing Postgres: Handling growing data volumes

Managing terabyte-scale data in Postgres? From basic maintenance to advanced techniques like partitioning and materialized views, learn how to scale your database effectively. Get practical advice on optimizing performance and knowing when it's time to explore other options.

Outgrowing Postgres: Handling growing data volumes

Managing terabyte-scale data in Postgres? From basic maintenance to advanced techniques like partitioning and materialized views, learn how to scale your database effectively. Get practical advice on optimizing performance and knowing when it's time to explore other options.

January 16, 2025

January 14, 2025

Use of Time in Distributed Databases (part 5): Lessons learned

This concludes our series on the use of time in distributed databases, where we explored how use of time in distributed systems evolved from a simple ordering mechanism to a sophisticated tool for coordination and performance optimization.

A key takeaway is that time serves as a shared reference frame that enables nodes to make consistent decisions without constant communication. While the AI community grapples with alignment challenges, in distributed systems we have long confronted our own fundamental alignment problem. When nodes operate independently, they essentially exist in their own temporal universes. Synchronized time provides the global reference frame that bridges these isolated worlds, allowing nodes to align their events and states coherently.

At its core, synchronized time serves as an alignment mechanism in distributed systems. As explored in Part 1, synchronized clocks enable nodes to establish "common knowledge" through a shared time reference, which is powerful for conveying the presence (not just absence) of information like lease validity without requiring complex two-way communication protocols to account for message delays and asymmetry.

This alignment manifests in several powerful ways. When nodes need to agree on a consistent snapshot of the database, time provides a natural coordination point. So a big benefit of aligning timelines is effortless MVCC consistent snapshots. Commit timestamping creates transaction ordering, and version management uses timestamps to track different versions of data in MVCC systems. Snapshot selection uses timestamps to identify consistent read points at each node independently. Many of the systems we reviewed, including Spanner, CockroachDB, Aurora Limitless and DSQL, demonstrate the power of consistent global reads using timestamps.

Alignment also enables effective conflict detection. By comparing timestamps, systems can detect concurrent modifications—a foundational element of distributed transactions. This particularly benefits OCC-based systems and those using snapshot isolation, including DynamoDB, MongoDB, Aurora Limitless, and DSQL.

Another benefit of alignment is to serve as a fencing mechanism, effectively creating "one-way doors" that prevent stale operations from causing inconsistencies. A simple realization is leases. Aurora Limitless uses consistency leases that expire after a few seconds to prevent stale reads from zombie nodes. This is a classic example of time-based fencing - once a node's lease expires, it cannot serve reads until it obtains a new lease.  Aurora DSQL employs time for fencing its adjudicator service. Instead of using Paxos for leader election, adjudicators use time ranges to fence off their authority over key ranges. When an adjudicator takes over responsibility for a key range, it establishes a time-based fence that prevents older adjudicators from making decisions about that range. These applications all stem from time's fundamental role as a shared reference frame, separating past from frothiness of the present. 

Another benefit of alignment is speculative execution to achieve better performance on the common case while still maintaining consistency/correctness across all scenarios. Modern systems increasingly embrace time-based speculation. Nezha introduced Deadline-Ordered Multicast (DOM) primitive that assigns deadline timestamps to message-requests and only delivers them after the deadline is reached. This creates a temporal fence ensuring consistent ordering across receivers while providing a speculation window for execution preparation. Aurora DSQL uses predictive commit timing, where it sets commit timestamps slightly in the future. This is a form of speculation that maintains read performance by ensuring reads are never blocked by writes while still preserving consistency guarantees.

The key in both cases is that time serves as a performance improvement without bearing the burden of correctness. Ultimately, we still need information transfer to ensure that the correct scenario played out, and there were no omitted messages, failures, etc. When our speculation fails, a post validation/recovery/cleanup mechanism kicks in. This approach introduces some metastability risk, so careful implementation is required. 


Correctness

That brings us to correctness. Barbara Liskov's 1991 principle remains fundamental: leverage time for performance optimization, not correctness. This separation of concerns appears consistently in modern systems. Google's Spanner uses TrueTime for performance optimization while relying on traditional locking and consensus for correctness. DynamoDB leverages timestamps to streamline transaction processing but doesn't depend on perfect synchronization for correctness. Similarly, Accord and Nezha employ synchronized clocks for fast-path execution while maintaining correctness through fallback mechanisms.

A critical aspect when using time is the requirement for monotonic clocks, as clock regression would break causality guarantees for reads. Hybrid Logical Clocks (HLC) provide a good insurance in NTP-based systems. By combining physical clock time with the strengths of Lamport logical clocks, HLC provides this monotonicity. So rather than relying solely on physical time, most of the contemporary databases use HLC to merge physical and logical time to get the benefits of both.

Modern clocks provide dynamically bounded uncertainty, by providing lower and upper time intervals, to account for imperfect synchronization. These clockbound APIs also help prevent putting unfounded confidence for a given timestamp. When using these tightly synchronized clocks with clockbound APIs, two things need to go wrong for correctness to be violated:

  • the clock/timestamp needs to go wrong, and
  • the clockbound API should still continue showing high confidence in that wrong clock, rather than having increased bounds.

Ok, let's say these HLC or clockbound safeguards failed, what could go wrong? Obviously, you could read a stale value from an MVCC system, if you demand to read with a timestamp in the past. On the other hand, the correctness on the write path of the distributed databases is well-preserved by locks. A full study of correctness is needed to get a more comprehensive understanding, but my understanding is that all of these systems maintain data integrity even when clock synchronization falters. Performance decreases, but write-path correctness persists. 


Future trends

Several patterns are emerging. There is definitely an accelerating trend of adoption of synchronized clocks in distributed databases, as our survey shows. We also see that distributed systems and databases are becoming more sophisticated about their use of time. Early systems leveraged synchronized clocks primarily for read-only transactions and snapshots, reaping low-hanging fruit. Over time, these systems evolved to tackle read-write transactions and employ more advanced techniques and synchronized clocks take on increasingly critical roles in database design. By aligning nodes to a shared reference frame, synchronized time eliminate the need for some coordination message exchanges across nodes/partitions, and help cut down latency and boost throughput in distributed multi-key operations.

We also see time-based speculation gaining traction in distributed databases. Maybe as another level of improvement, we will see systems automatically adjusting their use of time-based speculation depending on the smoothness of the operating conditions. The more a system relies on synchronized time for common-case optimization, the more vulnerable it becomes to failure of the common-case scenario or the degradation of clock synchronization. So we need smoother transitions between modes for speculation-based systems. 

More research is needed on the tradeoffs between time synchronization precision and isolation guarantees. Stricter isolation levels like external consistency require tighter clock synchronization and longer commit wait times. This raises important questions about the value proposition of strict-serializability. Snapshot isolation hits a really nice tradeoff in the isolation space, and it would be interesting to put more research into isolation-performance tradeoffs using synchronized clocks. 

On the infrastructure front, cloud time services like AWS TimeSync grow increasingly crucial as databases migrate to cloud environments. These services offer better precision and availability to support the use of time in distributed systems.