a curated list of database news from authoritative sources

May 11, 2026

Amazon Aurora DSQL connections: Drivers, strings, and best practices

Connecting to Amazon Aurora DSQL requires a different approach than traditional PostgreSQL databases. Instead of long-lived passwords, you use short-lived IAM authentication tokens. Instead of static endpoints, you work with distributed cluster endpoints that route connections across Availability Zones. In this post, you learn how to configure connection strings, set up drivers in Python, Java, and Node.js, and implement best practices for authentication, connection pooling, and lifecycle management with Amazon Aurora DSQL.

May 09, 2026

You need foreign keys and surrogate keys because you broke your relationships

Many developers believe they need a relational database because their data has relationships. The reasoning usually goes: "My entities are related, therefore I need foreign keys, therefore I need an RDBMS."

But the causality is actually reversed. Normalization creates the need for foreign keys—not the other way around.

This misunderstanding persists when the builders are agents, because LLMs operate on high-dimensional similarity (vectors), which may catch correlation better than causality.

Normalisation and relations

When you normalize your domain model into Normal Forms, you decompose aggregates—cohesive groups of entities that form a natural consistency boundary—into separate, independent tables. Each table represents a single relation in the mathematical sense: a set of tuples (rows) constrained by functional dependencies on a key. That is what "relational" means in "relational model"—it refers to these mathematical relations (tables), not to the relationships between entities.

Actually, this normalization is a decomposition that removes the structural associations between entities. Co-location, direct references, and pointers that once made related data physically and logically cohesive are eliminated. The connections between entities exist only through shared attribute values, and are reconstructed at query time via joins on those columns.

In other words, normalization doesn't strengthen relationships—it dissolves them from the schema and defers their reconstruction to the query, in order to have an application-agnostic representation of data.

An example: the Order aggregate

Consider a typical e-commerce domain. An Order is a natural aggregate: it contains line items, a shipping address, and payment details. In your domain model, these are not independent things—a line item has no meaning outside its order, a shipping address belongs to that specific purchase, and the payment is authorized for that exact total.

In a document database, this aggregate stays whole:

{
  "_id": ObjectId("order_123"),
  "customer": "Alice",
  "status": "confirmed",
  "shippingAddress": {
    "street": "123 Main St",
    "city": "Springfield",
    "zip": "62704"
  },
  "items": [
    { "sku": "WIDGET-A", "name": "Blue Widget", "qty": 2, "price": 9.99 },
    { "sku": "GADGET-B", "name": "Red Gadget", "qty": 1, "price": 24.99 }
  ],
  "payment": {
    "method": "credit_card",
    "last4": "4242",
    "authorized": 44.97
  }
}

The consistency boundary is the document. A single write replaces or updates the entire order atomically. No line item can be orphaned. No shipping address can drift out of sync. The structure is the integrity.

Now normalize this into Third Normal Form:

orders (order_id PK, customer, status)
order_items (item_id PK, order_id FK, sku, name, qty, price)
shipping_addresses (address_id PK, order_id FK, street, city, zip)
payments (payment_id PK, order_id FK, method, last4, authorized)

Notice what happened:

  1. The aggregate was shattered into four independent tables with no structural cohesion.
  2. Foreign keys had to be added (order_id FK) to re-express the parent-child relationship that was previously implicit through embedding.
  3. Surrogate keys had to be invented (address_id) because value objects don't have a natural identity—they were identified by their values.
  4. Joins are now required to reconstitute what was a single read—SELECT ... FROM orders JOIN order_items JOIN shipping_addresses JOIN payments.
  5. Referential integrity constraints must be declared to prevent orphan rows—a problem that could not exist when the data was embedded.
  6. Transaction scope must be explicitly managed because inserting an order now means inserting into four tables. You need BEGIN/COMMIT to ensure atomicity that was previously guaranteed by the single-document write.

Surrogate keys: identity invented for decomposition

This point deserves emphasis. In Domain-Driven Design, a Value Object has no identity—it is defined entirely by its attributes. A shipping address is a value object. A line item, depending on your domain, may be a dependent entity whose identity is meaningful only within its parent aggregate.

But a relational table requires every row to be uniquely identifiable by a primary key. When you extract a value object or dependent entity into its own table, you must assign it an artificial surrogate key—an auto-incremented integer or UUID that has no meaning in your domain. This surrogate exists solely because normalization demanded that the value object live independently.

The surrogate key is not modeling your domain. It is compensating for a storage decision.

This normalization has consequences for consistency and concurrency

In most RDBMS implementations, the physical model mirrors the logical one: the default unit of atomicity and locking is the row, not the original multi-entity aggregate from your domain model. Since normalization has scattered what was once a single aggregate across multiple tables, the database must provide additional mechanisms to restore consistency across those rows:

  • Referential integrity constraints (foreign keys) exist to re-enforce cross-row consistency that was inherent before decomposition, when related entities were embedded under a common root (the aggregate root).
  • Isolation levels and explicit locks exist because the natural consistency boundary (the aggregate, in Domain-Driven Design terms) no longer maps to a single lockable unit. The database must escalate from row-level locks to approximate the transactional boundary of the original aggregate root. This is implicit with serializable isolation level (often implemented with range or predicate locks) or explicit in lower isolation levels (SELECT FOR UPDATE on the aggregate root).

What normalization costs you—and what it buys

To be fair, normalization is not without benefits. It eliminates data redundancy, which simplifies updates to shared reference data and reduces storage when the same entity participates in many aggregates (e.g., a product referenced by thousands of order lines). It provides a flexible query model where any column can become a join path, enabling ad-hoc analytics that weren't anticipated at design time. The normalized model is also the best when you don't know your use cases and data distribution, because it is an abstraction that doesn't depend on access patterns and cardinalities.

But these benefits come at a structural cost:

What normalization removes What must be added to compensate
Embedding (co-location) Joins at query time
Structural association Foreign key constraints
Aggregate-level atomicity Explicit transactions across tables
Identity through containment Surrogate keys
Single-unit locking Isolation levels and lock escalation

Many foreign keys in your schema are evidence of a relationship that was structurally present and had to be re-declared as a constraint after decomposition.

The relational model as a middle ground

This application-agnostic quality is not accidental—it is the relational model's primary design goal. And it positions the relational model as an intermediate representation between two extremes:

  • Document model (full business aggregates): Optimized for OLTP, where the unit of read and write aligns with the domain's consistency boundary. You read what you need in one operation, you write what belongs together atomically. The structure serves the application.
  • Columnar model (individual attributes): Optimized for analytics and ad-hoc queries, where any combination of columns across millions of rows can be scanned, filtered, and aggregated without regard to entity boundaries. The structure serves the query engine.

The relational model sits between these two. It doesn't fully optimize for either transactional aggregates or analytical scans, but it provides a general-purpose representation that can serve both reasonably well when the database engines adds physical optimizations. Rows are decomposed enough to avoid redundancy and enable flexible joins, but not so decomposed that individual attributes lose their tuple context. This is why the relational model became the default: when you don't know your access patterns, don't know your future use cases, and need a single schema to serve multiple applications, normalization into relations is the safest bet.

But "safest bet" is not the same as "best fit." When you do know your domain, your access patterns, and your consistency boundaries, that middle-ground positioning becomes over-engineering for your transactional workload and under-optimization for your analytical one.

Beyond the aggregates

Everything above describes referential integrity within a domain aggregate—the consistency boundary where entities and value objects are tightly coupled under a single root. But there are relationships between aggregates too, and this is where your software architecture determines whether the database should enforce them:

  • Domain-Driven Design decouples aggregates so that each team can operate independently within a bounded context. You may reference another aggregate by its root identity, but enforcement is asynchronous—through application events between services—always routing consistency through the aggregate root. In this architecture, cross-aggregate foreign keys in the database would create the tight coupling that DDD explicitly avoids.
  • Monolithic data models keep all entities tightly coupled. Foreign keys can be defined across the entire schema, and any application module can update any entity. In this world, in-database integrity constraints are the sole guarantor of business invariants, because no architectural boundary prevents inconsistent writes.

The choice is not purely technical—it reflects whether you trust your architecture or your database to enforce boundaries between aggregates.

When the application evolves

This architectural distinction becomes critical when new use cases emerge—and they always do.

In a relational model, evolution means adding columns, adding tables, and adding joins. The normalized schema was over-engineered for the first use case, but that over-engineering pays off as flexibility: a new query path is just a new join condition. Schema changes against existing data require migration scripts—ALTER TABLE ADD COLUMN, backfilling values, updating constraints—but the work is well-understood and the tooling is mature. New use cases accumulate within the same schema without requiring organizational change: no new services to deploy, no event streams to operate, no eventual consistency to reason about. A single team can absorb new requirements by extending the model, and each evolution is a small, incremental step.

An ORM can lower the initial barrier significantly. Tools like Hibernate, ActiveRecord, or SQLAlchemy let developers define entities in application code and generate the normalized schema automatically—making prototyping and MVPs feel almost as frictionless as a schema-flexible database. But the ORM is an abstraction over decomposition, not an elimination of it. The joins, the transactions, the N+1 query problems are still there—just hidden until load reveals them. An ORM that isn't tuned (lazy loading in loops, missing fetch strategies, implicit transaction boundaries) will generate SQL that defeats the relational model's strengths. The ease of starting must be followed by the discipline of tuning.

However, as the schema grows—more tables, more joins, more constraints, more cross-cutting migrations—the monolithic model becomes harder to reason about, harder to change safely, and harder to split later when organizational growth demands it. Every new use case widens the blast radius of a schema change.

In a document model, evolution follows the bounded context. When a new use case aligns with an existing aggregate, you add fields to the document—and because documents are schema-flexible, existing data doesn't require migration. Old documents simply lack the new field, which the application handles gracefully. But when a new use case crosses aggregate boundaries or introduces a different access pattern—say, analytics across all orders, or a recommendation engine that needs to correlate items across customers—you don't contort your existing domain model to accommodate it. Instead, you create a new bounded context with its own collection, its own schema optimized for its own access pattern, and you feed it through event streaming (change streams, CDC) from the source aggregates.

This comes at a cost too: you need the infrastructure for event streaming, you must reason about eventual consistency between contexts, and you need organizational maturity to define clean boundaries. For a small team building an MVP, spinning up event pipelines or managing change stream consumers may be premature architecture.

This is a fundamental difference in how complexity grows:

  • The monolithic model absorbs complexity within a single schema. New use cases add joins and constraints. The schema becomes harder to reason about, harder to migrate, and harder to split later—but each evolution is a small step that doesn't require organizational change.
  • The domain model distributes complexity across bounded contexts. Each context remains simple—optimized for its specific use case—and the integration between them is through events. You need to recognize when a new use case doesn't fit the existing bounded context, and spawn a new collection with its own stream, rather than adding multi-document transactions or cross-collection lookups that would compromise the aggregate's integrity.

In practice, this means a document-oriented architecture scales with organizational complexity. Each team owns its bounded context, its schema, and its data. A new analytical use case doesn't require the order service to change—it subscribes to order events and materializes its own read-optimized view. The order aggregate stays clean, stays fast, and stays owned by a single team. Meanwhile, the relational model scales with functional complexity within a single team—letting a small group serve many use cases from one schema, provided they can manage its growing weight.

Dimension Relational model Document model
Getting started More upfront schema work Fast to prototype
Adding use cases Extend the schema, add joins Add fields or spawn new bounded contexts
Complexity growth Accumulates within one schema Distributes across contexts
Organizational impact Larger team synchronization Requires team boundaries
Risk at scale Monolithic schema becomes rigid Must continue in event-driven design

The takeaway

Relationships exist in your domain model regardless of your database choice. A document database preserves them—a document maps to an aggregate where entities and value objects are embedded, with transactions and locks scoped to that boundary. A relational database decomposes them through normalization and reconstructs them through joins, foreign keys, and transaction isolation.

The relational model earns its place as a general-purpose middle ground that scales with functional complexity—letting a single team serve many use cases from one schema without organizational change. The document model scales with organizational complexity—each team owns its bounded context, its aggregate stays clean, and new use cases spawn new contexts connected by events rather than widening a monolithic schema.

The question is not "Does my data have relationships?" (it always does). The question is: "Do I want my database to preserve my aggregate boundaries or decompose them? And when my application evolves, do I want complexity to accumulate in my schema, or to distribute across my architecture?"

May 08, 2026

The Two Abstractions of System Design: Hide or Reduce

When talking about TLA+, I keep referring to "abstraction" as the most important thing to learn. And it is about the hardest to learn as well.

But a contradiction has been bugging me. Aren't CS people already supposed to be good at abstraction? Isn't abstraction supposed to be at the root of OS, networking, software engineering? Abstract Data Types (ADTs) are a staple of every in CS curriculum. So why do I (and every other formal methods/modeling person) see such a large skill gap in abstraction, and flag it as the core, make-or-break skill for modeling?

I think I finally get to the root of this cognitive disonance. There are two kinds of "abstraction" conflated under the same umbrella term.

  • Modularity abstraction: This is the traditional abstraction taught in CS curricula as ADTs, APIs, layered design, etc. It is all about encapsulation, drawing boundaries, and hiding internals.
  • Modeling abstraction: This is what I talk about when I talk about abstraction in the context of modeling. This is the same sense of abstraction mathematicians and physicists when building models for thinking and reasoning. The goal is to find the minimal and most elegant description that preserves the property you care about. It is all about cutting away everything orhtogonal to the essence of that property.

These two couldn't be further apart in terms of their goal! Let me try to explain in the next two sections.


Modularity abstraction hides. Modeling abstraction reduces.

Modularity abstraction is about interfaces that hide internals. Modeling abstraction is about behaviors, and about reducing a system to its minimal behavioral skeleton for the property you care.

Modularity abstraction encapsulates, draws a vertical boundary, and hides the layer below. Modeling abstraction is crosscutting: it slices the system along a behavioral plane and keeps only what is absolutely relevant to the property under investigation, and even then in the form of "what", not "how". This slice usually looks nothing like the system's organization.


Modularity abstraction hides concurrency. Modeling abstraction exposes it.

Modularity abstraction is all about sealing the leaks, hence Joel Spolsky's famous post lamenting that "all abstractions are leaky". [ Note that his list is all about modularity abstraction: TCP (hide IP), string libraries (hide character arrays), file systems (hide spinning disks), virtual memory / flat address space (hide MMU and paging), SQL (hide query plans), NFS / SMB (hide the network), C++ string classes (hide char*). ] Modularity abstraction aspires to hide the interleavings and present operations as if they were atomic. Its goal is to make the module easy to use, but in doing so it forgoes exposing concurrency or efficiency opportunities.

In stark contrast, the modeling abstraction is about identifying what should leak and leveraging it! It exposes the fine-grained actions and orderings, and proves that invariants hold despite the interleavings. The payoff for this work is to harvest the maximum safe concurrency from the system.


Examples of modeling abstraction 

There is an abundance of modeling abstraction in distributed systems field. It feels like almost all protocols are designed this way.

  • Lamport logical clocks: throw away wall-clock time, keep happens-before
  • Hybrid logical clocks: keep wall-clock and causality, throw away the rest
  • TrueTime: time as a bounded-uncertainty interval 
  • Consensus: agree on a single decision. The way Lamport designed Paxos is a masterclass in abstraction; from Consensus, Voting, to the final protocol. 
  • Linearizability (and really all consistency models ): throw away replication, caching, retries
  • Log is the database idea: throw away materialized state as the source of truth; keep only the ordered, append-only sequence of events.
  • MapReduce/Spark: throw away orchestration, parallelism, scheduling, and fault tolerance. Keep a DAG of deterministic transforms over partitioned data—and let the framework reconstitute the rest from that skeleton.

The Kyle Kingsbury Podcast Podcast - Episode 2 - Dave Rossi

This video is brought to you by PodPitch, the AI-powered platform for PR experts. PodPitch is the first and only software that finds, writes, and sends pitches to podcast hosts from your actual email address. Whether or not you’re already pitching podcasts, PodPitch will make sure you get podcast bookings – guaranteed. And so long as PodPitch’s AI keeps confusing me with Kyle Kingsbury, MMA fighter and famous podcast host, some of those bookings… just might be with me.

My name is Kyle Kingsbury, I am not as popular nor ruggedly handsome as Kyle Kingsbury, and I have never hosted a podcast in my life. Now I’m struggling to read books, write interesting questions, and generally making a fool of myself.

Today’s guest is Dave Rossi, Silicon Valley entrepreneur and author of the book “Alphas Die Early”.

One of the PR firms figured out what’s happening and we’ve all had a good laugh about it, but they were contractually obligated to cancel my third guest’s interview, which I was very much looking forward to. Dr. Deborah Rozman of the HeartMath Institute, if you’re out there… I’d be delighted to interview you.

Bringing pt-query-digest-Style Slow Query Analysis to PostgreSQL with pg_enhanced_query_logging

In this blog post, we are going to briefly discuss pg_enhanced_query_logging (PEQL for short), a PostgreSQL extension that produces slow query logs in the same format MySQL and Percona Server users have been feeding into pt-query-digest for years. The idea is simple: reuse the tried-and-true tools and concepts we have been using for performing full … Continued

The post Bringing pt-query-digest-Style Slow Query Analysis to PostgreSQL with pg_enhanced_query_logging appeared first on Percona.

May 07, 2026

The Kyle Kingsbury Podcast Podcast - Episode 1 - Alex Dripchak

For the last few weeks publicists have bombarded me with emails asking if I’d interview their clients on the Kyle Kingsbury podcast. While my name is Kyle Kingsbury, and I have been a guest on several podcasts, I’ve never hosted one myself. I called one of the PR firms to ask if they might have me confused with the MMA fighter of the same name, but the woman I reached was so friendly and eager to tell me about all the potential guests I might interview and, well… things just got out of hand. Now I’m doing background research, writing questions, learning Riverside, and struggling mightily to avoid swallowing my tongue.

Joining me today and being an extraordinarily good sport about it is Alex Dripchak, sales director, productivity expert, and author of the new book “Maximize”–full of time-management habits that I really should get around to trying one of these days.

Query billion-scale vectors with SQL: Integrating Amazon S3 Vectors and Aurora PostgreSQL

In this post, you’ll learn how to query Amazon S3 Vectors from Amazon Aurora PostgreSQL-Compatible Edition using standard SQL, and how to combine vector similarity results with relational filters in a single query, for example, finding the most semantically similar products and then filtering by price, stock status, or tenant in one SQL statement.

May 06, 2026

May 05, 2026

Amazon Aurora DSQL for global-scale financial transactions

In this post, we first examine why traditional approaches to distributed consistency fall short for financial workloads. We then walk through how the Amazon Aurora DSQL architecture addresses these challenges, and apply it to three production use cases: core banking, global spend management, and digital currency infrastructure. We close with implementation considerations and how to get started with the Amazon Aurora DSQL Free Tier

PSMDB Sandbox: A Browser-Based UI for Deploying MongoDB with Terraform and Ansible

If you’ve ever wrestled with .tfvars files, juggled Ansible inventory paths, or tried to remember the exact command sequence for a MongoDB setup — this post is for you. PSMDB Sandbox is a lightweight web frontend built in Go that ships inside the Percona MongoDB Automation repository. It puts a clean browser interface on top … Continued

The post PSMDB Sandbox: A Browser-Based UI for Deploying MongoDB with Terraform and Ansible appeared first on Percona.

Realtime or ETL? How to choose the right tool

Both Supabase Realtime and Supabase ETL read changes from your Postgres database using logical replication. But they solve very different problems. Here is how to pick the right one.

On benchmarking

Benchmarking is hard. Done wrong it is very misleading, and unfortunately it is frequently done wrong. Let's explore how not to make silly mistakes.

Transparency in benchmarking

Transparent database benchmarks help customers make better decisions and push vendors to build better products.

May 04, 2026

JStack by Command String

This has been bugging me for years: you often run a JVM by a shell script wrapper, then want to jstack it to figure out what it’s doing, but can’t figure out what PID to ask for. Running jps gives remarkably unhelpful output, especially for tools like Leiningen. I wrote a hacky little Ruby script to dig into the process tree of everything matching a given pattern, find any JVMs those processes spawned, and hit the highest numbered one (presumably the last one started) with jstack. This is definitely wrong (PID rollover!) but it’s been surprisingly useful for debugging.

$ jstack+ lein test
jstack 1044647 (java -classpath /home/aphyr/hegel/test ...)

2026-05-04 10:23:07
Full thread dump OpenJDK 64-Bit Server VM (21.0.10+7-Ubuntu-124.04 mixed mode, emulated-client, sharing):

Threads class SMR info:
_java_thread_list=0x0000767f88002030, length=12, elements={
0x000076803801b140, 0x0000768038277460, 0x00007680382789e0, 0x000076803827a2a0,
0x000076803827b8f0, 0x000076803827cea0, 0x000076803827e6b0, 0x000076803828abe0,
0x0000768038296220, 0x000076803938d470, 0x0000768039392cd0, 0x0000767f88000fe0
}

"main" #1 [1044652] prio=5 os_prio=0 cpu=2403.65ms elapsed=2386.66s tid=0x000076803801b140 nid=1044652 waiting on condition  [0x000076803e7fa000]
   java.lang.Thread.State: WAITING (parking)
	at jdk.internal.misc.Unsafe.park(java.base@21.0.10/Native Method)
...

Here’s the script:

#!/usr/bin/env ruby

# Runs `jstack` on the highest-numbered java process invoked by the command
# matching the given pattern. This is really helpful when you want to find out
# why `lein test` is stuck, because lein is a shell script, which launches
# java, which launches *a new* java process.

# Parses a list of newline-separated pids as integers
def parse(pid_string)
  pid_string.split(/\n/).map { |p| p.to_i }
end

# Recursively expands pid into [pid, child1, child2, ...]
def expand(pid)
  children = parse `pgrep -P #{pid}`
  children.reduce([pid]) do |all, child|
    all.concat expand(child)
  end
end

roots = parse %x[pgrep -f '#{ARGV.join(' ')}']
expanded = roots.flat_map { |p| expand p }

# Just JVMs please
jvms = parse(`jps`).to_set
fav = expanded.filter do |pid|
  jvms.include? pid
end.max

unless fav
  puts "No JVMs with that command line in their ancestry"
  exit 1
end

# Show the command line so we can be sure of what process it was
cmd = `cat /proc/#{fav}/cmdline | tr '\\000' ' '`
puts "jstack #{fav} (#{cmd})"
puts
exec "jstack #{fav}"

I Know Kung Fu

You might find this hard to believe, but AI has become kind of a thing around here. Bennie published a post on our Build with AI competition last week, in which he shared that I was lucky enough to land the second place prize. Genuinely flattered, and a real thank you to Peter F, PZ, … Continued

The post I Know Kung Fu appeared first on Percona.

Curious case of PXC node that refused to start due to SSL

In this blog, I am going to share a real-world debugging case study where a routine Percona XtraDB Cluster node restart led to an unexpected failure. I will walk through what we observed, what we checked, and how we ultimately identified the root cause. Let’s see how the maintenance goes. It was supposed to be … Continued

The post Curious case of PXC node that refused to start due to SSL appeared first on Percona.

Building Query Analysis and Insights Dashboard in PMM

Percona Monitoring and Management is a great open source database monitoring, observability, and management tool. Query analytics is one of the prominent features DBA uses actively to trace the incidents and query performance identification. We all know and love the Query Analytics (QAN) dashboard… It’s the first place we look when an incident alert fires … Continued

The post Building Query Analysis and Insights Dashboard in PMM appeared first on Percona.