a curated list of database news from authoritative sources

April 04, 2025

April 03, 2025

Foreign Keys: A must in SQL, but not in a Document Database?

In relational SQL databases, foreign keys are essential for guaranteeing referential integrity. There are two main reasons to declare them, particularly regarding their cascading behavior:

  1. Enforcing Strong Relationships: SQL databases require entities to be stored in separate tables, even when they share the same lifecycle. This separation is mandated by normalization rules, particularly the first normal form for One-To-Many relationships. Foreign keys establish a dependency between these tables and ensure that changes in one table (such as deletions) are either reflected in the child table (using "ON DELETE CASCADE") or prohibited (with "ON DELETE RESTRICT"). For instance, in a blog application, you want to avoid orphaned comments that are not linked to a blog post. If the blog post is deleted, all associated comments should also be deleted, or the deletion should be declined when comments exist.
    In addition to consistency, the declaration of the foreign key is necessary for the query planner to be aware of the strong relationship between the two tables, so that it can estimate the result of the join more accurately.

  2. Consistency in Join Operations: SQL tables are often joined using inner joins, the default, where unmatched rows are discarded. This can lead to inconsistencies in query results, depending on whether there is a join or not. Foreign keys help mitigate these inconsistencies by ensuring that a valid foreign key reference always matches a corresponding row. For example, an author may be removed due to the GDPR "Right to Erasure" but not if it could make some blog posts invisible. The application must do it differently, assigning an unknown author entry before the deletion. Another possibility would be to always use an outer join in queries, but this limits the possible join optimizations.

There are other reasons to enforce foreign keys, but they also involve some application logic and cannot be a simple database declaration, except if you accept business logic in triggers. For example, if you rename a blog category, do you want all blog posts to reflect the new name? Should the old name remain for existing posts, and the new category only apply to new posts? You might want to prevent the removal of a category that is still in use or allow it while removing the category from all posts. If it was the main category of a post, you may need to designate another category as the main one, but not pick a random one. These decisions depend on business reasons for renaming: was it a simple typo, or is the category replaced or being split into two?

Such complex scenarios cannot be simply managed by the database through generic ON DELETE or ON UPDATE commands. The business transaction that removes a category would first analyze the consequences and make the necessary changes. Typically, the category is logically deleted, meaning it can no longer be used. However, joins will still show results with the old name until the posts in that category are reassigned. The category can later be physically deleted, when there are no orphaned records.

What About MongoDB?

The primary difference between a relational data model and a document data model is that the former is designed for a central database without knowing the future use cases, while the latter is tailored to specific domains with known application access patterns.
A document model is application-centric, meaning the application itself implements a logic that maintains the integrity, including validating lists of values or existing references before insertion, and safe preparation before deletion or update.

In the two cases mentioned above, when referential integrity does not involve additional application logic but is part of the data model, MongoDB does not require foreign keys for enforcement:

  1. Enforcing Strong Relationships is inherent in the document model, as entities sharing the same lifecycle are typically embedded within a single document. Since MongoDB can store and index sub-documents without restrictions, the separation of a One-to-Many relationship into different documents is unnecessary. Sub-documents cannot become orphaned because they are owned by their parent document.

  2. Consistency in Join Operations is guaranteed through the $lookup operation, which functions similarly to a left outer join. This ensures that records from the driving collection are not removed even if their lookup references do not match.

To illustrate other cases, consider a Many-to-One relationship, such as a product catalog in an order-entry system. A key advantage of normalization is that changing a product name requires only a single row update in the reference table, reflecting across all queries that join the product table. However, this can lead to inconsistencies outside the database. For example, a customer may receive an email at order completion or a printed bill displaying the old product name, which could mislead them if a different product name is displayed on the website.

Instead of deleting or renaming a product, a new version is created in a relational database, while previous orders reference the earlier version. In a document model, product information associated with the order, as seen in confirmation emails or on bills, is stored directly in the document. While this may appear as denormalization, the duplication is driven by business logic, not changing what was sent or printed, rather than solely for performance.
Typically, additional fields are included as an extended reference, rather than embedding the entire product document. A separate collection of products still exists, allowing updates without cascading effects, since necessary details are copied to the order document itself. Conceptually, this resembles having two entities for products, because of the temporality: the current catalog for new orders and a snapshot of the catalog at the time of order, embedded within the order.

I did not cover Many-to-Many relationships, as they are transformed into two One-to-Many relationships in SQL with an additional association table. With document data modeling, this functions similarly as a reference or extended reference, as MongoDB accommodates an array of references. For example, a new blog post with multiple authors requires complex migration in SQL but fits seamlessly into the same model with the same indexes in MongoDB. Another example can be found in The World Is Too Messy for SQL to Work.

Comparison

In conclusion, MongoDB does not lack foreign keys. It simply manages relationships differently than relational SQL databases. While SQL relies on foreign keys for referential integrity, MongoDB employs an application-centric approach through data embedding that is not limited by the normal forms, and update logic to cascade the updates, though an aggregation pipeline. This strategy effectively maintains document relationships, ensuring consistency without the need for foreign key constraints.
Although it may require more code, avoiding unnecessary foreign keys facilitates sharding and horizontal scaling with minimal performance impact. Ultimately, with MongoDB, maintaining database integrity during reference table updates is a shared responsibility between the database and the application, contrasting SQL's independent in-database declarations that are use-case agnostic.

Migration

When migrating from PostgreSQL to MongoDB, provided that the relational data modeling was done correctly, consider the following:

  • ON DELETE CASCADE indicates a shared lifecycle and should be transformed into an embedded sub-document.
  • ON DELETE SET NULL/DEFAULT represents an independent lifecycle. Use an extended reference to copy relevant fields to the child at insert time instead of using joins during queries.
  • ON DELETE RESTRICT/NO ACTION requires application logic before deletion. In MongoDB, you can manage consistency and performance with logical deletions prior to physical ones, and utilize an aggregation pipeline to execute them.
  • ON UPDATE actions indicate a natural key was used, and any key exposed to the user might be updated, but it's preferable to utilize an immutable _id for references.
  • Primary Keys that include the Foreign Keys suggest a Many-to-Many association. Convert this into an array of references on the side where the size is predictable. This approach eliminates the need for an additional array on the opposite side because the multi-key index on the array serves for the navigation from the other side.

We made an open source LLM Performance Tracker

You can't just build AI features, you have to operate them in production, which means observability. Here's an open source tool to watch your LLMs in real-time.

April 02, 2025

Fast Compilation or Fast Execution: Just Have Both!

Fast Compilation or Fast Execution: Just Have Both!

At CedarDB, we believe in high-performance computing. And one of the easiest tricks to high performance is to do only what you really need to do. So you want to reduce data movement or reduce unnecessary branches. But how can you be sure you are only doing what you need to do? Especially when you want to be flexible and configurable at runtime.

Basically, you want to tune the code based on as much information as you can get beforehand. If you’re a programmer tasked with optimizing a snippet, you’d probably start by gathering as much information as you can about the data set. Then you would look at data alignment and loops to remove any unnecessary checks and pre-calculate or cache as much as possible. Finally, the code is tailored as much as possible to your challenge so that it does only one thing, but does it extremely well. Unfortunately, as developers, we cannot just write code that does one thing because there are users. Users demand interactivity, and in our case, they want to decide how they access the data. So isn’t there a way to have your cake and eat it too?

April 01, 2025

Choosing the Right Storage for PostgreSQL on Kubernetes: A Benchmark Analysis

As more organizations move their PostgreSQL databases onto Kubernetes, a common question arises: Which storage solution best handles its demands? Picking the right option is critical, directly impacting performance, reliability, and scalability. For stateful workloads like PostgreSQL, storage must offer high availability and safeguard data integrity, even under intense, high-volume conditions. To address these concerns, […]

Kubernetes Sidecar Containers Explained: Benefits, Use Cases, and What’s New

Kubernetes is becoming a popular choice for running containerized applications. While the core idea is to have a single container running the application in a Pod, there are many cases where one or more containers need to run alongside the application container, such as containers for capturing logs, metrics, etc. This approach is typically referred […]

March 31, 2025

The World Is Too Messy for SQL to Work

Satya Nadella, Microsoft's CEO, discusses AI agents and quantum technology in an interview on the South Park Commons YouTube channel. At 29:36, he says the world is too messy for SQL to work, which might be surprising, as SQL serves as an API for relational databases, built to store various types of data, including the most complex.

This reminds me of some challenges I encountered with the relational model. I will share a real-life example from many years ago when I worked as a data architect at an airline company before the rise of document databases to resolve such issues.

Flights have an airport of origin and an airport of destination. This is a typical case where one table has multiple aliases in a query because of the two different roles. In a relational database, the table of flights has two columns with a foreign key that references a table of airports: origin airport and destination airport. There are already many questions that can start a debate with relational data modelers:

  • Should we use singular ("flight") or plural ("flights") for table names

  • Should we name the columns with the role first or last: destination airport ("DST_ARP") or airport of destination ("ARP_DST")

  • What are the keys? The foreign key can use the IATA code to avoid joins in many queries or a surrogate key in case the IATA code changes, but with two additional joins for the queries

  • What are the tables? How many tables are needed to store this information, normalized for easier updates or denormalized for simpler queries?

In a document model, we can store all relevant information together. An "airport" collection will contain one document per airport, detailing its IATA code, name, city, latitude, longitude, country code, and country name, all neatly organized in sub-documents.

Here is an example of a document that describes Amsterdam's Airport

{
  "_id": "AMS",
  "name": "Amsterdam Airport Schiphol",
  "location": {
    "city": {
      "name": "Amsterdam",
      "country": {
        "name": "Netherlands",
        "code": "NL"
      }
    },
    "coordinates": {
      "latitude": 52.308613,
      "longitude": 4.763889
    }
  },
  "iata": "AMS",
  "icao": "EHAM",
  "is_hub_for_airline": ["KLM", '"Air France"],
  "timezone": "Europe/Amsterdam",
  "type": "international",
  "runways": [
    {
      "name": "18R/36L",
      "length_meters": 3800
    },
    {
      "name": "06/24",
      "length_meters": 3500
    }
  ],
  "website": "https://www.schiphol.nl/"
}

In a relational database, because of the One-To-Many relationships and to obey the first Normal Form, the same data is scattered to multiple tables:

INSERT INTO Countries (country_id, name, code)
 VALUES (1, 'Netherlands', 'NL');

INSERT INTO Cities (city_id, name, country_id) 
 VALUES (1, 'Amsterdam', 1);

INSERT INTO Airports (airport_id, name, city_id, iata, icao, timezone, type, website)
 VALUES ('AMS', 'Amsterdam Airport Schiphol', 1, 'AMS', 'EHAM', 'Europe/Amsterdam', 'international', 'https://www.schiphol.nl/');

INSERT INTO Coordinates (airport_id, latitude, longitude)
 VALUES ('AMS', 52.308613, 4.763889);

INSERT INTO Runways (airport_id, name, length_meters)
VALUES ('AMS', '18R/36L', 3800), ('AMS', '06/24', 3500);

A relational table has a fixed structure that must be declared before inserting data:

CREATE TABLE Countries (
    country_id INT PRIMARY KEY,
    name VARCHAR(100) NOT NULL,
    code CHAR(2) NOT NULL UNIQUE
);

CREATE TABLE Cities (
    city_id INT PRIMARY KEY,
    name VARCHAR(100) NOT NULL,
    country_id INT,
    FOREIGN KEY (country_id) REFERENCES Countries(country_id)
);

CREATE TABLE Airports (
    airport_id CHAR(3) PRIMARY KEY,
    name VARCHAR(100) NOT NULL,
    city_id INT,
    iata CHAR(3) NOT NULL UNIQUE,
    icao CHAR(4) NOT NULL UNIQUE,
    timezone VARCHAR(50) NOT NULL,
    type VARCHAR(50),
    website VARCHAR(100),
    FOREIGN KEY (city_id) REFERENCES Cities(city_id)
);

CREATE TABLE Coordinates (
    airport_id CHAR(3),
    latitude DECIMAL(9,6) NOT NULL,
    longitude DECIMAL(9,6) NOT NULL,
    PRIMARY KEY (airport_id),
    FOREIGN KEY (airport_id) REFERENCES Airports(airport_id)
);

CREATE TABLE Runways (
    runway_id INT PRIMARY KEY AUTO_INCREMENT,
    airport_id CHAR(3),
    name VARCHAR(10) NOT NULL,
    length_meters INT NOT NULL,
    FOREIGN KEY (airport_id) REFERENCES Airports(airport_id)
);

Declaring foreign keys is essential to ensure referential integrity, which isn't required when modeling in a single document since the relationship is inherent in the data structure.

With numerous tables, an ERD diagram aids in illustrating the structure:

Welcome to Ou Messy World

This data model works well when a city is assigned an airport, which is usually true. You can develop numerous applications based on this model until a significant issue emerges: your flight schedule features a new international airport, EuroAirport Basel Mulhouse Freiburg, which this model cannot accommodate. This airport is physically in France, but you can drive to Basel without crossing any border, so it is also in Switzerland.

This airport is unique, bi-national, serving Basel in Switzerland and Mulhouse in France. It has multiple IATA codes: EAP, as well as MLH and BSL for each country. This necessitates a structural change to the relational data model because the relationship between airports and cities must be a Many-to-Many:

CREATE TABLE Countries (
    country_id INT PRIMARY KEY,
    name VARCHAR(100) NOT NULL,
    code CHAR(2) NOT NULL UNIQUE
);

CREATE TABLE Cities (
    city_id INT PRIMARY KEY,
    name VARCHAR(100) NOT NULL,
    country_id INT,
    FOREIGN KEY (country_id) REFERENCES Countries(country_id)
);

CREATE TABLE Airports (
    airport_id CHAR(3) PRIMARY KEY,
    name VARCHAR(100) NOT NULL,
    timezone VARCHAR(50) NOT NULL,
    type VARCHAR(50),
    website VARCHAR(100)
);

CREATE TABLE Airport_Codes (
    airport_code_id INT PRIMARY KEY AUTO_INCREMENT,
    airport_id CHAR(3),
    code_type VARCHAR(3) CHECK (code_type IN ('EAP', 'BSL', 'MLH')),
    code CHAR(3),
    FOREIGN KEY (airport_id) REFERENCES Airports(airport_id)
);

CREATE TABLE City_Airport (
    city_airport_id INT PRIMARY KEY AUTO_INCREMENT,
    city_id INT,
    airport_id CHAR(3),
    FOREIGN KEY (city_id) REFERENCES Cities(city_id),
    FOREIGN KEY (airport_id) REFERENCES Airports(airport_id)
);

CREATE TABLE Coordinates (
    airport_id CHAR(3),
    latitude DECIMAL(9,6) NOT NULL,
    longitude DECIMAL(9,6) NOT NULL,
    PRIMARY KEY (airport_id),
    FOREIGN KEY (airport_id) REFERENCES Airports(airport_id)
);

Inserting a single airport data requires more statements per transaction:

-- Insert into Countries
INSERT INTO Countries (country_id, name, code) VALUES 
(1, 'Switzerland', 'CH'),
(2, 'France', 'FR');

-- Insert into Cities
INSERT INTO Cities (city_id, name, country_id) VALUES 
(1, 'Basel', 1),
(2, 'Mulhouse', 2);

-- Insert into Airports
INSERT INTO Airports (airport_id, name, timezone, type, website) VALUES 
('EAP', 'EuroAirport Basel Mulhouse Freiburg', 'Europe/Zurich', 'international', 'https://www.euroairport.com/');

-- Insert into Airport_Codes
INSERT INTO Airport_Codes (airport_id, code_type, code) VALUES 
('EAP', 'EAP', 'EAP'),
('EAP', 'BSL', 'BSL'),
('EAP', 'MLH', 'MLH');

-- Insert into City_Airport
INSERT INTO City_Airport (city_id, airport_id) VALUES 
(1, 'EAP'),
(2, 'EAP');

-- Insert into Coordinates
INSERT INTO Coordinates (airport_id, latitude, longitude) VALUES 
('EAP', 47.59, 7.529167);

The Entity-Relationship Diagram (ERD) shows an additional table for the Many-to-Many relationship between airports and cities:

All other airports must use the new model, even if they have only one city per airport. Having a model that distinguishes the two possibilities of One-to-Many and Many-to-Many would be even more complex, with inheritance leading to more tables.

Modifying such a data model affects all existing code, requiring all queries to be rewritten with a new join. In practice, this modification is too risky, leading to the implementation of some workarounds for legacy code. For instance, I've seen an application with a dummy city for Basel+Mulhouse and a dummy country for Switzerland+France. Other applications ignored the multi-city property and added a BSL-MLH dummy flight with a zero-kilometer distance and a zero-minute flight duration, which had unexpected side effects. These workarounds leverage the existing code for single-city airports, ensuring that only the new code is impacted by the new airport. However, the data model no longer reflects the real world, and the ideal normalized model that is supposed to avoid anomalies is broken.

This is where the world is too messy for SQL: a normalized data model must include all possibilities, with one model shared by all objects, including the nominal case and the diversity of edge cases. Everything can be represented with normalized tables, but in practice, applications break this model for more flexibility. Many applications that adapt to different domains opted for an Entity-Attribute-Value model to get more agility, but at the price of the worst performance because the SQL query planner of a relational database is not built to optimize such key-value data access. If you have never encountered such schema or wonder what the problem is, I recommend this old story by Tim Gorman about a single "DATA" table model: bad carma.

Document Data Modeling Flexibility

With a document model, two documents of the same collection do not strictly need to adhere to the same structure. I can keep the existing structure and code for the single-city airports and use an array instead of a single value when inserting the bi-national airport:

{
  "_id": "EAP",
  "name": "EuroAirport Basel Mulhouse Freiburg",
  "location": {
    "city": [
      {
        "name": "Basel",
        "country": {
          "name": "Switzerland",
          "code": "CH"
        }
      },
      {
        "name": "Mulhouse",
        "country": {
          "name": "France",
          "code": "FR"
        }
      }
    ],
    "coordinates": {
      "latitude": 47.59,
      "longitude": 7.529167
    }
  },
  "iata": [ "EAP", "BSL", "MLH" ],
  "timezone": "Europe/Zurich",
  "type": "international",
  "website": "https://www.euroairport.com/"
}

Such a format may require additional code when the business logic differs between single and multi-city airports. Still, there's no need to change the existing code and reformat the existing data. The additional complexity due to the "messy" world has a scope limited to its new context and does not undermine the entirety of the existing data model.

Examples like this can create the impression that the relational model is too rigid. Although normalization was designed to facilitate model evolution, and SQL databases have reduced the downtime for some ALTER statements, it still restricts all data to a single model. Minor changes in business can break existing code and significantly increase the complexity of the data model. A normalized relational model makes adding a column to a table easy. However, when the keys or the association cardinality changes, the relational model requires impactful modification with lots of code to modify and test, and there is downtime during the schema migration.

Document Data Modeling and MongoDB API

Document data modeling allows you to add documents even if their structure diverges from the collection's existing documents. The structure can still be constrained with schema validation, but with more agility than what SQL relational tables definition provide.

Here is an example of schema validation that accepts the two document types - note the "oneOf" sections:

db.runCommand({
  collMod: "airports",
  validator: {
    $jsonSchema: {
      bsonType: "object",
      required: ["_id", "name", "location", "iata", "timezone", "type", "website"],
      properties: { 
...       
        "location": {
          bsonType: "object",
          required: ["city", "coordinates"],
          properties: {
            "city": {
              oneOf: [
                { 
                  bsonType: "object",
                  required: ["name", "country"],
...               
                },  
                {   
                  bsonType: "array",
                  items: {
                    bsonType: "object",
                    required: ["name", "country"],
...         
                  }
                }
              ]
            },
...       
          } 
        },    
        "iata": {
          oneOf: [
            {
              bsonType: "string",
              description: "must be a string or an array of strings and is required"
            },  
            { 
              bsonType: "array",
              items: {
                bsonType: "string"
              }
            }
          ]
        },
...
      }
    }
  }
});

With schema validation, you can combine the schema flexibility with schema integrity constraints. It is recommended that schema validation be added so that the MongoDB database can guarantee that the documents have the fields expected by the application code.

Additionally, the MongoDB API prioritizes developer experience for the application evolutions. Such polymorphic schema may not need different code. I have inserted the two documents described above into an "airports" collection, and the code to find a country's airports is the same whether the city attribute is one sub-document or an array

mdb> // find airports in Netherlands

mdb> db.airports.find({
  "location.city.country.name": "Netherlands"
},{
  "name": 1
});

[ { _id: 'AMS', name: 'Amsterdam Airport Schiphol' } ]


mdb> // find airports in France

db.airports.find({
  "location.city.country.name": "France"
},{
  "name": 1
});

[ { _id: 'EAP', name: 'EuroAirport Basel Mulhouse Freiburg' } ]


mdb> // find airports in Switzerland

db.airports.find({
  "location.city.country.name": "Switzerland"
},{
  "name": 1
});

[ { _id: 'EAP', name: 'EuroAirport Basel Mulhouse Freiburg' } ]
mdb> 

With MongoDB, you don't have to change the query. You also do not have to change the index definition. I have created an index on the country name:

mdb> db.airports.createIndex(
           { "location.city.country.name": 1 }
);

location.city.country.name_1

When having only single-city airports in the collection, the execution plan for the find query shows:

          stage: 'IXSCAN',
          keyPattern: { 'location.city.country.name': 1 },
          indexName: 'location.city.country.name_1',
          isMultiKey: false,
          multiKeyPaths: { 'location.city.country.name': [] },
          indexBounds: {
            'location.city.country.name': [ '["Netherlands", "Netherlands"]' ]
          }

Once a multi-city airport has been added, it switches to a multi-key index:

          stage: 'IXSCAN',
          keyPattern: { 'location.city.country.name': 1 },
          indexName: 'location.city.country.name_1',
          isMultiKey: true,
          multiKeyPaths: { 'location.city.country.name': [ 'location.city' ] },
          indexBounds: {
            'location.city.country.name': [ '["Netherlands", "Netherlands"]' ]
          }

MongoDB multi-key indexes simply index more than one value per document when the cardinality changes from One-to-One to One-to-Many.

You may think that using a JSON data type in an SQL database gives you the same advantages. However, query planning and indexing are significant differences between document databases like MongoDB and storing documents in a relational database like PostgreSQL.

PostgreSQL's JSONB is a data type that enables schema-less data storage. Using an expression index, you can create an index for a specific key in the document, provided it is known in advance (schema-on-write) and is a single key without an array in the path. However, you must use an inverted index when the path includes an array. GIN indexes, optimized for indexing unknown sub-paths (schema-on-read), are inverted indexes and can be used with arrays (see Indexing JSON in PostgreSQL). Still, they do not provide all the features of regular indexes. For example, the result from a GIN index in PostgreSQL is not sorted and cannot be used to optimize an ORDER BY ... LIMIT. When the schema-on-write involves a one-to-many relationship, storing it in normalized columns in a separate table and joining it at the time of the query is often recommended.

MongoDB is a database that natively indexes document paths, so you don't need this double data modeling approach. Regular indexes can be used on the schema-on-write part of the document, even when the path includes an array, and provide the full Equality, Sort, Range experience. Those indexes follow the flexibility of the document model, where not only new fields can be added, but a One-to-One can become a One-to-Many without adding more complexity. For the schema-on-read part, where the structure is unknown before the query, you can use the same indexes with wildcards to index all keys.

Satya Nadella's interview was about AI foundation models. MongoDB has search indexes for schemaless text or embeddings, but regular indexes are still used for exact searches on structured schemas. When the world is too messy for SQL, MongoDB excels by avoiding a one-size-fits-all data model. Instead, it provides a domain-driven model tailored to specific access patterns, accommodating new complex outlier business rules without complicating the existing structure, code, and index definition.

March 29, 2025

March 28, 2025

Transparent Data Encryption for PostgreSQL Release Candidate is Here!

PostgreSQL is, without a doubt, one of the most popular open source databases in the world. Why? Well, there are many reasons, but if I had to pick just one, I’d say it’s extensibility. PostgreSQL isn’t just a database; it’s an ecosystem of extensions that can transform it to tackle any upcoming challenges. By enabling […]