April 02, 2025
Essential MongoDB Backup Best Practices for Data Protection
Hype v. Reality: 5 AI features that actually work in production
Realtime: Broadcast from Database
Keeping Tabs on What's New in Supabase Studio
Fast Compilation or Fast Execution: Just Have Both!
Fast Compilation or Fast Execution: Just Have Both!
At CedarDB, we believe in high-performance computing. And one of the easiest tricks to high performance is to do only what you really need to do. So you want to reduce data movement or reduce unnecessary branches. But how can you be sure you are only doing what you need to do? Especially when you want to be flexible and configurable at runtime.
Basically, you want to tune the code based on as much information as you can get beforehand. If you’re a programmer tasked with optimizing a snippet, you’d probably start by gathering as much information as you can about the data set. Then you would look at data alignment and loops to remove any unnecessary checks and pre-calculate or cache as much as possible. Finally, the code is tailored as much as possible to your challenge so that it does only one thing, but does it extremely well. Unfortunately, as developers, we cannot just write code that does one thing because there are users. Users demand interactivity, and in our case, they want to decide how they access the data. So isn’t there a way to have your cake and eat it too?
April 01, 2025
Choosing the Right Storage for PostgreSQL on Kubernetes: A Benchmark Analysis
Kubernetes Sidecar Containers Explained: Benefits, Use Cases, and What’s New
Is MongoDB Truly Open Source? A Critical Look at SSPL
I've operated petabyte-scale ClickHouse® clusters for 5 years
PostgreSQL Security: A Comprehensive Guide to Hardening Your Database
How to Run MongoDB on Kubernetes: Solutions, Pros and Cons
Edge Functions: Deploy from the Dashboard + Deno 2.1
Automatic Embeddings in Postgres
March 31, 2025
Optimizing cost savings: The advantage of Amazon Aurora over self-managed open source databases
The World Is Too Messy for SQL to Work
Satya Nadella, Microsoft's CEO, discusses AI agents and quantum technology in an interview on the South Park Commons YouTube channel. At 29:36, he says the world is too messy for SQL to work, which might be surprising, as SQL serves as an API for relational databases, built to store various types of data, including the most complex.
This reminds me of some challenges I encountered with the relational model. I will share a real-life example from many years ago when I worked as a data architect at an airline company before the rise of document databases to resolve such issues.
Flights have an airport of origin and an airport of destination. This is a typical case where one table has multiple aliases in a query because of the two different roles. In a relational database, the table of flights has two columns with a foreign key that references a table of airports: origin airport and destination airport. There are already many questions that can start a debate with relational data modelers:
Should we use singular ("flight") or plural ("flights") for table names
Should we name the columns with the role first or last: destination airport ("DST_ARP") or airport of destination ("ARP_DST")
What are the keys? The foreign key can use the IATA code to avoid joins in many queries or a surrogate key in case the IATA code changes, but with two additional joins for the queries
What are the tables? How many tables are needed to store this information, normalized for easier updates or denormalized for simpler queries?
In a document model, we can store all relevant information together. An "airport" collection will contain one document per airport, detailing its IATA code, name, city, latitude, longitude, country code, and country name, all neatly organized in sub-documents.
Here is an example of a document that describes Amsterdam's Airport
{
"_id": "AMS",
"name": "Amsterdam Airport Schiphol",
"location": {
"city": {
"name": "Amsterdam",
"country": {
"name": "Netherlands",
"code": "NL"
}
},
"coordinates": {
"latitude": 52.308613,
"longitude": 4.763889
}
},
"iata": "AMS",
"icao": "EHAM",
"is_hub_for_airline": ["KLM", '"Air France"],
"timezone": "Europe/Amsterdam",
"type": "international",
"runways": [
{
"name": "18R/36L",
"length_meters": 3800
},
{
"name": "06/24",
"length_meters": 3500
}
],
"website": "https://www.schiphol.nl/"
}
In a relational database, because of the One-To-Many relationships and to obey the first Normal Form, the same data is scattered to multiple tables:
INSERT INTO Countries (country_id, name, code)
VALUES (1, 'Netherlands', 'NL');
INSERT INTO Cities (city_id, name, country_id)
VALUES (1, 'Amsterdam', 1);
INSERT INTO Airports (airport_id, name, city_id, iata, icao, timezone, type, website)
VALUES ('AMS', 'Amsterdam Airport Schiphol', 1, 'AMS', 'EHAM', 'Europe/Amsterdam', 'international', 'https://www.schiphol.nl/');
INSERT INTO Coordinates (airport_id, latitude, longitude)
VALUES ('AMS', 52.308613, 4.763889);
INSERT INTO Runways (airport_id, name, length_meters)
VALUES ('AMS', '18R/36L', 3800), ('AMS', '06/24', 3500);
A relational table has a fixed structure that must be declared before inserting data:
CREATE TABLE Countries (
country_id INT PRIMARY KEY,
name VARCHAR(100) NOT NULL,
code CHAR(2) NOT NULL UNIQUE
);
CREATE TABLE Cities (
city_id INT PRIMARY KEY,
name VARCHAR(100) NOT NULL,
country_id INT,
FOREIGN KEY (country_id) REFERENCES Countries(country_id)
);
CREATE TABLE Airports (
airport_id CHAR(3) PRIMARY KEY,
name VARCHAR(100) NOT NULL,
city_id INT,
iata CHAR(3) NOT NULL UNIQUE,
icao CHAR(4) NOT NULL UNIQUE,
timezone VARCHAR(50) NOT NULL,
type VARCHAR(50),
website VARCHAR(100),
FOREIGN KEY (city_id) REFERENCES Cities(city_id)
);
CREATE TABLE Coordinates (
airport_id CHAR(3),
latitude DECIMAL(9,6) NOT NULL,
longitude DECIMAL(9,6) NOT NULL,
PRIMARY KEY (airport_id),
FOREIGN KEY (airport_id) REFERENCES Airports(airport_id)
);
CREATE TABLE Runways (
runway_id INT PRIMARY KEY AUTO_INCREMENT,
airport_id CHAR(3),
name VARCHAR(10) NOT NULL,
length_meters INT NOT NULL,
FOREIGN KEY (airport_id) REFERENCES Airports(airport_id)
);
Declaring foreign keys is essential to ensure referential integrity, which isn't required when modeling in a single document since the relationship is inherent in the data structure.
With numerous tables, an ERD diagram aids in illustrating the structure:
Welcome to Ou Messy World
This data model works well when a city is assigned an airport, which is usually true. You can develop numerous applications based on this model until a significant issue emerges: your flight schedule features a new international airport, EuroAirport Basel Mulhouse Freiburg, which this model cannot accommodate. This airport is physically in France, but you can drive to Basel without crossing any border, so it is also in Switzerland.
This airport is unique, bi-national, serving Basel in Switzerland and Mulhouse in France. It has multiple IATA codes: EAP, as well as MLH and BSL for each country. This necessitates a structural change to the relational data model because the relationship between airports and cities must be a Many-to-Many:
CREATE TABLE Countries (
country_id INT PRIMARY KEY,
name VARCHAR(100) NOT NULL,
code CHAR(2) NOT NULL UNIQUE
);
CREATE TABLE Cities (
city_id INT PRIMARY KEY,
name VARCHAR(100) NOT NULL,
country_id INT,
FOREIGN KEY (country_id) REFERENCES Countries(country_id)
);
CREATE TABLE Airports (
airport_id CHAR(3) PRIMARY KEY,
name VARCHAR(100) NOT NULL,
timezone VARCHAR(50) NOT NULL,
type VARCHAR(50),
website VARCHAR(100)
);
CREATE TABLE Airport_Codes (
airport_code_id INT PRIMARY KEY AUTO_INCREMENT,
airport_id CHAR(3),
code_type VARCHAR(3) CHECK (code_type IN ('EAP', 'BSL', 'MLH')),
code CHAR(3),
FOREIGN KEY (airport_id) REFERENCES Airports(airport_id)
);
CREATE TABLE City_Airport (
city_airport_id INT PRIMARY KEY AUTO_INCREMENT,
city_id INT,
airport_id CHAR(3),
FOREIGN KEY (city_id) REFERENCES Cities(city_id),
FOREIGN KEY (airport_id) REFERENCES Airports(airport_id)
);
CREATE TABLE Coordinates (
airport_id CHAR(3),
latitude DECIMAL(9,6) NOT NULL,
longitude DECIMAL(9,6) NOT NULL,
PRIMARY KEY (airport_id),
FOREIGN KEY (airport_id) REFERENCES Airports(airport_id)
);
Inserting a single airport data requires more statements per transaction:
-- Insert into Countries
INSERT INTO Countries (country_id, name, code) VALUES
(1, 'Switzerland', 'CH'),
(2, 'France', 'FR');
-- Insert into Cities
INSERT INTO Cities (city_id, name, country_id) VALUES
(1, 'Basel', 1),
(2, 'Mulhouse', 2);
-- Insert into Airports
INSERT INTO Airports (airport_id, name, timezone, type, website) VALUES
('EAP', 'EuroAirport Basel Mulhouse Freiburg', 'Europe/Zurich', 'international', 'https://www.euroairport.com/');
-- Insert into Airport_Codes
INSERT INTO Airport_Codes (airport_id, code_type, code) VALUES
('EAP', 'EAP', 'EAP'),
('EAP', 'BSL', 'BSL'),
('EAP', 'MLH', 'MLH');
-- Insert into City_Airport
INSERT INTO City_Airport (city_id, airport_id) VALUES
(1, 'EAP'),
(2, 'EAP');
-- Insert into Coordinates
INSERT INTO Coordinates (airport_id, latitude, longitude) VALUES
('EAP', 47.59, 7.529167);
The Entity-Relationship Diagram (ERD) shows an additional table for the Many-to-Many relationship between airports and cities:
All other airports must use the new model, even if they have only one city per airport. Having a model that distinguishes the two possibilities of One-to-Many and Many-to-Many would be even more complex, with inheritance leading to more tables.
Modifying such a data model affects all existing code, requiring all queries to be rewritten with a new join. In practice, this modification is too risky, leading to the implementation of some workarounds for legacy code. For instance, I've seen an application with a dummy city for Basel+Mulhouse and a dummy country for Switzerland+France. Other applications ignored the multi-city property and added a BSL-MLH dummy flight with a zero-kilometer distance and a zero-minute flight duration, which had unexpected side effects. These workarounds leverage the existing code for single-city airports, ensuring that only the new code is impacted by the new airport. However, the data model no longer reflects the real world, and the ideal normalized model that is supposed to avoid anomalies is broken.
This is where the world is too messy for SQL: a normalized data model must include all possibilities, with one model shared by all objects, including the nominal case and the diversity of edge cases. Everything can be represented with normalized tables, but in practice, applications break this model for more flexibility. Many applications that adapt to different domains opted for an Entity-Attribute-Value model to get more agility, but at the price of the worst performance because the SQL query planner of a relational database is not built to optimize such key-value data access. If you have never encountered such schema or wonder what the problem is, I recommend this old story by Tim Gorman about a single "DATA" table model: bad carma.
Document Data Modeling Flexibility
With a document model, two documents of the same collection do not strictly need to adhere to the same structure. I can keep the existing structure and code for the single-city airports and use an array instead of a single value when inserting the bi-national airport:
{
"_id": "EAP",
"name": "EuroAirport Basel Mulhouse Freiburg",
"location": {
"city": [
{
"name": "Basel",
"country": {
"name": "Switzerland",
"code": "CH"
}
},
{
"name": "Mulhouse",
"country": {
"name": "France",
"code": "FR"
}
}
],
"coordinates": {
"latitude": 47.59,
"longitude": 7.529167
}
},
"iata": [ "EAP", "BSL", "MLH" ],
"timezone": "Europe/Zurich",
"type": "international",
"website": "https://www.euroairport.com/"
}
Such a format may require additional code when the business logic differs between single and multi-city airports. Still, there's no need to change the existing code and reformat the existing data. The additional complexity due to the "messy" world has a scope limited to its new context and does not undermine the entirety of the existing data model.
Examples like this can create the impression that the relational model is too rigid. Although normalization was designed to facilitate model evolution, and SQL databases have reduced the downtime for some ALTER statements, it still restricts all data to a single model. Minor changes in business can break existing code and significantly increase the complexity of the data model. A normalized relational model makes adding a column to a table easy. However, when the keys or the association cardinality changes, the relational model requires impactful modification with lots of code to modify and test, and there is downtime during the schema migration.
Document Data Modeling and MongoDB API
Document data modeling allows you to add documents even if their structure diverges from the collection's existing documents. The structure can still be constrained with schema validation, but with more agility than what SQL relational tables definition provide.
Here is an example of schema validation that accepts the two document types - note the "oneOf" sections:
db.runCommand({
collMod: "airports",
validator: {
$jsonSchema: {
bsonType: "object",
required: ["_id", "name", "location", "iata", "timezone", "type", "website"],
properties: {
...
"location": {
bsonType: "object",
required: ["city", "coordinates"],
properties: {
"city": {
oneOf: [
{
bsonType: "object",
required: ["name", "country"],
...
},
{
bsonType: "array",
items: {
bsonType: "object",
required: ["name", "country"],
...
}
}
]
},
...
}
},
"iata": {
oneOf: [
{
bsonType: "string",
description: "must be a string or an array of strings and is required"
},
{
bsonType: "array",
items: {
bsonType: "string"
}
}
]
},
...
}
}
}
});
With schema validation, you can combine the schema flexibility with schema integrity constraints. It is recommended that schema validation be added so that the MongoDB database can guarantee that the documents have the fields expected by the application code.
Additionally, the MongoDB API prioritizes developer experience for the application evolutions. Such polymorphic schema may not need different code. I have inserted the two documents described above into an "airports" collection, and the code to find a country's airports is the same whether the city attribute is one sub-document or an array
mdb> // find airports in Netherlands
mdb> db.airports.find({
"location.city.country.name": "Netherlands"
},{
"name": 1
});
[ { _id: 'AMS', name: 'Amsterdam Airport Schiphol' } ]
mdb> // find airports in France
db.airports.find({
"location.city.country.name": "France"
},{
"name": 1
});
[ { _id: 'EAP', name: 'EuroAirport Basel Mulhouse Freiburg' } ]
mdb> // find airports in Switzerland
db.airports.find({
"location.city.country.name": "Switzerland"
},{
"name": 1
});
[ { _id: 'EAP', name: 'EuroAirport Basel Mulhouse Freiburg' } ]
mdb>
With MongoDB, you don't have to change the query. You also do not have to change the index definition. I have created an index on the country name:
mdb> db.airports.createIndex(
{ "location.city.country.name": 1 }
);
location.city.country.name_1
When having only single-city airports in the collection, the execution plan for the find query shows:
stage: 'IXSCAN',
keyPattern: { 'location.city.country.name': 1 },
indexName: 'location.city.country.name_1',
isMultiKey: false,
multiKeyPaths: { 'location.city.country.name': [] },
indexBounds: {
'location.city.country.name': [ '["Netherlands", "Netherlands"]' ]
}
Once a multi-city airport has been added, it switches to a multi-key index:
stage: 'IXSCAN',
keyPattern: { 'location.city.country.name': 1 },
indexName: 'location.city.country.name_1',
isMultiKey: true,
multiKeyPaths: { 'location.city.country.name': [ 'location.city' ] },
indexBounds: {
'location.city.country.name': [ '["Netherlands", "Netherlands"]' ]
}
MongoDB multi-key indexes simply index more than one value per document when the cardinality changes from One-to-One to One-to-Many.
You may think that using a JSON data type in an SQL database gives you the same advantages. However, query planning and indexing are significant differences between document databases like MongoDB and storing documents in a relational database like PostgreSQL.
PostgreSQL's JSONB is a data type that enables schema-less data storage. Using an expression index, you can create an index for a specific key in the document, provided it is known in advance (schema-on-write) and is a single key without an array in the path. However, you must use an inverted index when the path includes an array. GIN indexes, optimized for indexing unknown sub-paths (schema-on-read), are inverted indexes and can be used with arrays (see Indexing JSON in PostgreSQL). Still, they do not provide all the features of regular indexes. For example, the result from a GIN index in PostgreSQL is not sorted and cannot be used to optimize an ORDER BY ... LIMIT. When the schema-on-write involves a one-to-many relationship, storing it in normalized columns in a separate table and joining it at the time of the query is often recommended.
MongoDB is a database that natively indexes document paths, so you don't need this double data modeling approach. Regular indexes can be used on the schema-on-write part of the document, even when the path includes an array, and provide the full Equality, Sort, Range experience. Those indexes follow the flexibility of the document model, where not only new fields can be added, but a One-to-One can become a One-to-Many without adding more complexity. For the schema-on-read part, where the structure is unknown before the query, you can use the same indexes with wildcards to index all keys.
Satya Nadella's interview was about AI foundation models. MongoDB has search indexes for schemaless text or embeddings, but regular indexes are still used for exact searches on structured schemas. When the world is too messy for SQL, MongoDB excels by avoiding a one-size-fits-all data model. Instead, it provides a domain-driven model tailored to specific access patterns, accommodating new complex outlier business rules without complicating the existing structure, code, and index definition.
The World Is Too Messy for SQL to Work
Satya Nadella, Microsoft's CEO, discusses AI agents and quantum technology in an interview on the South Park Commons YouTube channel. At 29:36, he says the world is too messy for SQL to work, which might be surprising, as SQL serves as an API for relational databases, built to store various types of data, including the most complex.
This reminds me of some challenges I encountered with the relational model. I will share a real-life example from many years ago when I worked as a data architect at an airline company before the rise of document databases to resolve such issues.
Flights have an airport of origin and an airport of destination. This is a typical case where one table has multiple aliases in a query because of the two different roles. In a relational database, the table of flights has two columns with a foreign key that references a table of airports: origin airport and destination airport. There are already many questions that can start a debate with relational data modelers:
Should we use singular ("flight") or plural ("flights") for table names
Should we name the columns with the role first or last: destination airport ("DST_ARP") or airport of destination ("ARP_DST")
What are the keys? The foreign key can use the IATA code to avoid joins in many queries or a surrogate key in case the IATA code changes, but with two additional joins for the queries
What are the tables? How many tables are needed to store this information, normalized for easier updates or denormalized for simpler queries?
In a document model, we can store all relevant information together. An "airport" collection will contain one document per airport, detailing its IATA code, name, city, latitude, longitude, country code, and country name, all neatly organized in sub-documents.
Here is an example of a document that describes Amsterdam's Airport
{
"_id": "AMS",
"name": "Amsterdam Airport Schiphol",
"location": {
"city": {
"name": "Amsterdam",
"country": {
"name": "Netherlands",
"code": "NL"
}
},
"coordinates": {
"latitude": 52.308613,
"longitude": 4.763889
}
},
"iata": "AMS",
"icao": "EHAM",
"is_hub_for_airline": ["KLM", '"Air France"],
"timezone": "Europe/Amsterdam",
"type": "international",
"runways": [
{
"name": "18R/36L",
"length_meters": 3800
},
{
"name": "06/24",
"length_meters": 3500
}
],
"website": "https://www.schiphol.nl/"
}
In a relational database, because of the One-To-Many relationships and to obey the first Normal Form, the same data is scattered to multiple tables:
INSERT INTO Countries (country_id, name, code)
VALUES (1, 'Netherlands', 'NL');
INSERT INTO Cities (city_id, name, country_id)
VALUES (1, 'Amsterdam', 1);
INSERT INTO Airports (airport_id, name, city_id, iata, icao, timezone, type, website)
VALUES ('AMS', 'Amsterdam Airport Schiphol', 1, 'AMS', 'EHAM', 'Europe/Amsterdam', 'international', 'https://www.schiphol.nl/');
INSERT INTO Coordinates (airport_id, latitude, longitude)
VALUES ('AMS', 52.308613, 4.763889);
INSERT INTO Runways (airport_id, name, length_meters)
VALUES ('AMS', '18R/36L', 3800), ('AMS', '06/24', 3500);
A relational table has a fixed structure that must be declared before inserting data:
CREATE TABLE Countries (
country_id INT PRIMARY KEY,
name VARCHAR(100) NOT NULL,
code CHAR(2) NOT NULL UNIQUE
);
CREATE TABLE Cities (
city_id INT PRIMARY KEY,
name VARCHAR(100) NOT NULL,
country_id INT,
FOREIGN KEY (country_id) REFERENCES Countries(country_id)
);
CREATE TABLE Airports (
airport_id CHAR(3) PRIMARY KEY,
name VARCHAR(100) NOT NULL,
city_id INT,
iata CHAR(3) NOT NULL UNIQUE,
icao CHAR(4) NOT NULL UNIQUE,
timezone VARCHAR(50) NOT NULL,
type VARCHAR(50),
website VARCHAR(100),
FOREIGN KEY (city_id) REFERENCES Cities(city_id)
);
CREATE TABLE Coordinates (
airport_id CHAR(3),
latitude DECIMAL(9,6) NOT NULL,
longitude DECIMAL(9,6) NOT NULL,
PRIMARY KEY (airport_id),
FOREIGN KEY (airport_id) REFERENCES Airports(airport_id)
);
CREATE TABLE Runways (
runway_id INT PRIMARY KEY AUTO_INCREMENT,
airport_id CHAR(3),
name VARCHAR(10) NOT NULL,
length_meters INT NOT NULL,
FOREIGN KEY (airport_id) REFERENCES Airports(airport_id)
);
Declaring foreign keys is essential to ensure referential integrity, which isn't required when modeling in a single document since the relationship is inherent in the data structure.
With numerous tables, an ERD diagram aids in illustrating the structure:
Welcome to Ou Messy World
This data model works well when a city is assigned an airport, which is usually true. You can develop numerous applications based on this model until a significant issue emerges: your flight schedule features a new international airport, EuroAirport Basel Mulhouse Freiburg, which this model cannot accommodate. This airport is physically in France, but you can drive to Basel without crossing any border, so it is also in Switzerland.
This airport is unique, bi-national, serving Basel in Switzerland and Mulhouse in France. It has multiple IATA codes: EAP, as well as MLH and BSL for each country. This necessitates a structural change to the relational data model because the relationship between airports and cities must be a Many-to-Many:
CREATE TABLE Countries (
country_id INT PRIMARY KEY,
name VARCHAR(100) NOT NULL,
code CHAR(2) NOT NULL UNIQUE
);
CREATE TABLE Cities (
city_id INT PRIMARY KEY,
name VARCHAR(100) NOT NULL,
country_id INT,
FOREIGN KEY (country_id) REFERENCES Countries(country_id)
);
CREATE TABLE Airports (
airport_id CHAR(3) PRIMARY KEY,
name VARCHAR(100) NOT NULL,
timezone VARCHAR(50) NOT NULL,
type VARCHAR(50),
website VARCHAR(100)
);
CREATE TABLE Airport_Codes (
airport_code_id INT PRIMARY KEY AUTO_INCREMENT,
airport_id CHAR(3),
code_type VARCHAR(3) CHECK (code_type IN ('EAP', 'BSL', 'MLH')),
code CHAR(3),
FOREIGN KEY (airport_id) REFERENCES Airports(airport_id)
);
CREATE TABLE City_Airport (
city_airport_id INT PRIMARY KEY AUTO_INCREMENT,
city_id INT,
airport_id CHAR(3),
FOREIGN KEY (city_id) REFERENCES Cities(city_id),
FOREIGN KEY (airport_id) REFERENCES Airports(airport_id)
);
CREATE TABLE Coordinates (
airport_id CHAR(3),
latitude DECIMAL(9,6) NOT NULL,
longitude DECIMAL(9,6) NOT NULL,
PRIMARY KEY (airport_id),
FOREIGN KEY (airport_id) REFERENCES Airports(airport_id)
);
Inserting a single airport data requires more statements per transaction:
-- Insert into Countries
INSERT INTO Countries (country_id, name, code) VALUES
(1, 'Switzerland', 'CH'),
(2, 'France', 'FR');
-- Insert into Cities
INSERT INTO Cities (city_id, name, country_id) VALUES
(1, 'Basel', 1),
(2, 'Mulhouse', 2);
-- Insert into Airports
INSERT INTO Airports (airport_id, name, timezone, type, website) VALUES
('EAP', 'EuroAirport Basel Mulhouse Freiburg', 'Europe/Zurich', 'international', 'https://www.euroairport.com/');
-- Insert into Airport_Codes
INSERT INTO Airport_Codes (airport_id, code_type, code) VALUES
('EAP', 'EAP', 'EAP'),
('EAP', 'BSL', 'BSL'),
('EAP', 'MLH', 'MLH');
-- Insert into City_Airport
INSERT INTO City_Airport (city_id, airport_id) VALUES
(1, 'EAP'),
(2, 'EAP');
-- Insert into Coordinates
INSERT INTO Coordinates (airport_id, latitude, longitude) VALUES
('EAP', 47.59, 7.529167);
The Entity-Relationship Diagram (ERD) shows an additional table for the Many-to-Many relationship between airports and cities:
All other airports must use the new model, even if they have only one city per airport. Having a model that distinguishes the two possibilities of One-to-Many and Many-to-Many would be even more complex, with inheritance leading to more tables.
Modifying such a data model affects all existing code, requiring all queries to be rewritten with a new join. In practice, this modification is too risky, leading to the implementation of some workarounds for legacy code. For instance, I've seen an application with a dummy city for Basel+Mulhouse and a dummy country for Switzerland+France. Other applications ignored the multi-city property and added a BSL-MLH dummy flight with a zero-kilometer distance and a zero-minute flight duration, which had unexpected side effects. These workarounds leverage the existing code for single-city airports, ensuring that only the new code is impacted by the new airport. However, the data model no longer reflects the real world, and the ideal normalized model that is supposed to avoid anomalies is broken.
This is where the world is too messy for SQL: a normalized data model must include all possibilities, with one model shared by all objects, including the nominal case and the diversity of edge cases. Everything can be represented with normalized tables, but in practice, applications break this model for more flexibility. Many applications that adapt to different domains opted for an Entity-Attribute-Value model to get more agility, but at the price of the worst performance because the SQL query planner of a relational database is not built to optimize such key-value data access. If you have never encountered such schema or wonder what the problem is, I recommend this old story by Tim Gorman about a single "DATA" table model: bad carma.
Document Data Modeling Flexibility
With a document model, two documents of the same collection do not strictly need to adhere to the same structure. I can keep the existing structure and code for the single-city airports and use an array instead of a single value when inserting the bi-national airport:
{
"_id": "EAP",
"name": "EuroAirport Basel Mulhouse Freiburg",
"location": {
"city": [
{
"name": "Basel",
"country": {
"name": "Switzerland",
"code": "CH"
}
},
{
"name": "Mulhouse",
"country": {
"name": "France",
"code": "FR"
}
}
],
"coordinates": {
"latitude": 47.59,
"longitude": 7.529167
}
},
"iata": [ "EAP", "BSL", "MLH" ],
"timezone": "Europe/Zurich",
"type": "international",
"website": "https://www.euroairport.com/"
}
Such a format may require additional code when the business logic differs between single and multi-city airports. Still, there's no need to change the existing code and reformat the existing data. The additional complexity due to the "messy" world has a scope limited to its new context and does not undermine the entirety of the existing data model.
Examples like this can create the impression that the relational model is too rigid. Although normalization was designed to facilitate model evolution, and SQL databases have reduced the downtime for some ALTER statements, it still restricts all data to a single model. Minor changes in business can break existing code and significantly increase the complexity of the data model. A normalized relational model makes adding a column to a table easy. However, when the keys or the association cardinality changes, the relational model requires impactful modification with lots of code to modify and test, and there is downtime during the schema migration.
Document Data Modeling and MongoDB API
Document data modeling allows you to add documents even if their structure diverges from the collection's existing documents. The structure can still be constrained with schema validation, but with more agility than what SQL relational tables definition provide.
Here is an example of schema validation that accepts the two document types - note the "oneOf" sections:
db.runCommand({
collMod: "airports",
validator: {
$jsonSchema: {
bsonType: "object",
required: ["_id", "name", "location", "iata", "timezone", "type", "website"],
properties: {
...
"location": {
bsonType: "object",
required: ["city", "coordinates"],
properties: {
"city": {
oneOf: [
{
bsonType: "object",
required: ["name", "country"],
...
},
{
bsonType: "array",
items: {
bsonType: "object",
required: ["name", "country"],
...
}
}
]
},
...
}
},
"iata": {
oneOf: [
{
bsonType: "string",
description: "must be a string or an array of strings and is required"
},
{
bsonType: "array",
items: {
bsonType: "string"
}
}
]
},
...
}
}
}
});
With schema validation, you can combine the schema flexibility with schema integrity constraints. It is recommended that schema validation be added so that the MongoDB database can guarantee that the documents have the fields expected by the application code.
Additionally, the MongoDB API prioritizes developer experience for the application evolutions. Such polymorphic schema may not need different code. I have inserted the two documents described above into an "airports" collection, and the code to find a country's airports is the same whether the city attribute is one sub-document or an array
mdb> // find airports in Netherlands
mdb> db.airports.find({
"location.city.country.name": "Netherlands"
},{
"name": 1
});
[ { _id: 'AMS', name: 'Amsterdam Airport Schiphol' } ]
mdb> // find airports in France
db.airports.find({
"location.city.country.name": "France"
},{
"name": 1
});
[ { _id: 'EAP', name: 'EuroAirport Basel Mulhouse Freiburg' } ]
mdb> // find airports in Switzerland
db.airports.find({
"location.city.country.name": "Switzerland"
},{
"name": 1
});
[ { _id: 'EAP', name: 'EuroAirport Basel Mulhouse Freiburg' } ]
mdb>
With MongoDB, you don't have to change the query. You also do not have to change the index definition. I have created an index on the country name:
mdb> db.airports.createIndex(
{ "location.city.country.name": 1 }
);
location.city.country.name_1
When having only single-city airports in the collection, the execution plan for the find query shows:
stage: 'IXSCAN',
keyPattern: { 'location.city.country.name': 1 },
indexName: 'location.city.country.name_1',
isMultiKey: false,
multiKeyPaths: { 'location.city.country.name': [] },
indexBounds: {
'location.city.country.name': [ '["Netherlands", "Netherlands"]' ]
}
Once a multi-city airport has been added, it switches to a multi-key index:
stage: 'IXSCAN',
keyPattern: { 'location.city.country.name': 1 },
indexName: 'location.city.country.name_1',
isMultiKey: true,
multiKeyPaths: { 'location.city.country.name': [ 'location.city' ] },
indexBounds: {
'location.city.country.name': [ '["Netherlands", "Netherlands"]' ]
}
MongoDB multi-key indexes simply index more than one value per document when the cardinality changes from One-to-One to One-to-Many.
You may think that using a JSON data type in an SQL database gives you the same advantages. However, query planning and indexing are significant differences between document databases like MongoDB and storing documents in a relational database like PostgreSQL.
PostgreSQL's JSONB is a data type that enables schema-less data storage. Using an expression index, you can create an index for a specific key in the document, provided it is known in advance (schema-on-write) and is a single key without an array in the path. However, you must use an inverted index when the path includes an array. GIN indexes, optimized for indexing unknown sub-paths (schema-on-read), are inverted indexes and can be used with arrays (see Indexing JSON in PostgreSQL). Still, they do not provide all the features of regular indexes. For example, the result from a GIN index in PostgreSQL is not sorted and cannot be used to optimize an ORDER BY ... LIMIT. When the schema-on-write involves a one-to-many relationship, storing it in normalized columns in a separate table and joining it at the time of the query is often recommended.
MongoDB is a database that natively indexes document paths, so you don't need this double data modeling approach. Regular indexes can be used on the schema-on-write part of the document, even when the path includes an array, and provide the full Equality, Sort, Range experience. Those indexes follow the flexibility of the document model, where not only new fields can be added, but a One-to-One can become a One-to-Many without adding more complexity. For the schema-on-read part, where the structure is unknown before the query, you can use the same indexes with wildcards to index all keys.
Satya Nadella's interview was about AI foundation models. MongoDB has search indexes for schemaless text or embeddings, but regular indexes are still used for exact searches on structured schemas. When the world is too messy for SQL, MongoDB excels by avoiding a one-size-fits-all data model. Instead, it provides a domain-driven model tailored to specific access patterns, accommodating new complex outlier business rules without complicating the existing structure, code, and index definition.
Introducing the Supabase UI Library
Supabase Auth: Bring Your Own Clerk
March 29, 2025
Postgres Language Server: Initial Release
March 28, 2025
Transparent Data Encryption for PostgreSQL Release Candidate is Here!
Real-time inventory management with lambda architecture
March 27, 2025
Postgres 17.4 vs sysbench on a large server, revisited part 2
I recently shared two posts (here and here) with results for sysbench on a large server using Postgres versions 10 through 17. In general there were several large improvements over time, but one small regression that arrived in Postgres 11.x. This blog post provides more details on the problem using results from Postgres 10.23, 11.22 and 17.4.
The regression occurs starting in Postgres 11.22 because Postgres is more likely to use bitmap index scan starting in 11.x. I have yet to learn why or how to prevent that.
Index scan vs bitmap index scan
Experts gave me great advice based on a few flamegraphs that I shared. It looked like Postgres started to use bitmap index scan more often starting in Postgres 11. Upstream sysbench does collect query plans for the SQL that it uses, so I modified my fork to do that.
While the explain output helps, it would help even more were there a feature in Postgres to provide optimizer traces, similar to what MySQL has, to understand why some query plans are selected and others are rejected. Another feature request is to improve the official Postgres docs to provide more detail on 1) the difference between index scan and bitmap index scan and 2) the difference between lossy and non-lossy bitmap index scans (AFAIK, one needs recheck).
The problem microbenchmark
The microbenchmarks in question do a range scan with aggregation and have a name like read-only_range=$X where X has been 10, 100 and 1000 but here I use X in 10, 100, 1000, 2000, 4000, 8000, 16000 and 32000. The value of X is the length of the range scan. These are run via oltp_read_only.lua and use these SQL statements.
Build, configuration, hardware, benchmark
These are described in my previous post. But a few things have changed for this report
- I only tested Postgres versions 10.23, 11.22 and 17.4
- I ran the read-only_range=$X microbenchmark for X in 10, 100, 1000, 2000, 4000, 8000, 16000 and 32000. Previously I ran it for X in 10, 100, 10000
- I added new tests that each run only one of the SQL statements used by oltp_read_only.lua. All of the Lua scripts are here.
(QPS for $version) / (QPS for Postgres 10.23)
A summary of the results is:
- the regression always occurs for the range=8000 microbenchmark and is less likely for other values for range=X. Note that range=X means the queries scan X rows.
- from the four tests that each run only one of the SQL aggregation queries used by the standard read-only microbenchmark (read-only-distinct, read-only-order, read-only-simple and read-only-sum) the regression occurs in read-only-simple and read-only-sum but not in the others and the regression is the largest in read-only-sum. The SQL statements are here for read-only-distinct, read-only-order, read-only-simple and read-only-sum.
1.03 1.06 read-only_range=10
1.02 1.04 read-only_range=100
1.00 1.00 read-only_range=1000
1.00 1.02 read-only_range=2000
1.00 0.99 read-only_range=4000
0.95 0.95 read-only_range=8000
0.99 1.02 read-only_range=16000
1.00 1.04 read-only_range=32000
0.98 0.97 read-only-distinct_range=8000
0.98 0.99 read-only-order_range=8000
0.94 0.90 read-only-simple_range=8000
0.89 0.83 read-only-sum_range=8000
work_mem 2 MB
1.02 1.06 read-only_range=10
1.01 1.02 read-only_range=100
1.00 0.99 read-only_range=1000
0.99 1.01 read-only_range=2000
0.98 0.96 read-only_range=4000
0.94 0.93 read-only_range=8000
0.99 1.00 read-only_range=16000
0.98 1.02 read-only_range=32000
0.97 0.96 read-only-distinct_range=8000
0.96 0.98 read-only-order_range=8000
0.92 0.89 read-only-simple_range=8000
0.86 0.80 read-only-sum_range=8000
1.02 1.06 read-only_range=10
1.02 1.03 read-only_range=100
1.01 1.01 read-only_range=1000
1.00 1.02 read-only_range=2000
1.00 1.00 read-only_range=4000
0.96 0.94 read-only_range=8000
1.13 1.24 read-only_range=16000
1.06 1.11 read-only_range=32000
0.98 0.97 read-only-distinct_range=8000
0.98 0.99 read-only-order_range=8000
0.94 0.90 read-only-simple_range=8000
0.89 0.82 read-only-sum_range=8000
1.03 1.07 read-only_range=10
1.02 1.03 read-only_range=100
1.00 0.99 read-only_range=1000
1.00 1.01 read-only_range=2000
0.99 0.97 read-only_range=4000
0.95 0.94 read-only_range=8000
0.99 1.00 read-only_range=16000
0.99 1.03 read-only_range=32000
0.97 0.96 read-only-distinct_range=8000
0.97 0.98 read-only-order_range=8000
0.92 0.89 read-only-simple_range=8000
0.87 0.81 read-only-sum_range=8000
1.04 1.08 read-only_range=10
1.03 1.05 read-only_range=100
1.00 1.00 read-only_range=1000
1.00 1.02 read-only_range=2000
0.99 0.97 read-only_range=4000
0.94 0.94 read-only_range=8000
0.98 1.00 read-only_range=16000
0.99 1.03 read-only_range=32000
0.97 0.96 read-only-distinct_range=8000
0.97 0.99 read-only-order_range=8000
0.92 0.90 read-only-simple_range=8000
0.86 0.80 read-only-sum_range=8000
1.02 1.06 read-only_range=10
1.01 1.03 read-only_range=100
1.00 1.00 read-only_range=1000
0.99 1.02 read-only_range=2000
1.00 0.97 read-only_range=4000
0.95 0.94 read-only_range=8000
0.99 1.01 read-only_range=16000
0.99 1.04 read-only_range=32000
0.97 0.96 read-only-distinct_range=8000
0.97 0.99 read-only-order_range=8000
0.94 0.90 read-only-simple_range=8000
0.89 0.83 read-only-sum_range=8000
Instrument your LLM calls to analyze AI costs and usage
Things that go wrong with disk IO
There are a few interesting scenarios to keep in mind when writing applications (not just databases!) that read and write files, particularly in transactional contexts where you actually care about the integrity of the data and when you are editing data in place (versus copy-on-write for example).
We'll go into a few scenarios where the following can happen:
- Data you write never actually makes it to disk
- Data you write get sent to the wrong location on disk
- Data you read is read from the wrong location on disk
- Data gets corrupted on disk
And how real-world data systems think about these scenarios. (They don't always think of them at all!)
If I don't say otherwise I'm talking about behavior on Linux.
The post is largely a review of two papers: Parity Lost and Parity Regained and Characteristics, Impact, and Tolerance of Partial Disk Failures. These two papers also go into the frequency of some of the issues discussed here. These behaviors actually happen in real life!
Thank you to Alex Miller and George Xanthakis for reviewing a draft of this post.
Terminology
Some of these terms are reused in different contexts, and sometimes they are reused because they effectively mean the same thing in a certain configuration. But I'll try to be explicit to avoid confusion.
Sector
The smallest amount of data that can be read and written atomically by hardware. It used to be 512 bytes, but on modern disks it is often 4KiB. There doesn't seem to be any safe assumption you can make about sector size, despite file system defaults (see below). You must check your disks to know.
Block (filesystem/kernel view)
Typically set to the sector size since only this block size is atomic. The default in ext4 is 4KiB.
Page (kernel view)
A disk block that is in memory. Any reads/writes less than the size of a block will read the entire block into kernel memory even if less than that amount is sent back to userland.
Page (database/application view)
The smallest amount of data the system (database, application, etc.) chooses to act on, when it's read or written or held in memory. The page size is some multiple of the filesystem/kernel block size (including the multiple being 1). SQLite's default page size is 4KiB. MySQL's default page size is 16KiB. Postgres's default page size is 8KiB.
Things that go wrong
The data didn't reach disk
By default, file writes succeed when the data is copied into kernel memory (buffered IO). The man page for write(2) says:
A successful return from write() does not make any guarantee that data has been committed to disk. On some filesystems, including NFS, it does not even guarantee that space has successfully been reserved for the data. In this case, some errors might be delayed until a future write(), fsync(2), or even close(2). The only way to be sure is to call fsync(2) after you are done writing all your data.
If you don't call fsync on Linux the data isn't necessarily durably on disk, and if the system crashes or restarts before the disk writes the data to non-volatile storage, you may lose data.
With
O_DIRECT,
file writes succeed when the data is copied to at least the disk
cache. Alternatively you could open the file with O_DIRECT|O_SYNC
(or O_DIRECT|O_DSYNC
) and forgo fsync calls.
fsync on macOS is a no-op.
If you're confused, read Userland Disk I/O.
Postgres, SQLite, MongoDB, MySQL fsync data before considering a transaction successful by default. RocksDB does not.
The data was fsynced but fsync failed
fsync isn't guaranteed to succeed. And when it fails you can't tell which write failed. It may not even be a failure of a write to a file that your process opened:
Ideally, the kernel would report errors only on file descriptions on which writes were done that subsequently failed to be written back. The generic pagecache infrastructure does not track the file descriptions that have dirtied each individual page however, so determining which file descriptors should get back an error is not possible.
Instead, the generic writeback error tracking infrastructure in the kernel settles for reporting errors to fsync on all file descriptions that were open at the time that the error occurred. In a situation with multiple writers, all of them will get back an error on a subsequent fsync, even if all of the writes done through that particular file descriptor succeeded (or even if there were no writes on that file descriptor at all).
Don't be 2018-era Postgres.
The only way to have known which exact write failed would be to open
the file with O_DIRECT|O_SYNC
(or O_DIRECT|O_DSYNC
), though this
is not the only way to handle fsync failures.
The data was corrupted
If you don't checksum your data on write and check the checksum on read (as well as periodic scrubbing a la ZFS) you will never be aware if and when the data gets corrupted and you will have to restore (who knows how far back in time) from backups if and when you notice.
ZFS, MongoDB (WiredTiger), MySQL (InnoDB), and RocksDB checksum data by default. Postgres and SQLite do not (though databases created from Postgres 18+ will).
You should probably turn on checksums on any system that supports it, regardless of the default.
The data was partially written
Only when the page size you write = block size of your filesystem = sector size of your disk is a write guaranteed to be atomic. If you need to write multiple sectors of data atomically there is the risk that some sectors are written and then the system crashes or restarts. This behavior is called torn writes or torn pages.
Postgres, SQLite, and MySQL (InnoDB) handle torn writes. Torn writes are by definition not relevant to immutable storage systems like RocksDB (and other LSM Tree or Copy-on-Write systems like MongoDB (WiredTiger)) unless writes (that update metadata) span sectors.
If your file system duplicates all writes like MySQL (InnoDB) does (like you can with data=journal in ext4) you may also not have to worry about torn writes. On the other hand, this amplifies writes 2x.
The data didn't reach disk, part 2
Sometimes fsync succeeds but the data isn't actually on disk because the disk is lying. This behavior is called lost writes or phantom writes. You can be resilient to phantom writes by always reading back what you wrote (expensive) or versioning what you wrote.
Databases and file systems generally do not seem to handle this situation.
The data was written to the wrong place, read from the wrong place
If you aren't including where data is supposed to be on disk as part of the checksum or page itself, you risk being unaware that you wrote data to the wrong place or that you read from the wrong place. This is called misdirected writes/reads.
Databases and file systems generally do not seem to handle this situation.
Further reading
In increasing levels of paranoia (laudatory) follow ZFS, Andrea and Remzi Arpaci-Dusseau, and TigerBeetle.
I wrote a post covering some of the scenarios you might want to be aware of, and resilient to, when you write systems that read and write files. pic.twitter.com/7FxbpMo1xm
— Phil Eaton (@eatonphil) March 27, 2025
March 26, 2025
Up and running with Apache OFBiz and Amazon Aurora DSQL
MongoDB Vector Search Index, with local Atlas and Ollama
For this demo, I'm using the Eurovision songs available at Kaggle, which contain lyrics in their original language, and translated into English
Sample data
I downloaded and uncompressed the files:
wget -c -O eurovision-song-lyrics.zip eurovision-song-lyrics.zip https://www.kaggle.com/api/v1/datasets/download/minitree/eurovision-song-lyrics
unzip -o eurovision-song-lyrics.zip
rm -f eurovision-song-lyrics.zip
MongoDB Atlas and Shell
I install MongoDB Atlas CLI and start a local instance:
curl https://fastdl.mongodb.org/mongocli/mongodb-atlas-cli_1.41.1_linux_arm64.tar.gz |
tar -xzvf - &&
alias atlas=$PWD/mongodb-atlas-cli_1.41.1_linux_arm64/bin/atlas
atlas deployments setup atlas --type local --port 27017 --force
This runs MongoDB Atlas in a Docker container:
I also installed MongoDB Shell to connect and run JavaScript and Node.js:
curl https://downloads.mongodb.com/compass/mongosh-2.4.2-linux-arm64.tgz |
tar -xzvf - &&
alias mongosh=$PWD/mongosh-2.4.2-linux-arm64/bin/mongosh
Ollama Large Language Model
I need a local LLM model to generate embeddings from lyrics. To avoid relying on external services, I will use Ollama with the nomic-embed-text
model, which I install locally:
curl -fsSL https://ollama.com/install.sh | sh
ollama pull nomic-embed-text
npm install ollama
mongosh
I've installed the Ollama module for node.js and started MongoDB Shell.
Load data into MongoDB
I load the files into an eurovision
collection:
const fs = require('fs/promises');
async function loadJsonToMongoDB() {
const fileContent = await fs.readFile('eurovision-lyrics-2023.json', 'utf8');
const jsonData = JSON.parse(fileContent);
const documents = Object.values(jsonData);
const result = await db.eurovision.insertMany(documents);
}
db.eurovision.drop();
loadJsonToMongoDB();
db.eurovision.countDocuments();
Generate embeddings
I update the MongoDB collection to add embeddings, generated from the lyrics, and generating embeddings with Ollama:
const ollama = require("ollama"); // Ollama Node.js client
// Calculate embeddings
async function calculateEmbeddings(collection) {
try {
// Process each document and update the embedding
const cursor = collection.find();
let counter = 0;
for await (const doc of cursor) {
// Call the embedding API
const data = {
model: 'nomic-embed-text',
prompt: doc["Lyrics translation"]
};
const { embedding } = await ollama.default.embeddings(data);
// Update the document with the new embedding
await collection.updateOne(
{ _id: doc._id },
{ $set: { embedding: embedding } }
);
counter++;
console.log(`Added embeddings for ${doc.Year} ${doc.Country}`);
}
} catch (error) {
console.error('Error:', error);
}
}
calculateEmbeddings(db.eurovision);
MongoDB Vector Search Index
I create a vector search index (Ollama nomic-embed-text
, like BERT, has 768 dimensions):
db.eurovision.createSearchIndex(
"vectorSearchOnLyrics",
"vectorSearch",
{
fields: [
{
type: "vector",
numDimensions: 768,
path: "embedding",
similarity: "cosine"
}
]
}
)
db.eurovision.getSearchIndexes()
Don't forget the name of the index, it will be used to query, and a wrong index name simply results on no results.
Aggregation pipeline with vector search
Here is the function I'll use to query with a prompt, converting the prompt to vector embedding with the same model:
const ollama = require("ollama"); // Ollama Node.js client
async function vectorSearch(collection, prompt) {
try {
// Get the embedding for the prompt
const data = {
model: 'nomic-embed-text',
prompt: prompt,
};
const { embedding } = await ollama.default.embeddings(data);
// Perform a vector search in aggregation pipeline
const results=collection.aggregate([
{
"$vectorSearch": {
"index": "vectorSearchOnLyrics",
"path": "embedding",
"queryVector": embedding,
"numCandidates": 10,
"limit": 5
}
},{
"$project": {
"Year": 1,
"Country": 1,
"Artist": 1,
"Song": 1,
"Language": 1,
"score": { "$meta": "vectorSearchScore" }
}
}
]);
// Display the result
results.forEach(doc => {
console.log(`${doc.score.toFixed(2)} ${doc.Year} ${doc.Country} ${doc.Artist} ${doc.Song} (${doc.Language})`);
});
} catch (error) {
console.error('Error during vector search:', error);
}
}
Here are a few prompts that I tried to find the songs for which I remember some bits, using different languages:
vectorSearch(db.eurovision,
"un nino y un pájaro"
);
vectorSearch(db.eurovision,
"a wax doll singing"
);
vectorSearch(db.eurovision,
"Un chico llamado Mercy"
);
vectorSearch(db.eurovision,
"lyrics were about being unique and not like the others"
);
Here are my results, the songs I had in mind appeared in the forst or second position:
Conclusion
This article discusses implementing vector search in a local MongoDB setup using Ollama and the Eurovision song lyrics dataset. We populated the database with vectorized data embedded in the documents and created a vector search index for retrieving songs based on semantic similarity.
Storing embeddings with data, rather than in a separate database, is advantageous because embeddings are generated from document fields and can be indexed like any other fields. MongoDB stores arrays natively with the document model. The next step will involve generating embeddings without moving data out of the database, by integrating Voyager AI into MongoDB.
MongoDB Vector Search Index, with local Atlas and Ollama
For this demo, I'm using the Eurovision songs available at Kaggle, which contain lyrics in their original language, and translated into English
Sample data
I downloaded and uncompressed the files:
wget -c -O eurovision-song-lyrics.zip eurovision-song-lyrics.zip https://www.kaggle.com/api/v1/datasets/download/minitree/eurovision-song-lyrics
unzip -o eurovision-song-lyrics.zip
rm -f eurovision-song-lyrics.zip
MongoDB Atlas and Shell
I install MongoDB Atlas CLI and start a local instance:
curl https://fastdl.mongodb.org/mongocli/mongodb-atlas-cli_1.41.1_linux_arm64.tar.gz |
tar -xzvf - &&
alias atlas=$PWD/mongodb-atlas-cli_1.41.1_linux_arm64/bin/atlas
atlas deployments setup atlas --type local --port 27017 --force
This runs MongoDB Atlas in a Docker container:
I also installed MongoDB Shell to connect and run JavaScript and Node.js:
curl https://downloads.mongodb.com/compass/mongosh-2.4.2-linux-arm64.tgz |
tar -xzvf - &&
alias mongosh=$PWD/mongosh-2.4.2-linux-arm64/bin/mongosh
Ollama Large Language Model
I need a local LLM model to generate embeddings from lyrics. To avoid relying on external services, I will use Ollama with the nomic-embed-text
model, which I install locally:
curl -fsSL https://ollama.com/install.sh | sh
ollama pull nomic-embed-text
npm install ollama
mongosh
I've installed the Ollama module for node.js and started MongoDB Shell.
Load data into MongoDB
I load the files into an eurovision
collection:
const fs = require('fs/promises');
async function loadJsonToMongoDB() {
const fileContent = await fs.readFile('eurovision-lyrics-2023.json', 'utf8');
const jsonData = JSON.parse(fileContent);
const documents = Object.values(jsonData);
const result = await db.eurovision.insertMany(documents);
}
db.eurovision.drop();
loadJsonToMongoDB();
db.eurovision.countDocuments();
Generate embeddings
I update the MongoDB collection to add embeddings, generated from the lyrics, and generating embeddings with Ollama:
const ollama = require("ollama"); // Ollama Node.js client
// Calculate embeddings
async function calculateEmbeddings(collection) {
try {
// Process each document and update the embedding
const cursor = collection.find();
let counter = 0;
for await (const doc of cursor) {
// Call the embedding API
const data = {
model: 'nomic-embed-text',
prompt: doc["Lyrics translation"]
};
const { embedding } = await ollama.default.embeddings(data);
// Update the document with the new embedding
await collection.updateOne(
{ _id: doc._id },
{ $set: { embedding: embedding } }
);
counter++;
console.log(`Added embeddings for ${doc.Year} ${doc.Country}`);
}
} catch (error) {
console.error('Error:', error);
}
}
calculateEmbeddings(db.eurovision);
MongoDB Vector Search Index
I create a vector search index (Ollama nomic-embed-text
, like BERT, has 768 dimensions):
db.eurovision.createSearchIndex(
"vectorSearchOnLyrics",
"vectorSearch",
{
fields: [
{
type: "vector",
numDimensions: 768,
path: "embedding",
similarity: "cosine"
}
]
}
)
db.eurovision.getSearchIndexes()
Don't forget the name of the index, it will be used to query, and a wrong index name simply results on no results.
Aggregation pipeline with vector search
Here is the function I'll use to query with a prompt, converting the prompt to vector embedding with the same model:
const ollama = require("ollama"); // Ollama Node.js client
async function vectorSearch(collection, prompt) {
try {
// Get the embedding for the prompt
const data = {
model: 'nomic-embed-text',
prompt: prompt,
};
const { embedding } = await ollama.default.embeddings(data);
// Perform a vector search in aggregation pipeline
const results=collection.aggregate([
{
"$vectorSearch": {
"index": "vectorSearchOnLyrics",
"path": "embedding",
"queryVector": embedding,
"numCandidates": 10,
"limit": 5
}
},{
"$project": {
"Year": 1,
"Country": 1,
"Artist": 1,
"Song": 1,
"Language": 1,
"score": { "$meta": "vectorSearchScore" }
}
}
]);
// Display the result
results.forEach(doc => {
console.log(`${doc.score.toFixed(2)} ${doc.Year} ${doc.Country} ${doc.Artist} ${doc.Song} (${doc.Language})`);
});
} catch (error) {
console.error('Error during vector search:', error);
}
}
Here are a few prompts that I tried to find the songs for which I remember some bits, using different languages:
vectorSearch(db.eurovision,
"un nino y un pájaro"
);
vectorSearch(db.eurovision,
"a wax doll singing"
);
vectorSearch(db.eurovision,
"Un chico llamado Mercy"
);
vectorSearch(db.eurovision,
"lyrics were about being unique and not like the others"
);
Here are my results, the songs I had in mind appeared in the forst or second position:
Conclusion
This article discusses implementing vector search in a local MongoDB setup using Ollama and the Eurovision song lyrics dataset. We populated the database with vectorized data embedded in the documents and created a vector search index for retrieving songs based on semantic similarity.
Storing embeddings with data, rather than in a separate database, is advantageous because embeddings are generated from document fields and can be indexed like any other fields. MongoDB stores arrays natively with the document model. The next step will involve generating embeddings without moving data out of the database, by integrating Voyager AI into MongoDB.