From Relational Algebra to Document Semantics
The relational model was designed not to mirror application structures, but to optimize reasoning about data dependencies independently of access patterns. By restricting data to relations in First Normal Form (1NF) and enforcing closure, relational algebra makes query behavior mathematically derivable, so correctness, equivalence, and optimization follow from the model itself rather than implementation-specific rules.
Document databases take the opposite approach: they optimize representation for applications, network, and storage. Domain concepts such as ownership, lifecycle, and cardinality are embedded directly in the data’s shape. Without assuming value atomicity, algebra no longer applies, so semantics must be defined explicitly. This is what MongoDB did when defining query behavior in its document database.
Applications rarely start from relations
Most applications do not model data as relations. They model aggregates. It's obvious in Domain-Driven Design and Object-Oriented analysis, but it was true long before object‑oriented languages. Data structures in applications have always been hierarchical, with sub-structures and repeating groups:
- Record-based models include nested tables, like in the COBOL data division where relationships are expressed through containment and repetition:
01 EMPLOYEE.
05 NAME PIC X(20).
05 SKILLS.
10 SKILL PIC X(10) OCCURS 5 TIMES.
- Object‑oriented models are more flexible but follow the same pattern with embedded objects, arrays, or lists:
@Entity
class Employee {
@Id Long id;
String name;
@ElementCollection
List<String> skills;
}
These models express semantics that do not exist in relational algebra:
- ownership: the skills belong to the employee
- lifecycle: the root aggregate is deleted with all its elements
- cardinality: this model supposes a short and bounded list of skills per employee, but a large and growing number of employees
- locality: employee's skills are queried with the employee, and stored together on disk
Those models are optimized for application behavior, not for algebraic reasoning. This is not new. It is how applications have always been written. Why do relational systems require data to be represented as flat relations at the logical level?
Why the relational model exists
When E. F. Codd published “Derivability, Redundancy and Consistency of Relations Stored in Large Data Banks” in 1969, his objective was not to help developers write applications, but to establish a formal mathematical foundation for data management.
He based the relational model and relational algebra on mathematics:
-
Set theory for relations, with operations such as
∪(union),∩(intersection),−(difference), and×(Cartesian product) -
First‑order predicate logic for constraints and querying: selection (
σ) corresponds to logical predicates, and joins correspond to conjunction with implicit existential quantification (∃) over the join attributes - A closed algebraic query language at the logical level, where every operation produces another relation
Within this framework, a relation is defined as:
- a set of tuples (unordered, with no duplicates)
- where all tuples share the same attributes
- and every attribute value is drawn from a simple (atomic) domain
These properties are not modeling advice. They are the definition.
First normal form
First Normal Form (1NF) is often presented as a design guideline. In Codd’s original work, it is not. It is a mandatory constraint to apply the first‑order predicate logic.
Without atomic attribute values, relations cease to be relations as defined by Codd, and the standard relational algebra no longer applies. Comparisons become ambiguous, predicates are no longer boolean, and algebraic equivalence rules break down.
Relational algebra is a closed system where inputs and outputs are relations. Its operators—selection (σ), projection (π), join (⨝)—all assume that:
- attributes can be compared using equality
- predicates evaluate to true or false
- each comparison involves exactly one value
This is what enables equivalence rules, join reordering, and provable optimizations, and the maths is defined for atomic values only.
Let's see some examples with SQL, which is inspired by the relational model (but is not itself a pure relational language).
1NF to apply mathematical operations to relations
Here is an example with a table of employees' skills:
CREATE TABLE employee_skill (
employee_name TEXT,
skill TEXT
);
INSERT INTO employee_skill VALUES
('Ann', 'SQL'),
('Ann', 'Java'),
('Ann', 'Python'),
('Bob', 'Java')
;
A simple selection involves a predicate comparing a column to a value:
SELECT DISTINCT *
FROM employee_skill
WHERE skill = 'Java'
;
It returns another relation with only the facts that verify the predicate:
employee_name | skill
---------------+-------
Ann | Java
Bob | Java
This works because skill is atomic, equality has one meaning, and the result is still a relation.
This collapses without 1NF
Here is another model, simple, but that violates 1NF:
CREATE TABLE employee_array (
name TEXT,
skills TEXT[]
);
INSERT INTO employee_array VALUES
('Ann', ARRAY['SQL', 'Java', 'Python']),
('Bob', ARRAY['Java'])
;
The same predicate no longer applies:
SELECT DISTINCT *
FROM employee_array
WHERE skills = 'Java'
;
ERROR: malformed array literal: "Java"
LINE 3: WHERE skills = 'Java';
^
DETAIL: Array value must start with "{" or dimension information.
PostgreSQL requires new operators:
SELECT DISTINCT *
FROM employee_array
WHERE 'Java' = ANY(skills)
;
name | skills
------+-------------------
Ann | {SQL,Java,Python}
Bob | {Java}
(2 rows)
SELECT DISTINCT *
FROM employee_array
WHERE skills @> ARRAY['Java']
;
name | skills
------+-------------------
Ann | {SQL,Java,Python}
Bob | {Java}
(2 rows)
These operators encode membership and containment. They are not part of relational algebra, and their exact semantics and syntax are vendor‑specific in SQL systems.
SQL/JSON does not restore relational semantics
JSON and JSONB datatypes do not change this:
CREATE TABLE employee_json (
doc JSONB
);
INSERT INTO employee_json VALUES
('{"name": "Ann", "skills": ["SQL", "Java", "Python"]}'),
('{"name": "Bob", "skills": ["Java"]}')
;
Selections rely on path navigation and containment, with special operators specific to the datatype:
SELECT DISTINCT *
FROM employee_json
WHERE doc->'skills' ? 'Java'
;
doc
------------------------------------------------------
{"name": "Ann", "skills": ["SQL", "Java", "Python"]}
{"name": "Bob", "skills": ["Java"]}
(2 rows)
SELECT DISTINCT *
FROM employee_json
WHERE doc->'skills' @> '["Java"]'
;
doc
------------------------------------------------------
{"name": "Ann", "skills": ["SQL", "Java", "Python"]}
{"name": "Bob", "skills": ["Java"]}
(2 rows)
This involves again different datatypes, operators, semantic, and also indexing. A different type of index is required to serve those predicates, a GIN index in PostgreSQL, with different syntax and different capabilities.
JSON has been added to the SQL standard as SQL/JSON but this doesn't unify the semantics. For example, an SQL array starts at 1 and a JSON array starts at 0:
SELECT
skills[0] "0",
skills[1] "1",
skills[2] "2",
skills[3] "3"
FROM employee_array
;
0 | 1 | 2 | 3
---+------+------+--------
| SQL | Java | Python
| Java | |
(2 rows)
SELECT
doc->'skills'->0 "0",
doc->'skills'->1 "1",
doc->'skills'->2 "2",
doc->'skills'->3 "3"
FROM employee_json
;
0 | 1 | 2 | 3
--------+--------+----------+---
"SQL" | "Java" | "Python" |
"Java" | | |
(2 rows)
JSON support in RDBMS extends SQL beyond relational algebra and introduces datatype‑specific semantics that are not algebraically closed. This is expected and was foreseen when enforcing the first normal form. Codd’s insight was that once attributes stop being atomic, mathematics no longer dictates behavior. Meaning must be defined explicitly.
MongoDB’s added semantics
MongoDB embraces the document model directly to match the data representation in the domain model and application structures:
db.employees.insertMany([
{ name: "Bob", skills: "Java" },
{ name: "Ann", skills: ["SQL", "Java", "Python"] }
]);
This is intentionally not 1NF because multiple entities and values may belong to the same aggregate. The relational operations cannot simply use the mathematical definition.
Selection resembles the relational operation, but when applied to a non‑1NF collection, MongoDB defines an explicit, extended semantics:
db.employees.find({ skills: "Java" })
;
[
{
_id: ObjectId('69a0ccaece22bf6640d4b0c2'),
name: 'Bob',
skills: 'Java'
},
{
_id: ObjectId('69a0ccaece22bf6640d4b0c3'),
name: 'Ann',
skills: [ 'SQL', 'Java', 'Python' ]
}
]
The same predicate applies to scalars and arrays. The document matches if the value or any array element satisfies the filtering condition. The same in SQL would require a union between a query using the SQL selection and another using the JSON containment, and casting the final result to the same datatype.
With MongoDB, indexes, comparisons, and sorting follow the same rule, as confirmed by execution plans. I create an index on skills that can index scalar values as well as array items, with one index type, and a generic syntax:
db.employees.createIndex({ skills: 1 })
;
It is easy to verify that the index is used to find the documents on a multi-key path (which means including an array in the path):
db.employees.find(
{ skills: "Java" }
).explain().queryPlanner.winningPlan
{
isCached: false,
stage: 'FETCH',
inputStage: {
stage: 'IXSCAN',
keyPattern: { skills: 1 },
indexName: 'skills_1',
isMultiKey: true,
multiKeyPaths: { skills: [ 'skills' ] },
isUnique: false,
isSparse: false,
isPartial: false,
indexVersion: 2,
direction: 'forward',
indexBounds: { skills: [ '["Java", "Java"]' ] }
}
}
The multi-key index has one index entry for each item when it is an array. When a comparison operator is applied to an array field, MongoDB compares the query value to each array element individually.
I used an equality predicate but the indexBounds in the execution plan show that the same can apply to a range. The same index is used for non-equality predicates:
db.employees.find({ skills: { $gt: "M"} })
;
[
{
_id: ObjectId('69a0ccaece22bf6640d4b0c3'),
name: 'Ann',
skills: [ 'SQL', 'Java', 'Python' ]
}
]
db.employees.find({ skills: { $lt: "M"} })
;
[
{
_id: ObjectId('69a0ccaece22bf6640d4b0c2'),
name: 'Bob',
skills: 'Java'
},
{
_id: ObjectId('69a0ccaece22bf6640d4b0c3'),
name: 'Ann',
skills: [ 'SQL', 'Java', 'Python' ]
}
]
Only Ann has a skill with a name after 'M' in the alphabet. Both Ann and Bob have a skill with a name before 'M' in the alphabet.
When sorting on an array field, MongoDB uses the minimum array element for ascending sort and the maximum array element for descending sort, according to BSON comparison order:
db.employees.find().sort({ skills: 1 })
;
[
{
_id: ObjectId('69a0ccaece22bf6640d4b0c2'),
name: 'Bob',
skills: 'Java'
},
{
_id: ObjectId('69a0ccaece22bf6640d4b0c3'),
name: 'Ann',
skills: [ 'SQL', 'Java', 'Python' ]
}
]
db.employees.find().sort({ skills: -1 })
;
[
{
_id: ObjectId('69a0ccaece22bf6640d4b0c3'),
name: 'Ann',
skills: [ 'SQL', 'Java', 'Python' ]
},
{
_id: ObjectId('69a0ccaece22bf6640d4b0c2'),
name: 'Bob',
skills: 'Java'
}
]
Here, 'Java' is the first in alphabetical order, so both employees are at the same rank in an ascending sort, but 'SQL' is the last in alphabetical order so 'Ann' appears first in a descending sort.
Again, the index is used:
db.employees.find().sort({ skills: 1 }).explain().queryPlanner.winningPlan
;
{
isCached: false,
stage: 'FETCH',
inputStage: {
stage: 'IXSCAN',
keyPattern: { skills: 1 },
indexName: 'skills_1',
isMultiKey: true,
multiKeyPaths: { skills: [ 'skills' ] },
isUnique: false,
isSparse: false,
isPartial: false,
indexVersion: 2,
direction: 'forward',
indexBounds: { skills: [ '[MinKey, MaxKey]' ] }
}
}
MongoDB is optimized for scalable OLTP, index access is a must for equality and range predicates as well as sorting for pagination. In SQL databases, inverted indexes such as GIN are typically specialized for containment and equality predicates and offer more limited support for range ordering and pagination than B‑tree indexes.
Not forcing First Normal Form allows storage and indexing to remain efficient:
- compound index may include fields from multiple entities within one aggregate
- storage involves a single disk I/O per aggregate
By deviating from 1NF, closure is not guaranteed—by design. An explicit $unwind operation in an aggregation pipeline can normalize the result to a relation, in its mathematical sense, if needed.
With MongoDB, we can list the skills of employees who have a 'Java' skills, with all their skill, as a relational result:
db.employees.aggregate([
{ $match: { skills: "Java" } },
{ $unwind: "$skills" },
{ $project: { _id: 0, name: "$name", skill: "$skills" } }
])
;
[
{ name: 'Bob', skill: 'Java' },
{ name: 'Ann', skill: 'SQL' },
{ name: 'Ann', skill: 'Java' },
{ name: 'Ann', skill: 'Python' }
]
This simple query in MongoDB would be much more complex with the PostgreSQL examples above:
-- Correlated semi-join (EXISTS) over a normalized relation
SELECT DISTINCT es.employee_name AS name, es.skill
FROM employee_skill es
WHERE EXISTS (
SELECT 1
FROM employee_skill j
WHERE j.employee_name = es.employee_name
AND j.skill = 'Java'
;
-- Existential quantification (ANY) over a non-1NF attribute (ARRAY) with explicit normalization (UNNEST)
SELECT DISTINCT ea.name, s.skill
FROM employee_array ea
CROSS JOIN LATERAL unnest(ea.skills) AS s(skill)
WHERE 'Java' = ANY (ea.skills)
;
-- JSON containment predicate (@>) with explicit normalization (jsonb_array_elements)
SELECT DISTINCT doc->>'name' AS name, skill
FROM employee_json
CROSS JOIN LATERAL jsonb_array_elements_text(doc->'skills') AS s(skill)
WHERE doc->'skills' @> '["Java"]'
;
Those queries must return all employees's skill, not only the one used for the filter, because they are part of the same aggregate. With those SQL queries, the object-relational mapping (ORM) must regroup those rows to build the aggregate.
In practice, the MongoDB query will not even $unwind to mimic a relation as it gets directly the aggregate:
db.employees.aggregate([
{ $match: { skills: "Java" } },
{ $project: { _id: 0, name: 1, skills: 1 } }
])
;
[
{ name: 'Bob', skills: 'Java' },
{ name: 'Ann', skills: [ 'SQL', 'Java', 'Python' ] }
]
With this query, MongoDB returns the binary BSON object directly to the application driver, instead of converting it into records or JSON like most SQL databases.
Conclusion
We exposed the enhanced semantics for selection over a non-1NF collection, as an example. MongoDB does more than enhance selection. All relational operations are extended with a document semantics:
- Selection works over scalars and arrays
- Projection reshapes documents
- Sort semantics are defined over arrays
- Indexes apply uniformly to scalars and array elements
- Joins exist ($lookup), and the semantics is defined even if the key is an array.
Relational theory is independent of physical implementation, but most RDBMS map each relation to a single table, and an index can cover the columns from a single table. Relational databases stem from mathematics and use normalization to structure storage. In contrast, applications center on aggregates, and MongoDB preserves these aggregates down to the storage layer.
First Normal Form (1NF) is required by the relational model—and therefore by relational SQL databases—because relational algebra assumes atomic attributes. In contrast, document databases relax these algebraic constraints in favor of explicit, document-oriented semantics, enabling similar operations to be performed directly on documents. So, when you hear that “data is relational” or feel you must apply normal forms to your domain model, recognize that this perspective is tied to a specific relational database implementation, not the nature of your data. The same data can support the same application with a document model.
Both models are valid but optimize for different abstraction layers: relational algebra offers a logical model for reasoning about data independently of the domain, while document databases model data as applications already do.