In a relational database, a one-to-many relationship is typically implemented with two tables. The "one" side is the parent and has a primary key that is indexed to guarantee uniqueness. The "many" side is the child and references the parent’s primary key with a foreign key. An index is often added to the foreign key as well, so that operations on the parent can efficiently locate its child rows.
While this design seems straightforward and ideal from a database administrator’s point of view—where data integrity is the primary concern—it can surprise developers. This normalized data model does not account for access patterns or cardinalities: the same structure is used whether the "many" side contains millions of rows and keeps growing or only a few items.
MongoDB takes a different approach: its data model is optimized for a specific application, based on known access patterns and cardinalities. In a document model, a one-to-many relationship can be implemented either as multiple documents linked by references or as an embedded array or subdocument within a single document. In both cases, you can choose whether to embed or reference from the parent (the "one" side) or from the child (the "many" side).
An example
I use an HR Dataset with employees that I load in an "employees" collection. It has two million documents:
cd /var/tmp
## Download HR_Dataset
from Kaggle (https://www.kaggle.com/datasets/kadirduran/hr-dataset)
## Unzip and import
curl -L -o hr-data-mnc.zip https://www.kaggle.com/api/v1/datasets/download/rohitgrewal/hr-data-mnc &&
unzip -o hr-data-mnc.zip &&
mongoimport -d mdb -c employees --type=csv --headerline --drop 'HR_Data_MNC_Data Science Lovers.csv'
Once imported, I connect with mongosh and update the performance rating to add random decimal digits, to recognize them better when comparing the results:
use mdb;
db.employees.updateMany( {}, [
{
$set: {
Performance_Rating: {
$add: [ { $toDouble: "$Performance_Rating" }, { $rand: {} } ]
}
}
}
]);
This collection contains employees associated with a department name. I’ll add additional details about each department, such as a description, and then explore alternative models for this one-to-many relationship, where one department has many employees and each employee belongs to one department.
For each model, I'll look at the performance of the following query:
Let's identify the top 10 most outstanding active employees in the IT department and list their names along with their performance ratings.
Embed in the "many" side
The many-to-one relationship here is employee-to-department, as each employee has a department name:
{
_id: ObjectId('693f1d61e235ef0960ae2b53'),
'Unnamed: 0': 22,
Employee_ID: 'EMP0000023',
Full_Name: 'James Valdez',
Department: 'R&D',
Job_Title: 'Research Scientist',
Hire_Date: '2017-10-30',
Location: 'East Scott, Mauritius',
Performance_Rating: 5.882713091486423,
Experience_Years: 7,
Status: 'Retired',
Work_Mode: 'On-site',
Salary_INR: 789283
}
As we generally embed more than one field, I add a description and structure it as a sub-object:
db.employees.aggregate([
{
$addFields: {
Department: {
$switch: {
branches: [
{ case: { $eq: ["$Department", "Finance"] }, then: { Name: "Finance", Description: "Manages the company’s budgets, expenses, and financial planning to ensure fiscal health." } },
{ case: { $eq: ["$Department", "HR"] }, then: { Name: "HR", Description: "Handles recruitment, employee relations, and organizational development initiatives." } },
{ case: { $eq: ["$Department", "IT"] }, then: { Name: "IT", Description: "Maintains technology infrastructure, software systems, and cybersecurity protections." } },
{ case: { $eq: ["$Department", "Marketing"] }, then: { Name: "Marketing", Description: "Promotes the company’s products and services through strategic campaigns and market research." } },
{ case: { $eq: ["$Department", "Operations"] }, then: { Name: "Operations", Description: "Oversees daily business activities, logistics, and process optimization for efficiency." } },
{ case: { $eq: ["$Department", "R&D"] }, then: { Name: "R&D", Description: "Researches and develops innovative products and services to support future growth." } },
{ case: { $eq: ["$Department", "Sales"] }, then: { Name: "Sales", Description: "Builds customer relationships and drives revenue through product and service sales." } }
],
default: { Name: "$Department", Description: "No description available" }
}
}
}
},
{
$merge: {
into: "employees", // same collection
whenMatched: "merge", // update existing docs
whenNotMatched: "fail"
}
}
])
The result for the same employee is:
{
_id: ObjectId('693f1d61e235ef0960ae2b53'),
'Unnamed: 0': 22,
Employee_ID: 'EMP0000023',
Full_Name: 'James Valdez',
Department: {
Name: 'R&D',
Description: 'Researches and develops innovative products and services to support future growth.'
},
Job_Title: 'Research Scientist',
Hire_Date: '2017-10-30',
Location: 'East Scott, Mauritius',
Performance_Rating: 5.882713091486423,
Experience_Years: 7,
Status: 'Retired',
Work_Mode: 'On-site',
Salary_INR: 789283
}
I want to retrieve the top 10 best-performing active employees from the IT department and display their names and performance ratings:
db.employees.find(
{ "Status": "Active", "Department.Name": "IT" },
{ _id: 0, "Full_Name": 1, "Performance_Rating": 1 }
).sort({ "Performance_Rating": -1 }).limit(10)
// result:
[
{ Full_Name: 'Stuart Lopez', Performance_Rating: 5.999973276392604 },
{ Full_Name: 'Mr. Ethan Morton', Performance_Rating: 5.9999561502903065 },
{ Full_Name: 'Lee White', Performance_Rating: 5.999935393136708 },
{ Full_Name: 'Amber Coleman', Performance_Rating: 5.999919949194189 },
{ Full_Name: 'Eugene Brown', Performance_Rating: 5.999917240114123 },
{ Full_Name: 'Nicole Edwards', Performance_Rating: 5.999914413630196 },
{ Full_Name: 'Erika Stewart', Performance_Rating: 5.999902351452448 },
{ Full_Name: 'Jenna King', Performance_Rating: 5.999896490219257 },
{ Full_Name: 'Douglas Hill', Performance_Rating: 5.999886177014563 },
{ Full_Name: 'Richard Gonzalez', Performance_Rating: 5.999879794558417 }
]
Since I have no index, it reads all documents, which takes 1.3 seconds:
x=db.employees.find(
{ "Status": "Active", "Department.Name": "IT" },
{ _id: 0, "Full_Name": 1, "Performance_Rating": 1 }
).sort({ "Performance_Rating": -1 }).limit(10).explain("executionStats");
printjson({
nReturned: x.executionStats.nReturned,
executionTimeMillis: x.executionStats.executionTimeMillis,
totalKeysExamined: x.executionStats.totalKeysExamined,
totalDocsExamined: x.executionStats.totalDocsExamined,
});
// Execution statistics:
{
nReturned: 10,
executionTimeMillis: 1367,
totalKeysExamined: 0,
totalDocsExamined: 2000000
}
One benefit of embedding on the "many" side is that you can use all fields to create a compound index. For instance, I can build an index that supports my filter, sort, and projection needs:
db.employees.createIndex({
"Status":1, // for equality predicate on employee
"Department.Name":1, // for equality predicate on department
"Performance_Rating": 1, // for sort and limit (pagination)
"Full_Name": 1, // for projection (covering index)
})
The query now instantly retrieves the top 10 documents from the index:
x=db.employees.find(
{ "Status": "Active", "Department.Name": "IT" },
{ _id: 0, "Full_Name": 1, "Performance_Rating": 1 }
).sort({ "Performance_Rating": -1 }).limit(10).explain("executionStats")
printjson({
nReturned: x.executionStats.nReturned,
executionTimeMillis: x.executionStats.executionTimeMillis,
totalKeysExamined: x.executionStats.totalKeysExamined,
totalDocsExamined: x.executionStats.totalDocsExamined,
});
// Execution statistics:
{
nReturned: 10,
executionTimeMillis: 0,
totalKeysExamined: 10,
totalDocsExamined: 0
}
Embedding data on the "many" side helps create optimized indexes and improves response times but involves duplicating data from the "one" side into the "many" side. In our example, each employee records the department name and description. This leads to two main effects:
- Increased storage usage, which can be reduced through compression. You might also opt not to embed all fields—such as storing only the department name, which is often queried with the employee, and keeping the description in a separate "departments" collection.
- Any update to the department information must be reflected across all associated employee records. This is generally manageable for infrequent changes, like name updates. Often, a department name change coincides with broader reorganizations, requiring employee record updates anyway.
Reference and lookup from the "many" side
To minimize duplication, I create a separate "departments" collection using the unique department names and descriptions I embedded, ensuring each department's information is stored only once:
db.employees.aggregate([
{
$group: {
_id: "$Department.Name",
Name: { $first: "$Department.Name" },
Description: { $first: "$Department.Description" }
}
},
{
$project: { _id: 0, Name: 1, Description: 1 }
},
{
$merge: {
into: "departments",
whenMatched: "keepExisting",
whenNotMatched: "insert"
}
}
]);
You might be surprised by the speed of this aggregation pipeline. Instead of scanning all documents, MongoDB efficiently retrieved the unique departments by searching for distinct values in the index (a loose index scan).
Then, I can substitute the "Department" sub-object with a reference to the "_id" from the "departments" collection:
db.employees.aggregate([
{
$lookup: {
from: "departments",
localField: "Department.Name",
foreignField: "Name",
as: "deptInfo"
}
},
{
$addFields: {
Department: { $arrayElemAt: ["$deptInfo._id", 0] }
}
},
{ $project: { deptInfo: 0 } },
{
$merge: {
into: "employees",
whenMatched: "merge",
whenNotMatched: "fail"
}
}
]);
Here is the shape of an employee document with a single field for the department:
{
_id: ObjectId('693f1d61e235ef0960ae2b53'),
'Unnamed: 0': 22,
Employee_ID: 'EMP0000023',
Full_Name: 'James Valdez',
Department: ObjectId('693f2e38c2dd5ab4fbfd73b8'),
Job_Title: 'Research Scientist',
Hire_Date: '2017-10-30',
Location: 'East Scott, Mauritius',
Performance_Rating: 5.882713091486423,
Experience_Years: 7,
Status: 'Retired',
Work_Mode: 'On-site',
Salary_INR: 789283
}
To find the top 10 highest-performing active employees in the IT department and display their names and ratings, I will join the department's collection using $lookup. Since $lookup incurs a cost, it's more efficient to filter the data beforehand. Therefore, I first apply $match to filter employees by status, then perform the $lookup with the "departments" collection from the reference, and filter more with the department name fetched from the foreign collection:
x=db.employees.aggregate([
{ $match: { Status: "Active" } },
{
$lookup: {
from: "departments",
localField: "Department",
foreignField: "_id",
as: "deptInfo"
}
},
{ $unwind: "$deptInfo" },
{ $match: { "deptInfo.Name": "IT" } },
{
$project: {
_id: 0,
Full_Name: 1,
Performance_Rating: 1
}
},
{ $sort: { Performance_Rating: -1 } },
{ $limit: 10 }
]).explain("executionStats")
print(x.stages[1])
// Execution plan for the lookup stage:
{
'$lookup': {
from: 'departments',
as: 'deptInfo',
localField: 'Department',
foreignField: '_id',
let: {},
pipeline: [
{ '$match': { Name: { '$eq': 'IT' } } }
],
unwinding: { preserveNullAndEmptyArrays: false }
},
totalDocsExamined: Long('1401558'),
totalKeysExamined: Long('1401558'),
collectionScans: Long('0'),
indexesUsed: [ '_id_' ],
nReturned: Long('421333'),
executionTimeMillisEstimate: Long('94596')
}
I printed the statistics for the $lookup stage, which read 1,401,558 documents—one per active employee. Each access was quick, thanks to the index on "_id," but executing it a million times took over a minute. It returned only 421,333 documents because the department name filter was pushed down by the query planner from the subsequent $unwind and $match stages into the $lookup pipeline. The main issue remains reading a million identical documents.
If you have a many-to-one relationship and numerous documents on the many side, it's better to join from the application rather than using an aggregation pipeline lookup.
I retrieved the "_id" of the departments I am interested in:
var itDeptId = db.departments.findOne({ Name: "IT" })._id;
If the department names are unlikely to change, this can be run once and stored in the application's cache.
Then I can access employees filtered by the reference '_id', so I will create an index for it:
db.employees.createIndex(
{ Status: 1, Department: 1, Performance_Rating: -1, Full_Name: 1 }
)
Since I'm querying just a single collection, I don't require an aggregation pipeline:
x=db.employees.find(
{
Status: "Active",
Department: itDeptId
},
{
_id: 0,
Full_Name: 1,
Performance_Rating: 1
}
).sort({ Performance_Rating: -1 }).limit(10).explain("executionStats");
printjson({
nReturned: x.executionStats.nReturned,
executionTimeMillis: x.executionStats.executionTimeMillis,
totalKeysExamined: x.executionStats.totalKeysExamined,
totalDocsExamined: x.executionStats.totalDocsExamined,
});
// Execution statistics:
{
nReturned: 10,
executionTimeMillis: 0,
totalKeysExamined: 10,
totalDocsExamined: 0
}
Finally, when the application accesses the lookup table first, I observe the same performance with a many-to-one reference as with embedding the single item into the many.
Reference from the "many" side, but lookup from the "one" side
I can perform the same operation as above in the application, using an aggregation pipeline that starts with the departments and includes employees via a lookup. I use a lookup pipeline to add the filter for active employees:
x=db.departments.aggregate([
{
$match: { Name: "IT" }
},
{
$lookup: {
from: "employees",
let: { deptId: "$_id" },
pipeline: [
{
$match: {
$expr: {
$and: [
{ $eq: ["$Department", "$$deptId"] },
{ $eq: ["$Status", "Active"] }
]
}
}
},
{
$sort: { Performance_Rating: -1 }
},
{
$limit: 10
},
{
$project: {
_id: 0,
Full_Name: 1,
Performance_Rating: 1
}
}
],
as: "employees"
}
},
{
$project: {
_id: 0,
Name: 1,
employees: 1
}
}
]).explain("executionStats")
print(x.stages[1])
// Execution plan for the lookup stage:
{
'$lookup': {
from: 'employees',
as: 'employees',
let: { deptId: '$_id' },
pipeline: [
{
'$match': { '$expr': { '$and': [ [Object], [Object] ] } <... (truncated)
by Franck Pachot
Murat Demirbas
Model minimalistically
Start from a tiny core, and always keep a working model as you extend. Your default should be omission. Add a component only when you can explain why leaving it out would not work. Most models are about a slice of behavior, not the whole system in full glory: E.g., Leader election, repair, reconfiguration. Cut entire layers and components if they do not affect that slice. Abstraction is the art of knowing what to cut. Deleting should spark joy.
Model specification, not implementation
Write declaratively. State what must hold, not how it is achieved. If your spec mirrors control flow, loops, or helper functions, you are simulating code. Cut it out. Every variable must earn its keep. Extra variables multiply the state space (model checking time) and hide bugs. Ask yourself repeatedly: can I derive this instead of storing it? For example, you do not need to maintain a WholeSet variable if you can define it as a state function of existing variables: WholeSet == provisionalItems \union nonProvisionalItems.
Review the model for illegal knowledge
Do a full read-through of your model and check what each process can really see. TLA+ makes it easy to read global state (or another process's state) that no real distributed process could ever observe atomically. This is one of the most common modeling errors. Make a dedicated pass to eliminate illegal global knowledge.
Check atomicity granularity
Push actions to be as fine-grained as correctness allows. Overly large atomic actions hide races and invalidate concurrency arguments. Fine-grained actions expose the real interleavings your protocol must tolerate.
Think in guarded commands, not procedures
Each action should express one logical step in guarded-command style. The guard should ideally define the meaning of the action. Put all enablement conditions in the guard. If the guard holds, the action may fire at any time in true event-driven style. This is why I now prefer writing TLA+ directly over PlusCal: TLA+ forces you to think in guarded-command actions, which is how distributed algorithms are meant to be designed. Yes, PlusCal is easier for developers to read, but it also nudges you toward sequential implementation-shaped thinking. And recently, with tools like Spectacle, sharing and visually exploring TLA+ specs got much easier.
Step back and ask what you forgot to model
There is no substitute for thinking hard about your system. TLA+ modeling is only there to help you think hard about your system, and cannot substitute thinking about it. Check that you incorporated all relevant aspects: failures, message reordering, repair, reconfiguration.
Write TypeOK invariants
TLA+ is not typed, so you should state types explicitly and early by writing TypeOK invariants. A good TypeOK invariant provides an executable documentation for your model. Writing this in seconds can save you many minutes of hunting runtime bugs through TLA+ counterexample logs.
Write as many invariants as you can
If a property matters, make it explicit as an invariant. Write them early. Expand them over time. Try to keep your invariants as tight as possible. Document your learnings about invariants and non-invariants. A TLA+ spec is a communication artifact. Write it for readers, not for the TLC model checker. Be explicit and boring for the sake of clarity.
Write progress properties
Safety invariants alone are not enough. Check that things eventually happen: requests complete, leaders emerge, and goals accomplished. Many "correct" models may quietly do nothing forever. Checking progress properties catch paths that stall.
Be suspicious of success
A successful TLC run proves nothing unless the model explores meaningful behavior. Low coverage or tiny state spaces usually mean the model is over-constrained or wrong. Break the spec on purpose to check that your spec is actually doing some real work, and not giving up in a vacuous/trivial way. Inject bugs on purpose. If your invariants do not fail, they are too weak. Test the spec by sabotaging it.
Optimize model checking efficiency last
Separate the model from the model checker. The spec should stand on its own. Using the cfg file, you can optimize for model checking by using appropriate configuration, constraints, bounds for counters, and symmetry terms.
You can find many examples and walkthroughs of TLA+ specifications on my blog.
There are many more in the TLA+ repo as well.
by Murat (noreply@blogger.com)
Kyle Kingsbury (Aphyr / Jepsen)
Here’s a page from AEG Test (archive), a company which sells radiation detectors, talking about the safety of uranium glass. Right from the get-go it feels like LLM slop. “As a passionate collector of uranium glass,” the unattributed author begins, “I’ve often been asked: ‘Does handling these glowing antiques pose a health risk?’” It continues into SEO-friendly short paragraphs, each with a big header and bullet points. Here’s one:
Uranium glass emits low levels of alpha and beta radiation, detectable with a Geiger counter. However, most pieces register less than 10 microsieverts per hour (μSv/h), which is:
- Far below the 1,000 μSv annual limit recommended for public exposure.
- Comparable to natural background radiation from rocks, soil, or even bananas (which contain potassium-40, a mildly radioactive isotope).
First, uranium glass emits gamma rays too, not just alpha and beta particles. More importantly, these numbers are hot nonsense.
First, the Sievert is a measure of dose, not source intensity; saying a piece emits 10 µSv/hour is like saying a microwave oven emits sixty degrees of warming per minute. It depends on whether the food is in the microwave or outside it, how much food there is, whether it was shielded, and so on. The dose from a uranium glass cup depends on whether you’re chipping off bits and eating them, cuddling it every night, or keeping it in a display case.
10 μSv/hour is 87,600 μSv/year. How is that “far below” 1,000 μSv/year? If you’ve got a uranium glass candy dish on your desk that delivers 10 µSv/hour to your body, and you keep that up for eight hours a day, you’re looking at 29,200 µSv (29.2 mSv) per year. That’s over the DHS emergency guidelines for public relocation, and about half of the NRC dose limit for radiation workers.
The other comparisons are also bonkers: 10 μSv/hour is not comparable to typical background radiation: in the US, that’s roughly 3,100 μSv/year, or 0.35 μSv/hour. Nor is it on par with a banana: the banana equivalent dose is very roughly 0.1 μSv. Nobody is eating 100 bananas an hour.
The best source I know of for uranium glass safety (and which, now that we’re drowning in LLM slop, is surprisingly hard to actually track down) is the Nuclear Regulatory Commission’s NUREG-1717. The section on glassware begins on page 499 of the PDF, labeled 3-217. You should read their methods for estimating dosages, as exposure is highly dependent on uranium density, exposure vector, acidity, distance, etc. The NRC estimated a negligible 1.8 × 10-5 mSv/year (0.018 μSv/year) from drinking glass leachate, up to 0.02 mSv/year (20 μSv/year) from external exposure to drinking glasses (e.g. washing dishes, being in the same room, etc.), and 0.002 mSv/year (2 μSv/year) from occasional handling, admiring, and generally being around four pieces of decorative glassware scattered through a home. These exposures are almost certainly fine.
Please stop asking large language models to tell you or others about radiation safety. Ask a physicist or regulator instead.
by Aphyr
Percona Database Performance Blog
This article explains how you can perform Point-in-time-Recovery (PITR) in Valkey/Redis. Requirements To perform PITR, you need to have append-only logging enabled. By default, AOF in Valkey/Redis only records the operations that have been executed against the instance, not when they were executed. For that, we need to enable the aof-timestamp-enabled parameter. So your Valkey/Redis […]
by Hieu Nguyen
PlanetScale Blog
We've lowered the entry price for using PlanetScale Metal to $50 and added more flexibility in storage-to-compute ratios.
ParadeDB Blog
How ParadeDB achieved 10x improved write throughput through searchable buffers, background merging, and handling Postgres HOT chains.
by Ming Ying