Paths of MySQL, vector search edition
This is an external post of mine. Click here if you are not redirected.
This is an external post of mine. Click here if you are not redirected.
You may have learned about normal forms when databases were designed before the applications that used them. At that time, relational data models focused on enterprise-wide entities, defined before access patterns were known, so future applications could share a stable, normalized schema.
Today, we design databases for specific applications or bounded domains. Instead of defining a full model up front, we add features incrementally, gather feedback, and let the schema evolve with the application.
Normal forms aren't just relational theory—they describe real data dependencies. MongoDB's document model doesn't exempt you from thinking about normalization—it gives you more flexibility in how you apply it.
We're starting a new business: a large network of pizzerias across many areas with a wide variety of pizzas. But let's start small.
As a minimal viable product (MVP), each pizzeria has one manager, sells only one variety, and delivers to one area. You can choose any database for this: key-value, relational, document, or even a spreadsheet. The choice will matter only when your product evolves.
Here is our first pizzeria:
{
name: "A1 Pizza",
manager: "Bob",
variety: "Thick Crust",
area: "Springfield"
}
With no repeating groups or multi-valued attributes, the model is already in First Normal Form (1NF). Because the MVP data model is simple—one value per attribute and a single key—there are no dependencies that would violate higher normal forms.
Many database designs start out fully normalized, not because the designer worked through every normal form, but because the initial dataset is too simple for complex dependencies to exist.
Normalization becomes necessary later, as business rules evolve and new varieties, areas, and independent attributes introduce dependencies that higher normal forms address.
The business started quite well and evolves. A pizzeria can now offer several varieties.
The following, adding multiple varieties in a single field, would violate 1NF:
{
name: "A1 Pizza",
manager: "Bob",
varieties: "Thick Crust, Stuffed Crust",
area: "Springfield"
}
1NF requires atomic values—each field should hold one indivisible piece of data. A comma-separated string breaks this rule: you can't easily query, index, or update individual varieties. You can manipulate it as a character string, but you can't treat each entry as a distinct pizza variety, and you can't index it efficiently.
SQL and NoSQL databases avoid this pattern for different reasons. In a relational database, the logical model must be independent of cardinalities and access patterns. Because the relational model doesn't know whether there are two or one million pizza varieties, it treats every one-to-many relationship as unbounded and stores it in a separate table as a set of pizzeria–variety relationships rather than embedding varieties within the pizzeria entity.
Once we understand the application domain, we can set realistic bounds. Thousands of pizza varieties in the menu would be impractical from a business perspective well before hitting database limits, so storing the varieties together can be acceptable. When object-oriented applications use richer structures than two-dimensional tables, it's better to represent such lists as arrays rather than comma-separated strings:
{
name: "A1 Pizza",
manager: "Bob",
email: "bob@a1-pizza.it",
varieties: ["Thick Crust", "Stuffed Crust"]
}
Arrays of atomic values satisfy a document-oriented equivalent of 1NF—each element is atomic and independently addressable—even though the document model isn't bound by the relational requirement of flat tuples. While SQL databases provide abstraction and logical-physical data independence, MongoDB keeps data colocated down to the storage and CPU caches for more predictable performance.
Normal form definitions assume keys for each 1NF relation. In a document model, multiple relations can appear as embedded sub-documents or arrays. Treating the parent key and the array element together as a composite key lets us apply higher normal forms to analyze partial and transitive dependencies within a single document.
We want to add the price of the pizzas to our database. If each pizzeria defines its own base price, it can be added to the varieties items:
{
name: "A1 Pizza",
manager: "Bob",
email: "bob@a1-pizza.it",
varieties: [
{ name: "Thick Crust", basePrice: 10 },
{ name: "Stuffed Crust", basePrice: 12 }
]
}
Second Normal Form (2NF) builds on 1NF by requiring that every non-key attribute depends on the entire primary key, not just part of it. This only becomes relevant when dealing with composite keys.
In our embedded model, consider the composite key ("pizzeria", "variety") for each item in the varieties array. If the price depends on the pizzeria and variety together—meaning different pizzerias can set different prices for the same variety—then "basePrice" depends on the full composite key, and we satisfy 2NF.
However, if prices are standardized across all pizzerias—the same variety costs the same everywhere—then a partial dependency exists: "basePrice" depends only on "variety", not on the full ("pizzeria", "variety") key. This violates 2NF.
To resolve this, we define pricing in a separate collection where the base price depends only on the pizza variety:
{ variety: "Thick Crust", basePrice: 10 }
{ variety: "Stuffed Crust", basePrice: 12 }
We can remove the base price from the pizzeria's varieties array and retrieve it from the pricing collection at query time:
db.createView(
"pizzeriasWithPrices",
"pizzerias",
[
{ $unwind: "$varieties" },
{
$lookup: {
from: "pricing",
localField: "varieties.name",
foreignField: "variety",
as: "priceInfo"
}
},
{ $unwind: "$priceInfo" },
{ $addFields: { "varieties.basePrice": "$priceInfo.basePrice" } },
{ $project: { priceInfo: 0 } }
]
);
Alternatively, we can use the pricing collection as a reference, where the application retrieves the price and stores it in the pizzeria document for faster reads.
To avoid update anomalies, the application updates all affected documents when a variety's price changes:
const session = db.getMongo().startSession();
const sessionDB = session.getDatabase(db.getName());
session.startTransaction();
sessionDB.getCollection("pricing").updateOne(
{ variety: "Thick Crust" },
{ $set: { basePrice: 11 } }
);
sessionDB.getCollection("pizzerias").updateMany(
{ "varieties.name": "Thick Crust" },
{ $set: { "varieties.$[v].basePrice": 11 } },
{ arrayFilters: [{ "v.name": "Thick Crust" }] }
);
session.commitTransaction();
SQL databases avoid such multiple updates because they're designed for direct end-user access, sometimes bypassing the application layer. Without applying normal forms to break dependencies into multiple tables, there's a risk of overlooking replicated data. A document database is updated by an application service responsible for maintaining consistency.
While normalizing to 2NF is possible, it may not always be the best choice in a domain-driven design. Keeping the price embedded in each pizzeria allows asynchronous updates and supports future requirements where some pizzerias may offer different prices—without breaking integrity, as the application enforces updates atomically.
In practice, many applications accept this controlled duplication when price changes are infrequent and prefer fast single-document reads over perfectly normalized writes.
When we started, each pizzeria had a single email contact:
{
name: "A1 Pizza",
manager: "Bob",
email: "bob@a1-pizza.it",
varieties: [
{ name: "Thick Crust", basePrice: 10 },
{ name: "Stuffed Crust", basePrice: 12 }
]
}
Third Normal Form (3NF) builds on 2NF by requiring that non-key attributes depend only on the primary key, not on other non-key attributes. When a non-key attribute depends on another non-key attribute, we have a transitive dependency.
Here, the email actually belongs to the manager, not the pizzeria directly. This creates a transitive dependency: "pizzeria" → "manager" → "email". Since "email" depends on "manager" (a non-key attribute) rather than directly on the pizzeria, this violates 3NF.
We can normalize this by grouping the manager's attributes into an embedded subdocument:
{
name: "A1 Pizza",
manager: { name: "Bob", email: "bob@a1-pizza.it" },
varieties: [
{ name: "Thick Crust", basePrice: 10 },
{ name: "Stuffed Crust", basePrice: 12 }
]
}
Now the email is clearly an attribute of the manager entity embedded within the pizzeria. If a pizzeria has multiple managers, we can simply use an array of subdocuments without creating new collections or changing index definitions.
A generic relational model would probably split this into multiple tables, with manager being a foreign key to a "contacts" table. However, in our business domain, we don't manage contacts outside of pizzerias. Even if the same person manages multiple pizzerias, they're recorded as separate manager entries. Bob may have multiple emails and use different ones for each of his pizzerias.
We want to record the areas where a pizzeria can deliver its pizza varieties:
{
name: "A1 Pizza",
manager: { name: "Bob", email: "bob@a1-pizza.it" },
offerings: [
{ variety: { name: "Thick Crust", basePrice: 10 }, area: "Springfield" },
{ variety: { name: "Thick Crust", basePrice: 10 }, area: "Shelbyville" }
]
}
Fourth Normal Form (4NF) addresses multi-valued dependencies. A multi-valued dependency exists when one attribute determines a set of values for another attribute, independent of all other attributes. 4NF requires that a relation have no non-trivial multi-valued dependencies except on superkeys.
If varieties and areas were dependent—for example, if certain varieties were only available in certain areas—then storing ("variety", "area") combinations would represent a single multi-valued fact, and there would be no 4NF violation.
However, since our pizzerias deliver all varieties to all areas, these are independent multi-valued dependencies: "pizzeria" →→ "variety" and "pizzeria →→ area". Storing all combinations creates redundancy—if we add a new area, we must add entries for every variety.
We normalize by storing each independent fact in a separate array:
{
name: "A1 Pizza",
manager: { name: "Bob", email: "bob@a1-pizza.it" },
varieties: [
{ name: "Thick Crust", basePrice: 10 },
{ name: "Stuffed Crust", basePrice: 12 }
],
deliveryAreas: ["Springfield", "Shelbyville"]
}
With this schema, we avoid violating 4NF because delivery areas and varieties are stored independently—even though the document model allows us to embed them together.
Our network grows further. Some pizzerias now charge different prices depending on the delivery area—distant areas cost more:
{
name: "A1 Pizza",
manager: { name: "Bob", email: "bob@a1-pizza.it" },
offerings: [
{ variety: "Thick Crust", area: "Springfield", price: 10 },
{ variety: "Thick Crust", area: "Shelbyville", price: 11 },
{ variety: "Stuffed Crust", area: "Springfield", price: 12 },
{ variety: "Stuffed Crust", area: "Shelbyville", price: 13 }
]
}
The composite key for each offering is ("pizzeria", "variety", "area"). The price depends on the full key, satisfying 2NF and 3NF.
Now our franchise assigns an area manager to each area—one manager per area, regardless of pizzeria. We add it to our offerings:
offerings: [
{ variety: "Thick Crust", area: "Springfield", price: 10, areaManager: "Alice" },
{ variety: "Stuffed Crust", area: "Springfield", price: 12, areaManager: "Alice" },
{ variety: "Thick Crust", area: "Shelbyville", price: 11, areaManager: "Eve" },
{ variety: "Stuffed Crust", area: "Shelbyville", price: 13, areaManager: "Eve" }
]
Boyce-Codd Normal Form (BCNF) is a stricter version of 3NF. It requires that for every non-trivial functional dependency X → Y, the determinant X must be a superkey. Unlike 3NF, BCNF doesn't make an exception for dependencies where the dependent attribute is part of a candidate key.
This model passes 3NF but fails BCNF: the dependency "area" → "areaManager" has a determinant ("area") that is not a superkey of the offerings relation. The area alone doesn't uniquely identify an offering—you need the full ("pizzeria", "variety", "area") key for that.
The practical problem: if Alice is replaced by Carol for Springfield, we must update every offering for that area across every pizzeria. The relational solution is to extract area managers to a separate table.
In MongoDB, we can keep the embedded structure and handle updates explicitly:
db.pizzerias.updateMany(
{ "offerings.area": "Springfield" },
{ $set: { "offerings.$[o].areaManager": "Carol" } },
{ arrayFilters: [{ "o.area": "Springfield" }] }
)
This trades strict BCNF compliance for simpler queries and faster reads. The application ensures consistency during updates.
We now offer multiple sizes (Small, Medium, Large). Sizes, varieties, and delivery areas are all independent—any combination is valid.
Storing every combination explodes quickly:
offerings: [
{ variety: "Thick Crust", size: "Large", area: "Springfield" },
{ variety: "Thick Crust", size: "Large", area: "Shelbyville" },
{ variety: "Thick Crust", size: "Medium", area: "Springfield" },
// ... 150 entries for 5 varieties × 3 sizes × 10 areas
]
Fifth Normal Form (5NF), also called Project-Join Normal Form, addresses join dependencies. A relation is in 5NF if it cannot be decomposed into smaller relations that, when joined, reconstruct the original—without losing information or introducing spurious tuples.
When valid combinations can be reconstructed from independent sets (the Cartesian product of varieties, sizes, and areas), storing all combinations explicitly creates redundancy and risks inconsistency. This violates 5NF.
The fix stores each independent fact separately:
{
name: "A1 Pizza",
varieties: ["Thick Crust", "Stuffed Crust"],
sizes: ["Large", "Medium"],
deliveryAreas: ["Springfield", "Shelbyville"]
}
Adding a new size requires updating one array—not hundreds of entries. The application or query logic reconstructs valid combinations when needed.
Our finance team needs to track price changes over time. We could embed the history:
offerings: [
{
variety: "Thick Crust",
area: "Springfield",
currentPrice: 12,
priceHistory: [
{ price: 10, effectiveDate: ISODate("2024-01-01") },
{ price: 11, effectiveDate: ISODate("2024-03-15") },
{ price: 12, effectiveDate: ISODate("2024-06-01") }
]
}
]
This works for moderate history but grows unboundedly over time.
Sixth Normal Form (6NF) decomposes relations so that each stores a single non-key attribute along with its time dimension. Every row represents one fact at one point in time:
// price_history collection
{ pizzeria: "A1 Pizza", variety: "Thick Crust", area: "Springfield", price: 10, effectiveDate: ISODate("2024-01-01") }
{ pizzeria: "A1 Pizza", variety: "Thick Crust", area: "Springfield", price: 11, effectiveDate: ISODate("2024-03-15") }
{ pizzeria: "A1 Pizza", variety: "Thick Crust", area: "Springfield", price: 12, effectiveDate: ISODate("2024-06-01") }
6NF is rarely used for operational data because it requires extensive joins for common queries. However, for auditing, analytics, and temporal queries—where you need to answer "what was the price on March 10th?"—it provides a clean model for tracking changes over time.
Normal forms are not a relic of relational theory. They describe fundamental data dependencies present in any system, regardless of storage technology. MongoDB’s document model does not remove the need to consider normalization. Instead, it lets you decide where, when, and how strictly to apply it, based on domain boundaries and access patterns.
In relational/SQL databases, schemas are usually designed as enterprise-wide information models. Many applications and users share the same database, accepting ad hoc SQL. To avoid update, insertion, and deletion anomalies in this shared environment, the schema must enforce functional dependencies, making higher normal forms essential. Because the database is the system of record, normalization centralizes integrity rules in the data model.
Modern architectures, by contrast, often follow Domain-Driven Design (DDD). Each bounded context owns its data model, which evolves with the application. With CQRS and microservices, each aggregate is updated only through a single application service that encapsulates business rules. Here, the database is not a shared integration point but a private persistence detail of the service.
MongoDB fits this style well:
Because one service owns all updates, violating higher normal forms can be acceptable—and sometimes beneficial—provided the service preserves its invariants. Normalization becomes a design tool, not a rigid checklist.
In short:
Normal forms still matter—but in MongoDB, they guide your choices instead of dictating your schema.
In this series, we explored several ways to solve the "Doctor's On-Call Shift" problem, which demonstrates write skew anomalies and the need for serializable transactions in SQL. Beyond using a serializable isolation level, we also addressed it with normalization, explicit parent locking, and SQL assertions. I applied document modeling in Postgres with SELECT FOR UPDATE as an alternative to a parent-child relationship, so it is natural to consider MongoDB. Since MongoDB lacks explicit locking and a serializable isolation level, we can instead use a simple update that atomically reads and writes in an optimistic concurrency control style.
Here is a collection with one document per shift and a list of doctors with their on-call status for this shift:
db.shifts.insertOne({
_id: 1,
doctors: [
{ name: "Alice", on_call: true, updated: new Date() },
{ name: "Bob", on_call: true, updated: new Date() }
]
});
The following function encapsulates the business logic in a single update: for a shift with at least one other doctor on call, one doctor can be taken off on-call duty:
function goOffCall(shiftId, doctorName) {
const res = db.shifts.updateOne(
{
_id: shiftId,
$expr: {
$gte: [
{
$size: {
$filter: {
input: "$doctors",
as: "d",
cond: {
$and: [
{ $ne: [ "$$d.name", doctorName ] },
{ $eq: [ "$$d.on_call", true ] }
]
}
}
}
},
1
]
},
"doctors.name": doctorName
},
{
$set: { "doctors.$.on_call": false, updated: new Date() }
}
);
return res.modifiedCount > 0 ? "OFF_OK" : "OFF_FAIL";
}
MongoDB is a document database with many array operators. Here, the condition checks that there is another doctor ($ne: ["$$d.name", doctorName]) who is on call ($eq: ["$$d.on_call", true]). It counts these doctors with $size and keeps only shifts where the count is at least 1. Since there is only one document per shift, if none is returned either the shift doesn’t exist, or the doctor is not in this shift, or there aren’t enough on-call doctors to let one go off call. The following calls show the return code:
test> goOffCall(1,"Alice");
OFF_OK
test> goOffCall(1,"Bob");
OFF_FAIL
Alice was allowed to go off‑call, but Bob couldn’t, because he was the only doctor remaining on‑call.
I added a simpler function to set a doctor on call for a shift:
function goOnCall(shiftId, doctorName) {
const res = db.shifts.updateOne(
{
_id: shiftId,
"doctors.name": doctorName,
"doctors.on_call": false
},
{
$set: { "doctors.$.on_call": true, updated: new Date() }
}
);
return res.modifiedCount > 0 ? "ON_OK" : "ON_FAIL";
}
Here is Alice back to on-call again:
test> goOnCall(1,"Alice");
ON_OK
I define an assertion function to verify the business rule for a shift by counting the doctors on call:
function checkOnCalls(shiftId) {
const pipeline = [
{ $match: { _id: shiftId } },
{ $project: {
onCallCount: {
$size: {
$filter: {
input: "$doctors",
as: "d",
cond: "$$d.on_call"
}
}
}
}
}
];
const result = db.shifts.aggregate(pipeline).toArray();
if (result.length && result[0].onCallCount < 1) {
print(`❌ ERROR! No doctors on call for shift ${shiftId}`);
return false;
}
return true;
}
Now, I run a loop that randomly sets Alice or Bob on call and
const shiftId = 1;
const doctors = ["Alice", "Bob"];
const actions = [goOnCall, goOffCall];
let iteration = 0;
while (true) {
iteration++;
const doctor = doctors[Math.floor(Math.random() * doctors.length)];
const action = actions[Math.floor(Math.random() * actions.length)];
const result = action(shiftId, doctor);
print(`Shift ${shiftId}, Iteration ${iteration}: ${doctor} -> ${result}`);
if (!checkOnCalls(shiftId)) {
print(`🚨 Stopping: assertion broken at iteration ${iteration}`);
break; // exit loop immediately
}
}
I've run this loop in multiple sessions and confirmed that the "Doctor's On-Call" assertion is never violated. This runs indefinitely because MongoDB guarantees data integrity—update operations are ACID:
If you want to stop it and check that the assertion works, you can simply bypass the conditional update and set all doctors to off call:
db.shifts.updateOne(
{ _id: 1 },
{ $set: { "doctors.$[doc].on_call": false, updated: new Date() } },
{ arrayFilters: [ { "doc.on_call": true } ] }
);
The loops stop when they detect the violation:
You can, and should, define schema validations on the part of your schema the application relies on, to be sure that no update bypasses the application model and logic. It is possible to add an 'at least one on‑call' rule:
db.runCommand({
collMod: "shifts",
validator: {
$expr: {
$gte: [
{
$size: {
$filter: {
input: "$doctors",
as: "d",
cond: { $eq: ["$$d.on_call", true] }
}
}
},
1
]
}
},
validationLevel: "strict"
});
My manual update immediately fails:
Schema validation is a helpful safeguard, but it does not fully protect against write skew under race conditions. It runs on inserts and updates and, with validationLevel: "strict", raises an error on invalid documents—but only after MongoDB has already matched and targeted the document for update.
Key differences between conditional updates and schema validation:
| Approach | When Check Occurs | Failure Mode |
|---|---|---|
Conditional updateOne
|
Before write, atomically with document match | Returns modifiedCount: 0 (no document updated) |
| Schema validation | After document match but before write | Returns DocumentValidationFailure error |
To prevent write skew, you need the correct condition in the update itself. Use schema validation as an extra safeguard for other changes, such as inserts.
In PostgreSQL, using one row per shift with a JSON array of doctors makes updates atomic and eliminates race conditions but reduces indexing flexibility (for example, range scans on array fields), so the serializable isolation level or normalization to parent-child is preferable. In MongoDB, storing a one-to-many relationship in a single document is native, and full indexing remains available—for example, you can index the updated field for each doctor's on-call status:
db.shifts.createIndex({ "doctors.updated": 1 });
This index supports equality, sorting, and range queries, such as finding shifts where the on-call status changed in the last hour:
const oneHourAgo = new Date(Date.now() - 60 * 60 * 1000);
db.shifts.find({ "doctors.updated": { $gte: oneHourAgo } });
db.shifts.find(
{ "doctors.updated": { $gte: oneHourAgo } }
).explain("executionStats")
Here is the execution plan:
executionStats: {
executionSuccess: true,
nReturned: 1,
executionTimeMillis: 0,
totalKeysExamined: 1,
totalDocsExamined: 1,
executionStages: {
stage: 'FETCH',
nReturned: 1,
works: 2,
advanced: 1,
isEOF: 1,
docsExamined: 1,
inputStage: {
stage: 'IXSCAN',
nReturned: 1,
works: 2,
advanced: 1,
isEOF: 1,
keyPattern: {
'doctors.updated': 1,
_id: 1
},
indexName: 'doctors.updated_1__id_1',
isMultiKey: true,
multiKeyPaths: {
'doctors.updated': [
'doctors'
],
_id: []
},
direction: 'forward',
indexBounds: {
'doctors.updated': [
'[new Date(1770384644918), new Date(9223372036854775807)]'
],
_id: [
'[MinKey, MaxKey]'
]
},
keysExamined: 1,
seeks: 1,
}
}
},
MongoDB's document model and optimistic concurrency control solve the "Doctor's On-Call Shift" problem without explicit locks, serializable isolation, or SQL assertion. By embedding business logic in conditional updateOne operations using $expr and array operators, you can prevent write-skew anomalies at the database level.
Atomic, document-level operations combined with a "First Updater Wins" rule ensure that concurrent updates to the same shift document yield exactly one success and one failure. This approach leverages MongoDB's strengths:
Schema validation alone cannot prevent race conditions, but together with conditional updates it protects against concurrency anomalies and direct data corruption.
This pattern shows how MongoDB's document model can simplify concurrency problems that would otherwise require advanced transaction isolation or explicit locking in relational databases. By co-locating related data and using atomic operations, you can maintain integrity with simpler, faster code.
How does your computer create the illusion of running dozens of applications simultaneously when it only has a few physical cores?
Wait, I forgot the question because I am now checking my email. Ok, back to it...
The answer is CPU Virtualization. Chapters 6, 7 of OSTEP explore the engine behind this illusion, and how to balance raw performance with absolute control.
The OSTEP textbook is freely available at Remzi's website if you like to follow along.
The crux of the challenge is: How do we run programs efficiently without letting them takeover the machine?
The solution is Limited Direct Execution (LDE) --the title spoils it. "Direct Execution" means the program runs natively on the CPU for maximum speed. "Limited" means the OS retains authority to stop the process and prevent restricted access. This requires some hardware support.
To prevent chaos, hardware provides two execution modes. Applications run in "User Mode", where they cannot perform privileged actions like I/O. The OS runs in "Kernel Mode" with full access to the machine. When a user program needs a privileged action, it initiates a "System Call". This triggers a 'trap' instruction that jumps into the kernel and raises the privilege level. To ensure security, the OS programs a "trap table" at boot time, telling the hardware exactly which code to run for each event.
If a process enters an infinite loop, how does the OS get the CPU back?
Finally, when the OS regains control and decides to switch to a different process, it executes a "context switch". This low-level assembly routine saves the current process's registers to its kernel stack and restores the next process's registers. By switching the stack pointer, the OS tricks the hardware: the 'return-from-trap' instruction returns into the new process instead of the old one.
With the switching mechanism in place as discussed in Chapter 6, our next piece to attack is deciding which process to run next. Chapter 7 explores these policies, initially assuming all jobs arrive at once and have known run-times.
First, let's look at batch scheduling using "Turnaround Time". This metric is simply the time a job completes minus the time it arrived (T_completion - T_arrival). Now let's consider some batch scheduling policies with this metric:
Now, we consider interactivity using "Response Time". This is the time from when a job arrives to the first time it is scheduled (T_response = T_firstrun - T_arrival).
STCF is great for turnaround but terrible for response time; a user might wait seconds for their interactive job (say a terminal session) to start. Round Robin solves this by time-slicing: it runs a job for a set quantum (e.g., 10 ms) and then switches. This makes the system feel responsive.
However, Round Robin creates a trade-off. While it optimizes fairness and response time, it destroys turnaround time by stretching out the completion of every job. You cannot have your cake and eat it too. See Fig 7.6 & 7.7.
Finally, real programs perform I/O. When a process blocks waiting for a disk, the scheduler treats the time before the I/O as a "sub-job". By running another process during this wait, the OS maximizes "overlap" and system utilization.
There is one last assumption we did not relax: the OS does not actually know how long a job will run. This "No Oracle" problem sets the stage for the next chapter on the "Multi-Level Feedback Queue", which predicts the future by observing the past.
To conclude Section 7, it is worth remembering that there is no silver bullet. The best policy depends entirely on the workload. The more you know about what you are running, the better you can schedule it.
Last week Anthropic released a report on disempowerment patterns in real-world AI usage which finds that roughly one in 1,000 to one in 10,000 conversations with their LLM, Claude, fundamentally compromises the user’s beliefs, values, or actions. They note that the prevalence of moderate to severe “disempowerment” is increasing over time, and conclude that the problem of LLMs distorting a user’s sense of reality is likely unfixable so long as users keep holding them wrong:
However, model-side interventions are unlikely to fully address the problem. User education is an important complement to help people recognize when they’re ceding judgment to an AI, and to understand the patterns that make that more likely to occur.
In unrelated news, some folks have asked me about Prothean Systems’ new paper. You might remember Prothean from October, when they claimed to have passed all 400 tests on ARC-AGI-2—a benchmark that only had 120 tasks. Unsurprisingly, Prothean has not claimed their prize money, and seems to have abandoned claims about ARC-AGI-2. They now claim to have solved the Navier-Stokes existence and smoothness problem.
The Clay Mathematics Institute offers a $1,000,000 Millennium Prize for proving either global existence and smoothness of solutions, or demonstrating finite-time blow-up for specific initial conditions.
This system achieves both.
At the risk of reifying XKCD 2501, this is a deeply silly answer to an either-or question. You cannot claim that all conditions have a smooth solution, and also that there is a condition for which no smooth solution exists. This is like being asked to figure out whether all apples are green, or at least one red one exists, and declaring that you’ve done both. Prothean Systems hasn’t just failed to solve the problem—they’ve failed to understand the question.
Prothean goes on to claim that the “demonstration at BeProthean.org provides immediate, verifiable evidence” of their proof. This too is obviously false. As the Clay paper explains, the velocity field must have zero divergence, which is a fancy way of saying that the fluid is incompressible; it can’t be squeezed down or spread out. One of the demo’s “solutions” squeezes everything down to a single point, and another shoves particles away from the center. Both clearly violate Navier-Stokes.
My background is in physics and software engineering, and I’ve written several numeric solvers for various physical systems. Prothean’s demo (initFluidSimulator) is a simple Euler’s method solver with four flavors of externally-applied acceleration, plus a linear drag term to compensate for all the energy they’re dumping into the system. There’s nothing remotely Navier-Stokes-shaped there.
The paper talks about a novel “multi-tier adaptive compression architecture” which “operates on semantic structure rather than raw binary patterns”, enabling “compression ratios exceding 800:1”. How can we tell? Because “the interactive demonstration platform at BeProthean.org provides hands-on capability verification for technical evaluation”.
Prothean’s compression demo wasn’t real in October, and it’s not real today. This time it’s just bog-standard DEFLATE, the same used in .zip files. There’s some fake log messages to make it look like it’s doing something fancy when it’s not.
document.getElementById('compress-status').textContent = `Identifying Global Knowledge Graph Patterns...`;
const stream = file.stream().pipeThrough(new CompressionStream('deflate-raw'));
There’s a fake “Predictive vehicle optimization” tool that has you enter a VIN, then makes up imaginary “expected power gain” and “efficiency improvement” numbers. These are based purely on a hash of the VIN characters, and have nothing to do with any kind of car. Prothean is full of false claims like this, and somehow they’re offering organizational licenses for it.
It’s not just Prothean. I feel like I’ve been been trudging through a wave of LLM nonsense recently. In the last two weeks alone, I’ve watched software engineers use Claude to suggest fatuous changes to my software, like an “improvement” to an error message which deleted key guidance. Contractors proffering LLM-slop descriptions of appliances. Claude-generated documents which made bonkers claims, like saying a JVM program I wrote provided “faster iteration” thanks to “no JVM startup”. Cold emails asking me to analyze dreamlike, vaguely-described software systems—one of whom, in our introductory call, couldn’t even begin to explain what they’d built or what it was for. Someone who claimed to be an engineer wanting to help with fault-injection work on Jepsen, then turned out to be a scammer soliciting investment in their AI video chatbot project.
When people or companies intentionally make false claims about the work they’re doing or the products they’re selling, we call it fraud. What is it when one overlooks LLM mistakes? What do we call it when a person sincerely believes the lies an LLM has told them, and repeats those lies to others? Dedicates months of their life to a transformer model’s fever dream?
Anthropic’s paper argues reality distortion is rare in software domains, but I’m not so sure.
This stuff keeps me up at night. I wonder about my fellow engineers who work at Anthropic, at OpenAI, on Google’s Gemini. I wonder if they see as much slop as I do. How many of their friends or colleagues have been sucked into LLM rabbitholes. I wonder if they too lie awake at three AM, staring at the ceiling, wondering about the future and their role in making it.
Back in 2005, when I first joined the SUNY Buffalo CSE department, the department secretary was a wonderful lady named Joann, who was over 60. She explained that my travel reimbursement process was simple: I'd just hand her the receipts after my trip, she'd fill out the necessary forms, submit them to the university, and within a month, the reimbursement check would magically appear in my department mailbox.
She handled this for every single faculty member, all while managing her regular secretarial duties. Honestly, despite the 30-day turnaround, it was the most seamless reimbursement experience I've ever had.
But over time the department grew, and Joann moved on. The university partnered with Concur, as corporations do, forcing us to file our own travel reimbursements through this system. Fine, I thought, more work for me, but it can't be too bad. But, the department also appointed a staff member to audit our Concur submissions.
This person's job wasn't to help us file reimbursements, but to audit the forms to find errors. Slowly but surely, it became routine for every single travel submission to be returned (sometimes multiple times) for minor format irregularities or rule violations. These were petty violations no human would care about if the goal were simply to get people reimbursed. The experience degraded from effortless to what could be perceived as adversarial.
This was a massive downgrade from the Joann era.
This story (probably all to familiar to many) illustrates the danger of not setting the right intention regarding friction. If the goal isn't actively set to help and streamline the process (if the intention isn't "how do we solve this?"), the energy of the system inevitably shifts toward finding problems. Friction becomes the product.
This dynamic is not just true for organizations, it is also true for each of us.
We have to manage the stories we tell ourselves. These stories, whether we tell them knowingly or unknowingly, determine how we manage/conduct ourselves, which in turn determines our success. Just as organizations can start to manufacture friction, individuals can do the same internally. You can install an internal auditor in your own mind.
When intention shifts away from growth, things degrade. You stop asking how to move forward and start looking for violations. You nitpick and reject your own efforts before it has a chance to mature. You begin to find ways to grate against your own progress.
I wrote about this concept previously in my post "Your attitude determines your success". That post tends to get two very different reactions. It gets nitpicked to pieces by cynics (the auditors), and it gets a silent knowing nod from people in the know (the builders). Brooker recently wrote career advice along the same lines, reinforcing that high agency mindset. In a similar vein, I wrote about recently to optimize for momentum.
“When there is a will, there is a way,” as the saying goes. Get the intention right and friction dissolves. Get it wrong and you may end up weaponizing process, tooling, and auditing against your own goals.
I often enjoy vibe coding, but I think we’re still far away from AI writing all your code. Newer models improve your development speed, even for complex applications. However, writing a usable browser from scratch without the heavy involvement of an experienced engineer is certainly not something that’s currently possible.
For me, vibe coding works well most of the time if I have a very good understanding of what I need. For tasks that Claude Code solves well, it saves me a lot of time, but it’s still not “hands-off”. To converge to an acceptable result faster, I often need to give very specific instructions (e.g. “don’t manually create and delete temporary files in Python, just use the tempfile module”). Sometimes, I also just waste a lot of time and don’t get any working result at all.
I generally use Claude Code (currently with Opus 4.5) and regularly try it out for new tasks or older tasks that haven’t worked well in the past. This is a collection of my experience with specific tasks:
✅ Writing a Rust program to do ARP pings over a range of VLAN ids and IP subnets (to scan my local network):
Claude Code found suitable libraries to generate and send raw packets, understood how to generate ARP packets with VLAN tags and understood how to scan through IPs of a given subnet. It also wrote a raw packet receiver to process ARP responses and added sensible cmdline arguments. Because I already knew very specifically what I wanted, this worked super well. I didn’t need to write a single line of code myself.
✅ Looking at my network configuration (a bunch of config files and screenshots of switch/AP management interfaces) and translating this into a human-readable Markdown file describing the network:
Claude Code asked for missing context (e.g., physical layout of the switch ports) which I think is crucial: In my experience, missing context often leads to bad results with LLMs, so Anthropic does a good job there. It even generated an overview in an SVG file that was correct!
❌ Writing a custom clang-tidy matcher for our internal C++ code style:
It turned out that the matcher I wanted to write just isn’t possible with the current Clang AST API. Claude Code tried a lot of different things (only some solutions compiled), I had to write a lot of code manually to guide it towards a possible solution, and look at the Clang source code with the help of Claude Code to verify what Claude Code was claiming. Eventually, I (not Claude!) understood that the matcher I wanted to write just wasn’t possible and abandoned the project after several hours.
✅ Regularly asking very detailed questions about the core CedarDB database code base:
I know the code base very well, so I tend to ask specific questions such as “We push a data block to S3 once the number of buffered rows exceeds a threshold. What’s the threshold exactly, where do we set it, and where do we check if it’s exceeded?”. Claude Code manages to answer these questions precisely, giving me exact code locations, even if answering the question requires understanding 10+ different source files in detail. This also works well with other code bases I’m familiar with, such as PostgreSQL or LLVM.
❌ Writing a float parser for hexfloats in C++:
I had a specific algorithm in my head that I wanted to try out and implement. Claude Code got the boilerplate and parsing basic numbers correct. It even wrote several helpful test cases. Where I really wasted a lot of time were the edge cases: Overflows around the edges of representable numbers, subnormal numbers, NaN payloads, etc. Even for the tests, Claude Code really wanted to use standard library functions to verify the correctness. But the standard library functions don’t handle these edge cases consistently (which is why I wrote a custom parser in the first place) and I couldn’t convince Claude otherwise. So, I ended up writing the edge cases manually having wasted an hour talking to Claude.
✅ Writing ansible modules in Python for different tasks that I hacked with ansible.builtin.shell before:
Claude Code processed my hacky shell scripts, understood what I wanted to do, and created equivalent Python modules. The modules also have support for check mode and display good error messages.
✅ Writing a Python script that generates a static HTML file from a list of backups (using borg backup):
I wanted a quick read-only overview of my backups. A static HTML page was the easiest solution for me, no monitoring stack required.
I have found that the more I know about a certain problem and programming language, the better the result I get from an LLM. I will definitely continues using LLMs to write boilerplate code and to solve (coding) problems that have been solved before. For really novel problems or algorithms, I think LLMs can assist very well but I’m not yet satisfied with the quality of the code.
This is an external post of mine. Click here if you are not redirected.